FreeBSD is a free, UNIX-like operating system, similar in many ways to
Linux.  Unlike Linux, FreeBSD was not written from scratch; it is based on
the "4.4-Lite" distribution from the University of California, Berkeley's
CSRG group, which had previously release the "Net/2" (successor to the
"Net/1" release), upon which 386BSD (which was extensively described
in DDJ) was based.

There are advantages and disadvantages to this; one of the advantages is
that much of the code in FreeBSD is mature, and well-tested.  In particular,
the networking code is extremely mature.  A corresponding disadvantage is
that, of course, the code *is* "old," and may not always use the best
algorithms.

Information about FreeBSD, including how to obtain it, is available at
http://www.FreeBSD.Org/ -- it is also available on CD-ROM from Walnut Creek
CD-ROM, which sponsors the project.

truss

One of the more useful utilities available on some UNIX systems is "truss"
(also known as "strace" or "trace").  truss is a program that monitors the
system calls of a program.  Because of this, it does not need (unlike a
debugger) to have symbols compiled into the executable; this means that it
can work with _any_ executable, without recompilation.  This is invaluble
in many debugging situations.  And since it is also generally possible to
trace a process that is already running, it means that debugging system
programs and daemons such as init and inetd can be much easier.

FreeBSD, like all 4.4-derived BSD systems, already has a program called
"ktrace."  This functions similar to truss, although it works quite
differently.  One of the main differences is that, when ktrace enables
tracing, the process dumps information to a file.  In some cases, this can
be extremely undesirable, as the process can fill up a filesystem if the
tracing is not disabled quickly enough.  The difference is essentially
one of synchronous vs. asynchronous behaviour.

truss, on the other hand, works by monitoring the process as it enters and
leaves the kernel, and arranging for the process to stop in those cases.
After truss gets the desired information, it then restarts the process, and
the cycle continues.

Since this requires features -- namely, the ability to have a process
stop when doing a system call -- not currently present in the FreeBSD
kernel, they will have to be added.

When adding or modifying any code for the kernel, care must be taken,
both in design, implementation, and testing.  Since one of the
original attractions of UNIX was its simple, yet powerful, design, that
should be kept in mind when designing any project.

stopevent() -- the main enchilada

The basis of stopping the process will be done in a routine called
stopevent().  To meet our needs (which include truss, but also future
debuggers), stopevent() needs to be easily invoked at many kernel locations,
and needs to cause the process to stop.  Fortunately, the kernel already
provides a couple of different methods to stop a process.

The most common is via a SIGSTOP signal.  This is one of the signals
classified as a "job control" signal.  (Job control was a feature added to
4.2BSD, which allowed a user to suspend and resume processes.  This is
normally done using Control-Z on the terminal, which causes the kernel to
send a SIGTSTP signal to the process group attached to that terminal.)
Unlike SIGTSTP, SIGSTOP cannot be caught, nor can it be ignored, and it
causes the process to stop, when it is noticed.  (More on this below.)

However, one side effect of stopping a process this way is that the parent
is notified.  (This is how the shell regains control of the terminal, when
you suspend a program as described above.  This is also how a debugger is
notified that the process being debugged has stopped.)  While
convenient, and fitting nicely into the traditional UNIX model, this has a
disadvantage for my truss considerations -- it strikes me as rude, when the
process that requested the stop may not be the stopped process' parent.
And, since one of the events that truss would monitor is signal delivery,
that would complicate matters quite a bit.

There is another method available in the kernel to have a process stop --
sleep() (and the slightly-better version available in BSD, tsleep()).
sleep() is used to have a process wait for an event or resource; the
opposite is wakeup().  When a process invokes sleep(), it stops, until it is
woken up, and no signal is generated, and no processes are notified.

This suits my choices better, and is the method I chose.  (An earlier
verison of stopevent() used a mechanism similar to the signal method;
however, it had more intimate knowledge of process switching than I was
comfortable with.)

The listing shows the code for stopevent().  As you can see, it is a very
short function.  Despite that, it accomplishes the goals, when invoked
correctly.

It requires some additions to FreeBSD's process structure ("struct proc", in
<sys/proc.h>).  These changes are to add five elements to the structure:

	p_step		Indicates whether or not the process is
			stopped, and whether it can continue

	p_stops		Bitmask of which kinds of events the process
			should stop on

	p_stype		The particular event that caused it to stop

	p_pfsflags	Non-event flags set and used by procfs.

	p_pad3		Padding to make the proc structure end on a
			32-bit boundary.

Lastly, stopevent() also allows for some extra data to be associated with
the event.  For example, in the case of a signal delivery, it arranges
for the signal number to be passed back.  This is done by storing the
data in the p_xstat field in the proc structure; note that this is
also used by exit() and wait() -- this conflict may cause problems
eventually, in which case another entry may have to be added to the
proc structure.

As the code indicates, p_step is set when stopevent() is invoked, and
checked every time the process wakes up.  The tsleep() in the loop is what
controls it.

tsleep() and wakeup()

The first argument to tsleep() is the "channel" -- this is an arbitrary
value, typically a pointer.  It can be anything; it is used only by
wakeup().  By making it an element of the proc structure, and one that
relates to what stopevent() is doing, it is more mnemonic, and obvious as to
what is going on.

The second parameter is a priority level, optionally with a flag.  The
priority level indicates what priority the process should have when it
resumes execution; the optional flag, PCATCH, indicates whether the sleep
should be interruptable by a signal or not.

The third parameter is a string to indicate where the process is.  This will
be displayed by ps, and also by the kernel's "status" mechanism (this is
what the "status" control character that shows up with "stty -a" is for).

The fourth parameter is a time limit -- if the time runs out and the process
has not otherwise been woken up, it wakes up anyway.  A time limit of 0
means no limit -- meaning the process will sleep forever (or until a
signal interrutps it, if PCATCH was set).

wakeup() is the other half of sleep()/tsleep() -- it only takes one
parameter, which is the "channel" that was passed in to sleep().  I said
above that this can be *anything* -- wakeup() simply goes through the list
of sleeping processes, and schedules all processes that have a matching
value to be woken up.  (While wakeup() will work on all processes with
the correct value, wakeup_one() only wakes up the first process on the
list.)

You'll note that stopevent() calls wakeup(p->p_stype) before tsleep();
this is to wake up any other processes that are waiting for *this*
process to stop.  (Don't worry; this will become clearer later.)

--- SIDEBAR ---
Process Scheduling

I said that wakeup() schedules processes to be woken up; it doesn't actually
run them.  UNIX is a pre-emptive, multitasking system -- user processes that
run for too long are interrupted, and then the kernel runs.  However, the
kernel itself is not pre-empted in this fashion -- the only
pre-emption occurs when switching back to user mode. sleep() is how
the kernel indicates that it is going to be waiting for a while; in
the sleep() function, the kernel simply changes context to another
process.  (It helps to think of the kernel as a multi-threaded
program, with only voluntary switching between them.)  There are few
places where the kernel will actually switch contexts; all are done
through the function mi_switch(), which is invoked from issig() (which
handles signals for processes), tsleep(), and before returning from a
system call.

One thing to bear in mind is that the kernel is, almost always,
running in the context of a process.  This imposes limitations on what
the kernel can do -- since one process cannot arbitrarily affect
another, this means that the kernel cannot do so on behalf of a
process, either.  This is why signals, for example, are delivered only
when the recipient process runs, rather than when the signal is sent.

--- SIDEBAR ---

In order for the process to actually stop, stopevent() needs to be invoked.
This will be done through the STOPEVENT() macro; it is a simple wrapper that
checks to see if a given event is being checked by the process, and, if so,
it invokes stopevent().  The macro is simply so that a single line can be
dropped into nearly any part of the kernel, with no re-coding needed.  (In
practice, it doens't completely work this way, unfortunately.  But most of
the time it does.)

The events that are currently checked for are:

	S_SIG	Signal delivery
	S_EXIT	Process exit
	S_EXEC	exec() of a new program
	S_SCE	System call entry
	S_SCX	System call exit
	S_CORE	Coredump

Since it currently uses a bitmask, there are 32 possible events that could
be checked for; those six cover the cases I was interested in, however, and
are enough for now.

STOPEVENT() is placed (with the appropriate arguments) at a convenient place
where a process should stop.  For example, in syscall() (the kernel side of
the system call interface, in /sys/i386/i386/trap.c), the following is
added:

	STOPEVENT(p, S_SCE, callp->sy_narg);

This simply has the process stop, just after system call entry, if that
event is being monitored.  callp->sy_narg contains the number of
32-bit word arguments for the system call.

The listings show where in the kernel I have placed STOPEVENT()s; note that
in almost all cases, there was no extra work to be done.  (This, then, meets
my goal of being simple to add.)  The one exception is kern_sig.c; in two
places, it is necessary to check to see if signals are being monitored at
all.  The first case, in psignal(), is necessary because of some
optimizations done in that function -- psignal() normally will pass on the
signal delivery if the signal is ignored or blocked.  The exception is if
the process is being debugged (because debuggers always get to know about
attempted signals); by adding a check for p->p_stops&S_SIG, we achieve the
same effect.

issig() is the function that actually delivers the signal; unlike psignal(),
issig() always runs in the context of the recipient process.  This is
necessary for our purposes -- there is no way, in the FreeBSD kernel, to
stop *another* process.  (See the sidebar on process scheduling.)  Once in
issig(), we check to see if the process is being debugged (P_TRACED), or if
signal events are being monitored.

One other change bears mentioning, in kern_exit.c.  Here, I have added
a call to procfs_exit(), which is a new function in the procfs code.
This function goes through the open procfs files, and forcibly closes
any one for the specified process id.  This prevents an open procfs
file from referring to a process that has already exited.  (By calling
procfs_exit() after the STOPEVENT(), any monitoring process was able
to have a last shot at examining the process before the file
descriptors it has become irrelevant.  And, to answer the obvious
question:  yes, this code is not extremely fast or efficient.
However, the number of open procfs files in a system is likely to be
small, and probably not worth the effort of a more-efficient method.)

The last major kernel changes are interface related.  So far, although we've
added the code, we have not added a way for the event bitmask to be set,
cleared, queried, etc.  This we do via procfs, the process
pseudo-filesystem.

procfs

procfs was added to 4.4BSD, and adds a filesystem interface to the kernel
process list.  A quick way to find which processes are currently running on
the system is via "ls /proc".  "ls -l /proc" will show the user ids as well.

procfs is laid out as follows:

	/proc
		curproc
		<process id list>
			ctl
			etype
			file
			fpregs
			map
			mem
			note
			notepg
			regs
			status
			
regs contains the register set (we will use them later), and mem
contains the process' memory space.  The layout is based on the Plan 9
/proc filesystem, although not all of the files are implemented.
<write sidebar about other files and their uses?>

There are several possible ways to communicate with the target
process; given that procfs presents a file interface, read() and
write() jump out as the most obvious.  However, each of read() and
write() is unidirectional -- to send a command that would return
information would require a write() followed by a read().  Plan 9 uses
the note and notepg files to send commands and get the results; this
is slightly simpler, but not completely sufficient:  because
multiple processes could be examining the same process, the
non-atomicity makes the implementation more difficult.  Fortunately,
there are other methods available.

In this case, we will add the ability to do some process manipulation via
ioctl's.  One complication (which is readily handled) is that ioctl's are
not allowed on regular files, normally; that is solved by simply removing a
pre-emptive error return.  (See listing.)

The rest of the procfs changes are to add the ioctls.  First, they must be
defined -- to do that, we add a new kernel file, <sys/pioctl.h> (for
"process ioctl").  The currently-supported ioctl's are:

	PIOCBIS		Set event flag(s)
	PIOCBIC		Clear event flag(s)
	PIOCSFL		Set non-event flags
	PIOCSTATUS	Get process status information
	PIOCWAIT	Wait for a process to stop
	PIOCCONT	Continue a process

The _IOW and _IOR macros in the listing construct a 32-bit value, that is
used by the ioctl system call to determine what to do with the third
argument to ioctl -- _IOW means that it is copied from user space to kernel
space, and _IOR is the other way around.

As you can see from the listing for procfs_vnops.c, PIOCBIS, PIOCBIC,
and PIOCSFL are fairly straigt forward.

Both PIOCSTATUS and PIOCWAIT fill out a procfs_status structure,
provided by the user.  The procfs_status structure indicates whether
the process is stopped on an event or not (and, if so, which event it
has stopped on), which events are being monitored, any extra data set
by STOPEVENT(), and the (currently-unused) flags.

The difference between the two is that PIOCWAIT will put the
requesting process to sleep until the target process stops.  This is,
of course, where the wakeup() in stopevent(), mentioned above, comes into
play.  First, PIOCWAIT checks to see if the process is already stopped -- if
so, it has nothing to do.  If the process is not stopped, however, then it
calls tsleep() on the p_stype; when stopevent() then goes to stop the
process, it first schedules any sleeping processes to be woken up.
Note that, unlike the tsleep() in stopevent(), PCATCH is set, meaning
that the waiting process can be interrupted by signals.

The last currently-supported IOCTL is PIOCCONT.  This restarts a process
that was stopped, and is modeled after the ptrace PT_CONTINUE method.  This
allows a stopped process to continue with a specified signal, if desired.
(PT_CONTINUE also allows an address to be specified; however, the PC can be
set via procfs, by opening up /proc/process-id/regs and modifying it.)

<Make a diagram of the state of the two processes, truss and child?>

PIOCCONT restarts the process by first clearing the p_step flag in the
target process; this tells stopevent() (which is running in the context of
the stopped process!) that it is okay to run now.  It also then wakes up the
process, and is done.

procctl

The first program to use the interface is procctl.  This is an
administrative program, which "unsticks" processes -- remember that a
process stopped via stopevent() will not respond to signals.

As programs go, it is currently limited, but simple -- the user must
specify process ID's, and there are no options.  For each process, it
simply clears the event bit mask, and continues the process.  It warns
about failures, but continues.

Although simple, it is functional, and does demonstrate the interface:
PIOCBIC clears specifies bits (~0 indicating that it should clear all
of the bits in the event mask), and PIOCCONT wakes the process up (and
returns EINVAL if the process was not already stopped).

The program was also absolutely necessary during development --
a process that was stopped in stopevent() could not be killed, which
made it very difficult to regain control of the system.

Two obvious improvements to procctl present themselves:  "unsticking"
all processes owned by a specified user, and having the program itself
search the /proc filesystem, rather than specifying the process id's.

truss

The next application is of more general use -- truss.  It is also
significantly more complicated.  It can be broken into two parts:
setup, and the main loop.

At its simplest, truss is invoked as:

	truss program args

This will trace the system calls program makes.  However, it may be
desirable to have the output of truss stored in a file somewhere; the
-o option will do that.  Also, you may wish to ignore cetain events;
the -S option will ignore delivered signals.  (This could be extended
to other options in a strightforward option.)

Lastly, truss can be used to trace an already-running process, with
the -p option.

After going through the options, if a pid was not specified, truss
needs to set up the process itself.  It does this in setup_and_wait()
-- whose sole purpose is to create a process, exec the program
desired, and leave that process in a stopped state.  Note that it sets
the event mask, initially, to S_EXEC | S_EXIT, and applies that to
itself (after the vfork() and before the execvp()).

In the event that the specified program is unable to be executed, then
the process will exit, and stop; if, on the other hand, it succeeds,
then the process will stop before returning from exec inside the
kernel.  (In both cases, by "stop," I mean that the process will wait
in stopevent() to be woken up.)

<below is unclear about what is going on -- which file in /proc, etc.>
In start_tracing(), truss opens up the mem file for the given process
(e.g., /proc/45/mem), and sets up the event mask for the target
process.  (Note that the event mask may or may not set S_SIG,
depending on the options given to truss.)

The next thing truss does is determine which emulation type the target
process is.  FreeBSD supports emulation of different operating systems
(such as Linux and SCO); in many cases, the system call numbers are
different for the different operating systems, and so truss needs to
know which mapping to use.  It determines this by reading the procfs
etype file; by default, this is "FreeBSD a.out" -- although "Linux
ELF" and "IBCS2 COFF" are also common possibilities.  As of right
now, truss supports native FreeBSD programs and Linux ELF binaries,
although I will only concentrate on the native (i386_syscall_)
versions in my discussion.

After that, truss enters the main loop:  wait for the process to stop
(and find out why it stopped); print the desired information; and
restart the process.  It stops when the process has died (pfs.why ==
S_EXIT).

When a system call is entered or exited, truss calls two
processor-specific and OS-specific functions.  For FreeBSD/a.out
executables (i.e., native binaries), the two functions are
i386_syscall_entry() and i386_syscall_exit().  (The Linux equivalents
get the system call number and arguments through different means.)

Under FreeBSD on the x86 architecture, the system call number is
contained in the EAX register; success or failure is indicated by
setting the carry bit in the process status register, and the return
value is in EAX (and, sometimes, EDX as well -- some system calls
return two values, such as fork()).

The first thing i386_syscall_entry() and i386_syscall_exit() do is to
determine whether or not they need to re-open the register file.
After the first call, they should not need to.  Then, they seek to the
beginning of the register file, and read the entire register set.

The system call names are generated automatically, in the file syscalls.c.
This contains an array of all of the system call names available on
the system; this file is generated automatically from the system call
configuration file, that is part of the kernel.  It fills one need
easily -- the need to translate system call numbers to system call
names -- but is lacking in another one:  knowing the types of the
arguments.  However, we do have the number of arguments, in the nargs
parameter to i386_syscall_entry().  We can, therefore, get the
arguments by looking in the process' memory space -- the arguments are
at ESP + sizeof(int).

i386_syscall_exit()'s job is much easier -- it simply has to determine
if the system call failed or not (by checking the carry bit of the
PSW), and print out the value of EAX accordingly (either as an error,
or as a return value).  Note that this does not check for system calls
that return multiple values.

Has this been worthwhile?

At this point, we now have a functional, albeit minimalist, version of
truss for FreeBSD, using kernel code we've added ourselves.  And,
since it can trace the system calls for non-FreeBSD processes, it has
at least one large advantage over ktrace.

Once again, a warning:  modifying the kernel is potentially disastrous; there
are few parts of the operating system you can modify that can cause
equivalent amounts of damage.

Finally, we should consider possible enhancements to code:  additional
events; copying the event mask to a process after a fork (currently,
it is zeroed out); printing out the system call arguments in an
intelligent fashion, based on their types; etc.

Suggested reading

The Design and Implementation of the UNIX Operating System, Maurice J.
Bach

The Design and Implementation of the 4.4BSD Operating System, McKusick
et al.

http://www.freebsd.org/
