INTERPROCESS COMMUNICATIONS IN OS/2

SIGINTR      a Ctrl-C was detected
SIGBREAK      a Ctrl-Break was detected
SIGTERM      the process is being terminated
SIGBROKENPIPE   a pipe read or write failed

Signals in the second class are explicitly sent by one process to another. These are known as event flags, and three types are available (each of which may have a distinct handler): Flags A, B, and C. Event flag signals may be accompanied by an arbitrary word (16 bits) of data.

For each signal type, a process may either register its own handler, instruct the system to ignore the signal, or allow the system's default handler to take its usual action. If a particular signal occurs and the process has previously indicated its desire to service that signal type, the primary thread of the process is transferred forcibly to the routine designated as the signal handler. When the handler completes its processing control is restored to the point of interruption.

The system's default handling of the different signal types varies. SIGTERM terminates the target process. SIGINTR and SIGBREAK are fielded by the ancestor process which has registered an appropriate handler; if this ancestor is CMD.EXE or the Presentation Manager shell, SIGBREAK and SIGINTR are translated to SIGTERM. SIGBROKENPIPE and the Event Flag signals, on the other hand, are by default discarded.

From the preceding discussion and the summary in Table 1, it is clear that the characteristics of OS/2's various IPC facilities vary drastically. Yet, at least several of them can be made to do essentially the same job. How does one assess their relative performance and suitability for a specific application? The OS/2 documentation gives little guidance here, except to note in passing that RAM semaphores are faster than system semaphores, semaphores in general are faster than everything else, and pipes are faster than queues.

Table 1: Summary of the various OS/2 interprocess communication facilities

  IPC                       Global Name        Resident/      Maximum
  Mechanism                 Form               Swappable      Data Held
---------------------------------------------------------------------------

  RAM Semaphore             not                Swappable      set/clear or
                            applicable                        owned/available

  Fast-Safe                 not                Swappable      owned/available
  RAM Semaphore             applicable

  System                   \SEM\name          Resident       set/clear or
  Semaphore                                                  owned/available

  Anonymous                 not               Resident       64 Kbyte
  Pipe                      applicable

  Named Pipe                \PIPE\name        Resident       64 Kbyte

  Anonymous                 not               Swappable      limited only by
  Shared Memory             applicable                       virtual memory

  Named Shared              \SHAREMEM\name    Swappable      64 Kbyte per
  Memory                                                     named segment

  Queue                    \QUEUES\name       Swappable      limited only by
                                                             virtual memory

  Signal                   not                   not         16 bits passed
  (Event Flag)             applicable         applicable     with signal

In order to try and get a feel for these issues, I carried out some simple timings on the most commonly used IPC methods, which I will describe shortly. The timings were obtained on a IBM PS/2 Model 80 at 16MHz with 4 Mbytes of RAM, running under IBM's OS/2 Standard Edition, Version 1.1. The relevant CONFIG.SYS parameters were:

   BUFFERS=30
   BREAK=OFF
   DISKCACHE=64
   IOPL=YES
   MAXWAIT=3
   MEMMAN=SWAP,MOVE
   PROTECTONLY=NO
   RMSIZE=640
   THREADS=128

The only significant processes that were running during the timings were the Presentation Manager shell and two instances of LMI UR/FORTH in PM windows. I judged the system to be lightly loaded, a conclusion supported by my observation that no swapping occurred during the timings (as evidenced by the fixed disk light) and by the fact that the DosMemAvail function returned the size of the largest block of available physical memory as 1,367,520 bytes.

The programs used to obtain the timings were written in LMI UR/FORTH, my own company's protected mode Forth interpreter/compiler for OS/2. Forth is an ideal language for this sort of system probing because it is fast enough for real-time work, yet it affords interactive, direct access to all operating system functions.

Semaphore Performance

Let's look first at the semaphore family. To appraise the relative speeds of system, RAM, and Fast-Safe RAM semaphores for both the "signalling" and "mutual exclusion" models, I timed 100,000 request release cycles and set/clear cycles for each semaphore type (Table 2). The tare time for the loop was determined by substituting a dummy function for each system call that simply returned a success status; this time was then subtracted from the total before calculating the cycles per second.

Table 2: Performance comparisons of various OS/2 semaphore types. Timings are based on 100,000 complete cycles (request then release, or set then clear) on the same semaphore.

  Semaphore Type       Request/Release       Set/Clear
                          Cycles per Second     Cycles per Second
  ------------------------------------------------------------------

  RAM Semaphore        16,507                   17,156

  Fast-Safe            17,066                   not applicable
  RAM Semaphore

  System Semaphore     7,464                    7,532

As you can see from the Table, the difference between the performance of system and RAM semaphores is not nearly as great as you might expect from reading the OS/2 technical manuals. Your selection of system, RAM, or Fast-Safe RAM semaphores should really be made on other grounds. I have already mentioned some of the important differences (counting and cleanup), but there are additional subtle differences that might prove important in a real-life project.

First, the apparent performance advantage of RAM semaphores in a lightly-loaded system cannot be generalized to a heavily-loaded system. System semaphores are implemented in fixed, non-swappable memory owned by the operating system; the access time to a system semaphore will always be consistent. In contrast, RAM semaphores are located in memory owned by a process --which is by default moveable and swappable. If the segment containing a RAM semaphore has been swapped out to disk, a reference to the semaphore could be delayed for an unpredictable length of time (but on the order of tens or even hundreds of milliseconds) until the virtual memory manager can roll the segment back into physical memory.

Another important aspect of system semaphores is that they are implemented in memory below the 640K-byte boundary, so that they can be addressed in either real mode or protected mode. This is vital if you wish to use semaphores to communicate between a closely coupled process and device driver and the driver might need to manipulate the semaphore during service of an hardware interrupt, because the CPU mode at the time of an interrupt cannot be predicted.

Finally, we should note that the location of system semaphores in physical memory severely constrains the number that OS/2 can make available. The memory below the 640K-byte boundary is dear, because it must be conserved for the execution of real-mode programs in the DOS Compatibility Environment. Consequently, the maximum number of system semaphores is 128 in OS/2, Version 1.0, and 256 in OS/2, Version 1.1, and many of these are used up by the operating system itself. If you need large numbers of semaphores in your application, you will have to use RAM or Fast-Safe RAM semaphores and simply work around their other limitations.

Message-Passing Performance

As I thought about assessing the relative throughput of message passing using shared memory, pipes, and queues, I realized that simplistic timings of system calls would not be very helpful. The amount of tangential work that is associated with the use of these IPC mechanisms can be fairly extensive (allocating and deallocating memory segments, setting and clearing semaphores to control access to shared segments, copying data to and from local buffers, and so on).

Eventually, I settled upon a timing model which, I think, is at least reasonably parallel to the IPC performed by real applications. I obtained each set of timings by running two processes, a parent and a child. The parent's only function was to launch the child, then serve as a message turnaround point. As the parent received each message from its child via the IPC mechanism under test, it would simply do whatever was necessary to ship the message back to the child again (a more detailed sketch of the timing procedure for each IPC method can be found in Figure 1, Figure 2, and Figure 3). A consistent message size of 512 bytes was used.

The results, which are reported in Table 3, are based on 100,000 message round-trips --from child to parent and back again. The tare times were found and subtracted using equivalent loops where the system calls had been replaced with dummy functions that returned a success status or other reasonable result.

Table 3: Comparison of IPC throughput using named shared memory, anonymous pipes, and queues. The timings are based on 100,000 round-trips of a 512-byte message between two processes.

  IPC Method            Message Round-Trips
                        Per Second
  ------------------------------------------------------------

  Share Memory          661

  Anonymous Pipe        346

  Queue                 76

IPC performance via shared memory segments, even with the overhead of system, calls to set and clear RAM semaphores that synchronize access to the segments, is seen to be far faster than either pipes or queues. In fact, because processes can easily simulate the behavior of a pipe by explicitly controlling a ring buffer in a shared segment, the use of pipes for any reason other than "transparent" communication with an oblivious child process is probably ill-advised.

Communication by queues turns out, as expected, to be the slowest method. It is an order of magnitude slower than IPC using shared memory, and two orders of magnitude slower than signalling with system semaphores. It seems clear that IPC with queues should be reserved for those occasions where message prioritizing and selective message scanning and extraction are really needed. The complexity of queue manipulation, the number of system calls involved, and the relatively heavy demand for system resources, such as sharable selectors, should deter you from casual use of queues.

As with the semaphores, these comparisons on a lightly-loaded system could turn out quite differently on a heavily-loaded system, where applications have over-committed virtual memory and the virtual memory manager and swapper are constantly busy. Pipe performance should be relatively consistent, because the system buffers used by pipes are not swappable. On the other hand, named shared memory segments, and the giveable shared segments used in queue messages, are swappable, so IPC performance via shared memory or a queue could be quite erratic depending on swapper activity, thread priorities, and so on.

Some Final Thoughts

Although OS/2 has gotten off to a slow start, its eventual importance in the desktop computer world can no longer be doubted. I feel strongly that the appearance of the high-performance file system (HPFS) and 80386-specific versions over the next year or so will make it the platform of choice for software developers. Users will migrate more slowly (we have the history of the Macintosh to guide us here), but the benefits of OS/2's multitasking, virtual memory, and graphical user interface will eventually draw them in.

With such a complex system, though, the ad hoc design methods we all used in the CP/M and MS-DOS days will no longer cut the mustard. We need detailed and reliable metrics that can help us make tradeoffs between code size, code complexity, and code performance at every level of an application --in short, we need an understanding of the operating system's overall behavior that has never before been necessary in the microcomputer world. The timings presented in this article are crude and their scope is narrow, but perhaps (with luck) they will inspire successor articles by wiser and more experienced DDJ readers!