STRUCTURED PROGRAMMING

Chasing Bubbles in the Waterbed

Jeff Duntemann, K16RA/7

Cold fusion, huh? Well, I can top that: I have a waterbed that breaks down water into its component gases without electricity. I'm not sure what these gases are. They could be hydrogen and oxygen, or I suppose they could as well be carbon dioxide and xenon. I'm not macho enough to perform any conclusive experiments. About all I am sure of is that this particular waterbed has generated what seems like hundreds of cubic feet of invisible gas since being filled with equal parts water and Algae-Go-Bye-Bye in May of 1987.

Every week or so I have to chase Mr. Byte off the comforter, yank the bed-clothes off the waterbed, and wheedle ten zillion little bubbles around under the vinyl with my old slide rule, merging tiny bubbles into small bubbles, small bubbles into larger bubbles, larger bubbles into Godzillan bubbles, and finally letting the single monstrous survivor out through the fill hole. Otherwise, an ordinary night between the sheets can sound like one of those old Sea Hunt episodes.

Dynamic Variables Recap

Ever alert for the occasional wild metaphor roosting in the rafters, it occurred to me that we have this same problem in the structured programming world. The faulty waterbed is the heap, and the bubbles -- like the 'ole that Ringo 'ad in his pocket in Yellow Submarine -- are simply blocks of empty space that have been used for awhile and then turned loose.

First, a little background for those who work in languages such as Basic that do not support explicit memory allocation and deallocation. Pascal and Modula-2 (and C as well) set aside a certain amount of memory called the "heap" and allow the running program to create variables out of that memory (we say "on the heap") as needed. These variables are called dynamic variables to differentiate them from static variables, which are named in the source code and allocated at compile time, and exist in the same form and in the same location as long as the program runs. When no longer needed, dynamic variables are destroyed, or deallocated, and the memory they occupied on the heap is made available as grist from which to create other variables. In both Pascal and Modula-2 the allocation routine is called New, and the deallocation routine is called Dispose. (Turbo and QuickPascal have a similar pair of routines called GetMem and FreeMem as well.)

Variables created on the heap have no names. They are accessed through a type of variable called a "pointer" that exists solely to "tie" these dynamic variables into the program's reality. The pointer (which is declared in the usual way variables are declared) is set to point to the dynamic variable when the dynamic variable is allocated using New:

   VAR
       {A pointer to type Integer:}
       MyPointer: ^Integer;
   .
   .
   .
   New(MyPointer);
   MyPointer^: = 42

The caret symbol is the dereference operator. To specify what the pointer points to (rather than to specify the pointer itself) you must use the dereference operator, as shown in the assignment statement. Accessing the item that a pointer points to is called "dereferencing the pointer" and the item pointed to is called the "pointer's referent."

If you could only create dynamic variables as referents of pointers allocated as static variables, it would be interesting but not compellingly useful. Pointers really shine when they are used to connect the components of entire data structures constructed on the heap. Linked lists, queues, stacks, and other data structures can be built from simple data structures and pointers. This is done by making pointers the fields of records:

TYPE
       DynaPtr = ^DynaRec;

      DynaRec =
            RECORD
                  StrData: STRING;
     Next: DynaPtr;
            END

In a linked list, several of these records are allocated on the heap such that the Next pointer in each record points to the "next" record in the list, allocated elsewhere on the heap. The pointer in the last record is set to a value of NIL, which is a sentinel value indicating that the pointer points to nothing.

Standard procedures exist to query the system and find out how much free heap memory is available. The MemAvail function returns the total number of bytes of free memory on the heap. The MaxAvail function returns the size in bytes of the largest single chunk of memory on the heap.

Heaps of Holes

This is a powerful concept, but there's a worm in it: When you deallocate items stored on the heap, the memory where the item had been, becomes a hole," the exact size of the item. This hole is available for use in allocating future dynamic variables, but obviously, you can't allocate a variable any larger than the variable that created the hole by going poof.

This sounds worse than it often is. Items are allocated on the heap in order, side by side. If you allocate 100 records in a row, use them, and then deallocate all 100 records at once; the "hole" resulting from the deallocation will be contiguous, the size of the full 100 records. Problems arise when you allocate a great many items of varying sizes on the heap, and deallocate some but not others, especially if the deallocation is fairly random. Allocating a record in a hole left by a slightly larger record can leave a tiny sliver of unused memory that isn't good for anything at all. Do this often enough, and you may have a great deal of free memory that can't be used at all because it exists in a large number of very small chunks.

This condition is called "heap fragmentation" because the heap becomes divided into a great many separate fragments of memory in varyingly useful sizes. Figure 1 represents this condition in Pascal or Modula-2. The shaded areas represent memory that is in use, pointed to by one of the pointers on the left. The white areas represent memory that is free for use. Note that some of the slices are pretty thin, and may not be large enough to hold anything useful to the currently running application.

I've drawn such a middleman in Figure 2. Between the pointers scattered through the system and the heap itself is an array of handles. I've shown the handles as pointers themselves, but they aren't pointers in the same physical sense that Pascal and Modula-2 pointers are. A handle is means of access. It could be a pointer to the actual block of heap memory, or it could be a pointer into some larger mechanism that maps blocks of storage onto disk sectors, or EMS or extended memory. Also, a handle must contain the size of the memory block it points to. I've not shown this in the figure for simplicity's sake, but a handle might be a very simple record containing a pointer to the memory block on the heap and a 16-bit block size value.

A handle's primary virtues are two: First of all, for each block of memory there is only one handle, and second, the system always knows where every handle is. I show an array of handles in the figure for simplicity; in many cases the handles will be arranged as a linked list. The important thing is that the handles are stored in a form that is always under the system's control.

A handle either points to a block of memory on the heap or it is set to NIL to indicate that the handle is free and may be used. Note in Figure 2 that in two cases, two different pointers point to the same handle. This is entirely equivalent to the situation in Figure 1 where two pointers pointed to the same block of memory on the heap. Any number of pointers may point to one handle, but only one handle may point to any given block of memory.

Packing the Heap

The heap shown in Figure 2 is identical to that shown in Figure 1, that is, fragmented to the point of being useless. Unlike Figure 1, however, the fragmentation in Figure 2 can be fixed by a little scavenging. Because the system knows where every handle is located, and because each handle points to one memory block and each memory block has only one handle pointing to it, the system can eliminate wasted space by moving memory blocks together on the heap until they become contiguous. This condition is shown in Figure 3. All of the available heap space is now in one large block, and no small slivers of memory remain wasted.

In a practical system, a few more things would be necessary, such as a "free list" to keep track of available blocks. Both Turbo Pascal and QuickPascal use such a list, a simple linked list of records located on the heap.

Could a system such as this be built into a Pascal or Modula-2 compiler? Of course. The cost is in speed, and, to a lesser extent, in the memory needed by the handles and the code to manipulate them. Dereferencing a pointer to a pointer involves two separate memory accesses instead of only one. Also, as mentioned before, the garbage collection task itself takes some time, but with some cleverness can be spread so thin as to hardly be noticeable, a la Actor.

Breaking Old Habits

The real problem, though, is breaking current code. Turbo Pascal, in particular, allows a lot of direct manipulation of pointers. If it were as simple as generating handle-manipulation code for New, Dispose, and the dereference operator, there'd be little difficulty with compatibility. However, a great deal of Turbo Pascal code (including a lot of my own) make assumptions about the nature of pointers that would not jive with a handle-based heap manager.

The most obvious case is building pointers from segment and offset addresses using the Ptr function. How many times have you done something like this:

     VAR
            Display: Pointer;
            .
            .
            .
      Display := Ptr($B800,0);

The last thing you want is for a heap manager to decide to move your video refresh buffer somewhere a little closer to the other bubbles.

The kicker is that pointers are often used for things that have nothing at all to do with the heap. And as long as pointers point to things that exist at fixed locations and cannot (or should not) be moved, using handles for heap management with traditional pointer and dereferencing syntax is going to be difficult indeed.

The Trouble with Objects

I'm making a very big deal of all this because sometime soon the matter of heap fragmentation is going to become a very serious problem for the new-born object-oriented languages. I won't speak for C++ because I don't understand it very well as yet (though I'm trying) but for Turbo and QuickPascal we're headed for trouble.

Heap fragmentation has been with us from the beginning, but it's attracted little attention for two major reasons:

1. Many self-taught Pascal programmers hit a conceptual wall when they encounter pointers, and rather than figure them out or yell for help, simply work around them. Thus, a great many Pascal applications don't make use of the heap at all.

2. Seasoned Pascal developers who use the heap heavily have worked out tricks to deal with heap fragmentation. These include padding records out so that most records allocated on the heap are either the same size or convenient multiples of the size of the smallest record, or massaging an algorithm such that records are allocated and deallocated in order rather than at random. "Heap discipline" of this sort is second nature to longtime Pascal developers, and is considered by many to be part of good structured programming practice, even though it violates the spirit of true dynamic allocation.

Unfortunately, when you start dealing with objects in a big way, this kind of heap discipline no longer works. A linked list of objects no longer contains items all of one type. As long as all objects in a list are descended from a common ancestor (such as the Node type shipped as an example with Turbo Pascal 5.5) polymorphism allows the developer to stop worrying about the type of the nodes in a list and let the nodes handle their own business through virtual methods. When the application has to have carnal knowledge of the contents of a list in order to make things happen, most of the benefits of object-oriented techniques get lost. Much of OOP's novelty lies in giving individual objects more autonomy, but heap discipline generally means orchestrating heap management in ways that individual objects -- being just parts of a larger whole -- cannot accomplish.

Compounding the problem is the fact that programmers can no longer ignore the heap once they start using OOP. In QuickPascal all objects are allocated on the heap. You don't get any choices. In Turbo Pascal, objects may be defined statically in the data segment without involving the heap if you like, but such objects may not take part in polymorphic algorithms and are considerably less useful than objects on the heap.

So picture it: An ambitious application consisting of dozens or hundreds of objects popping into being on the heap or poofing into holes more or less at random, without the application's being fully aware of individual objects' exact types and hence their sizes. The old tricks no longer work, and (to state it in line with the metaphor of active data) the heap fragments itself to uselessness in record time.

The Tyranny of the Installed Base

Both Turbo Pascal 5.5 and QuickPascal suffer from this problem, but if a solution is out there, QuickPascal will be the easier fix by far. Because the installed base for QuickPascal is counted in the tens of thousands rather than in the many hundreds of thousands, there is less QuickPascal code to break and fewer QuickPascal users to alienate if a fundamental syntactic change must be made to the language to allow a handle-based heap manager.

But -- most remarkably -- QuickPascal may not need to make any syntactic changes at all. What I originally thought was oversight or sheer clumsiness in the Apple Pascal definition (which QuickPascal follows closely) might hold the solution to the whole mess.

Recall the inconsistent nature of object definition in QuickPascal. Objects are defined like records, allocated like pointers, and then used like records:

   TYPE
          Figure = OBJECT
                     X,Y: Integer;
                     Visible: Boolean;
                     PROCEDURE Show;
                     PROCEDURE Hide;
                 END;

   VAR
                     MyObject = Figure;
          .
          .
          .
          New(MyObject);
          MyObject.A := 17;
          MyObject.Y := 42;
          MyObject.Show;

Ordinarily, New works only with pointers, but QuickPascal allows New to take an object variable identifier as though it were a pointer. In fact, beneath the surface the variable MyObject is a pointer, but once passed to New it can be used as though it were the object itself and not simply a pointer to a nameless block of memory on the heap.

What's important is that, regardless of its physical implementation, a QuickPascal object is not accessed through pointers. Unless you explicitly declare an object as the referent of a pointer type, the dereference operator is not required to access the object's methods and fields. This means that, if Microsoft chose to do so, it could isolate objects in an objects-only heap, and implement a handle-based manager for objects that could include automatic garbage collection. The name of the object would become a pointer to a handle, which would then point to the object's actual location on the heap. This does not involve redefining pointer syntax or the dereference operator, and would not break a single line of Version 1.0 code. For QuickPascal it would solve the heartbreak of heap fragmentation for objects, which is itchier than psoriasis and lots harder to get rid of.

I won't say that this was what the designers of Apple Pascal had in mind when they chose their somewhat idiosyncratic syntax for object creation (though if they contact me I'd love to ask them) but it certainly is a potential solution aching to be implemented. It will have some costs in performance, but I suppose we were naive in assuming that the amazing flexibility of polymorphism would come for free.

Turbo Pascal's designers will have a much harder row to hoe. There is no solution that won't involve breaking megalines of existing code. They're clever, those Borlanders; as clever as you find in this business and then some, but the tyranny of the installed base is going to make the Big Fragmentation Fix for Turbo Pascal a painful one all the way around.

Would that it were as simple as chasing bubbles in a waterbed.