STRUCTURED PROGRAMMING

Chasing Bubbles in the Waterbed

Jeff Duntemann, K16RA/7

Cold fusion, huh? Well, I can top that: I have a waterbed that breaks down water into its component gases without electricity. I'm not sure what these gases are. They could be hydrogen and oxygen, or I suppose they could as well be carbon dioxide and xenon. I'm not macho enough to perform any conclusive experiments. About all I am sure of is that this particular waterbed has generated what seems like hundreds of cubic feet of invisible gas since being filled with equal parts water and Algae-Go-Bye-Bye in May of 1987.

Every week or so I have to chase Mr. Byte off the comforter, yank the bed-clothes off the waterbed, and wheedle ten zillion little bubbles around under the vinyl with my old slide rule, merging tiny bubbles into small bubbles, small bubbles into larger bubbles, larger bubbles into Godzillan bubbles, and finally letting the single monstrous survivor out through the fill hole. Otherwise, an ordinary night between the sheets can sound like one of those old Sea Hunt episodes.

Dynamic Variables Recap

Ever alert for the occasional wild metaphor roosting in the rafters, it occurred to me that we have this same problem in the structured programming world. The faulty waterbed is the heap, and the bubbles -- like the 'ole that Ringo 'ad in his pocket in Yellow Submarine -- are simply blocks of empty space that have been used for awhile and then turned loose.

First, a little background for those who work in languages such as Basic that do not support explicit memory allocation and deallocation. Pascal and Modula-2 (and C as well) set aside a certain amount of memory called the "heap" and allow the running program to create variables out of that memory (we say "on the heap") as needed. These variables are called dynamic variables to differentiate them from static variables, which are named in the source code and allocated at compile time, and exist in the same form and in the same location as long as the program runs. When no longer needed, dynamic variables are destroyed, or deallocated, and the memory they occupied on the heap is made available as grist from which to create other variables. In both Pascal and Modula-2 the allocation routine is called New, and the deallocation routine is called Dispose. (Turbo and QuickPascal have a similar pair of routines called GetMem and FreeMem as well.)

Variables created on the heap have no names. They are accessed through a type of variable called a "pointer" that exists solely to "tie" these dynamic variables into the program's reality. The pointer (which is declared in the usual way variables are declared) is set to point to the dynamic variable when the dynamic variable is allocated using New:

   VAR
       {A pointer to type Integer:}
       MyPointer: ^Integer;
   .
   .
   .
   New(MyPointer);
   MyPointer^: = 42

The caret symbol is the dereference operator. To specify what the pointer points to (rather than to specify the pointer itself) you must use the dereference operator, as shown in the assignment statement. Accessing the item that a pointer points to is called "dereferencing the pointer" and the item pointed to is called the "pointer's referent."

If you could only create dynamic variables as referents of pointers allocated as static variables, it would be interesting but not compellingly useful. Pointers really shine when they are used to connect the components of entire data structures constructed on the heap. Linked lists, queues, stacks, and other data structures can be built from simple data structures and pointers. This is done by making pointers the fields of records:

TYPE
       DynaPtr = ^DynaRec;

      DynaRec =
            RECORD
                  StrData: STRING;
     Next: DynaPtr;
            END

In a linked list, several of these records are allocated on the heap such that the Next pointer in each record points to the "next" record in the list, allocated elsewhere on the heap. The pointer in the last record is set to a value of NIL, which is a sentinel value indicating that the pointer points to nothing.

Standard procedures exist to query the system and find out how much free heap memory is available. The MemAvail function returns the total number of bytes of free memory on the heap. The MaxAvail function returns the size in bytes of the largest single chunk of memory on the heap.

Heaps of Holes

This is a powerful concept, but there's a worm in it: When you deallocate items stored on the heap, the memory where the item had been, becomes a hole," the exact size of the item. This hole is available for use in allocating future dynamic variables, but obviously, you can't allocate a variable any larger than the variable that created the hole by going poof.

This sounds worse than it often is. Items are allocated on the heap in order, side by side. If you allocate 100 records in a row, use them, and then deallocate all 100 records at once; the "hole" resulting from the deallocation will be contiguous, the size of the full 100 records. Problems arise when you allocate a great many items of varying sizes on the heap, and deallocate some but not others, especially if the deallocation is fairly random. Allocating a record in a hole left by a slightly larger record can leave a tiny sliver of unused memory that isn't good for anything at all. Do this often enough, and you may have a great deal of free memory that can't be used at all because it exists in a large number of very small chunks.

This condition is called "heap fragmentation" because the heap becomes divided into a great many separate fragments of memory in varyingly useful sizes. Figure 1 represents this condition in Pascal or Modula-2. The shaded areas represent memory that is in use, pointed to by one of the pointers on the left. The white areas represent memory that is free for use. Note that some of the slices are pretty thin, and may not be large enough to hold anything useful to the currently running application.

The Long Wait for the Garbage Man

So, what happens if you call MaxAvail and the size of the largest free chunk of heap memory is less than the size of the item you need to allocate on the heap? Not much. The best you can do is to try deallocating some other items on the heap and hope that they'll open up a hole large enough to fit the item you need to create. If there's nothing you can turn loose, you're stuck. Really.

When confronted with this problem, most people assume there must be a way to rearrange the blocks of memory on the heap, packing them up nose-to-tail so that all the little empty slivers of memory collect in one place and add up to a single large block of useful space again. It's pretty disconcerting to realize that there is absolutely no general-purpose way to do this in Pascal or Modula-2. (Or C, either.) Every trick involves making assumptions about the order in which items were allocated on the heap and other information that varies from application to application or even from one run time to the next.

This Holy Grail of gathering together all the bubbles in the great waterbed of heap memory is what we call "garbage collection." It is one of the most wretchedly difficult things to do in all computer science, and in many languages it is simply impossible.

Why? It has to do with the way pointers are implemented in our most common implementations of Pascal and Modula-2. Pointers are simply machine addresses. The "long" pointers used in Modula-2 (and all pointers in Turbo and QuickPascal) are 32-bit addresses consisting of a 16-bit segment address and a 16-bit offset address. So a pointer is just 4 bytes somewhere in the megabyte of 8086 real address space, containing the address of the memory block that the pointer points to. (A NIL pointer contains 4 bytes of 0s.)

The problem with rearranging the heap lies in notifying the pointers that the addresses of their referents have changed. A pointer is an address, so you would have to change the value of every pointer that pointed to any block of heap memory involved in a move. The problem with that is simply finding all the pointers. Although you can always find a pointer's referent starting from the pointer, you can't work back from a block of memory to find all pointers that point to it. Keep in mind that any number of pointers may point to a single block of memory on the heap, and those pointers may be anywhere in memory at all, including on the heap, in the code segment, or on the stack. Nothing marks a pointer as being a pointer; furthermore, nearly all 4-byte sequences in memory contain binary patterns that could represent valid addresses.

So there may be tens of thousands of pointers scattered through the megabyte of real address space, and they look no different from 4 bytes of machine code, 4 bytes of data, or 4 bytes of anything. Finding pointers just by looking for them is thus meaningless. And if you can't find every pointer that points to a block of data, you had better leave that block of data right where it is. That in a nutshell is why garbage collection for the standard Pascal/Modula-2 heap is impossible.

What Those Other Guys Do

Outside of the Pascal/C/Modula world things are considerably better. Small-talk and Actor both support fully general automatic garbage collection on their dynamic storage. One reason both pick up their trash is that they have to; both are large, ambitious systems that take up a lot of memory. Without garbage collection, they would drink their heap dry in no time flat.

Actor contains a mechanism that constantly scans its stack and dictionary, looking for "dead" objects to which no references are found. Actor's dictionary is roughly equivalent to a symbol table. If a block of memory is allocated for an object, and that object is not referenced in the dictionary or by any other object, the block is scavenged and made available for new objects.

This happens automatically, and the process continues in the background throughout the execution of an Actor program. This way, the garbage collection overhead (which takes an irritatingly large fraction of the CPU cycles devoted to a program's execution) is spread out evenly through a program's life and is thus less noticeable. Some older systems performed garbage collection all at once, which caused the executing program to stop dead in its tracks (sometimes for minutes at a time) while the garbage collector fiddled memory around. This happened to me once back in my Xerox days while I was learning some internal revision of Small-talk on a creaky old Alto workstation. I was certain the system had croaked, but lo! It was only picking up its dirty socks, and came back to me after what had to be six or seven minutes.

Actor's garbage collector works so well that there is no dispose message to clean up after allocations triggered by New, when a program no longer needs to work with an object, it cuts the object's space loose by setting its internal state to Nil. The scavenger then picks up the trash on its next pass.

Smalltalk/V's documentation says less about its garbage collection system than Actor's, but things seem to work in about the same way: When an object is no longer referenced, it is removed from memory and the memory is reclaimed. Unused symbols do accumulate, however, and must be explicitly purged and their space reclaimed by sending the purgeUnusedSymbols message to the Smalltalk system.

The Secret of the Middleman

There are lots of ways to implement garbage collection, most of them far beyond my understanding. What I will explain, though, is the minimum machinery that makes garbage collection possible. What you need, basically, is a middleman.

I've drawn such a middleman in Figure 2. Between the pointers scattered through the system and the heap itself is an array of handles. I've shown the handles as pointers themselves, but they aren't pointers in the same physical sense that Pascal and Modula-2 pointers are. A handle is means of access. It could be a pointer to the actual block of heap memory, or it could be a pointer into some larger mechanism that maps blocks of storage onto disk sectors, or EMS or extended memory. Also, a handle must contain the size of the memory block it points to. I've not shown this in the figure for simplicity's sake, but a handle might be a very simple record containing a pointer to the memory block on the heap and a 16-bit block size value.

A handle's primary virtues are two: First of all, for each block of memory there is only one handle, and second, the system always knows where every handle is. I show an array of handles in the figure for simplicity; in many cases the handles will be arranged as a linked list. The important thing is that the handles are stored in a form that is always under the system's control.

A handle either points to a block of memory on the heap or it is set to NIL to indicate that the handle is free and may be used. Note in Figure 2 that in two cases, two different pointers point to the same handle. This is entirely equivalent to the situation in Figure 1 where two pointers pointed to the same block of memory on the heap. Any number of pointers may point to one handle, but only one handle may point to any given block of memory.

Packing the Heap

The heap shown in Figure 2 is identical to that shown in Figure 1, that is, fragmented to the point of being useless. Unlike Figure 1, however, the fragmentation in Figure 2 can be fixed by a little scavenging. Because the system knows where every handle is located, and because each handle points to one memory block and each memory block has only one handle pointing to it, the system can eliminate wasted space by moving memory blocks together on the heap until they become contiguous. This condition is shown in Figure 3. All of the available heap space is now in one large block, and no small slivers of memory remain wasted.

In a practical system, a few more things would be necessary, such as a "free list" to keep track of available blocks. Both Turbo Pascal and QuickPascal use such a list, a simple linked list of records located on the heap.

Could a system such as this be built into a Pascal or Modula-2 compiler? Of course. The cost is in speed, and, to a lesser extent, in the memory needed by the handles and the code to manipulate them. Dereferencing a pointer to a pointer involves two separate memory accesses instead of only one. Also, as mentioned before, the garbage collection task itself takes some time, but with some cleverness can be spread so thin as to hardly be noticeable, a la Actor.

Breaking Old Habits

The real problem, though, is breaking current code. Turbo Pascal, in particular, allows a lot of direct manipulation of pointers. If it were as simple as generating handle-manipulation code for New, Dispose, and the dereference operator, there'd be little difficulty with compatibility. However, a great deal of Turbo Pascal code (including a lot of my own) make assumptions about the nature of pointers that would not jive with a handle-based heap manager.

The most obvious case is building pointers from segment and offset addresses using the Ptr function. How many times have you done something like this:

     VAR
            Display: Pointer;
            .
            .
            .
      Display := Ptr($B800,0);

The last thing you want is for a heap manager to decide to move your video refresh buffer somewhere a little closer to the other bubbles.

The kicker is that pointers are often used for things that have nothing at all to do with the heap. And as long as pointers point to things that exist at fixed locations and cannot (or should not) be moved, using handles for heap management with traditional pointer and dereferencing syntax is going to be difficult indeed.

The Trouble with Objects

I'm making a very big deal of all this because sometime soon the matter of heap fragmentation is going to become a very serious problem for the new-born object-oriented languages. I won't speak for C++ because I don't understand it very well as yet (though I'm trying) but for Turbo and QuickPascal we're headed for trouble.

Heap fragmentation has been with us from the beginning, but it's attracted little attention for two major reasons:

    1. Many self-taught Pascal programmers hit a conceptual wall when they encounter pointers, and rather than figure them out or yell for help, simply work around them. Thus, a great many Pascal applications don't make use of the heap at all.

    2. Seasoned Pascal developers who use the heap heavily have worked out tricks to deal with heap fragmentation. These include padding records out so that most records allocated on the heap are either the same size or convenient multiples of the size of the smallest record, or massaging an algorithm such that records are allocated and deallocated in order rather than at random. "Heap discipline" of this sort is second nature to longtime Pascal developers, and is considered by many to be part of good structured programming practice, even though it violates the spirit of true dynamic allocation.

Unfortunately, when you start dealing with objects in a big way, this kind of heap discipline no longer works. A linked list of objects no longer contains items all of one type. As long as all objects in a list are descended from a common ancestor (such as the Node type shipped as an example with Turbo Pascal 5.5) polymorphism allows the developer to stop worrying about the type of the nodes in a list and let the nodes handle their own business through virtual methods. When the application has to have carnal knowledge of the contents of a list in order to make things happen, most of the benefits of object-oriented techniques get lost. Much of OOP's novelty lies in giving individual objects more autonomy, but heap discipline generally means orchestrating heap management in ways that individual objects -- being just parts of a larger whole -- cannot accomplish.

Compounding the problem is the fact that programmers can no longer ignore the heap once they start using OOP. In QuickPascal all objects are allocated on the heap. You don't get any choices. In Turbo Pascal, objects may be defined statically in the data segment without involving the heap if you like, but such objects may not take part in polymorphic algorithms and are considerably less useful than objects on the heap.

So picture it: An ambitious application consisting of dozens or hundreds of objects popping into being on the heap or poofing into holes more or less at random, without the application's being fully aware of individual objects' exact types and hence their sizes. The old tricks no longer work, and (to state it in line with the metaphor of active data) the heap fragments itself to uselessness in record time.

The Tyranny of the Installed Base

Both Turbo Pascal 5.5 and QuickPascal suffer from this problem, but if a solution is out there, QuickPascal will be the easier fix by far. Because the installed base for QuickPascal is counted in the tens of thousands rather than in the many hundreds of thousands, there is less QuickPascal code to break and fewer QuickPascal users to alienate if a fundamental syntactic change must be made to the language to allow a handle-based heap manager.

But -- most remarkably -- QuickPascal may not need to make any syntactic changes at all. What I originally thought was oversight or sheer clumsiness in the Apple Pascal definition (which QuickPascal follows closely) might hold the solution to the whole mess.

Recall the inconsistent nature of object definition in QuickPascal. Objects are defined like records, allocated like pointers, and then used like records:

   TYPE
          Figure = OBJECT
                     X,Y: Integer;
                     Visible: Boolean;
                     PROCEDURE Show;
                     PROCEDURE Hide;
                 END;

   VAR
                     MyObject = Figure;
          .
          .
          .
          New(MyObject);
          MyObject.A := 17;
          MyObject.Y := 42;
          MyObject.Show;

Ordinarily, New works only with pointers, but QuickPascal allows New to take an object variable identifier as though it were a pointer. In fact, beneath the surface the variable MyObject is a pointer, but once passed to New it can be used as though it were the object itself and not simply a pointer to a nameless block of memory on the heap.

What's important is that, regardless of its physical implementation, a QuickPascal object is not accessed through pointers. Unless you explicitly declare an object as the referent of a pointer type, the dereference operator is not required to access the object's methods and fields. This means that, if Microsoft chose to do so, it could isolate objects in an objects-only heap, and implement a handle-based manager for objects that could include automatic garbage collection. The name of the object would become a pointer to a handle, which would then point to the object's actual location on the heap. This does not involve redefining pointer syntax or the dereference operator, and would not break a single line of Version 1.0 code. For QuickPascal it would solve the heartbreak of heap fragmentation for objects, which is itchier than psoriasis and lots harder to get rid of.

I won't say that this was what the designers of Apple Pascal had in mind when they chose their somewhat idiosyncratic syntax for object creation (though if they contact me I'd love to ask them) but it certainly is a potential solution aching to be implemented. It will have some costs in performance, but I suppose we were naive in assuming that the amazing flexibility of polymorphism would come for free.

Turbo Pascal's designers will have a much harder row to hoe. There is no solution that won't involve breaking megalines of existing code. They're clever, those Borlanders; as clever as you find in this business and then some, but the tyranny of the installed base is going to make the Big Fragmentation Fix for Turbo Pascal a painful one all the way around.

Would that it were as simple as chasing bubbles in a waterbed.