Shared Memory and PC Supercomputing

Achieving gigaflop performance on a PC

Stephen Fried

Steve is the president of Microway Inc. and can be reached at P.O. Box 79, Kingston, MA 02364 or at steve@microway.com.

Shared-memory parallel processing is a supercomputer technique used by companies like Cray to build very high-speed computational systems. Shared-memory systems typically employ four to sixteen processors, with the shared memory acting as both a connectivity element and a repository for information. All CPUs in such a system can read or write the shared memory but often employ a hierarchy of other devices to store data--vector registers, data caches, local memory, and crossbar switches. Ideally, these devices reduce average data access time. The more often an item is accessed, the closer it will get placed to the internal units that do the actual computations. Traditionally, shared memory has been limited to supercomputers and super-minis, but it's now making its way into the PC world. This is because of the connectivity advantages it offers over serially linked, distributed-memory parallel architectures which are often too weakly connected to execute fine-grained parallel problems efficiently, especially when the CPUs have both scalar and vector facilities.

In addition to improved connectivity, shared-memory systems are easier to program because the data stored in shared memory can be mapped directly into the global storage structures of high-level languages (Fortran COMMON, for example). Many programmers get reasonable efficiency porting their code to a shared-memory parallel environment by simply using shared memory to hold COMMON and local memory to hold the stack and data structures that are repeatedly accessed. A drawback to shared memory, however, is that the number of CPUs which can be connected together is finite. This isn't a problem for distributed-memory systems, where interconnection bandwidth often scales with the number of CPUs added to the system. This is one of the reasons companies like IBM, with its networks of RS/6000s, and Intel, with its hypercube, have heavily promoted distributed topologies. From a practical perspective, however, parallel versions of many large applications were written for shared-memory supercomputers, and the techniques used to code shared memory often don't map well to distributed-memory architectures. Furthermore, most heavy-duty parallel code is written in Fortran, which, unlike C, doesn't lend itself to distributed processing. In short, there's a schism between distributed- and shared-memory parallel processing that hinges, in part, on the higher-level language of choice.

Shared vs. Distributed Memory Coding Concerns

From the programmer's perspective, there are differences between the shared-memory and distributed-memory approaches to implementing algorithms. The common matrix multiply in Figure 1, for example, depicts three arrays, A, B, and C, stored in the memory of both types of systems just before each of the nodes on the system starts its portion of a matrix multiply. (Figure 1 also shows the number of data-bus or signal lines used to connect the parts of each system together and the bandwidth of each line in Mbytes/sec.) In both systems, A has been divided into quarters. In shared-memory systems where crossbar switches are used to connect multiple banks of memory to multiple processors (not shown), this split isn't needed. However, crossbar switches are prohibitively expensive for the PC market. The key to an efficient parallel matrix multiply is keeping the quarters of A in local RAM. In the distributed-memory system, this is done using message-passing techniques over serial lines. In a shared-memory system (such as the QuadPuter860, an i860-based EISA add-in board my company produces), it's done using block moves from an address in shared memory to local memory. The distributed-memory system also has to scatter the rows of B and gather back the resulting rows of C. The secondary scattering and gathering of data in the shared-memory system is invisible to the program. The rows of B end up in local caches, and if C is stored in noncached shared memory to begin with, it will end up there at the end of the computation. Whichever technique, A must find its way into the highest-speed memory in the system and the rows of B must end up cached between uses.

To determine how efficiently an algorithm runs on a parallel machine, you need to know how long it takes to run on a single CPU, divide by the number of processors, then add in the time to sow the input and harvest back the output. Frequently, the time to start a process can also be important. An i860 can complete a 1024 complex FFT in less than a millisecond: If it takes 15 milliseconds to start a process, the system isn't suitable for doing parallel FFTs. Two of the measures of merit of a parallel system are shown in Table 1 (the top four rows were compiled by Adam Kolawa of Parasoft). The latency is the time required to send a null message and receive back a reply without doing a calculation. The bandwidth of the shared memory (exemplified in the QuadPuter-860) beats most of the distributed-memory systems when it comes to interprocess transfers. Shared-memory supercomputers from companies such as Cray, and i860 VME boards from companies such as Sky use crossbar switches that run at bandwidths over 500 Mbytes per second. While the QuadPuter's five-way arbitrated 64-bit memory probably runs an order of magnitude slower than these heavily connected crossbar systems, the QuadPuter still fares well if you compare its throughput-to-interconnect bandwidth with that of the more expensive systems. The QuadPuter-860 even compares well with i860 supercomputers such as the Intel Delta.

Shared COMMON Memory

Central to utilizing shared memory with Fortran is the i860's virtual memory and Fortran COMMON. Most i860 systems virtualize addressing using the i860's page tables, similar to those of the 80486. Here, physical memory ceases to become a resource of the user, instead becoming an operating-system resource. The trick to using COMMON as an interface between processors running in parallel is to dedicate a region of the shared physical memory to a virtual-address region I call "shared COMMON." This region of memory in the parallel version of OS-860 (Microway's single-threaded kernel) is mapped with a single page table used by all four modules. All accesses to a virtual address in the shared COMMON region are made to the same physical-memory location. Once shared COMMON is implemented, other resources that need to be made available to complete a parallel-

Fortran environment include the ability to send/receive signals between CPU modules and to control whether the pages in shared COMMON are cacheable or not.

Root Nodes and Farming

Since much of the PC parallel market uses MS-DOS, we've developed a methodology for writing shared-memory parallel programs using NDP Fortran (again from Microway) and its i860 run-time support. The technique uses OS-860 to manage the i860's memory, exceptions, and the I/O interface, with DOS running on the host. We borrowed a page out of Inmos T8 history by employing one of the i860 modules as the "root" node. This special node manages all communications between the T800s in a Transputer network and the i860s on the QuadPuter-860. The implication here is that only procedures running on the root node can open or close files or use I/O statements like WRITE or printf(). All other modules have to send messages to the host through the root if they want to communicate with the user. This isn't a limitation if the main applications are numeric intensive.

To illustrate how parallel processing algorithms are organized, I again borrow from Inmos history. The most common metaphor in the transputer business was that of a "farm"; see Figure 2. Since most problems could be mapped onto a "transputer farm," this became the methodology of choice for users who didn't want to learn Occam or get a computer-science degree just for parallel processing. In this paradigm, the root node manages the show, farming out tasks to the four worker nodes, and then harvesting the results. The typical Inmos analogy described farms as a cottage industry, such as sweater knitting. In this model the root node dispatches yarn that gets knitted into bodies, sleeves, and necks by worker nodes. The parts are then passed to a final worker and sewn into sweaters. In many parallel applications, the post-processing stage is nonexistent, and often the root will join the workers after the initial distribution is complete.

Matrix Multiplies on the Farm

The key to doing fast parallel matrix multiplies is having the columns of A stored in high-speed memory. If you examine the problem for vector lengths of 100, for every access of a row of B, you'll make 100 accesses to a column of A. For vector lengths of 40 or less, all of A and at least a row of B will fit into the data cache, and you can run your program out of shared memory without having to worry about data flow. Each processor does one fourth of the dot products, independent of the data-flow strategy. When the vector lengths grow larger than 40, you'll get significant shared-memory and cache thrashing unless you add some sophistication. You can either break A into quarters and move it into shared memory, or "strip mine"--a supercomputer technique in which the data is processed in strips. In this case, each of the CPUs processes strips of A, each containing 2000 elements, or 20 columns. During the first part of a strip-mined matrix multiply, each of the processors multiplies the 20 columns in its cache by all the rows of B. During the second half, the five columns that didn't get processed by the first iteration also get multiplied by all the rows of B. The disadvantage of strip mining as opposed to distributing A into local memory is that all the rows have to be read in for every iteration. As vector lengths increase, a point will come when a jump must be made to storing quarters of A in local memory. As the problem continues to grow even larger, B won't fit in the data cache and the code will have to strip mine B, reading A out of local memory.

A Simple Farming Example

To demonstrate how parallel applications are written with Fortran without resorting to message-passing, I'll examine the parallel computation of fourth-order polynomials using 1000 values of x and 1000 sets of coefficients; see Example 1(a). Our ultimate goal is to make a call from an ordinary main program which carries out this task on four processors, using a worker routine which looks like the original Fortran code.

The worker routine runs on the worker modules and the root itself, which calls a copy of the worker procedure after it has started up the worker modules. POLY_WORKER in Example 1(b) computes 1000 polynomials for a single set of coefficients. It's called as part of a more elaborate process which divides up the tasks among processors and makes calls to the four CPUs with different sets of coefficients.

Distributing four copies of POLY_WORKER and arranging things so that two nearly identical pieces of code can be used on both a single processor and parallel processors is easier said than done. Between the root node (which controls the show) and the worker nodes (which get things done), there's a trail of procedures which set the stage in Example 1(b). The Fortran routines presented here are examples of early QuadPuter code. Since then, we've developed a remote procedure call (RPC) model which uses assembly-coded finite-state machines to hide the Fortran details I'm about to discuss. However, this Fortran code demonstrates how to use signals and shared memory for parallel processing and runs much faster on our hardware than more-sophisticated message-passing techniques which bog down on the i860.

The POLY_WORKER code on each of the worker modules gets called by set-up routines which manage communications with the root module and interface shared memory for POLY_WORKER. Here, there's no movement of arrays from shared to local memory, resulting in a simple WORKER_SET_

UP routine. A worker routine's code begins by waiting at a "start-up barrier" for a signal from the root. When this arrives, it reads shared memory for the coefficients needed by POLY_WORKER along with the start and end indexes and pointers to the input X array and the output RESULT array.

The counterpart of WORKER_SET_UP on the root, which sets up the shared-memory control block, is ROOT_

WORKER_START. The array of shared-memory data structures used to control the workers is the module control block (MCB), located in shared memory, and defined using a Fortran STRUCTURE; see Example 1(c). It contains both the system-wide parameters needed by the worker module and the application-specific information needed by the worker routines. Integers hold pointers to reals (passing pointers in reals is a bad idea in any language). On the root side, I obtain the addresses of arrays like X and RESULT using a VMS Fortran extension, %LOC. On the worker side, I fake out the next module by passing these addresses by value to POLY_WORKER (the Fortran default is to pass by address). This is done using the %VAL VMS extension. Every language used for developing real-world programs ultimately acquires these little tricks that make it possible to get things done without switching tongues.

A Parallel Fortran Run Time

The goal of implementing shared memory on the QuadPuter was to make it possible to run fine-grained vector problems (those in small vector loops can be parallelized) on one to four i860s. To write code that executes on one to four worker modules, it's necessary for the system to know how many workers there are and which module is currently the root. This information makes it possible to write signaling primitives that deal with processors on a logical basis, even though the ultimate code executes using lower-level physical calls ("root wait for worker three" translates into "module 4 wait for module 1," in a system where the root is module 4 and the third worker is worker 1). Logical references make it easier to write code. For example, when you're located on the worker (which is how you feel when you have to write or debug worker code) and you want to send a signal to the root, you shouldn't have to determine who you are and who the root is, but just call a routine which gets the job done. Given these facilities, it's possible to construct worker routines which run on either the root or a worker and which carry out the appropriate tasks.

Logical signaling routines require that when a program gets started, it have some knowledge of the system. The problem is that until all modules in a system have started, the loader won't know what the system contains. There are a couple of solutions to this: You could write a worm-like facility (worms are virus-like programs which can move from processor to processor in a network) or parse the command line at run time. The technique I used to load programs and build a picture of the system was to have each module leave some information behind in shared memory during its start-up phase. When you combine this "left behind" information with the fact that each module is passed its module number on its command line, and that the last module loaded is by definition the root, you have all the ingredients needed by the root to fill in the module-control blocks with the information which describes the physical system.

The formal technique used to implement the start-up procedure starts with the passing of a filename to the shared-memory parallel version of RUN860 (Microway's "alien file" server which runs on the DOS host and provides the i860s with basic input and output services). RUN860 is passed a file containing a command line for each i860 in the system. Each line specifies the application being loaded on the module, module number, physical location of the module's shared-memory resources that will get allocated for local storage (other than local memory), and whether or not to initialize all of shared memory prior to load. The last line in the command file also tells the system how much shared memory to allocate to "shared COMMON"--the region of memory that user programs will be able to address using identical virtual addresses. This line in the command file is always the root, as it is always the last module to get loaded and started.

By running several applications in succession and only initializing common memory when the first application is run, you can build parallel applications which act as numeric filters. In this mode, results are left behind in the shared COMMON region from prior runs, which get processed in succession. This eliminates spooling intermediate results to a hard disk between the stages of a pipelined computation. For many problems, the time it takes to load an Mbyte or less of code is much less than saving temporary data to disk, then reading it back in for the next phase.

Throughout this article, I've discussed a hypothetical declaration of a special type of COMMON which gets handled by the Fortran compiler and linker in a special manner. While this has been implemented, a better technique is to build up this region by a program which executes before the Fortran MAIN program starts. You can think of this routine as an overlay-style array manager. If you're running large applications, COMMON's status is troublesome. While a C-like malloc() approach could be taken, that wouldn't work either, as heaps are designed to hold hundreds of small items. A better approach is to build a database describing the arrays being processed, then use this database to carve up shared memory before the Fortran MAIN starts execution. This gives you shared COMMON divided into a static region and one or more reusable regions, each of which can be used by various stages of an application.

When the MAIN program starts in this scenario, the addresses of arrays are passed to it as if MAIN were a Fortran procedure. The arrays in regions that may get reused will have overlapping addresses. The presumption is that the user will finish one of these phases before starting the next and that he will store data which has to survive across phases in an area allocated for static storage.

The Fortran MAIN program isn't the first to execute because what we've done is write a Fortran "parallel run time" in Fortran instead of in C, which would expose it to the user. As it turns out, Fortran MAIN programs and C main() procedures are never the first routines in an application to execute. They're always called from a UNIX-like CRT0 procedure. Because all the parallel Fortran modules start with a knowledge of the system, it's possible to build a single executable which runs on both the root and worker modules. While this approach wastes memory, it simplifies debugging and code development.

Figure 3 shows the structure of the program that emerges and a map of shared COMMON. As you can see, there's a strong symmetry between both sides. The symmetry breaks down for those tasks that can only be carried out by the host--the initialization of the MCBs and post-processing of data left behind in COMMON after the completion of a parallel task.

Performance and Conclusions

Figure 4 plots the efficiency of a matrix multiply vs. vector length. Below the vector length, I've listed the time for each i860 to do its share of the computation and the total cycle time. The time to do a dot product grows as the square of the vector length n, while the I/O grows linearly with n. As a consequence, large matrix multiplies can be efficiently parallelized, and the efficiency grows to over 90 percent by the time n hits 200. Actually, we've discovered that with users who use the QuadPuter and DOS, typical scientific problems frequently depend more on vector adds and multiplies than dot products. Since these primitives are less memory bound, they tend to parallelize more easily. Typical of our users is a crystallographer whose public-domain code improved by a factor of two after vectorization and a factor of three after parallelization. This code runs in 19 seconds on the QuadPuter and takes 30 seconds on his RS/6000-550.

We designed the QuadPuter to maximize throughput per EISA slot, the goal being to build a system which used five cards and had a throughput of one gigaflop. This was accomplished through the use of cool-running 25-MHz i860s. The time to execute a parallel program on the QuadPuter has computation and shared-memory communication components. Problems which vectorize easily sometimes present a challenge to the communications component, and require the use of memory-partitioning schemes like that in Figure 1. Algorithms that don't vectorize well or are scalar bound are less communications bound and parallelize efficiently using Fortran COMMON mapped to shared memory.

Vector vs. Scalar and Superscalar Devices

Moore's law states that the number of gates available on a square centimeter of silicon doubles roughly once per year. This miracle of technology has brought the cost of a megabyte down from $1 million to just $50 over the last 12 years. It's also made possible the ability to go from relatively simple CPUs, to devices that run at the speed of a Cray.

The Crays of the early '80s consumed 150 KW, yet had less numeric throughput than a 40-MHz i860. What's made devices like the i860 so powerful is that it became possible to pack pipelined numeric devices, data caches, code caches, and 32-bit RISC processors onto a single die. The multipliers on devices like the i860 are, in fact, called "Cray multipliers" (named for Seymore Cray, who invented them when at IBM). Cray multipliers are square silicon devices that take their input operands in on two sides and emit the result on the other two. The results literally flow from the input sides to the output sides. The larger the multiplier, the faster the device that employs it. Floating-point units use Cray multipliers to do the mantissa arithmetic and are sized so this arithmetic flows in a single cycle for single and possibly double precision. These and other flow-through devices have grown as big as they're going to grow. The only way to speed up numeric devices today is to employ more of them running in parallel. The i860 does this by stacking up the adder and multiplier in various pipelined combinations. The other approach to this problem is to use multiple, scalar numeric units. The only other easy way to speed up devices is to increase the size of the data cache.

The main problem with superscalar architectures is writing code for them. Of course, if Intel is on the ball, you probably won't have to. My guess is that Intel will follow the lead of Inmos and initiate multiple scalar operations in a single cycle using a long instruction-decode pipeline. In a superscalar device, it won't be necessary to write pipelined code, although it might be necessary to properly schedule code if you want to take full advantage of the units.

Figure 5 compares the operation of the i860 instruction used in a dot product, with the execution behavior of a hypothetical superscalar device. The i860 instruction is one of the 32 pipelined numeric instructions which specify how the i860's pipelined multiplier and adder feed each other. In this case, operand pairs are pumped into the multiplier once per cycle. They then pass through the multiplier in three cycles and out into the adder, which also takes three cycles to accumulate a result. In a dot product, the results circulate in the adder stages until the entire product is complete, at which point they get added up and stored in memory. The latency through the six stages is six cycles, but the rate at which operands are accepted is one pair per cycle. This translates into two numeric operations per cycle, or 50 megaflops at 25 MHz.

The four units of the superscalar device can be divided. My guess is that an on-chip scheduler would discover when the addresses of operands for instructions in the decode queue weren't aliased and issue instructions as fast as possible to the available numeric units. In this case, the processor would start the first multiplication in unit #1 and start a second in unit #2, one cycle later. Since these are scalar numeric units instead of pipelined units, they should permit a faster two-cycle operation. This means that on the third cycle, you could repeat the process. The actual instruction used for a dot product might be a multiply/accumulate similar to that developed for the Weitek 1167/4167 numeric units. This instruction automatically takes the output of a multiplier and accumulates it in a specific register until the algorithm is complete. In this case, we'd accumulate partial dot products in two registers. Alternatively, some logic in the scheduler might deduce this fact and use the adders as accumulators. Instead of writing a complicated, pipelined, primitive operation that involved 16 to 32 floating-point registers, the superscalar code would use just two lines in its inner loop:

reg1=reg1+x(i)*y(i)
reg2=reg2+x(i+1)*y(i+1)

It might not even be necessary to use registers to hold the dummy arguments used to represent the contents of the adders. One of Intel's goals appears to be replacing registers with memory. This is also Inmos's goal and should be the goal of any CISC company that has to compete with RISC. By making memory operations run at the same speed as register operations, it's possible to build CISC chips competitive in speed with RISC. However, this speed probably comes at a cost in die area which will translate into dollars. On the plus side, the scalar devices mostly run in two cycles, and don't have to be pipelined to yield decent performance--they can get excellent performance executing small regions of scalar code which may not be vectorizable but do have significant numeric content.

In the end, the four numeric units of this hypothetical device are just as efficient at dot products as an i860, which uses two units. However, they're faster at scalar-bound calculations, could be easier to write code for, and could often execute small sections of scalar code at vector speeds. There are two ways they might lose: if the less-complicated nature of RISC devices makes it easier to port RISC technology to smaller geometries that allow you to increase the clock speed more often; or if the smaller size of a RISC device makes it possible to place several RISC devices on the same die. Another problem could be CISC's extra logic, which burns more power. The four numeric units on the right-hand side of Figure 5 take almost twice as many gates as the two units on the left. This causes heating problems that prevent the devices from keeping up with clock-speed increases. Only time will tell if the Intel superscalar CISC gambit will pay off.

--S.F.

Example 1: (a) Parallel computation of fourth-order polynomials using 1000 values of x and 1000 sets of coefficients; (b) POLY_WORKER carries out the task of computing 1000 polynomials for a single set of coefficients; (c) the module control block structure.

(a)      N = 1000
         DO 100 I= 1,N
               DO 100 J= 1,N
     100 Result(j,i) = a(i)*x(j)**4 + b(i)*x(j)**3 +
        &             c(i)*x(j)**2  +d(i)*x(j) +e(i)

(b)      Subroutine POLY(x,result,N,a,b,c,d,e,iend,istart)
         real*4 x(N),result(N),a,b,c,d,e
         integer*4 N,iend,istart
         DO 100 J= istart,iend     100       Result(j) = a*x(j)**4 + b*x(j)**3 +
        &                  c*x(j)**2 + d*x(j) +e
         END

(c)  STRUCTURE/Module_control_block/
     integer*4   istart
     integer*4   iend
     integer*4   x_ptr
     integer*4   res_ptr
     real*4      a
     real*4      b
     real*4      c
     real*4      d
     real*4      e
     END STRUCTURE
     RECORD /Module_control_block/MCB(4)

Table 1: Interprocessor-communication bandwidths and latencies.


System                  Latency               Interprocessor Rate
RS/6000 Ethernet        3.5 millisec           50--100 Kbytes/sec
DELTA                   130 microsec          5--7 Mbytes/sec
RS/6000 Bit-3           240 microsec          10 Mbytes/sec
IBM V-7                 140 microsec          3--5 Mbytes/sec
T800                    <3 microsec          1.5--6 Mbytes/sec
QuadPuter-860           10--20 microsec       67 Mbytes/sec

Figure 1: Matrix-multiply data storage using distributed- and shared-memory architectures.

Figure 2: A six-process/processor farm.

Figure 3: Flow of parallel Fortran shared memory with inset of shared-memory organization.

Figure 4: Efficiency vs. vector length for single-precision matrix multiplies.

Figure 5: Comparing dot products; (a) i860 Vector unit executing m12apm; (b) a pair of superscalar units executing multiply accumulate at the same rate as m12apm.