ACCESSING HARDWARE FROM 80386 PROTECTED MODE PART I

Understanding the 386 architecture may simply be a matter of building on what you already know

Stephen Fried

Stephen is the vice president of MicroWay's R&D. He is well known in the field for his PC numeric and HF chemical laser contributions. You can reach him at MicroWay Inc., P.O. Box 79, Kingston, MA 02364


At one time, I ran a flight school that taught people how to fly aircraft and sailplanes. Of all the equipment we operated, the trickiest to manage was our fleet of Cessna L-19 "Bird Dogs," which we used to tow gliders and banners. The key to transitioning a pilot into an L-19 was getting the idea across that he wasn't flying an ordinary airplane, but one which had several distinct personalities. For example, if you limit the flaps to 30 degrees and the power to 150 HP, an L-19 flies just like a Skyhawk or C-170 (in fact, it has the same wing as both). However, when you go to full power, or full flap, what you get is an airplane that performs much like a helicopter. This split personality was designed into the L-19 for the Army, who used it for forward air control and covert operations in Vietnam. This same performance made it a great tow plane but a very expensive aircraft to transition pilots into. In fact, over a 15 year period we had three major accidents transitioning experienced pilots, and the Civilian Air Patrol (CAP) lost so many Bird Dogs that they were eventually forced to sell them off.

Without a shadow of a doubt, the "bird dog" of microprocessors has to be the 80386. In the case of the L-19, our experienced pilots could argue for hours about the right way to make a landing (where you put the flaps down, where you added power, and whether you three pointed it or landed on the mains). Transitioning the 386 from real to protected mode is every bit as complicated as making a short field landing in a Bird Dog. The process involves building tables in real memory, transitioning to protected mode, transitioning from 16- to 32-bit mode, building paging tables and, finally, transitioning to paged mode. The exact sequence used is a matter of personal choice, and at every stretch the processor and the assembly code that drives it has its own distinct personality. The resulting code is as difficult to decipher as any that has ever been written for a computer.

Getting By

Fortunately, it is not necessary for ordinary folk to get involved in the writing of kernel routines. However, if you plan to directly access your 386 AT's facilities, such as the screen buffers, it will be necessary to understand the rudiments of 80386 memory management and how it affects application development.

Just how confused are people about the 80386? Our 80386 protected-mode compilers have been available for over two years, and our products have been used to port millions of lines of Fortran and C. Yet, as I discovered in writing this article, we had been making an incorrect claim that it was possible for the 80386 to run multiple segments, each of which could have up to 4 gigabytes of code or data. As we'll see shortly, this assertion is not really correct.

When I asked myself how I could write outstanding code for a processor that I didn't understand, I quickly came to the conclusion that it wasn't necessary to understand the 80386 to code it, but to just understand the two modes that virtually all 386 code runs in.

Protected Mode

The majority of 80386s running in PCs see two types of service: Real mode and 32-bit flat protected mode. 99.9 percent of all 386 applications, including those written with DOS extenders, Unix, Xenix, and OS/2, run in 32-bit flat protected mode. The mode is called flat or small because all of the code and data of a program exist in a single segment, which resembles the 32-bit address space of a typical mainframe. In the case of DOS extenders, the processors slip in and out of real mode to access MS-DOS.

Unix, Xenix, and the future 386 release of OS/2 run entirely in protected mode. Because these operating systems are either multitasking or multiuser, the protection of operating system facilities, and therefore all hardware, becomes a major issue. As a result, these operating systems make it impossible to write the kind of fast running "misbehaved" applications that are the subject of this article. They accomplish this by running the user's code at a low RPL (request privilege level) and making system facilities only accessible from code running at a high RPL. Therefore, the subject of this article applies to code running on DOS extenders only.

Probably the biggest problem with learning the 80386 is the fact that most of the books on the subject were written for or by operating system types. As it turns out, the 80386 has two sides: A complex one that takes months to fully appreciate and a simple, physical one that is an almost trivial extension of the 8086 architecture.

A 32-bit 8086

The easiest way to approach this multi-personality processor is to treat it like a 32-bit 8086 that can be attached to a piece of hardware that makes paged memory possible. To help facilitate this exposition, imagine that the year is 1984, that we work for Intel, and that we have been asked to design a 32-bit 8086. (We will ignore the fact that the 80286 exists and that we have been asked to create a processor that also runs most 80286 code.)

Recall that the 8086 has six general-purpose 8/16-bit registers which were adapted from similar 8- and 16-bit registers in the 8080. Also recall that the address space of the 8086 was extended to 20 bits by adding four 16-bit segment registers to the 8080 architecture that point to 64K-bytes windows, called "segments," that can be located on any of the 64K paragraphs (a modulo 16 address) that exist in a megabyte of physical memory.

Segmentation was the trick that made the 16-bit registers of the 8086 capable of spanning a 20 bit address space. The problem we are now faced with is that we want to address more than a megabyte and we want to use 32-bit registers for computing addresses.

To generate an upward compatible architecture, we will now mimic what we did when we expanded the 8080 to 16-bits. We will use segments to act as windows into the address space and let our general-purpose registers contain offsets into the segment windows. We are now faced with several problems. Our segments must be capable of holding a lot of information, and to keep segments from hogging valuable address space, there should be some way to specify their size. Finally, to simplify upward compatibility, we will stay with 16-bit segment registers.

What comes out of these requirements is a segment that can be located at any system address in a 32-bit address space and whose size is not fixed, but specified by a 32-bit integer. Describing the size and location of a 32-bit segment in a 32-bit address space takes 64 bits of information and clearly violates our desire to leave our segment registers 16 bits wide. We resolve this by letting the segment register contain a 16-bit index into a table that is stored in memory and contains the 8 bytes needed to describe our new 32-bit segment. This index is called a "selector," and the table it references a "descriptor table." The location of a segment will henceforth be called its "base," and we use the term "limit" to describe its size.

To make it possible for the processor to access descriptors quickly, we incorporate registers into the processor that hold the base and limit for all the currently active segments. We also expand the number of segment registers from the four of the 8086 (cs, ds, ss, es) to six, through two new data segments, fs and gs.

Attribute Bits

To implement protection, we must free up a few of the 64 bits that we dedicated to the descriptors above. We do this by reducing the size of the limit from 32 to 20 bits. This limits the size of a segment to a megabyte, so we use one of the 12 bits that have just been freed up to specify the granularity of a segment. When this bit is set to zero, the segment is said to have "byte granularity" and its limit is a 20-bit integer. When this bit is 1, the limit value gets multiplied by 4K-bytes, yielding a 4-gigabyte upper value to the limit, with 4K granularity.

The other 11 bits that we have carved from our original 32-bit limit are used to specify protection attributes. These include bits that describe whether the segment is a 16- or 32-bit attribute (that is, the processor has a 16- and 32-bit default mode specified by this bit), the privilege level of the segment (0..3), a "present" bit (if this bit is not set, the selector is invalid), a DT bit that distinguishes ordinary memory segments from those that describe system resources (which are ignored here), and four TYPE bits that specify 16 possible usages for a segment (read-only, read/ write, execute-only, execute/read).

Of the 12 attribute bits, the only ones that we will encounter are those that specify the granularity, protection level, and segment TYPE. The attribute bits are used by the processor to ensure that every time a byte is accessed from memory, the program accessing the byte has the right to access the byte in question, that the byte lies in the segment, and that it is being used for its intended purpose (you can only execute code, not data, and vice versa).

As a result of the depth of protection provided by the processor, bugs which would cause crashes in 8086 systems, cause, more often than not, memory protection faults in 80386 systems. This makes debugging much easier. You simply run the program under an 80386 debugger and when the processor hits the fault, examine the program to determine what caused an illegal access request. It is usually impossible to back-track after an 8086 error because the original error destroys the processor stack, which causes the CPU to jump to data instead of code, resulting in everything becoming scrambled. Errors such as stack underflows cause immediate exceptions in the 80386, making it possible to backtrack before the processor has destroyed the information needed to track down the bug.

General Registers

So far we have spent all of our time worrying about how to extend the concept of a 64K segment into a general-purpose 32-bit segment. Now that we have created a 32-bit segmented framework for accessing information in memory, we must worry about extending the size of the six 8086 general-purpose registers: ax, bx, cx, dx, si, and di. We will do the same thing with them that we did when we extended the 8080 architecture from 8 to 16 bits. We create a new 32-bit register for each, having its lowest 16 bits named after the corresponding 8086 register, and its two lowest bytes named after the 8-bit registers of the 8086. For example, the 16-bit register ax, which contains the 8-bit registers al and ah in the 8086, gets expanded in the 80386 into a 32-bit register eax, which has a 16-bit component ax and two 8-bit components al and ah.

The registers ebx, esi, and edi are used in exactly the same manner as bx, si, and di. We also add some new addressing modes that simplify accesses of vectors but, otherwise, our architecture looks remarkably like an 8086. Addresses are computed in these 32-bit registers and are used as offsets into the 1- to 4-gigabyte segments that we developed earlier. Intersegment jumps and calls are NEAR in the 80386, but when running in 32-bit mode, NEAR changes its meaning from the 16 bits of the 8086 to 32 bits.

Supporting Syntax

To simplify the encoding of instructions, we use the same opcodes for mov eax, eax as we did for mov ax, ax, and so on. The size of the register operands is determined in two ways. One of the attribute bits in the segment descriptor describes a segment as being 16 or 32 bits. In addition, when the processor is running in 16-bit mode, a prefix byte can be used ahead of an instruction to indicate that the operands of that instruction are only 32 bits. A similar prefix makes it possible to use 16-bit registers in 32-bit mode.

The use of an override prefix makes it possible to write 16-bit code, which accesses the 32-bit registers. However, when running in real mode, accessing 32-bit registers does not buy much, as the size of segments in real mode is limited to 64K and the address space is limited to the first megabyte. In fact, we create real mode to make it possible to run 8086 code without doing anything, and to provide an execution environment for setting up descriptor tables in memory, so that the processor is capable of setting itself up before jumping into protected mode.

Other Features

There are a few other features that I should at least mention. The overlooked details include several control registers, three types of descriptor tables, task segment switches (48-bit intrasegment FAR calls), paging tables, and 8086 virtual mode. One of the facilities that we have to mention, the IDT (interrupt descriptor table) makes it possible for the processor to create different interrupt tables for different tasks.

These rather abstract facilities make it possible for these two personality processors to use software control to exhibit many other personalities (most of which will never see the light of day in the real world). In addition, they make it possible to implement demand-paged virtual memory that is very efficient. Virtual memory is available for all of the operating systems and environments that Microway's NDP C works with, making it possible to run mainframe programs on 386 systems that only have 1 to 2 Mbytes of RAM and a lot of free space on a hard disk.

A Quick look at the Map

The 80386 has a "physical" side that is quite close to the physical side presented by the 8086, and an abstract side that we can ignore. We will now examine this physical side and make the connection between the environment of our protected-mode application and the real-mode resources that the processor takes advantage of for doing I/O in an 80386 "AT" system.

Figure 1 shows the local descriptors for a program running under Phar Lap with the no page switch on. These values were obtained by running an NDP C program under the Phar Lap 386DEBUG program and using the dl command to dump the local descriptor table. The selector numbers on the left side of the table are the values that a programmer passes into the 386 segment registers to activate a segment. Because 386DEBUG was invoked with paging off, the BASE values in Figure 1 correspond to physical addresses.

Figure 1: The local descriptors for a program running under Phar Lap

  Selector       BASE     Limit     Flags     Use     Gran      Comment

---------------------------------------------------------------------------

     04           53030        FF       92       32     BYTE     DOS EXTEND
     0C          100000       2FF       9A       32     BYTE     USER CODE
     14          100000       2FF       92       32     BYTE     USER DATA
     1C           B8000       FFF       92       32     BYTE     DOS SCREEN
     24           53030        FF       92       32     BYTE     DOS EXTEND
     2C           52f60        B9       92       32     BYTE     DOS EXTEND
     34          000000     FFFFF       92       32     BYTE     1st MEG
     3C        C0000000      FFFF       92       32     BYTE     WEITEK

The memory map has a number of these selectors pinpointed on its left side. Looking at selectors 0C and 14, we see that their corresponding segments are located at the start of what IBM calls "extended memory" (the start of the second megabyte of memory). If we had invoked 386DEBUG with paging on, the primary difference in our segment memory map would be that 0C and 14 would be moved down into the first megabyte to save memory. However, with paging enabled, it would not be possible to read the physical location of a segment from the selector BASE value, as the processor performs an additional address translation with paging enabled. Therefore, we will examine some of the selectors in Figure 1 that have been set up by the DOS extender before going on to see what happens when paging is enabled.

Selector 1C has been set up so that it contains the current screen buffer. This selector has a base that starts at address 0B800:0 (in 8086 notation) and is 0FFFH + 1 byte in length (16K bytes). The fact that this segment corresponds exactly to the screen buffer was no accident. The Phar Lap DOS Extender queried the system to find out what kind of graphics adapter was active, and based on this information created an entry in the LDT (local descriptor table) that precisely matched the device. It is also important to point out that the use of selector 1CH is preferred over selector 34H (which maps in the entire first megabyte of RAM) for screen buffer accesses, because an out-of-bounds write will result in a protection fault when using 1CH, but could have disastrous results if 34H were used.

The selectors 0CH and 14H were created for user code and data. Note that these selectors have the same location base and limit. In fact, they are identical in every way, except for the attribute flags. The format of the attribute byte is:

     upper                  lower
|P|DPL|DT|             |TYPE|

Looking over the memory map, we see that the flag byte has only two values: 92H and 9AH. The lower nibble in 92H indicates that the segment is of type 2, which means the segment is read/write (for data only). All but one of the segments must therefore contain data. Looking at the map, we discover that the segment that we have identified with "user code" has an attribute of 9A. The TYPE nibble, 0AH, indicates that selector, 0CH, is execute/read only (code).

The upper nibble contains miscellaneous information about the segment, including the present bit, two bits that specify the privilege level, and a bit which, when set, specifies that the descriptor describes memory (as opposed to a task switch or special system entity). The binary translation of 9 is 1001, which translates into the segment marked as present in memory with a privilege level of 0. Privilege level 0 is the highest available, and is frequently referred to as "ring 0."

Segments that run in ring 0 are theoretically capable of creating havoc by playing games with systems' tables that should only be accessed by the operating system or DOS extender. As a practical matter, the only time we have had to deal with invisible system tables, such as the global descriptors, was in the early 80386 days, before the DOS extenders had calls for mapping in new hardware, such as the Weitek coprocessor (which is now automatically mapped in by all DOS extenders).

As long as the program you write goes through systems calls provided by Phar Lap and Eclipse to modify lower-level system tables, such as the interrupt descriptor table, the program that results will conform to the VCPI specification, which means it will run with VCPI operating environments, such as Desqview-386, Netware-386, Phar Lap, and Eclipse.

As a point of interest, Eclipse runs programs in ring 3. There is a movement in the 386 extender industry toward running in ring 3 instead of ring 0. As long as the operating environments continue to provide the memory mapping capabilities that are utilized below, we have no objection to running in ring 3 over 0. However, we think there is, and will continue to be, a need for operating environments that provide direct access to all system resources, as a counter measure to operating systems such as OS/2 and Unix, which are attempting to shut off access to these facilities.

Real Memory from Protected Mode

To move a block of characters and attributes into screen RAM in an 8086 system, we might employ a block move. This technique is frequently used by spreadsheets that build an image in memory of what the screen is going to contain and then instantaneously move this buffer to screen RAM by using a single processor instruction. To set up a block move in an 8086, we point the ds:si registers at the source, the es:di registers at the destination, place the number of bytes to be moved in cx and then use a rep movsb instruction to have the processor make the transfer for us.

The code for an 80386 block move is identical, except that we now use 32-bit registers to hold 32-bit offsets, and where we used physical paragraphs in ds and es, we now use the appropriate selectors. In addition, where we placed the count in cx, we now place the count in ecx, which is a 32-bit register and makes it possible to move more than 64K with a single instruction. For example, to move a 16K buffer of character attribute pairs to a monochrome screen buffer located at paragraph B800, we would employ one of the two sequences of code shown in Figure 2, depending on whether we were running in real mode or 80386 32-bit mode under Phar Lap.

Figure 2: 16-bit vs. 32-bit assembly code to move a 16K buffer of character attribute pairs to a monochrome screen buffer

  Real mode                 32-bit protected mode
  ------------------------------------------------------------------

  mov    ax,0B800H          mov    eax,1CH       ;set destination
  mov    es,ax              mov    es,ax         ;segment
  xor    di,di              xor    edi,edi       ;dest offset = 0
  mov    si,buffer          mov    esi,buffer    ;set source offset
  mov    cx,1000H           mov    ecx,1000H     ;set count
  rep    movsb              rep    movsb         ;perform block move

The program assumes that the buffer being moved is contained by the current data segment in ds. It then sets up a FAR pointer to the destination (screen buffer at B800:0). Note that where the real-mode code used the physical paragraph of the screen buffer, the 80386 uses the selector set up by Phar Lap. Next, the code points si or esi at the buffer to be moved. Again, note that where a 16-bit offset was used by the real mode code, a 32-bit offset is now being used by the 80386 for the 32-bit code. Finally, the program sets the number of bytes to be moved in cx or ecx, and requests the processor to carry out the block move. Except for the first line, these two sequences are virtually identical.

Because the selectors in Figure 1 can access all of the memory in the first physical megabyte of RAM, we have just demonstrated that it is possible to access all of a system's "real" memory from a program running in protected mode. In our example, the source buffer is contained by the default data segment, 14H, which is located in "extended" memory above the first megabyte.

48-bit Address Space?

All that remains to our expose of the 386's flat model is to explore the operation of ports, interrupts, and paging. However, before we leave segmentation, there is one myth we need to burst. The typical text on the 80386 presents the processor as having three address spaces -- virtual, linear, and physical. Up to this point, what we have been exploring is the linear and physical, which are both identical when paging is disabled. The mythical address space turns out to be the "virtual" one. The myth was born because individuals who were used to programming in the large or huge models on the 8086 asked, "What would happen if we could write large or huge code on an 80386, instead of small code?" They quickly came to the conclusion that programs written with compilers, and operating systems that support 48-bit pointers (the 16-bits of the selector count for 16- and the 32-bit maximum size of the limit count for 32), would be capable of addressing a 48-bit address space, which just happens to contain 64 terrabytes!

We don't know who created this concept, although we suspect that Intel marketing told its systems' architects (after the last perceived black eye they got from a segmented architecture) that if they had to resort to segmentation again, they better have a damn good reason. The reality of the situation is that practical program size is limited by the size of what Intel calls the "linear" address space (to 32-bits), and that a 48-bit address space will not become a reality until Intel increases the size of the linear address space in a future device.

To prove the point, we did a calculation of what would happen if we took a simple program that performed a matrix multiply and extended it to handle arrays whose total size was greater than 4 gigabytes. As the total size of the arrays in our problem approach 4 gigabytes (each of the three arrays approach 1.3 gigabytes), we have to abandon our 80386 small model, and Phar Lap, in favor of a compiler-supported memory model and operating system that utilizes the virtual address space (which is not the same as demand-paged virtual memory, which we commonly refer to as "virtual memory").

Once our problem hits the 1.4-gigabyte array size, it is impossible to have all three arrays in our 4-gigabyte linear address space at the same time. So, we take advantage of the present bit in the descriptor table to make it possible for our large model operating system to swap arrays as needed. Our large model operating system makes it possible to run large model virtual segments. When we compute the time required to swap our 1.4-gigabyte segments as required by our algorithm, we discover that, assuming we have the world's fastest hard disks, our code runs 100,000 times slower than it did in the small model currently supported by Phar Lap, Unix, and Xenix.

The largest sized array that our large model supports is 4 gigabytes, which means our problem will span a tiny (in comparison to 64 terrabytes) 12-gigabyte address space. But never fear, we have still not finished digging into our bag of 8086 tricks. By resurrecting FAR pointers, the huge model, and tiling, we can hit our 64 terrabyte goal -- and for only a cost factor of 400 percent in code efficiency.

What's Next?

That these systems tricks are crucial for future Intel products is quite evident from the 80486, which, unlike the 80386, achieves its best speed with small model code that limits data accesses to the ds segment register only. It's amazing what happens to the best laid plans of product managers, public relations, and system types, when everyone suddenly discovers that the key to selling systems is simplicity (i.e., RISC)! But, I hope to convince you next month in Part II of this article that the only use for FAR pointers in 80386 code appear in operating system kernels.