Digital Video File Formats

Understanding QuickTime and Video for Windows

Mark is president of the San Francisco Canyon Company, which developed QuickTime for Windows for Apple Computer. Canyon publishes the Movie Toolkit, a C++ class library for manipulating QuickTime and AVI movie files, and Canyon Clipz!. Canyon's How to Digitize Video will be published by John Wiley in early 1994. Mark can be reached at 415-398-9957; through CompuServe at 72371,104; or through AppleLink at CANYON.

Fads come and go, and like object-oriented programming, artificial intelligence, and bell-bottom dungarees, digital video is currently in vogue. Desktop digital video was pioneered--and proven--by Apple on the Macintosh. Over two years ago, QuickTime emerged as a strong standard with a loyal and talented following of developers. In late 1992, Apple announced QuickTime for Windows at the same time Microsoft ushered in Video for Windows, each vying to become desktop standards.

As of yet, there's no clear winner. But with the advent of powerful programs such as Adobe's Premiere for Windows, it's clear that digital video is approaching some stage of maturity. But what is digital video, and how can you, as a programmer, harness its power?

The 30,000-foot Perspective

Digital-video movies on your PC can be viewed as nothing more than a collection of rather large files that otherwise look like regular DOS files. Instead of holding eye-glazing columns of accounts-receivable data, however, these files contain digitized sequences of video and sound. (Incidentally, although I'll talk a lot about digital video, strictly speaking I'm referring to time-based data. For example, a QuickTime movie of a performance of Tosca might contain a track of video, a track of stereo sound, and additional text tracks of the libretto in English, German, and Italian, each synchronized to the other. Another point worth noting is that while the technocrats are well aware of the symmetry of using the Latin terms video ["I see"] and audio ["I hear"], sound is, for some reason, often preferred.)

An implementation of digital video must solve three problems to be viable. First, just like the analog systems that preceded it (CD-DA or NTSC for TV), it must define an architecture. This architecture must be robust enough to endure (so that content providers can be sure that their material won't quickly become obsolete) but flexible enough to adapt to the future. QuickTime and Video for Windows mainly embody their architectures in the data structures of their respective file formats (in the case of Video for Windows, this is called AVI, for audio-visual interleave). I'll examine these file formats in this article.

Secondly, digital video must provide extremely efficient compression and decompression implementations. A quick exercise in arithmetic shows why. Full-screen (say, 640x480), full-motion (either 24 frames per second in the movies, or 29.97 fps on your TV), uncompressed video needs between 22.1 and 27.6 Mbytes per second. One of my favorite movies, The Great Escape, would need 307 gigabytes to store uncompressed, which, if laid end-to-end_well, you get the picture. And space requirements are only part of the story. Imagine if mass-storage devices were cheap enough that you could afford 307 gigabytes for a single movie. Your hardware would still have to support a sustained data rate of 1 Mbyte per second to play it back.

Finally, a successful digital-video implementation must provide an engine that can play back the movies it digitizes at realistic frame sizes and rates on general-purpose desktop PCs. Both QuickTime and Video for Windows more or less succeed in this goal.

QuickTime Movie File Format

The first important point to note about the QuickTime movie file format is that QuickTime uses a strict subset of the Mac file format on the PC, making life easier for content providers. One consequence of this is that the byte ordering of structured data is Motorola, not Intel (because the Mac implementation came first). This may, at first, make it confusing to relate some of the discussion in this article to a hex dump of a QuickTime file (although I find Motorola hex easier to read!). In general, I'll still talk mostly about the PC, because it uses a simpler subset of the full QuickTime specification.

QuickTime files have a recursive, atom-based format. An atom is prefixed by a 32-bit length and a 32-bit identifier. It can contain either data or more atoms. Figure 1(a) shows the basic atom layout. The semantics of an atom are implied by its identifier. Each atom identifier is a four-character mnemonic, which is also the value of the identi-

fier itself. This may seem odd to

Windows programmers, to whom constructs like: #define SOME_ATOM_ID 0x12345678 /* unreadable value */ are more familiar. On almost every other platform, 32-bit compilers have been quite happy to accept character constants like moov or mdat.

Clearly, a QuickTime movie can be viewed as a tree structure. Normally, of course, trees have a single root, but QuickTime movies have two. On the Mac, movie data (video frames and sound samples) is stored in the file's data fork; the atoms that describe that data can be stored in the resource fork. DOS, of course, does not have this concept, so both tree structures are concatenated, or flattened in QuickTime jargon. QuickTime on the Mac is quite happy to play these flattened movies.

Movie data is stored in the mdat atom, which always comes first. It contains only data, not other atoms, that data being the video frames and sound samples that comprise the actual movie. The moov atom (pronounced, moo-vee) is the root to a structure of atoms that act as an index to the movie data. Figure 1(b) shows the basic structure of all QuickTime movies on the PC.

The moov Atom

I mentioned earlier that atoms can be viewed as a tree. Table 1 shows the basic tree structure of the moov atom. While the semantics of a particular atom constrain it to a certain level in the tree, the ordering of atoms at a given level is arbitrary. Moreover, software that parses the tree is expected to ignore atoms it doesn't recognize. It is this simple facility that gives the QuickTime movie file structure the flexibility to adapt to future needs. You can explore this structure for yourself by using Apple's DUMPMOOV program. Under DOS, it generates output like that shown in Listing One, page 17. Space constraints prevent a detailed examination of each atom. This can be found in Apple's QuickTime Movie Exchange Toolkit, and Canyon's Movie Tookit.

The mvhd atom defines the overall characteristics of the movie, principally its time scale and duration. A time scale is simply the units (in events per second) in which time values are expressed. For example, a time scale of 1000 means that time values are interpreted as milliseconds. However, time scales of 100 or 1000, while seemingly convenient, are not often used. You will more likely see scales of 600 because it has more factors, allowing integer arithmetic to be performed with less loss of precision.

The Movie Header time scale and duration provide the key to synchronization. Rather than synchronize video and sound to a particular fixed frame rate (like analog systems or Microsoft's AVI), QuickTime synchronizes all its tracks to the Movie Header. In digital video (as opposed to analog video), frame rates do not have to be constant. There's no celluloid driven by sprockets in front of a beam of light. A single digital image can be displayed on the CRT for as short or as long a time as necessary. This is where the stts atom comes into play. Conceptually, it is an array of durations for each frame in the movie, each of which can, of course, be a different value. In practice, a simple compression scheme allows a single value to be applied to multiple frames. For example, Figure 2 specifies that all 1270 frames have a duration of 100.

QuickTime movies may contain an arbitrary number of trak atoms, reflecting each of its tracks. Like their analog counterparts, tracks can hold video or sound. Additionally, QuickTime supports text tracks, although they are not yet implemented in the Windows version. Any number of video tracks can be present. QuickTime will choose the one that will look best when played on the target device. For example, you could digitize video into three tracks using 8-bit color, 16-bit color, or 24-bit (so-called) true color. If the movie is played on a PC with a video adapter capable of only 8-bit color, the 8-bit color track will be chosen. Similarly, any number of sound tracks may be present. Each can be recorded in a different language, for example. QuickTime will select the track that matches the current Windows language specification.

Tracks can, and typically do, have time scales and durations different from those in the Movie Header. For example, the natural time scale for a sound track is its sampling rate, say 11.025 kHz. A QuickTime movie can have any number of tracks, and no one type of track is conceptually favored. For example, movies are not required to have video tracks, and sound-only movies are quite common. They are often used in multimedia presentations, along with more conventional movies, instead of Microsoft WAVE files. In this way, a single API can be used to control all aspects of the application.

A track may have an arbitrary number of elst atoms. These atoms are mainly generated by movie-editing software like Adobe's Premiere. They allow selected parts of a track to be played out of sequence. You may have seen the recent Woody Allen movie, Manhattan Murder Mystery, in which Woody Allen and Diane Keaton attempt to blackmail their neighbor, whom they suspect of murder, by recording his girlfriend's audition for a play they've faked. Then they literally cut-and-paste the tape to produce a convincing, but quite different, shake-down message. In QuickTime, elst atoms are the digital equivalent of Woody's razor blade.

The stsd atom tells QuickTime how the track's video or sound data is compressed. The accompanying text box entitled, "Video Compression Technologies" explores this subject further. An additional text box entitled, "Selecting a Decompressor" describes how a decompressor is selected on playback.

A track may have multiple stsd atoms. At first, it's hard to see why this is useful. It implies that different parts of the track can be encoded using different compressors, and appears to be a somewhat esoteric feature. But consider a movie-editing package that might glue together parts of existing movies to form a new movie. If the source movies used different compressors, but multiple stsd atoms weren't supported, the new movie would have to be recompressed using a single compressor. However, each time a frame is compressed with a lossy compressor, it loses quality, much like a video tape that is copied.

The stsc, stco, and stsz atoms are used to extract data from the mdat atom. stsc allows video frames or sound samples to be grouped into chunks, to improve performance on playback. Typically, chunking is performed by postproduction optimization software. It gives the size of each chunk and the number of video frames or sound samples in that chunk. stco gives the offset of each chunk within the mdat atom. stsz gives the size of each video frame or sound sample.

The mdat Atom

The mdat atom is simply a stream of video frames and sound samples. Theoretically, the physical ordering is unimportant because the stsc, stco, and stsz atoms are used as indexes. However, in order to play back from relatively slow devices like CD-ROM, seeks must be avoided at all costs, so in practice, physical ordering is extremely important. QuickTime prefers that video frames and sound samples be physically grouped into half-second chunks, with sound leading. The text box entitled, "Sound Encoding Techniques" describes how sound samples are stored.

Reading and Writing QuickTime Movie Files

The first routine we'll need to tackle the task of reading and writing QuickTime movies is a fast WORD and DWORD flip routine, which converts Intel ordering to Motorola, and vice versa. Listing Two, page 18 shows Flip16 and Flip32, both of which can be conveniently called from C/C++ code using the prototypes in Figure 3. In production code, you'll want to implement Flip16Many and Flip32Many in assembler so that you don't have to iterate over Flip16 or Flip32 to flip multiple WORDs or DWORDs.

I'll use recursive descent to parse QuickTime movies. This technique has the advantage of simplicity and elegance, as the structure of the code exactly matches the structure of the file itself. Listing Three, page 18, shows the CollectAtomsFromFile routine, the heart of the recursive-descent logic. Listing Four, page 18, shows the actual parsing code. I use the Windows multimedia I/O calls (mmioOpen, mmioSeek, mmioRead, and mmioClose) for convenience; in this context, they're equivalent to any other I/O interface. I use the Windows mmioFOURCC macro to construct atom identifier constants; if you had a 32-bit compiler (as for Windows NT), you could simply code moov, for example, directly.

You may have noticed that CollectAtomsFromFile flips the atom size atmh.lSize (in order to perform arithmetic) but not its identifier atmh.lName. This isn't a bug. Rather, it allows you to code mmioFOURCC constants in their natural, readable Motorola order.

Although there isn't room here to show the actual code, writing a QuickTime movie uses a structurally parallel technique. I use the following procedure:

Write out a dummy mdat atom with a zero length.
Write out all the movie data (video frames and sound chunks) in the desired order.
At the same time, accumulate, in internal tables, the information you'll need to build the moov atoms.
Seek to the beginning of the file, and then write out the true length of the mdat atom.
Seek to the end of the mdat atom.
Write out all the moov atoms from your internal tables. Mirror the recursive-descent technique to write leaf atoms first, working up the tree toward the root. Each routine that writes an atom returns its length. This way, routines that write nonleaf atoms simply accumulate their length from the routines they call.

AVI Movie File Format

AVI files are stored as a specialization (form in Microsoft jargon) of the Microsoft RIFF (Resource Interchange File Format) standard. Microsoft defines RIFF as a tagged-file specification used to define standard formats for multimedia files. Other forms are WAVE for waveform audio data and RDIB for bitmaps. An introduction to RIFF can be found in the Windows Multimedia Programmer's Guide, and a discussion of the AVI form can be found in the Video for Windows Development Kit Programmer's Guide.

The AVI RIFF form starts with a standard 12-byte header; see Figure 1(c). Of course, Intel byte ordering is used for all fields. In order to code identifiers, such as RIFF and AVI, naturally, Microsoft provides the mmioFOURCC macro. For example, the following type of construct is common: #define formtypeAVI mmioFOURCC('A', 'V', 'I', ' ').

In general, RIFF files consist of chunks, lists of chunks, or a combination of both. The AVI form specifies which chunks are defined and the order in which they are expected. All programs that read RIFF files are expected to ignore chunks they don't recognize (but preserve them when the file is written). A chunk is very similar in both form and concept to a QuickTime atom; it consists of a 4-byte identifier and a

4-byte length, followed by the chunk data; see Figure 1(d). The semantics of a chunk or list are implied by its identifier. A list of chunks is prefixed by a 12-byte header, as in Figure 1(e).

Microsoft supplies two good programs for exploring AVI files. RIFFWALK works under DOS, and generates output like that shown in Listing Five, page 19. It's worth taking a look at this code; armed with the information discussed so far, you'll be able to infer a lot about the AVI file structure. FILEWALK displays similar output under Windows. Table 2 shows the required chunks and lists of chunks in an AVI file. I'll discuss the highlights of the important chunks shortly. Unlike QuickTime atoms, the ordering of AVI chunks is important.

entry format. If the index is present (as denoted by flags in the avih chunk), you are expected to use it to parse the data in the movi list. The ordering of index entries defines the order in which video and sound chunks must be played. One trick about using the index is worth noting. It normally records chunk offsets relative to the start of the movi list. However, Microsoft reportedly changed its mind during the Video for Windows beta period and some early encodings record chunk offsets relative to the beginning of the file. To determine which is being used, I read the first index entry. If its chunk offset is large (greater than 2K), I assume the old encoding; if it is small, I assume the new.

Reading and Writing AVI Movie Files

The Windows multimedia I/O calls (mmioOpen, mmioClose, mmioRead, mmioWrite, mmioSeek, mmioDescend, and mmioAscend) are designed to process RIFF files. In particular, mmioDescend and mmioAscend allow chunks and lists of chunks to be processed quite conveniently. As a point of comparison, QuickTime provides no such assistance, and dealing with AVI files is considerably simpler.

Listing Six (page 19) shows how to parse a basic AVI movie file. For clarity, error checking has been omitted. Again, there isn't room to show the code for writing an AVI file, but I use this technique.

Seek to an offset of 2K into the file.
Write out all the movie data (video frames and sound chunks) in the desired order.
At the same time, accumulate, in internal tables, the information you'll need to build the hdrl list.
Seek back to the beginning of the file and create a RIFF chunk and the required chunks in the hdrl list.
Create a junk chunk to pad the end of the hdrl list to the beginning of the movi list.
Create a movi list chunk.
Seek to the end of the file and create an index chunk.

Conclusion

Content developers often ask whether they should develop for QuickTime or Video for Windows. On the one hand, I think that QuickTime is technically superior. As far as production is concerned, the Intel Smart Video Recorder (ISVR card) can capture QuickTime and AVI movies with equal ease, and products like Adobe's Premiere bring first-rate editing capabilities to both. And on the Mac, where Video for Windows is not even a player, there exists a vast pool of equipment, software, and (most importantly) production talent, all dedicated to QuickTime.

On the other hand, Apple is fast losing ground to Microsoft by daring to play in Microsoft's sandbox. The decision to do Windows was a bold one, but unless Apple begins exhibiting a much stronger commitment to QuickTime for Windows, it may ultimately be overwhelmed.

Mark Florence

Figure 1: (a) Basic atom format; (b) basic QuickTime movie file structure; (c) AVI RIFF form header; (d) basic chunk format; (e) list-header format; (f) AVI index-entry format.

Table 1: moov atom tree structure.

Atom     Purpose
moov     Movie atom.
-mvhd     Movie header. Defines the time scale and duration of the     movie.
-trak     Track atom.
--tkhd     Track header. Defines the dimension, time scale, and
     duration of the track.
--edts     Edit list.
---elst     Edit-list entry. Allows selections of the track be played
     out of sequence.
--mdia     Media atom.
---mdhd     Media header. Defines the characteristics of the media
     holding this track's data.
---hdlr     Handler. On the Mac, defines the component that handles
     the media.
---minf     Media information.
----vmhd 
or smhd     Video- or sound-media information header. Defines basic
     media requirements.
----hdlr     Handler. On the Mac, defines the component that handles
     the video or sound.
----dinf     Data information.
-----dref     Data reference. On the Mac, can point to another file
     holding this track's data.
----stbl     Sample table.
-----stsd     Sample description. Describes how the track's video or
     sound is compressed.
-----stts     Time-to-sample. Gives the duration of each video frame.
-----stss     Sync sample. Indicates the location of key frames.
-----stsc     Sample-to-chunk. Groups video frames or sound samples
     into chunks.
-----stco     Chunk offset. Gives the offset into the mdat atom of
     each chunk.
-----stsz     Sample size. Gives the size of each video frame or
     sound sample.
-trak     As many additional tracks as required.

Figure 2: Applying a duration value to multiple frames.

stts (24) Time To Sample
-Version/Flags: 0x00000000
-Number Of Entries: 1
- 0: Sample Count 1270, Sample Duration 100.

Figure 3: Prototypes for the Flip16 and Flip32 routines, which converts Intel ordering to and from Motorola.

WORD PASCAL Flip16(WORD);
DWORD PASCAL Flip32 (DWORD);

Table 2: AVI file structure.

Code                    Purpose
RIFF     AVI               File header.
     LIST     hdrl          Defines structure of data                    in the movi list.
          avih          Defines basic movie
                    format.
          LIST     strl     One strl list per stream
                    (video or sound data)
               strh     Defines stream format.
               strf
          LIST     strl
               strh
               strf
          _
     junk               Optionally, provides
                    padding (otherwise
                    ignored).
          LIST     movi     Contains actual movie
                    data.
          LIST     rec     Groups video and sound
                    data for efficiency.
               ##wb     Sound data.
               ##dc     Video data.
          LIST     rec
               ##wb
               ##dc
          _
     idx1               An optional index into
                    movi list.

Video Compression Technologies

Table 3 summarizes the compression technologies available today for QuickTime and Video for Windows. By the time you read this, more may be known about the Captain Crunch and Indeo R3 compressors (both still in beta at the time of this writing). Those compressors producing encodings that are identical in both systems are indicated with an asterisk. I have deliberately omitted MPEG, motion, JPEG, and other technologies, simply because no QuickTime or Video for Windows CODECs exist yet.

I've also indicated a typical frame size and rate, although these numbers should be taken with a grain of salt. I've assumed software-only decompression on a 486/33 PC. In my opinion, the current leader of the pack is clearly CinePak, although Captain Crunch and Indeo R3 show signs of catching up.

When analyzing the performance of a CODEC, the most important gating factor is the CD-ROM transfer rate, because most movies are distributed this way. Consider that common CD-ROMs have 150--200 Kbytes/second transfer rates. A good CODEC will attempt to compress the data as tightly as possible (which gates the maximum playback rate from CD) in such a way that it can be decoded as quickly as possible (which gates the actual playback rate). For example, if the average frame size is 10K, then no more than 15 to 20 fps from CD can be expected, regardless of the speed of the decompressor.

--M.F.

Table 3: Compression technologies currently available for QuickTime and Video for Windows (*common to both QuickTime for Windows and AVI).

Compressor Identifier Frame Size/Rate Comments
Apple rle 320x240/12 fps Optimized for animationsAnimation and cartoons. Gives poor
performance for real-life
video content.
Apple smc 160x120/15 fps A modest performer
Graphics optimized for 8-bit
content. (The identifier
is the initials of its
patent holder, Sean
Callaghan.)
Apple Video rpza 160x120/15 fps Also known as"road pizza"
because of its good
compression ratio, it is
now superceded by CinePak.
Captain klic 320x240/15 fps From MediaVision;
Crunch* currently in beta.

CinePak* cvid 320x240/15 fps The one apparent
disadvantage currently
is that the algorithm is
highly asymetrical. It
takes up to 100 seconds
to compress one second of
video. For content with
a high turnover and a
short life, this can be
critical.
Intel Indeo rt21 160x120/12 fps A modest performer without
R2* hardware assistance. When
available, will be
superceded by Indeo R3.
Intel's ISVR card captures
directly into this format.
Intel Indeo iv31 320x240/15 fps Currently in beta.
R3*

Intel YVU9 yvu9 Primarily used only during
capture; available only
in Video for Windows.
Content is almost always
converted into another
format.
JPEG jpeg Primarily used only
during capture; available
only in QuickTime. Content
almost always converted
into another format.
Microsoft RLE mrle 160x120/12 fps Optimized for animations
and cartoons. Gives poor
performance for real-life
video content. Not the
same as Apple's Animation
compressor.
Microsoft msvc 160x120/12 fps Media Vision's ProMovie
Video 1* Studio captures directly
into this format.

Selecting a Decompressor

Key to the flexibility of both QuickTime and Video for Windows is their open architecture for compressors and decompressors (CODECs). Today, vector-quantization compression techniques allow playback rates of approximately 12 to 15 fps of 240x180 frames on most general-purpose computers. Tomorrow, perhaps wavelets or fractals will double this. It's vital that both QuickTime and Video for Windows accommodate this growth without changing their file formats or architecture. Fortunately, they do, and we are starting to see a wide range of powerful CODECs from Apple, Microsoft, and third-party developers.

QuickTime decompressors are structured as components. Components are a Mac concept, ported to the PC in QuickTime for Windows. A component is a special kind of DLL (in Windows, they normally use the .QTC extension) that negotiates its capabilities with its callers through a predefined set of entry points. A single .QTC file, which Windows views as a DLL, can contain multiple components. Full details are in Apple's QuickTime documentation.

You may recall from the general discussion that the stsd atom describes how a track's video is compressed. It does this by encoding the four-character identifier of the compressor. The assignment of these identifiers is regulated by Apple to ensure that they remain unique across all third-party developers. They look just like atom identifiers; for example, cvid is assigned to SuperMac's CinePak CODEC.

When QuickTime for Windows starts to play a video track, it negotiates with all the decompressor components it can find, using the standard Windows LoadLibrary search strategy. Each decompressor is asked, of course, if it can handle the identified encoding. But it also has the opportunity to check if a preferred environment (for example, special hardware) is present. In any event, it will report whether or not it can perform the decompression and, if so, how fast. The speed being measured as the number of milliseconds necessary to decode a 320x240 frame. QuickTime then uses the fastest decompressor.

Even when a movie is playing, QuickTime can switch decompressors. For example, if the video frame becomes clipped by another window, QuickTime will repeat the decompressor selection process. It does this because a decompressor that uses hardware assistance may wish to defer to a software-only decompressor for nonrectangular frames.

This elegant scheme is simple and effective, although it does place a burden on the decompressor writer to develop the correct negotiation logic. It has the advantage in that decompressors can simply be dropped into the user's system without any SYSTEM

.INI changes. For example, content providers can deliver a CD of movies and a proprietary QuickTime decompressor without fear of a conflict with the existing environment or special installation requirements. Further, multiple decompressors for the same encoding can be present, and QuickTime will automatically choose the most

appropriate.

Video for Windows decompressors are drivers (DLLs with the extension .DRV) written to the specification Microsoft documents in the Video for Windows Development Kit Programmer's Guide. In a manner similar to QuickTime, the AVI file format encodes the four-character identifier of the compressor in the video stream header, strh. Again, the assignment of these identifiers is regulated by Microsoft to ensure uniqueness, although we can be sure that the level of coordination between Apple and Microsoft is fairly low! Fortunately, where an encoding is supported in both systems, its identifier is constant. For example, Microsoft has also assigned cvid to SuperMac's CinePak CODEC.

Before it plays a video stream, Video for Windows simply takes the encoder identifier, prefixes it with VIDC, and uses it to look up the name of the CODEC in the [drivers] section of SYSTEM.INI; see the [drivers] section in Figure 4.

The scheme is simple and effective, but it has disadvantages compared to QuickTime. An installation procedure of some kind is required, and multiple decompressors for the same encoding cannot coexist. This means that a new version of a decompressor cannot specialize the capabilities of existing versions; it must totally replace them. Imagine that Intel wants to develop a new version of the Indeo decompressor optimized especially for the XYZ video chipset. Under QuickTime, it need only perform this one task, and can defer to other decompressors if the XYZ chip is not present. Video for Windows decompressors under Video for Windows must assume all the functionality of prior versions.

--M.F.

Figure 4: Example [drivers] section of a Windows SYSTEM.INI file.

[drivers]
VIDC.MSVC=msvidc.drv
VIDC.YVU9=isvy.drv
VIDC.IV31=indeor3.drv
VIDC.RT21=indeo.drv
VIDC.CVID=iccvid.drv
VIDC.MRLE=msrle.drv

Sound Encoding Techniques

Both QuickTime and AVI formats store sound in similar ways. At the time of this writing, neither supported compressed sound. Table 4 summarizes the encoding techniques each uses.

When sound is digitized, analog signals are converted to numbers. The size of those numbers is referred to as the sample size. The rate at which the analog signal is sampled to is called the sample rate. In general, the larger the sample size and rate, the better the quality of the digitization. As a point of reference, CD-DA (standard audio CDs) is the equivalent of 16-bit, 44.1 kHz sound.

For sample sizes of 8 bits, each sample represents one of possibly 256 different values; for 16-bit samples, 65,536 discrete values can be represented. You might visualize the difference in quality to be analogous to that more easily perceived between 8- and 16-bit color. Each sample represents the deviation of a waveform from a midpoint. Two conventions exist for the midpoint. In AVI, 8-bit samples use 0x80 as the midpoint (the so-called "raw format"), and 16-bit samples use conventional signed numbers with 0x0000 as the midpoint (so-called "two's-complement" format). QuickTime can use either format with either sample size. Figure 5 shows this more clearly.

To complicate matters a little, Microsoft does not actually use the jargon raw and two's complement. Instead, it uses the acronym PCM to refer to its 8-bit and 16-bit encodings. To convert between the two formats, simply XOR each sample with 0x80 or 0x8000 as appropriate.

QuickTime stores 16-bit sound samples in Motorola order; AVI uses Intel order. Byte ordering is, of course, moot for 8-bit samples! Consequently, 16-bit sound samples must be flipped when converting from AVI to QuickTime and vice versa.

Most PC sound cards can only digitize and play back at the three standard MPC rates of 11.025, 22.05, and 44.1 kHz. Many QuickTime movies are captured on the Mac and their sample rates can appear as weird numbers like 11.12754 kHz. Both QuickTime and AVI share the same convention for stereo sound in that the left-channel sample appears before the right-channel sample in the stream.

The interplay of sample size, rate, and number of channels has a great effect on the ability of the QuickTime or AVI engine to playback a movie. For example, CD-DA quality sound (16-bit, 44.1 kHz, stereo) requires a sustained data transfer rate of 176.4 Kbytes per second. Single-speed CD-ROM drives are capable of a peak rate of 150 Kbytes per second, which doesn't leave a lot of room for video! For this reason, most digital video movies you'll see today use 8-bit, 11-kHz mono sound (which doesn't sound too bad through most PC speakers). This situation is unlikely to improve much until we see a quantum leap in hardware performance.

Interleave is a primary characteristic of digital video files, so much so that the AVI file format is named after the concept. However, interleave is mainly a factor for slow playback devices such as CD-ROM. The trick that both the QuickTime and AVI engines have had to master is to stream enough data from the CD-ROM to keep themselves busy. It is a delicate balance of RAM buffer sizes, transfer rates, seek times, and playback rate. Note that streaming is actually quite the opposite of the more conventional caching. A cache (like SMARTDrive) attempts to improve performance by anticipating that data, once read, will be read again. Streaming assumes that data will be read once, from beginning to end, and attempts to steadily supply that data at the same rate that it is consumed.

Although QuickTime and AVI acknowledge the same concept, their engines have different requirements for interleave. QuickTime prefers sound and video in half-second chunks, with sound leading. AVI prefers sound and video interleaved on a frame-by-frame basis. That is, each video frame is physically followed by a frames worth of sound. To complicate matters, though, sound samples are skewed ahead of video by 0.75 second. In a dump of an AVI file, you'll see the first few sound samples unmatched by video frames, and the last few video frames unmatched by sound samples (look at the end of Listing Five (page 19) for an example of this).

When an AVI file is converted to QuickTime, or vice versa, the interleave factor should be adjusted to these preferred values. If it is not, you can expect poor playback performance from a CD-ROM.

--M.F.

Table 4: Sound-encoding techniques.

     QuickTime          AVI
Sample Size     8 bit, 16 bit.          8 bit, 16 bit.
Sample Rate     Continuum of rates,          Normally the discrete MPC rates of      normally between 11.0          11.025, 22.05, and 44.1 kHz.
     and 44.1 kHz.
Channels     Mono and stereo.          Mono and stereo.
Interleave     Half-second chunks,          Frame-by-frame, sound skewed.
     sound leading.           ahead by 0.75 second.

Figure 5: Comparison of "raw" and "two's" sound.

For More Information
QuickTime Developer's Kit
Apple Computer
P.O. Box 319
Buffalo, NY 14207
800-282-2732
$195

Video for Windows
Microsoft Corp.
One Microsoft Way
Redmond, WA 98052-6399
Available free on CompuServe

Canyon Movie Toolkit
San Francisco Canyon Company
150 Post Street, Suite 620
San Francisco, CA 94108
415-398-9957
$795

[LISTING ONE]


moov (16658) Movie Atom
  mvhd (108) Movie Header
  -Version/Flags: 0x00000000
  -Creation Time: Thu Aug 19 13:26:31 1993
  -Modification Time: Thu Aug 19 13:26:31 1993
  -Time Scale: 1000 per second
  -Duration: 127000
  -Preferred Rate: 1
  -Preferred Volume: 0x00ff
  -Matrix:       1           0          0
                 0           1          0
                 0           0          1
  -Preview Time: 0
  -Preview Duration: 0
  -Poster Time: 0
  -Selection Time: 0
  -Selection Duration: 0
  -Current Time: 0
  -Next Track ID: 2
  trak (5524) Track Atom
    tkhd (92) Track Header
    -Version/Flags: 0x0000000f
    -Creation Time: Thu Aug 19 13:26:31 1993
    -Modification Time: Thu Aug 19 13:26:31 1993
    -Track ID: 0
    -Time Scale: 0 per second
    -Duration: 127000
    -Movie Time Offset: 0
    -Priority: 0
    -Layer: 0
    -Alternate Group: 0
    -Volume: 0
    -Matrix:       1           0          0
                   0           1          0
                   0           0          1
    -Track Width: 320
    -Track Height: 240
    edts (36) Edit List
      elst (28) Edit Entry
      -Version/Flags: 0x00000000
      -Number Of Entries: 1
      - Entry 0: Duration 127000, time 0, rate 1.
    mdia (5388) Media Atom
      mdhd (32) Media Header
      -Version/Flags: 0x00000000
      -Creation Time: Thu Aug 19 13:26:31 1993
      -Modification Time: Thu Aug 19 13:26:31 1993
      -Time Scale: 11025 per second
      -Duration: 14001750
      -Language: 0x0000
      -Quality: 0x0000
      hdlr (32) Handler
      -Version/Flags: 0x00000000
      -Component Type: mhlr
      -Component Subtype: soun
      -Component Manufacturer: appl
      -Component Flags: 0x00000000
      -Component Flags Mask: 0x00000000
      minf (5316) Video Media Info
        smhd (16) Sound Media Information
        -Version/Flags: 0x00000000
        -Balance: 0
        hdlr (32) Handler
        -Version/Flags: 0x00000000
        -Component Type: dhlr
        -Component Subtype: alis
        -Component Manufacturer: appl
        -Component Flags: 0x00000000
        -Component Flags Mask: 0x00000000
        dinf (36) Data Info
          dref (28) 00 00 00 00 00 00 00 01 00 00 00 0c 61 6c 69 73
                     00 00 00 01
        stbl (5224) Sample Table
          stsd (52) Sample Description
          -Version/Flags: 0x00000000
          -Number Of Entries: 1
            raw  (36) Sound Description
            -Data reference ID: 0x0000
            -Version: 0x0000
            -Codec Revision Level: 0x0000
            -Codec Vendor: appl
            -Number of Channels: 1
            -Bits/Sample: 8
            -Compression ID: 0
            -Packet Size: 0
            -Sample Rate: 11025.
          stts (24) Time To Sample
          -Version/Flags: 0x00000000
          -Number Of Entries: 1
          - 0: Sample Count 14001750, Sample Duration 1.
          stsc (3832) Sample To Chunk
          -Version/Flags: 0x00000000
          -Number Of Entries: 318
          - 0: First Chunk 1, Sample per Chunk 4410, Chunk Tag 1.
          ...
          - 317: First Chunk 318, Sample per Chunk 1500, Chunk Tag 1.
          stsz (20) Sample Size
          -Version/Flags: 0x00000000
          -Sample Size: 1
          -Number Of Entries: 14001750
          stco (1288) Chunk Offset
          -Version/Flags: 0x00000000
          -Number Of Entries: 318
                8 18268 71942 125877 181638 239367 295397 351595 408044 466170
          -Dumping 1232 bytes
  trak (11018) Track Atom
    tkhd (92) Track Header
    -Version/Flags: 0x0000000f
    -Creation Time: Thu Aug 19 13:26:31 1993
    -Modification Time: Thu Aug 19 13:26:31 1993
    -Track ID: 1
    -Time Scale: 1000 per second
    -Duration: 127000
    -Movie Time Offset: 0
    -Priority: 0
    -Layer: 0
    -Alternate Group: 0
    -Volume: 0
    -Matrix:       1           0          0
                   0           1          0
                   0           0          1
    -Track Width: 320
    -Track Height: 240
    edts (36) Edit List
      elst (28) Edit Entry
      -Version/Flags: 0x00000000
      -Number Of Entries: 1
      - Entry 0: Duration 127000, time 0, rate 1.
    mdia (10882) Media Atom
      mdhd (32) Media Header
      -Version/Flags: 0x00000000
      -Creation Time: Thu Aug 19 13:26:31 1993
      -Modification Time: Thu Aug 19 13:26:31 1993
      -Time Scale: 1000 per second
      -Duration: 127000
      -Language: 0x0000
      -Quality: 0x0000
      hdlr (32) Handler
      -Version/Flags: 0x00000000
      -Component Type: mhlr
      -Component Subtype: vide
      -Component Manufacturer: appl
      -Component Flags: 0x00000000
      -Component Flags Mask: 0x00000000
      minf (10810) Video Media Info
        vmhd (20) Video Media Information Header
        -Version/Flags: 0x00000000
        -Graphics Mode: 64
        -Op Color: 0x0000, 0x0000, 0x0000
        hdlr (32) Handler
        -Version/Flags: 0x00000000
        -Component Type: dhlr
        -Component Subtype: alis
        -Component Manufacturer: appl
        -Component Flags: 0x00000000
        -Component Flags Mask: 0x00000000
        dinf (36) Data Info
          dref (28) 00 00 00 00 00 00 00 01 00 00 00 0c 61 6c 69 73 00 00 00 01
        stbl (10714) Sample Table
          stsd (102) Sample Description
          -Version/Flags: 0x00000000
          -Number Of Entries: 1
            cvid (86) Image Description (cvid)
            -Version: 1
            -Revision Level: 1
            -Vendor: appl
            -Temporal Quality: 0x3ff
            -Spatial Quality: 0x3ff
            -Width (in pixels): 320
            -Height (in pixels): 240
            -Horizontal Resolution: 72
            -Vertical Resolution: 72
            -Data Size: 0
            -Codec name: Movie Toolkit (cvid)
            -Depth: 24
            -Dumping 2 bytes
          stts (24) Time To Sample
          -Version/Flags: 0x00000000
          -Number Of Entries: 1
          - 0: Sample Count 1270, Sample Duration 100.
          stss (356) Sync Sample
          -Version/Flags: 0x00000000
          -Number Of Entries: 85
                1 16 31 46 61 76 91 106 121 136
          -Dumping 300 bytes
          stsc (28) Sample To Chunk
          -Version/Flags: 0x00000000
          -Number Of Entries: 1
          - 0: First Chunk 1, Sample per Chunk 1, Chunk Tag 1.
          stsz (5100) Sample Size
          -Version/Flags: 0x00000000
          -Sample Size: 0
          -Number Of Entries: 1270
                13850 12357 12148 12439 12320 12338 12323 12481 12383 12499
          -Dumping 5040 bytes
          stco (5096) Chunk Offset
          -Version/Flags: 0x00000000
          -Number Of Entries: 1270
                4418 22678 35035 47183 59622 76352 88690 101013 113494 130287
          -Dumping 5040 bytes

[LISTING TWO]



         ALIGN 16
Flip16   PROC    FAR VALUE:WORD
         MOV     AX, VALUE
         ROL     AX, 8
         RET
Flip16   ENDP


         ALIGN   16
Flip32   PROC    FAR VALUE:DWORD
         MOV     DH, BYTE PTR VALUE
         MOV     DL, BYTE PTR VALUE + 1
         MOV     AH, BYTE PTR VALUE + 2
         MOV     AL, BYTE PTR VALUE + 3
         RET
Flip32   ENDP

[LISTING THREE]



typedef int (*ATOMFILPARSER) (HMMIO hmmio, long lName, long lOffset, long lSize);

int CollectAtomsFromFile (HMMIO hmmio, long lOffset, long lSize,
                                                      ATOMFILPARSER apfil) {
    struct {long lSize, lName; } atmh;

    // Process the various atoms as we find them
    for (; lSize > 0; lOffset += atmh.lSize, lSize -= atmh.lSize)  {
        if (lSize < sizeof atmh)
            return FALSE ;
        mmioSeek (hmmio, lOffset, SEEK_SET);
        if (mmioRead (hmmio, (HPSTR) &atmh, sizeof atmh) != sizeof atmh)
            return FALSE ;
        atmh.lSize = Flip32 (atmh.lSize);
        if (atmh.lSize < sizeof atmh)
            return FALSE ;
        if (! apfil (hmmio, atmh.lName, lOffset+sizeof atmh,
                                                      atmh.lSize-sizeof atmh))
            return FALSE ;
    }
    // If the movie is well-formed, we should end on an atom boundary
    return (lSize == 0);
}

[LISTING FOUR]



 ...
hmmio = mmioOpen (szFileName, NULL, MMIO_READ | MMIO_DENYNONE);
CollectAtomsFromFile (hmmio, 0, mmioSeek (hmmio, 0, SEEK_END),
&ParseWholeMovie);
mmioClose (hmmio);
 ...

int ParseWholeMovie (HMMIO hmmio, long lName, long lOffset, long lSize)
{
    switch (lName) {
        case mmioFOURCC ('m','o','o','v'):
            return CollectAtomsFromFile (hmmio, lOffset, lSize,
&ParseMoovAtom);
        default:
            return TRUE;
    }
}
int ParseMoovAtom (HMMIO hmmio, long lName, long lOffset, long lSize) {
    switch (lName) {
        case mmioFOURCC ('m','v','h','d'):
         /* your code */
            return TRUE;
        case mmioFOURCC ('t','r','a','k'):
           return CollectAtomsFromFile (hmmio, lOffset, lSize, &ParseTrakAtom);
        default:
            return TRUE;
    }
}
int ParseTrakAtom (HMMIO hmmio, long lName, long lOffset, long lSize) {
    switch (lName) {
        case mmioFOURCC ('t','k','h','d'):
         /* your code */
            return TRUE;
        case mmioFOURCC ('e','d','t','s'):
           return CollectAtomsFromFile (hmmio, lOffset, lSize, &ParseEdtsAtom);
        case mmioFOURCC ('m','d','i','a'):
           return CollectAtomsFromFile (hmmio, lOffset, lSize, &ParseMdiaAtom);
        default:
            return TRUE;
    }
}
int ParseEdtsAtom (HMMIO hmmio, long lName, long lOffset, long lSize) {
    switch (lName) {
        case mmioFOURCC ('e','l','s','t'):
            /* your code */
            return TRUE;
        default:
            return TRUE;
    }
}
int ParseMdiaAtom (HMMIO hmmio, long lName, long lOffset, long lSize) {
    switch (lName) {
        case mmioFOURCC ('m','d','h','d'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('h','d','l','r'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('m','i','n','f'):
           return CollectAtomsFromFile (hmmio, lOffset, lSize, &ParseMinfAtom);
        default:
            return TRUE;
    }
}
int ParseMinfAtom (HMMIO hmmio, long lName, long lOffset, long lSize) {
    switch (lName) {
        case mmioFOURCC ('s','t','b','l'):
           return CollectAtomsFromFile (hmmio, lOffset, lSize, &ParseStblAtom);
        default:
            return TRUE;
    }
}
int ParseStblAtom (HMMIO hmmio, long lName, long lOffset, long lSize) {
    switch (lName) {
        case mmioFOURCC ('s','t','s','d'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('s','t','t','s'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('s','t','s','s'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('s','t','s','c'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('s','t','s','z'):
            /* your code */
            return TRUE;
        case mmioFOURCC ('s','t','c','o'):
            /* your code */
            return TRUE;
        default:
            return TRUE;
    }
}

[LISTING FIVE]



00000000    RIFF (00E5DE86) 'AVI '
0000000C        LIST (000007D4) 'hdrl'
00000018            avih (00000038)
                        TotalFrames  : 1270
                        Streams      : 2
                        InitialFrames: 8
                        MaxBytes     : 307200
                        BufferSize   : 30720
                        uSecPerFrame : 100000
                        Rate         : 10.000 fps
                        Size         : (320, 240)
                        Flags        : 0x00000710
                            AVIF_HASINDEX
                            AVIF_ISINTERLEAVED
                            AVIF_VARIABLESIZEREC
                            AVIF_NOPADDING
00000058            LIST (00000074) 'strl'
00000064                strh (00000038)
                            Stream Type   : vids
                            Stream Handler: cvid
                            Samp/Sec      : 10.000
                            Priority      : 0
                            InitialFrames : 0
                            Start         : 0
                            Length        : 1270
                            Length (sec)  : 127.0
                            Flags         : 0x00000000
                            BufferSize    : 14654
                            Quality       : 7500
                            SampleSize    : 0
000000A4                strf (00000028)
                            Size        : (320, 240)
                            Bit Depth   : 24
                            Colors used : 0
                            Compression : cvid
000000D4            LIST (0000005C) 'strl'
000000E0                strh (00000038)
                            Stream Type   : auds
                            Stream Handler: <default>
                            Samp/Sec      : 11025.000
                            Priority      : 0
                            InitialFrames : 8
                            Start         : 0
                            Length        : 1399470
                            Length (sec)  : 126.9
                            Flags         : 0x00000000
                            BufferSize    : 1103
                            Quality       : 7500
                            SampleSize    : 1
00000120                strf (00000010)
                            wFormatTag      : WAVE_FORMAT_PCM
                            nChannels       : 1
                            nSamplesPerSec  : 11025
                            nAvgBytesPerSec : 11025
                            nBlockAlign     : 1
                            nBitsPerSample  : 8
00000138            vedt (00000008)
000007E8        LIST (00E4E7F6) 'movi'
000007F4            LIST (0000045C) 'rec '
00000800                01wb (0000044F)
00000C58            LIST (0000045A) 'rec '
00000C64                01wb (0000044E)
000010BA            LIST (0000045C) 'rec '
000010C6                01wb (0000044F)
0000151E            LIST (0000045A) 'rec '
0000152A                01wb (0000044E)
00001980            LIST (0000045C) 'rec '
0000198C                01wb (0000044F)
00001DE4            LIST (0000045A) 'rec '
00001DF0                01wb (0000044E)
00002246            LIST (0000045C) 'rec '
00002252                01wb (0000044F)
000026AA            LIST (0000045A) 'rec '
000026B6                01wb (0000044E)
00002B0C            LIST (00003A7E) 'rec '
00002B18                00dc (0000361A)
0000613A                01wb (0000044F)
00006592            LIST (000034A8) 'rec '
0000659E                00dc (00003045)
000095EC                01wb (0000044E)
00009A42            ...
00E4E9CA            LIST (00000614) 'rec '
00E4E9D6                00dc (00000608)
00E4EFE6        idx1 (0000EEA0)
00E5DE8E

[LISTING SIX]



void ParseAVIMovie (char *szFileName) {
    MMCKINFO ckAVI, ckAVIH, ckHDRL, ckSTRL, ckSTRH, ckSTRF, ckIDX1, ckMOVI;
    MainAVIHeader avihdr;
    AVIStreamHeader strhdr;
    AVIINDEXENTRY avindx;
    HMMIO hmmio;
    long lStream;

    // Open file
    hmmio = mmioOpen (szFileName, NULL, MMIO_READ);

    // Read the AVI header
    mmioSeek (hmmio, 0, SEEK_SET);
    ckAVI.ckid = ckidRIFF;
    ckAVI.fccType = formtypeAVI;
    mmioDescend (hmmio, &ckAVI, 0, MMIO_FINDRIFF);
    ckHDRL.ckid = ckidLIST;
    ckHDRL.fccType = listtypeAVIHEADER;
    mmioDescend (hmmio, &ckHDRL, &ckAVI, MMIO_FINDLIST);
    ckAVIH.ckid = ckidAVIMAINHDR;
    mmioDescend (hmmio, &ckAVIH, &ckHDRL, MMIO_FINDCHUNK);
    mmioRead (hmmio, (HPSTR) &avihdr, sizeof(MainAVIHeader));

    // Read each stream header
    for (lStream = 0; lStream < (long) avihdr.dwStreams; lStream++) {
        ckSTRL.ckid = ckidLIST;
        ckSTRL.fccType = listtypeSTREAMHEADER;
        mmioDescend (hmmio, &ckSTRL, &ckHDRL, MMIO_FINDLIST);
        ckSTRH.ckid = ckidSTREAMHEADER;
        mmioDescend (hmmio, &ckSTRH, &ckSTRL, MMIO_FINDCHUNK);
        mmioRead (hmmio, (HPSTR) &strhdr, sizeof(AVIStreamHeader));
        mmioAscend (hmmio, &ckSTRH, 0);

        // Is it video?
        if (strhdr.fccType == streamtypeVIDEO) {
            /* your code */
        }

        // Or is it sound?
        else if (strhdr.fccType == streamtypeAUDIO) {
            /* your code */
        }
        // Loop until all streams processed
        mmioAscend (hmmio, &ckSTRL, 0);
    }

    // Done reading headers
    mmioAscend (hmmio, &ckHDRL, 0);
    mmioAscend (hmmio, &ckAVI, 0);

    // Locate movi data
    mmioSeek (hmmio, 0, SEEK_SET);
    ckAVI.ckid = ckidRIFF;
    ckAVI.fccType = formtypeAVI;
    mmioDescend (hmmio, &ckAVI, 0, MMIO_FINDRIFF);
    ckMOVI.ckid = ckidLIST;
    ckMOVI.fccType = listtypeAVIMOVIE;
    mmioDescend (hmmio, &ckMOVI, &ckAVI, MMIO_FINDLIST);
        /* your code */
    mmioAscend (hmmio, &ckMOVI, 0);
    mmioAscend (hmmio, &ckAVI, 0);

    // Locate index
    mmioSeek (hmmio, 0, SEEK_SET);
    ckAVI.ckid = ckidRIFF;
    ckAVI.fccType = formtypeAVI;
    mmioDescend (hmmio, &ckAVI, 0, MMIO_FINDRIFF);
    ckIDX1.ckid = ckidAVINEWINDEX;
    mmioDescend (hmmio, &ckIDX1, &ckAVI, MMIO_FINDCHUNK);
        /* your code */
    mmioAscend (hmmio, &ckIDX1, 0);
    mmioAscend (hmmio, &ckAVI, 0);

    // Close file
    mmioClose (hmmio);
}
End Listings