Don't stop me if you've heard this one.
A man is sent to prison, and during his first supper inside, a fellow prisoner stands up and says loudly, "Five thousand nine hundred and thirty-three." Everybody laughs. The next night, a different convict jumps up and shouts, "Three thousand and ten," and the crowd breaks up. This continues: The next day in the yard, a prisoner mutters, "Two sixteen," and the prisoners near him snicker and snort. Puzzled, the man finally asks his cell mate what's going on.
"There's one joke book in the prison library," his cell mate explains, "with the ten thousand best jokes, all numbered. Well, a lot of us have been here long enough to have memorized the jokes, so when we want to tell one, we just use its number. My favorite is four hundred and eighty-seven." And he chuckles softly to himself.
The next night at supper, the man stands up and shouts, "Four hundred and eighty-seven." His fellow prisoners stare at him in silence. Crushed, he sits back down and tries to finish his meal.
"I thought you said four hundred and eighty-seven was a good joke," he snarls at his cell mate later that night.
"It's one of the best," the cell mate drawls.
"Then why didn't anybody laugh?"
"You didn't tell it right."
If you're wondering right now what the number of the joke I just told is, you have the right mind-set for a discussion of representation in Lisp. If, furthermore, you think that the number of that joke ought to be four hundred and eighty-seven, and if you've thought about what a different joke it would be if that were the case, you must be a Lisp hacker.
Lisp hackers often point to certain features of the language in explaining why they choose to program in Lisp. These include recursion, the ability to operate on code as data, and extensibility. None of these features are unique to Lisp, though, and none of them gets across why Lisp is unique.
In this month's column, we'll take a look at representation in Lisp, and try to get some sense of the power inherent in Lisp's representation scheme, and we'll examine the wide range of data structures supported by the Common Lisp standard, built on this representation scheme.
"Author: The idea is to imitate Godel's self-referential construction, which as you know is INDIRECT, and depends on the isomorphism set up by Godel-numbering.
"Crab: Oh. Well, in the programming language Lisp, you can talk about your own program directly ... because programs and data have exactly the same form. Godel should have just thought up Lisp...." -- Douglas Hofstadter, a fictional character, and Crab, a crab, in Godel, Esher, Bach, by Douglas Hofstadter.
John McCarthy thought up Lisp 30 years ago as a tool for manipulating symbolic expressions, which is essential for tasks like symbolic integration. But the real point was to make it possible for a program to talk about a program, with an eye to developing provably correct programs. Toward this end, McCarthy used a class of expressions called "s-expressions," or symbolic expressions, based on Alonso Church's work on the lambda calculus. S-expressions permit programs and data in Lisp to have the same form. Programs and data in Lisp are both coded as s-expressions.
What, exactly, are s-expressions? That's really two questions: What do they look like and what do they do, or what's the notation and what's the interpretation? Both questions are fruitful.
The notation McCarthy used for s-expressions was list notation, and the name Lisp is an acronym for LISt Processing. Roughly, s-expressions are lists. But the matter of s-expression notation goes a little deeper than this.
Formally, an s-expression is either an atomic symbol (atom) or a list of s-expressions:
What's an atomic symbol? An atomic symbol, or atom, is, formally, just a string of characters subject to certain constraints. It's a name, a symbol, such as a name for a variable or constant or function in any language. Or, as we shall see, not quite like a name in any other language.
And the notation for lists? A list can be recursively defined to be either the empty list (which is sometimes written using the atom NIL) or an item like the thing on the right below:
In other words, all Lisp lists are binary trees with all their (non-NIL)atoms hanging off their left terminal branches, and all right terminal branches containing NILs. NIL is a primitive Lisp object (atom) used for several purposes, including to represent Boolean false and the empty list.
An alternative notation for lists used in Lisp is called "dot" notation. In dot notation, a (dotted) list can be defined thus:
Some examples of Lisp s-expressions are:
Here are prose descriptions of the five s-expressions above:
A is not a list, nor even a dotted list, but is both an atomic symbol and an s-expression.
(A . B) is a dotted list, but not a true list, because it has a non-NIL right terminal branch.
(A . NIL) is the list containing one element: The atomic symbol A.
(A . (B . NIL)) is the list containing the two elements A and B.
(A . (B . ((C . (D . NIL)) . NIL))) is the list containing three elements: A, B, and the list containing the two elements C and D.
Figure 1 shows pictures of the binary trees that the s-expressions represent. There are reasons for using dot notation in low-level programming, but for most purposes the simpler list notation is used. In this shorthand form, the above forms are written:
Lists in this notation can have any number of elements, but note that the underlying representation is still the binary tree described by the dotted pairs of the corresponding dot notation. This (non-dotted) notation is the form in which all Lisp programs are normally written. Every Lisp program or subprogram begins and ends with a matched pair of parentheses, usually with more pairs inside and with atomic symbols inside them. Every data object is also expressed as one of these parenthesized lists if it is not represented as an atomic symbol.
Lisp students early on decided that the acronym List stood for "Lots of Insipid, Stupid Parentheses." Little Wonder.
"Author: I find indirect self-reference a more general concept, and far more stimulating, than direct self-reference. Moreover, no reference is truly direct -- every reference depends on SOME kind of coding scheme. It's just a question of how implicit it is. Therefore, no self-reference is direct, not even in Lisp."
-- ibid.
We've looked at the notation used to write Lisp programs (list notation) and at the more elemental dot notation and the underlying data structure that lists form. We've seen that lists contain other lists and/or atomic symbols, and that properly-formed combinations of lists and atomic symbols are called symbolic expressions, or s-expressions, and that s-expressions are the universal form in which data objects and programs in Lisp are expressed. So far, these are just notational issues, and don't get at the power of Lisp representation. Now let's consider what these s-expressions mean.
Atomic symbols are named data objects. The object named can be a constant or a variable, and the structure of the object can be anything. Each symbol has associated with it a structure called its "property list," or p-list, which allows it to be treated as a record structure with an extensible set of components, each of which can be an arbitrarily complex named data object. The name can be retrieved, given the object, or can be used to retrieve the data object, and there are functions for creating and deleting data objects, and for manipulating their property lists. The tools for manipulating symbols are powerful, but the real power of Lisp comes in the ability to treat tools themselves symbolically, and operate on them. A Lisp named function is just one particular kind of named data object that atomic symbols can comprise.
The difference between constants and variables is roughly one scope of binding. In Lisp, the values bound to constants and variables are maintained in different data structures.
A Lisp constant is an atomic symbol that possesses an associated constant value. One of the things that can be stored in the p-list of an atomic symbol is what is called an "apval" (constant value). If a symbol has the string "apval" as the first element of a pair of values on its p-list, the symbol is a constant, and its constant value is the second element of that pair.
Variables in Lisp are atomic symbols, too, but they don't have apvals. They get their values via a different list. Lisp maintains a data structure called the "a-list," or association list. This is a list of symbol value pairs representing the current bindings of variables and of function names. At any point while Lisp is running, we are "inside" an s-expression; that is, an s-expression is being evaluated. The function that does the evaluation, called "eval," maintains the a-list. Constant values are not affected by a-list bindings, and the p-list is checked before the a-list when retrieving a value for a symbol.
Named functions are also maintained by eval, and the binding stored on the a-list. It is worth noting that functions are Lisp objects, and have a data type. All functions are of type function.
The meaning of a list is generally dependent on the meanings of its constituent atomic symbols, because a list is a data structure. One use of lists, though, attaches quite a different sort of meaning to a list. This is the use of a list as the invocation of a function. The fundamental unit of interaction with Common Lisp is the form, which is an s-expression, ergo a data object. The form is handed to eval to be interpreted as a function, with a new data object to be returned. In this interpretation, the first element of the list names a function to be applied to the other elements as arguments. (This is over-simplified).
Although the use of a list to represent the invocation of a function and its use to represent a particular data object may seem very different, and are in truth very different, the style of Lisp programming often makes the two uses seem very much alike. Consider the function FIRST, which takes a list as an argument and returns the first element of the list as its value. When you write
you are executing a function on the list VENDORLIST. But it's obvious that you are also referring to a particular data object, specifically the FIRST of VENDOR_LIST. To the implementer of the Lisp system, it matters that FIRST is a function, but for most purposes, the user can think of it as a way of referring to a particular element of a complex data structure. This is a property of functions generally, not restricted to Lisp, but the complete dependence of Lisp on functions makes this a distinctive aspect of the "feel" of programming in Lisp.
In Common Lisp, an object may belong to more than one data type, so it is not generally meaningful to inquire about the data type of an object. For example: the simplest data type is type null. It consists of exactly one item: NIL. NIL is both list and atom, it means both false and the empty list, and it can be written NIL or ( ). It is possible to inquire whether an object belongs to a particular type, and a Boolean function typep is provided for this purpose. It is objects, however, and not variables, that can be typed. A Lisp variable can have any Lisp object as its value.
Dotted lists, true lists, and trees are all different ways of viewing structures of conses. A cons is a Lisp data type, and is another name for the dotted pair: ( <s-expr> . <s-expr> ). What we have been calling a list is a chain of conses branching off the right-side s-expressions. If the list is terminated by NIL, it's a true list, if by some other atom, it's a dotted list.
Many Lisp data types are built from conses. The data type list, for example, is the union of the cons and null data types, so it includes true lists and dotted lists. Lists are supported by a large number of list-manipulation functions. Lists can also be used as sets, and there are set functions, such as union, intersection, and the like, in the Common Lisp definition. The a-list, used by function eval to keep track of bindings, is a list of conses.
Common Lisp also supports sequences and other compound data types, hash tables, arrays, vectors, strings, streams, and structures, among other data types. Conceptually, these, too, can be viewed as being built of cons cells, although some of them may be implemented in some more efficient fashion. Type sequence includes both lists and vectors, which are one-dimensional arrays. An array is an object with components arranged according to a Cartesian coordinate system. Arrays can have their sizes adjusted dynamically after creation, or can be of type simple array, and not permit this. Non-simple arrays can also share data with other non-simple arrays. Element 1,1 of a two-dimensional array foo is specified via the array-reference function aref thus: (aref foo 1 1). The data type vector is a subtype of type array. A string is a vector of type string-char. A structure is an instance of a user-defined data type with a fixed number of named components. These are similar to Pascal records.
Some Lisp data objects are what is called "self-evaluating forms." Because, as I suggested earlier, Lisp is always evaluating forms, even objects such as numbers, characters, and the empty list get evaluated. These objects return themselves as values. Common Lisp defines four broad kinds of numeric data types: integer, ratio, floating point, and complex; and one-character data type, with subtypes.
Integers can be of any length, and are implemented by means of two subtypes: fixnums and bignums. Fixnums are limited in length and are more efficient to use; bignums can be any length. The definition of Common Lisp calls for the distinction between fixnums and bignums to be hidden from the user most of the time. The default radix is decimal, but any radix from 2 to 36 can be specified.
The ratio data type permits precise representation of rational numbers, such as the function 2/3. Common Lisp always converts a ratio to lowest terms, and to an integer if possible.
Common Lisp provides names but not full specifications for four different floating-point data types. These are short, single, double, and long.
Complex numbers are represented in Cartesian form as pairs of numbers of any of the other data types; thus the actual type of a complex number is the type complex and the type of the components, such as complex short float. The components are coerced to have the same type.
There are two important Common Lisp character data subtypes, standard-char and string-char. The former is intended to be as portable a character set as possible; that is, programs written in it should port easily. This is important, because the character-data type provides for such features as specification of font attributes and various modifying flags, the latter for programmers with terminals from Mars. The string-char subtype comprises all characters that can be contained in strings. These can't have any of those funny flags or font attributes attached to them.
Next month we'll look at the full hierarchy -- although one author has called it a heterarchy -- of Common Lisp data types, and at some of the object-oriented extensions to Common Lisp. We'll also look at the way in which Lisp functions are interpreted, which will get us into what Douglas Hofstadter has called a double-entendre:
..double-entendre can happen with Lisp programs that are designed to reach in and change their own structure. If you look at them on the Lisp level, you will say that they change themselves; but if you shift levels, and think of Lisp programs as data to the Lisp interpreter ... then in fact the sole program that is running is the interpreter, and the changes being made are merely changes in pieces of data."
Notes on Notation
<s-expr> ::= <atom> | <list-of-s-expressions>
<list> ::= NIL | ( <s-expr> <list> )
<dotted-list> ::= ( <s-expr>. <s-expr> )
A
(A . B)
(A . NIL)
(A . (B.NIL))
(A . (B . ((C . (D . NIL)) . NIL)))
(No list representation for A, since A is not a list.)
(No list representation for (A . B), since (A. B) is not a "true" list.)
(A) (A B) (A B (C D))
Figure 1: Binary trees that s-expressions represent
A
* *
/ \ / \
/ \ / \
A B A *
/ \
* / \
/ \ B *
/ \ / \
A NIL / \
* * NIL
/ \ / \
/ \ / \
A * C *
/ \ / \
/ \ / \
B NIL D NIL
How Symbols Symbolize
(FIRST VENDOR_LIST)
Common Lisp Data Types