C PROGRAMMING

TEXTSRCH Continued: The Expression Interpreter

Al Stevens

Last month we built an expression parser that processes expressions consisting of keywords, Boolean operators, and parentheses. We will use the expressions in text searching queries for a new "C Programming" column project, a text data base indexing and retrieval system. This month we give the name TEXTSRCH to the project, and continue with the expression interpreter.

The text data base that TEXTSRCH will support is one that consists of a large number of text files that do not change much. Typical applications include files of electronic mail messages and engineering documentation. There are many others, of course, but the characteristic they share is that they are relatively static.

The primary purpose for the data base is to store our files of text in a centrally organized place. The primary purpose for TEXTSRCH is to allow us to find the files we want to read, based on some criteria. Given that we know something about what we are looking for, we want to be able to express those criteria in the form of a query expression and to find the files in the data base that match the expression. A typical expression might be:

   (cobol and fortran) and not pascal

A search of the data base would tell us which files match the query, that is, which files contain the words "cobol" and "fortran" but do not contain the word "pascal."

The expression parser from last month did a lexical scan of the expression, converting the expression's words and operators into tokens, and placing them into a stack in postfix notation. This month we will build the TEXTSRCH expression interpreter. Its function is to extract the keywords from the stack, initiate a search of the data base for a match on each one, and combine the results of the search with the Boolean operators to determine which files match the search criteria. After we build such a file list, we will switch to a text scan to complete the search and find the places in the files where the words and phrases appear. Those processes will come in a later installment. The index that we will build allows us to get down to the file level with a minimum of search time.

First let's consider how we will index our text data base. We won't build or use the index in this month's installment, but we must understand its operation in order to build the expression interpreter.

Our data base will consist of a number of text files and a word index. The index will contain an entry for each keyword or phrase that appears in the data base. The index entry for a particular word will tell us which files contain that word by pointing to a file list. The list is recorded as a bit map where each bit represents a file. At any given time, the number of files in the data base is constant, so the length of the bit map for any entry is fixed at that number of bits.

When we search the index for a word, we will get back a fixed-length bit map with those bits turned on that correspond to the files that contain the word. The mechanics of the index and the search are not important to us now; it is necessary for us to know only that a search returns such a bit map.

With such a map we can tell if a word does or does not appear in a file. Suppose our data base has 16 text files in it (not very many, but enough to illustrate the principle). If we use the index to tell us which files include the word "fortran," it might return to us a bit map such as this:

    0010001011000010

Reading the bits from right to left, we can see that the second, seventh, eighth, tenth, and fourteenth files in the list contain the word. In our data base manager, we will have a table of file names that match the bit positions, so by applying the one bits, we can derive the file names where the word appears. Understand that we have not actually read the file to get this information -- not during the search, that is. Sometime in the past we read all the files to build the index. This is why such data base architectures work best with relatively static data. The most time-intensive process is the construction of the index. After that, searches can be a snap.

You can see how a Boolean query would work. We will search the index for each word in the query. Then we will apply the Boolean operators to the bit maps that the search returns. The final bit map is the one that lists the files that match the complete search criteria.

You will remember from last month that the query begins life in infix notation that the parser translates into postfix notation. The search itself must be driven by an expression interpreter that reads the postfix stack.

Let's see now how to interpret a postfix expression. Consider a simple expression such as this one

  apples and oranges

The postfix stack for this expression would look like this:

  
<and>   
apples   
oranges

The interpreter pops entries from the stack and processes each one according to what it is. If the entry is a binary operator token such as the <and> shown here, the interpreter pops the next two tokens, processes them, and combines the two operands by applying the operator. If the entry is a word or phrase operand, the interpreter calls the search process to convert the word or phrase into a bit map.

Here is how our "apples and oranges" expression would be interpreted. First the interpreter pops the <and> token. Next it pops the oranges token and converts it into a bit map. Then it pops the apples token and converts it into a bit map. Because the <and> token is waiting, the interpreter logically ANDs the two bit maps and returns the result.

To see how the interpreter works with a more complex expression, consider this:

  fortran and not (cobol or pascal)

The postfix stack for this expression looks like this:

  
<and>   
<not>   
<or>   
pascal   
cobol   
fortran

The interpreter pops the <and> token just as it did with the simpler example. This means it expects two more operands to AND together to get the result. The next interim result it must compute is indicated by the <not> token that it pops. This token says that the result of the next operand must be the ones complement of the search result. The next token is the <or>, which says that two operands must be popped and logically ORed together. The pascal and cobol operands come next. The interpreter calls the search function for each of these words and gets back two bit maps, which it ORs together. Then the interpreter takes the ones complement of that result, and the result of all that is the first operand required by the <and> operator. The next token popped is fortran. The interpreter ANDs its bit map with the first bit map, and that delivers the result of the search. Out of breath? What I have just described seems complicated but is a relatively simple algorithm that uses recursion in a function that returns the search result. When it pops an operator, it calls itself to pop the next stack entry or two and returns the result that comes from applying the operator.

Listing One, page 154, is a new version of textsrch.h to replace the one from last month. A notable addition to this file is the MAXFILES global variable. This variable specifies the maximum number of text files in your data base. The bitmap structure defines the multiple-integer bit map that represents the search result. We use a structure here so that we can easily pass the map between functions.

Listing Two, page 154, is exinterp.c, the expression interpreter. It expects the postfix expression to have been built by the parser from last month. It returns a bit map that represents the result of the search. We'll discuss more about the search later. The exinterp function calls the recursive pop function to get the bit map. The pop function is where the real work is done. It pops a token from the postfix stack and processes it. If that token is an operand (a word), the pop function calls the search function, which returns a bit map. The bit map has a bit set for each file that includes the word. If the token is an operator, the pop function calls itself recursively to get the next token or two for the operator to work on. The three functions named and, or, and not modify the bit map to reflect the result of the Boolean operations they represent. We use functions here because the bit map is an array. (Think how much easier this would be with C++.)

Listing Three, page 155, is search.c. This one is a throwaway. Although it works, it is less than efficient. Its purpose is to allow us to test and demonstrate the expression evaluation algorithms. It includes three generic functions that we will replace later. The first one, called init_database, initializes the data base system. For now, all it does is build a list of file names that have the .TXT extension. This stub function uses MS-DOS file structures and the Turbo C findfirst and findnext functions. If you are using a different environment, the substitutions should be simple. What you want is a list of all the file names in the data base built into the array called names.

The second generic stub function is called search. Its purpose is to search the data base to determine which files have the specified word in them. This one cheats. It calls the grep program and redirects its output to a file called "hits." The grep command that it executes looks like this:

   grep -l -i<word>*.txt>hits

This format works with the Turbo C grep program and should work with most others. The -1 parameter tells grep to list only the file names that match the search, and the -i parameter tells it to make the search case insensitive. When the grep program is done, the search function reads the hits file and compares its entries to the file name list that init_database built. For those files that match, the search function sets a bit in the bit map that it builds and returns.

The third generic stub function in search.c is called process_result. It accepts a bit map as the result of an expression search. Remember that search builds a bit map for a single word, and exinterp combines all the bit maps with the operators to build a final bit map. That final one is the one that process_result gets. In this stubbed version, process_result simply lists the files that are found to match the search criteria.

Listing Four, page 156, is textsrch.c, the main function of the TEXTSRCH program and the one that ties the search process together. It calls init_database and then reads the search expression from the console. It calls lexical_scan from last month to convert the expression to postfix notation. If the expression is in error, the program displays a caret under the token where it found the problem. Otherwise, it calls exinterp to interpret the expression and process_result to process the result.

You need to compile express.c from last month and textsrch.c, search.c, and exinterp.c from this month to build the program. Make sure that you use the new textsrch.h from this month.

To run the program, build your data base first. Copy all of your text files into a single subdirectory and give them the file extension .TXT. If the files have word processor stuff embedded in them, run them through your word processor to build straight ASCII text files. Otherwise, grep might not be able to find the words it searches for.

These instructions assume an MS-DOS environment. Make sure that the grep and textsrch programs are in the path and that you are logged onto the subdirectory where you built the data base. Run the textsrch program from the command line. It will prompt you for an expression. Type one in and watch it churn. If your data base is big, it will take a while. Later, after we build an index, the search will be much faster. The result of the program will be a displayed list of files that match the query. You can verify the results with grep.

Next month we'll work on building an index into the data base. Things get interesting because we have several different ways to approach the problem.

OOPSLA

I attended the ACM's fourth annual Convention On Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), held in New Orleans in October, 1989. This was my first trip to the Crescent City, a rare confession from an old jazz musician, and besides the hoopla of OOPSLA, I heard plenty of good jazz and ran into some old friends, both on the conference floor and on bandstands throughout the town.

C++ permeated the show. This is clearly the platform of the very near future. There were demonstrations of C++ language environments from ParcPlace and Apple as well as a number of products that enhance the C++ platform. Absent were Zortech, Intek, and Guidelines, the main purveyors today of complete C++ systems for the PC. We can expect announcements soon, I hope, from some of the other vendors.

One subject popped up frequently in conversations with the C++ language vendors. There is no standard class library, and no one seems to know where the first one will come from. The attitude seems to be that users of the language will each build their own. This reminds me of the early days of C when the few functions described in K&R constituted the only standard we had, and the compiler vendors extended it with their own unique versions of the things left out. There was the Unix library to go from, but many of the PC compilers went their own ways. Market pressure enforced compliance with a de facto standard long before the ANSI X3J11 committee was close to a finished draft. That pressure was created by the overwhelming success of the Microsoft C compiler. Every other compiler vendor tried then to be compatible with good old MSC, and a PC standard of sorts evolved. Some of that standard -- the pieces that were not PC specific -- found its way into the ANSI draft.

Right now there are several books about C++. Most of them describe some specific classes that they use to explain how you build classes rather than to advance a standard for a class library. I did the same thing myself a few columns back with homegrown classes for windows, menus, strings, and linked lists. None of these disparate libraries are likely to drive an industry-accepted standard. I suspect that the first compiler vendor to bundle a class library with a highly successful C++ compiler will hold the lead.

Bjarne Stroustrup makes the point that we mustn't think about a universal class library the way the pure object-oriented data model would have us do. Classes defy universal definition because the universe is too big. Remember that the purpose of the standard C library is to define methods for manipulating generic objects: Files, strings, characters, numbers, memory blocks, program execution, dates, times, and so on. A standard class library must restrict itself as well to classes that have general application, and the hierarchy for a given program is not necessarily a subset of a universally defined hierarchy but rather one contrived to suit the problem at hand. Eventually the builders of vertical applications might come to share their work in a series of application-specific class libraries. I doubt it, though. Few such industry-wide sharings of C function libraries exist.

The Teaching of C

Another frequently-expressed sentiment at OOPSLA was that C++ is too hard to learn, and the language industry fears that programmers will be scared off by this. They compare C++ to one of the new Pascals with object-oriented extensions and conclude that the Pascal route is easier to travel. They are right to some extent. C++ is more difficult to learn than object-oriented Pascal. But then again, traditional C is more difficult to learn than traditional Pascal, yet C became and remains the dominant language. Why? Because, once learned, C becomes the programmer's preferred language, religious language wars notwithstanding. I think that C is more difficult to learn only because it is more difficult to teach. We need to educate the teachers first. They need to know how to present C (and C++) to the student in logical increments.

C teachers should be required to teach Pascal first. Maybe its syntax will help them understand the properties of a programming language whose original purpose was for teaching programming. C instructors around the country are bewildering students with abstruse code examples the likes of which only a terse language such as C will permit. It isn't that you have to write C programs that way, it's just that you can, and the teachers haven't learned to leave that part until later.

One of the first lessons in K&R is about how you can say this:

    
while ((c = getchar( )) != EOF)     
...

Many teachers conclude that this facility of C to support embedded expressions is so powerful that it needs to be taught first thing out of the chute. There is no compelling reason to avoid such expressions once you understand them, but there is also no reason why a newcomer to C shouldn't be spared these details until he or she is ready for it.

The K&R book precedes this example with a less terse one that breaks each of the expressions and operations into individual statements like this:

  
   c = getchar( );
   while (c!= EOF) {
        ...
        c = getchar( );
   }

The purpose of this earlier example is to introduce the experienced programmer to the terser style of code and to illustrate how every statement is an expression and that you can use that feature of C to write fewer lines of code.

Upon seeing such a program, a C veteran is immediately moved to implode everything until it looks like the first example. But the rookie finds the more verbose program a lot easier to comprehend. The student sees a working program before learning a lot of the finer points of C. The extent to which a teacher should delay some of the finer details will depend on the programming experience of the new C student, particularly because more and more curricula are catering to students who pass progressively through layers of prerequisite classes without logging any real-world programming time. Try, for example, to explain to someone who is not a programmer why recursion is important.

Another problem is that teachers teach too many other complex topics in what are supposed to be C courses. They strive to shroud C in mystery. I recently tutored a student who was enrolled in an advanced C course at a community college. The programming assignments required her to use the low-level I/O and interrupt functions of Turbo C to manipulate the sectors of a DOS diskette directory. The class spent more time learning about the internals of DOS's file system than about C. The instructor's rationale for such assignments was that C is used primarily for systems programming, and DOS is the system that the school uses. A student who is going into a non-DOS C environment has been cheated by this approach. To begin with, the instructor's assumption is no longer valid. We use C to develop nearly every kind of software system. By spending so much of the class's time and energy on the learning of DOS and BIOS sector I/O, the instructor has lost valuable opportunities to teach C. This class would have been more appropriately labeled an MS-DOS Systems Programming Course with C as the programming language. Fewer students would have signed up, of course, but the ones who did would have gotten what they paid for, and the others could have been elsewhere in a real C class.

The ANSI Corner

Sometimes an improvement can get in the way. The draft ANSI specification for standard C adds adjacent string literal concatenation to the language, and that is a handy thing to have. With this feature you can code adjacent literals, and the compiler will concatenate them. For example, consider this code:

    printf("Hello," "World");

When the compiler sees adjacent string literals such as these, it combines them. The power of this facility becomes apparent when used along with the preprocessor in this way:

#define PROGRAM "Jiffy SpreadSheet"
#define VERSION "v1.23" 
#define DATE "1990"

printf(PROGRAM VERSION "        Copyright " DATE);

This feature, handy as it is, cost me a bit of time. I had the following array in a program:

    char *names[] = {
          "add",
          "change",
          "delete",
          "cut"
          "paste",
          NULL
     };

See that missing comma after the "cut" literal? I didn't, not for a long time, and I couldn't figure out why this statement worked only some of the time:

    if (strcmp(names[i], name) = = 0)

I could add, change, and delete, but I couldn't cut or paste. After I found the problem, I stared in disbelief, wondering why the compiler hadn't warned me about the missing comma. Then it hit me. Good old ANSI C thought I really meant to build a "cutpaste" string. But, when viewed in the context of how I coded the array, the missing comma was hard to spot, and, once revealed to the old eyeball, looked like it should have been flagged by the compiler as an error. Even after I saw the error, it was not immediately obvious to me that I had coded perfectly acceptable ANSI C language. The compiler, however, looks dispassionately at our code and cannot guess our intentions. What I coded was legitimate C, yet it didn't work worth a hoot.

Here is another use for the # preprocessor operator that we discussed last month. Suppose you have some maximum number of things your program will handle and you define that maximum this way:

  #define MAXTHINGS 500

When the user tries to add the 501st thing, you want to display a message that specifies the limit. We used to do it this way.

  printf("%d entries only", MAX);

This is a contrivance of constants. MAXTHINGS is a constant, and should be included in the string literal. But pre-ANSI C had no convenient way to do that and still preserve the global definition of MAXTHINGS. With the new # operator and adjacent string concatenation, we can do this:

  
#define str(x) #x   
printfSTR(MAX "entries only");

Book of the Month

Get yourself a copy of C Traps and Pitfalls by Andrew Koenig (Addison-Wesley, 1989). It is a collection of the mistakes a C programmer makes. Besides the obvious ones (the dangling else, for example), the book contains many insightful discussions of less obvious but common errors including programmer-induced malfunctions of the dreaded C pointer.

Many of the errors that Koenig describes will be flagged as warnings by contemporary ANSI conforming compilers. But a lot of pre-ANSI code is still out there in maintenance, and we must either turn the warnings off or disregard them, and so we are sometimes susceptible to these errors now just as in the old days.

_C PROGRAMMING COLUMN_ by Al Stevens [LISTING ONE]



/* ----------- textsrch.h ---------- */

#define OK    0
#define ERROR !OK

#define MXTOKS 25           /* maximum number of tokens */
#define MAXFILES 512        /* maximum number of files  */
#define MAPSIZE MAXFILES/16 /* number of ints/map       */

/* ---- the search decision bitmap (one bit per file) ---- */
struct bitmap {
    int map[MAPSIZE];
};

/* ------- the postfix expression structure -------- */
struct postfix  {
    char pfix;      /* tokens in postfix notation */
    char *pfixop;   /* operand strings            */
};

/* --------- the postfix stack ---------- */
extern struct postfix pftokens[];
extern int xp_offset;

/* --------- expression token values ---------- */
#define TERM     0
#define OPERAND 'O'
#define AND     '&'
#define OR      '|'
#define OPEN    '('
#define CLOSE   ')'
#define NOT     '!'
#define QUOTE   '"'

/* ---------- textsrch prototypes ---------- */
struct postfix *lexical_scan(char *expr);
struct bitmap exinterp(void);
void init_database(void);
struct bitmap search(char *word);
void process_result(struct bitmap);

[LISTING TWO]



/* ------------ exinterp.c --------------- */

/*
 * An expression interpreter that processes the
 * tokens on a postfix stack.
 */

#include "textsrch.h"

static struct postfix *pf = pftokens;

static struct bitmap pop(void);
static struct bitmap not(struct bitmap map1);
static struct bitmap and(struct bitmap map1, struct bitmap map2);
static struct bitmap or(struct bitmap map1, struct bitmap map2);

/* ----- entry to the interpreter
         returns a bitmap which indicates the files
         that match a search expression ------------- */
struct bitmap exinterp(void)
{
    /* ------- find the top of the postfix stack ------- */
    while (pf->pfix != TERM)
        pf++;
    /* ------- get the result of the expression ------- */
    return pop();
}

/* ------ pops an operand and converts it to a bit map ------ */
static struct bitmap pop(void)
{
    struct bitmap map1;
    switch ((--pf)->pfix)   {
        case OPERAND:
            map1 = search(pf->pfixop);
            break;
        case NOT:
            map1 = not(pop());
            break;
        case AND:
            map1 = and(pop(), pop());
            break;
        case OR:
            map1 = or(pop(), pop());
            break;
        default:
            break;
    }
    return map1;
}

/* ------- unary <not> operator ----------- */
static struct bitmap not(struct bitmap map1)
{
    int i;
    for (i = 0; i < MAPSIZE; i++)
        map1.map[i] = ~map1.map[i];
    return map1;
}

/* ------- binary <and> operator -------------- */
static struct bitmap and(struct bitmap map1, struct bitmap map2)
{
    int i;
    for (i = 0; i < MAPSIZE; i++)
        map1.map[i] &= map2.map[i];
    return map1;
}

/* ------- binary <or> operator -------------- */
static struct bitmap or(struct bitmap map1, struct bitmap map2)
{
    int i;
    for (i = 0; i < MAPSIZE; i++)
        map1.map[i] |= map2.map[i];
    return map1;
}

[LISTING THREE]



/* ---------- search.c ----------- */

/*
 * stub functions to simulate the TEXTSRCH retrieval process
 */

#include <stdio.h>
#include <dir.h>
#include <string.h>
#include <process.h>
#include "textsrch.h"

static char names[MAXFILES][13];
static int namect;
static void setbit(struct bitmap *map1, int bit);
static int getbit(struct bitmap *map1, int bit);

/* --------- initialize the data base environment --------- */
void init_database(void)
{
    struct ffblk ff;
    int rtn;
    rtn = findfirst("*.txt", &ff, 0);
    while (rtn != -1 && namect < MAXFILES)  {
        strcpy(names[namect++], ff.ff_name);
        rtn = findnext(&ff);
    }
}

/* ------ search the data base for a match on a word ------ */
struct bitmap search(char *word)
{
    int i;
    FILE *fp;
    char str[80];
    struct bitmap map1;

    for (i = 0; i < MAPSIZE; i++)
        map1.map[i] = 0;
    sprintf(str, "grep -l -i %s *.txt >hits", word);
    system(str);
    if ((fp = fopen("hits", "rt")) != NULL) {
        while (fgets(str, 80, fp) != NULL)  {
            *strchr(str, ':') = '\0';
            for (i = 0; i < MAXFILES; i ++) {
                if (stricmp(str+5, names[i]) == 0)  {
                    setbit(&map1, i);
                    break;
                }
            }
        }
        fclose(fp);
    }
    return map1;
}

/* ---- process the result of a query expression search ---- */
void process_result(struct bitmap map1)
{
    int i;
    for (i = 0; i < MAXFILES; i++)
        if (getbit(&map1, i))
            printf("\n%s", names[i]);
}

/* ------ sets a designated bit in the bit map ------ */
static void setbit(struct bitmap *map1, int bit)
{
    int off = bit / 16;
    int mask = 1 << (bit % 16);
    map1->map[off] |= mask;
}

/* ------ tests a designated bit in the bit map ------ */
static int getbit(struct bitmap *map1, int bit)
{
    int off = bit / 16;
    int mask = 1 << (bit % 16);
    return map1->map[off] & mask;
}

[LISTING FOUR]




/* ----------- textsrch.c ------------- */

/*
 * The TEXTSRCH program.
 */

#include <stdio.h>
#include <process.h>
#include <string.h>
#include "textsrch.h"

void main(void)
{
    char expr[80];
    /* -------- initialize the text data base --------- */
    init_database();
    do  {
        /* ----- read the expression from the console ------ */
        printf("\nEnter the search expression:\n");
        gets(expr);
        if (*expr)  {
            /* --- scan for errors and convert to postfix --- */
            if (lexical_scan(expr) == NULL) {
                /* ---- the expression is invalid ---- */
                while(xp_offset--)
                    putchar(' ');
                putchar('^');
                printf("\nSyntax Error");
                exit(1);
            }
            /* --- interpret the expression, search the data base,
                and process the hits ------- */
            process_result(exinterp());
        }
    } while (*expr);
}