LETTERS

WEB: Paeans and Pointers

I was pleased to see Andrew Schulman pay some attention to WEB in his August 1992 "Programmer's Bookshelf" column. This was the first coverage of the topic I've ever seen outside academic journals.

It was interesting to see Knuth admit that logging his errors hadn't diminished their frequency. I wonder if he's measured the effect of using WEB on his stumble rate. I haven't measured it myself, but could swear I make fewer errors since starting to use WEB; it seems to discourage the "code first, think later" style of work that constantly tempts programmers. Perhaps this works in a way similar to how having to explain a program discourages you from loading it down with features that don't belong in it, as both Knuth and Schulman note. For if you can explain your implementation of an algorithm--or better still, if you have explained it--there's a good chance that your code actually works.

One last thing: Readers following up on Schulman's reference to CWEB will be disappointed if their C dialect is ANSI, let alone C++. For these I strongly recommend a system called FWEB, which supports these dialects as well as Fortran and Ratfor. It includes binaries for MS-DOS as well as build procedures for UNIXes, VMS, and MVS. OS/2 isn't explicitly provided for, but is an easy port with the OS/2 GNU tools. FWEB is available by anonymous ftp from lyman.pppl.gov in the directory pub/fweb.

Jersey City, New Jersey

C Question

Many, if not most, versions of the memcmp and strcmp C library functions perform an unsigned comparison of the supplied strings (although this is not usually stated in the accompanying documentation). I was recently involved in a discussion (with a group of C programmers) about whether or not these functions should perform signed or unsigned comparisons. Since there was no general agreement from the discussion, I was wondering if some of your readers would be prepared to offer an opinion (or two). It would seem to me that the answer lies in the precise meaning of the phrase, "lexicographical order."

Llewellyn R. Griffiths

Victoria, Australia

One Steak, One Sushi

I'm writing with regard to Homer Tilton's letter on Japanese patents in the July 1992 issue. I suspect that his steak dinner is safe. Automated language translation was one of the first applications computer scientists tackled in the 1960s. It turned out to be an enormous problem. I haven't heard much about it for ten years or more. But I'm sure the language translation folks haven't given up; are making progress; and will be able to do something along the line of what Mr. Tilton desires in another few decades. Perhaps you'll get a letter or two from experts who can explain the state of the art.

A computer program for translating English to Japanese (or any other language) is an enormously difficult undertaking. Almost every English word has multiple meanings and while Japanese has words or phrases to match most English words, the word that matches one meaning usually is not the same as the word that matches another meaning.

Words and phrases often have context-specific meanings. Even translating the individual words in a phrase "right" is no guarantee that the phrase will make sense or that it will mean what the writer intended it to mean.

Idioms, homonyms, spelling errors, grammatical errors, and major language exceptions all need to be handled.

There are a number of things that need to be rephrased in moving between the languages. Japanese verb tenses are not the same as English verb tenses. While many words map directly between the two languages, some common ones don't.

The language translator is not just dealing with mechanical replacing of a few tens of thousands of words with one-to-one equivalents while juggling word order to get verbs at the end of the sentence. Analysis must go beyond syntax (bad enough in English) and worry about context. There are lots of special cases and judgement calls. It's a big job.

One example of the problems. A Japanese vendor managed to translate the phrase "track jam" (something that happens when a piece of paper sticks while flowing down a mechanical track) as "the jelly on the track." That happened with people. Lord knows what a computer would do.

Pragmatically, even if you had an English-to-Japanese translation program, how would you proofread the result?

An inexpensive alternative to a Japanese law firm would be to hire Japanese speaking graduate students from the engineering department of a nearby college or university. Have one translate the document to Japanese and a second translate it back. Work with one of them to resolve any inconsistencies you spot in proofreading.

Canton, Michigan

I was amazed by the letter by Homer B. Tilton published in the July 1992 issue. I think that if I were to file a patent here in the U.S., I would have to have my patent application professionally localized to the American variety of English, if not to Mexican Spanish. I very much doubt that the government of this country would accept the application written in Japanese. How could one hope for the Japanese government to accept a patent filing in English? I thought this highly Bushy.

I have been working on CAT (computer aided translation) for some years using various translation software. I have only been successful with automatic translation of simple things such as hello world. Beyond this level of complexity, the difference in cultural background between the two languages makes it impossible to generate any comprehensible documents automatically. The level of complexity of patent filings in Japanese is such that no average mortal would ever attempt to even read, let alone file, such a document. I doubt that the situation is much different in the U.S. The government formalities and specific wordings required in the patent filings in the target language, whichever it may be, would not simply be represented by a rule-based logic or any sort of currently available machine intelligence. Being native Japanese, I have attempted to improve my English writing skills using various software, without much success. Could software help the average person come up with a correct patent filing even in his or her own language? My experience indicates the answer to be negative.

To focus on the cultural background problem, let us as an example develop a translator from properly written Fortran to properly written C code. These two languages are alike except for the philosophical background, historical evolutions, and system environments. Unless the translation system recognizes and appreciates these differences, the result will still be Fortran code written in C syntax. No serious programmer would accept such code as a C program. Wouldn't the same sentiment apply to the "Japatent" filing? Instead of asking for such an impossible product as automatic patent-filing software, I would go for a real Fortran-to-C translator. If someone comes up with such software, my treat will be sushi instead of cholesterol-rich red meat.

Shohei Nakazawa

Sunnyvale, California

CRCMEN

I take exception to Mark R. Nelson's statement in the article "File Verification Using CRC" (May 1992) that "it is possible that an exceptionally clever virus will be able to defeat it [CRCMAN]." He then continues, "Further modifications to the program would improve its ability to fight viruses. For example, just storing the length of the file along with its CRC would make a virus programmer's job much more difficult, if not impossible."

On page 65, Mark also states, "The challenge to the virus programmer would then be to add new bytes to the end of the file so that the original CRC was restored." This is not only misleading but also demonstrates that Mark is not familiar with a class of viruses known as "stealth" viruses. It is not necessary for a virus to forge a scheme like CRC-32 to evade detection. There are currently a number of viruses (e.g. 4096, Fish-6, 512) that fool all such schemes, including those that employ cryptographic checksums. It might surprise Mark that none of these viruses attempt a direct attack on checksum algorithms, but they still succeed in almost every case!

For any detection mechanism to work against such beasts, the verifier must first establish a secure channel between the objects on the disk and its memory buffers. The reason is simple: Stealth viruses intercept disk access and undo the modifications they have made. In other words, the verifier will get clean input. Therefore, no matter how sophisticated the algorithm is, it will be fooled every time.

This does not mean that CRC-32 and such schemes cannot be used to implement an integrity checker. In fact, they are quite adequate as far as catching modifications, as long as the verifier can get to the actual contents of the objects on the disk. If a stealth virus is active in memory, other measures must be taken to gain untampered access to the disk. Under MS/PC DOS, this is not as trivial as it sounds; and it may be one of the reasons why there are many professional software packages that deal with viruses. Homemade solutions such as CRCMAN are simply not adequate. Mark should have at least advised his readers to boot from a clean, write-protected floppy diskette and then run CRCMAN from the same diskette (in case the copy on the hard disk is infected). Otherwise, CRCMAN not only fails to find any modifications, but also may spread the infection to all programs it checks. Besides the problem mentioned above, the article was excellent, and I must express my admiration for Mark's Data Compression Book (M&T Publishing, 1991). Keep up the good work.

Mark Nelson's article, "File Verification Using CRC," and its accompanying C code are clearly written and make an enjoyable reading. The only thing I don't agree with and find potentially dangerous is Mark's claim that it would take "a few weeks of computing power" to come up with a file with a CRC given in advance, or to produce two files with the same CRC. This is true if a simple brute force attack is made, but with a little help from linear algebra it can be done in a matter of seconds; see Example 1.

Example 1

  44444444444444444444444444444444444444444
  (41 t's (i.e. 74 hex) followed by the end-of-file mark (.1A hex)).

  qtttttqttqqtttttqtttqqqtqqtqqtqqqtttttttt
  (some t's replaced by q's.  Use s instead of q if you wish).  Both files
   have CRC=3EEFECA4.

This algebraic approach is based on a simple observation (which follows from the properties of polynomial division): The CRC bits are linear functions of the file bits. To be a bit more specific, consider only the last 32 bits (four bytes) of a given file. They can be thought of as components of a binary vector x of size 32. Let y be the corresponding CRC. It can be also interpreted as a binary vector; moreover, we have y = Ax + b for some constant binary matrix A and vector b. They can be found by setting x equal to various unit and 0 vectors and computing the corresponding CRCs. It follows that Ax = y + b (addition and subtraction are the same in the 0-1 number field) and finally x=A{-1} (y+b). I used Maple V to invert A in the 0-1 field and produce the examples below. Once you have A{-1}, you can, given any CRC y, find x from the aforementioned formula and thus construct a file whose CRC equals y.

The last four bytes in the files in Example 1 were 74 74 74 1A (hex). Change them to 31 D9 B5 54 to obtain CRC = 00000000. Or change them to 50 F7 10 7A to obtain CRC = BEEFBEEF.

I apologize for this full-scale lecture, but since it is so easy to forge CRCs, I believe that a warning of some sort to the unsuspecting readers is in order. Please feel free to use this material if you like it. In my opinion, message digest algorithms (such as MD5) are much more reliable for virus detection than CRCs. MD5 is described in RFC 1321, available by anonymous FTP from rsa.com.

Miroslav D. Asic

Newark, Ohio

Mark responds: I appreciate (without necessarily understanding) Professor Asic's demonstration of CRC-32 inversion. While my article pointed out that a 32-bit CRC could be broken fairly easily under a brute-force attack, I did not explain that a more sophisticated approach could reduce the procedure to a matter of seconds.

What I do need to re-emphasize is that a CRC check by itself does not constitute protection against a virus. Professor Asic shows us that the algorithm is no match for a card-carrying practitioner of linear algebra. My article just demonstrated that even the mythical "teenage hacker" can invert a CRC given a spare CPU and a C compiler. Let's just hope that Professor Asic and his colleagues side with the good guys in this fight.

Copyright © 1992, Dr. Dobb's Journal