The I-String Strategy

Dr. Dobb's Journal March 1997

A winner in both space and time

By Fred Wild

Fred is a principal of Advantage Software Technologies. He can be contacted at 72774.657@compuserve.com.

C/C++ programmers often shy away from heavy use of strings, because the built-in string operations can be expensive. The C standard library's strXXX() functions use a data format that is simple and compact, but doesn't provide fast comparison or editing. But if you know which operations are important, you can investigate alternatives that provide different trade-offs. For example, you might store all of your strings in a trie (see "Algorithm Alley," DDJ, April 1994) for quick searching, or use hash codes to quickly detect when two strings are different. This month, Fred shows a hash technique that provides both compact storage and fast equality tests.

-- Tim Kientzle

Many programs keep data that needs to be searched. We often find ourselves identifying and finding items according to a name or key. In many cases, that key is a string.

We know from experience that programs that keep data of any respectable volume will not perform well if they are searching for keys as strings. Why? Because string comparison functions are expensive -- much more so than, say, register-value comparisons. Also, the longer and more similar the strings are, the worse the performance will be.

A C/C++ string is a null-terminated array of characters with a pointer to the first character in the array. When you want to compare two strings for equality, you need to traverse the strings and compare their individual characters for equality. If one string terminates before the other, they are not equal. If a character in a given position is different, they are again not equal.

The C library strcmp function does this for you by returning zero (0) if the two strings are equal, a negative value if the first is less than the second, and a positive value if the first is greater than the second. Using strcmp is convenient, but burdensome at run time for two reasons: First, it is a function call; and second, it must laboriously walk through each string from the front each time it is called. There's lots of room for improvement here, mostly by minimizing the number of times you call strcmp.

Abstractions for Strings

Most people who use C++ have either created or used a string class. One feature of nearly every one of those string classes is that it saves the strings' lengths. The classes also usually have comparison operators to allow code to be specific for handling ==, <, and >. Tests for equality can be short-circuited by first ensuring that the lengths are the same before vectoring off into strcmp. If the lengths are not the same, then the strings are not equal. When the lengths are the same, the strings still need to have their characters traversed in entirety before declaring them equal.

Optimizing Tests for Equality

The ultimate string comparison for equality would not require a call to strcmp at all (nor a traversal of each of the characters). How can this be accomplished? By paying some of the cost of comparison up front, before we need to do a comparison.

An adage in software development is, "The longer you take to store something, the faster you can retrieve it." In more-specific terms, an extra preparation step is sometimes all you need to make all subsequent processing of that data more efficient. This is the pivotal idea behind the remainder of the strategy I'll discuss here. As you will see, the up-front cost of preparing for efficient string comparisons for equality is not exorbitant.

The I-String Strategy

I originally heard about this strategy from an associate who said that he'd seen it used in Scheme and LISP implementations. He said it had been "internalizing" the string. In honor of that rumor, I'll call this approach the "I-String strategy." Informally, this is how it works:

Encapsulate strings in Istring class instances.
Keep unique string values in a table.
On construction, look up each new Istring's value in the table and assign/find its unique ID.
When doing comparisons for equality, just compare the IDs.

You encapsulate strings in an Istring class so you get control of how they are created, copied, compared, and destroyed. You'll use the Istring class to perform the preprocessing necessary to optimize comparisons for equality. The Istring class will keep a table of string values it knows about, as well as reference counts for them. When a new instance is created, the string value is found or added to the table, and its ID is saved in the Istring class-instance data. The Istring operator == simply compares the ID values to determine equality, based on the ID values mapping one-to-one onto unique string values. (The complete source code for the Istring class, as well as a make file and test program, are provided electronically; see "Availability," page 3.)

Expected Time and Space Benefits

In most applications the I-String strategy is a winner in space, because the underlying values of Istring instances are mapped onto the same storage locations. This means that redundancy in the storage of equal string values is eliminated. All roads in optimization lead ultimately to malloc and free (or new and delete), and avoiding redundant storage also means avoiding memory-management calls (a time benefit as well).

The I-String strategy is also a winner in speed. When you use Istrings as search keys, you are doing register-value comparisons. Compared with the usual function call and string-scanning overhead of strcmp, Istrings are quite optimal. The benefits of avoiding strcmp overhead also increase as string sizes grow, and as the string values have more character-for-character similarities at their beginnings. Consider, for example, the string values:

window.popup_dialog_1.title

window.popup_dialog_1.xpos

window.popup_dialog_1.ypos

When comparing one of the aforementioned string values against the other with strcmp, you'd need to traverse the bulk of each string to discover ultimately that they do not match. Istrings allow a single 32-bit (address) comparison to do the same job, regardless of the length or similarity of the strings.

Implementation Notes

Now that you understand the strategy, you need to make a few low-level implementation choices:

Physical form for unique ID values. Since a memory address is necessarily unique for unique values, and you need to store the address of string values for other purposes anyway, let's use the unique string value's memory address as the Istring's ID. You'll hold this value in the Istring's unique_string member variable.
String-class pointer values. For a string, you don't need or want to revert to using character pointers. You likely already have a string class, so let's make use of it. The address of one of the string-class instances works equally well as a unique ID. Further, within the Istring class, you'll define the operators -> and * so that they surface the string-class value, which allows clients to directly use all of the string's utility functions. This allows your Istring class to remain a lightweight wrapper around the old standby string classes.
Use of STL. In this example, I'll use STL to gain use of both its string class and its map template class. The map template will be used to create a table to map string values onto their own addresses, which I will in turn use as the Istring's unique string value ID. I'll also use the map template to create a store for the reference counts for each string value. This way, you'll also be able to determine when you can safely destroy the string values. The two maps are class-wide static resources that are shared by all instances of Istring.

Istring Class Declaration

Listing One is the Istring class declaration. You start by declaring two static map class instances: one to hold the unique string-value pointers, the other to keep track of the reference counts for the unique string-value pointers.

Next, you declare the only data value that an Istring possesses, which is the one pointer it keeps to its shared string value. The private members section ends with the declaration of the default constructor for this class. Since the default constructor is declared as a private member, clients must use one of the other constructors (either the constructor taking a string, or the class's copy constructor).

You also declare an assignment operator to allow the shared unique string values to be managed properly when Istrings are copied. (Whenever a copy constructor is needed to satisfy required bookkeeping or storage-management requirements, an assignment operator is usually needed as well.)

Next, you supply dereferencing operators so that clients can get at the string, and use it as they normally would if it were not encapsulated. This is a quick way for the class to export all of the contained class's operations through its interface. When you allow dereferencing of this type, a client can compose a reference to the underlying string by dereferencing the Istring. Example 1, for instance, shows references to the underlying string in the two allowed forms.

Last, you supply the operators that are the focus of the Istring implementation -- equality and inequality testing. Note that you have given the code for the functions inline, and that it consists of simply comparing the unique string-value addresses. By declaring them as inline, you have also avoided function-call overhead in performing the comparison operations.

Construction from a String

Creating an Istring that doesn't have a value doesn't make much sense, since you can't modify an Istring. Instead, the constructor in Listing Two takes a string value as input, and uses that value to correctly set the unique string address from the shared string-to-string-pointer map. If the string is not found, a new entry is made for the string in both the table of string pointers and the reference counts. You handle the creation of entries first, and then allow the general code to deal with adopting the shared pointer and increasing the reference count for it.

Copy and Assignment

Listing Three shows the copy constructor and assignment operator. The only twist in copying is correctly updating the reference count for the string so it will be destroyed only when no active references exist.

Assignment also requires carefully tracking reference counts. If the pointers are already the same, then nothing need be done. If they're different, you need to reduce the count for the value being overwritten, and increase the count on the value being copied.

The destructor in Listing Four also uses the reference count. Only when the reference count drops from 1 to 0 should you remove the shared string from the map.

Istrings versus Reference-Counted Strings

You might be tempted to compare Istrings to reference-counted strings, but this would be an apples-to-oranges comparison. Istrings are designed to behave differently. First, unlike reference counted strings, separate Istrings can be constructed in various places in your program, and still come up sharing the underlying unique string. Reference-counted strings only share a value within a single chain of copying and assignment.

Second, reference-counted strings employ a copy-on-modify strategy of delaying, and potentially avoiding, copying of a string until the value of the string is changed. They are useful for general string purposes. In contrast, Istrings are best thought of as unchanging values that are used most effectively as search keys. Unlike reference-counted strings, they impose a shared value model on all of their instances with no copy-on-modify semantics.

Conclusion

An application that needs to do frequent searches based on names or string key values, through relatively nontrivial collections, will easily and dramatically reduce its search overhead by incorporating the Istrings strategy. This can be done without subjugating the internal data structures of the application entirely to hash tables (which introduces additional issues if your data is ordered by those keys as well), and the Istring class can be used as a plug-compatible replacement for your current use of strings as keys.

DDJ

Listing One

//////////////////////////////////////////////////////////////////////////////  Istring class internalizes strings for fast comparison for equality

using std::string ;

#define StrStrptrMap_t std::map<string, string*, std::less<std::string> >
#define StrptrIntMap_t std::map<string*, int, std::less<string*> >

class Istring {
private:
    static StrStrptrMap_t ustrings ;
    static StrptrIntMap_t ustring_counts ;
    
    string *m_ustring ;       // the "unique string" value being shared

    // Empty Istring case thwarted by declaring def. ctor private
    Istring() { } 
public:
    // Standard parts 
    Istring(const Istring &istr);
    ~Istring();

    // Special ctors
    Istring (const string &str);
    // Assignment
    const Istring & operator=(const Istring &istr);
    // Deferencing operators
    const string * operator->() const { return  m_ustring ; }    // dangerous
    const string & operator*() const  { return *m_ustring ; }    // more safe
    int operator==(const Istring &istr) const 
        { return this->m_ustring == istr.m_ustring ; }
    int operator!=(const Istring &istr) const 
        { return this->m_ustring != istr.m_ustring ; }
};

Back to Article

Listing Two

Istring::Istring (const string &str){
    if ( ustrings.find(str) == ustrings.end() ) {
        string *s = new string(str);
        ustrings[str] = s ;
        ustring_counts[this->m_ustring] = 0 ;
    }
    // Adopt the shared string ptr from the map
    this->m_ustring = ustrings[str] ;
    // Increase usage by one
    int n_users = ustring_counts[this->m_ustring] ;
    ustring_counts[this->m_ustring] = ++n_users ;
}

Back to Article

Listing Three

Istring::Istring(const Istring & istr){
    this->m_ustring = istr.m_ustring ;
    // Increase usage by one
    int n_users = ustring_counts[this->m_ustring] ;
    ustring_counts[this->m_ustring] = ++n_users ;
}
const Istring & 
Istring::operator=(const Istring &istr)
{
    if (this == &istr || this->m_ustring == istr.m_ustring) return *this ;
    // Reduce usage by one on current m_ustring
    int n_users = ustring_counts[this->m_ustring] ;
    ustring_counts[this->m_ustring] = --n_users ;
    this->m_ustring = istr.m_ustring ;
    // Increase usage by one on assigned m_ustring
    n_users = ustring_counts[this->m_ustring] ;
    ustring_counts[this->m_ustring] = ++n_users ;
    return *this ;
}

Back to Article

Listing Four

Istring::~Istring(){
    int n_users = ustring_counts[this->m_ustring] ;
    n_users-- ;

    // We can delete the ustring when it is no longer referenced 
    if (n_users == 0) {
        ustrings.erase(*m_ustring) ;
        ustring_counts.erase(m_ustring) ;
        delete m_ustring ;
    }
    else {
        ustring_counts[this->m_ustring] = n_users ;
    }
}

Back to Article


Copyright © 1997, Dr. Dobb's Journal