Dr. Dobb's Journal March 1997
C/C++ programmers often shy away from heavy use of strings, because the built-in string operations can be expensive. The C standard library's strXXX() functions use a data format that is simple and compact, but doesn't provide fast comparison or editing. But if you know which operations are important, you can investigate alternatives that provide different trade-offs. For example, you might store all of your strings in a trie (see "Algorithm Alley," DDJ, April 1994) for quick searching, or use hash codes to quickly detect when two strings are different. This month, Fred shows a hash technique that provides both compact storage and fast equality tests.
-- Tim Kientzle
Many programs keep data that needs to be searched. We often find ourselves identifying and finding items according to a name or key. In many cases, that key is a string.
We know from experience that programs that keep data of any respectable volume will not perform well if they are searching for keys as strings. Why? Because string comparison functions are expensive -- much more so than, say, register-value comparisons. Also, the longer and more similar the strings are, the worse the performance will be.
A C/C++ string is a null-terminated array of characters with a pointer to the first character in the array. When you want to compare two strings for equality, you need to traverse the strings and compare their individual characters for equality. If one string terminates before the other, they are not equal. If a character in a given position is different, they are again not equal.
The C library strcmp function does this for you by returning zero (0) if the two strings are equal, a negative value if the first is less than the second, and a positive value if the first is greater than the second. Using strcmp is convenient, but burdensome at run time for two reasons: First, it is a function call; and second, it must laboriously walk through each string from the front each time it is called. There's lots of room for improvement here, mostly by minimizing the number of times you call strcmp.
Most people who use C++ have either created or used a string class. One feature of nearly every one of those string classes is that it saves the strings' lengths. The classes also usually have comparison operators to allow code to be specific for handling ==, <, and >. Tests for equality can be short-circuited by first ensuring that the lengths are the same before vectoring off into strcmp. If the lengths are not the same, then the strings are not equal. When the lengths are the same, the strings still need to have their characters traversed in entirety before declaring them equal.
The ultimate string comparison for equality would not require a call to strcmp at all (nor a traversal of each of the characters). How can this be accomplished? By paying some of the cost of comparison up front, before we need to do a comparison.
An adage in software development is, "The longer you take to store something, the faster you can retrieve it." In more-specific terms, an extra preparation step is sometimes all you need to make all subsequent processing of that data more efficient. This is the pivotal idea behind the remainder of the strategy I'll discuss here. As you will see, the up-front cost of preparing for efficient string comparisons for equality is not exorbitant.
I originally heard about this strategy from an associate who said that he'd seen it used in Scheme and LISP implementations. He said it had been "internalizing" the string. In honor of that rumor, I'll call this approach the "I-String strategy." Informally, this is how it works:
You encapsulate strings in an Istring class so you get control of how they are created, copied, compared, and destroyed. You'll use the Istring class to perform the preprocessing necessary to optimize comparisons for equality. The Istring class will keep a table of string values it knows about, as well as reference counts for them. When a new instance is created, the string value is found or added to the table, and its ID is saved in the Istring class-instance data. The Istring operator == simply compares the ID values to determine equality, based on the ID values mapping one-to-one onto unique string values. (The complete source code for the Istring class, as well as a make file and test program, are provided electronically; see "Availability," page 3.)
In most applications the I-String strategy is a winner in space, because the underlying values of Istring instances are mapped onto the same storage locations. This means that redundancy in the storage of equal string values is eliminated. All roads in optimization lead ultimately to malloc and free (or new and delete), and avoiding redundant storage also means avoiding memory-management calls (a time benefit as well).
The I-String strategy is also a winner in speed. When you use Istrings as search keys, you are doing register-value comparisons. Compared with the usual function call and string-scanning overhead of strcmp, Istrings are quite optimal. The benefits of avoiding strcmp overhead also increase as string sizes grow, and as the string values have more character-for-character similarities at their beginnings. Consider, for example, the string values:
window.popup_dialog_1.title
window.popup_dialog_1.xpos
window.popup_dialog_1.ypos
When comparing one of the aforementioned string values against the other with strcmp, you'd need to traverse the bulk of each string to discover ultimately that they do not match. Istrings allow a single 32-bit (address) comparison to do the same job, regardless of the length or similarity of the strings.
Now that you understand the strategy, you need to make a few low-level implementation choices:
Listing One is the Istring class declaration. You start by declaring two static map class instances: one to hold the unique string-value pointers, the other to keep track of the reference counts for the unique string-value pointers.
Next, you declare the only data value that an Istring possesses, which is the one pointer it keeps to its shared string value. The private members section ends with the declaration of the default constructor for this class. Since the default constructor is declared as a private member, clients must use one of the other constructors (either the constructor taking a string, or the class's copy constructor).
You also declare an assignment operator to allow the shared unique string values to be managed properly when Istrings are copied. (Whenever a copy constructor is needed to satisfy required bookkeeping or storage-management requirements, an assignment operator is usually needed as well.)
Next, you supply dereferencing operators so that clients can get at the string, and use it as they normally would if it were not encapsulated. This is a quick way for the class to export all of the contained class's operations through its interface. When you allow dereferencing of this type, a client can compose a reference to the underlying string by dereferencing the Istring. Example 1, for instance, shows references to the underlying string in the two allowed forms.
Last, you supply the operators that are the focus of the Istring implementation -- equality and inequality testing. Note that you have given the code for the functions inline, and that it consists of simply comparing the unique string-value addresses. By declaring them as inline, you have also avoided function-call overhead in performing the comparison operations.
Creating an Istring that doesn't have a value doesn't make much sense, since you can't modify an Istring. Instead, the constructor in Listing Two takes a string value as input, and uses that value to correctly set the unique string address from the shared string-to-string-pointer map. If the string is not found, a new entry is made for the string in both the table of string pointers and the reference counts. You handle the creation of entries first, and then allow the general code to deal with adopting the shared pointer and increasing the reference count for it.
Listing Three shows the copy constructor and assignment operator. The only twist in copying is correctly updating the reference count for the string so it will be destroyed only when no active references exist.
Assignment also requires carefully tracking reference counts. If the pointers are already the same, then nothing need be done. If they're different, you need to reduce the count for the value being overwritten, and increase the count on the value being copied.
The destructor in Listing Four also uses the reference count. Only when the reference count drops from 1 to 0 should you remove the shared string from the map.
You might be tempted to compare Istrings to reference-counted strings, but this would be an apples-to-oranges comparison. Istrings are designed to behave differently. First, unlike reference counted strings, separate Istrings can be constructed in various places in your program, and still come up sharing the underlying unique string. Reference-counted strings only share a value within a single chain of copying and assignment.
Second, reference-counted strings employ a copy-on-modify strategy of delaying, and potentially avoiding, copying of a string until the value of the string is changed. They are useful for general string purposes. In contrast, Istrings are best thought of as unchanging values that are used most effectively as search keys. Unlike reference-counted strings, they impose a shared value model on all of their instances with no copy-on-modify semantics.
An application that needs to do frequent searches based on names or string key values, through relatively nontrivial collections, will easily and dramatically reduce its search overhead by incorporating the Istrings strategy. This can be done without subjugating the internal data structures of the application entirely to hash tables (which introduces additional issues if your data is ordered by those keys as well), and the Istring class can be used as a plug-compatible replacement for your current use of strings as keys.
DDJ
////////////////////////////////////////////////////////////////////////////// Istring class internalizes strings for fast comparison for equality
using std::string ;
#define StrStrptrMap_t std::map<string, string*, std::less<std::string> >
#define StrptrIntMap_t std::map<string*, int, std::less<string*> >
class Istring {
private:
static StrStrptrMap_t ustrings ;
static StrptrIntMap_t ustring_counts ;
string *m_ustring ; // the "unique string" value being shared
// Empty Istring case thwarted by declaring def. ctor private
Istring() { }
public:
// Standard parts
Istring(const Istring &istr);
~Istring();
// Special ctors
Istring (const string &str);
// Assignment
const Istring & operator=(const Istring &istr);
// Deferencing operators
const string * operator->() const { return m_ustring ; } // dangerous
const string & operator*() const { return *m_ustring ; } // more safe
int operator==(const Istring &istr) const
{ return this->m_ustring == istr.m_ustring ; }
int operator!=(const Istring &istr) const
{ return this->m_ustring != istr.m_ustring ; }
};
Istring::Istring (const string &str){
if ( ustrings.find(str) == ustrings.end() ) {
string *s = new string(str);
ustrings[str] = s ;
ustring_counts[this->m_ustring] = 0 ;
}
// Adopt the shared string ptr from the map
this->m_ustring = ustrings[str] ;
// Increase usage by one
int n_users = ustring_counts[this->m_ustring] ;
ustring_counts[this->m_ustring] = ++n_users ;
}
Istring::Istring(const Istring & istr){
this->m_ustring = istr.m_ustring ;
// Increase usage by one
int n_users = ustring_counts[this->m_ustring] ;
ustring_counts[this->m_ustring] = ++n_users ;
}
const Istring &
Istring::operator=(const Istring &istr)
{
if (this == &istr || this->m_ustring == istr.m_ustring) return *this ;
// Reduce usage by one on current m_ustring
int n_users = ustring_counts[this->m_ustring] ;
ustring_counts[this->m_ustring] = --n_users ;
this->m_ustring = istr.m_ustring ;
// Increase usage by one on assigned m_ustring
n_users = ustring_counts[this->m_ustring] ;
ustring_counts[this->m_ustring] = ++n_users ;
return *this ;
}
Istring::~Istring(){
int n_users = ustring_counts[this->m_ustring] ;
n_users-- ;
// We can delete the ustring when it is no longer referenced
if (n_users == 0) {
ustrings.erase(*m_ustring) ;
ustring_counts.erase(m_ustring) ;
delete m_ustring ;
}
else {
ustring_counts[this->m_ustring] = n_users ;
}
}
Copyright © 1997, Dr. Dobb's Journal