This article presents a C++ string manipulation system that features anonymous strings with an efficient implementation. It is inspired by Java's string manipulation system and is largely compatible with it so that the large number of Java programmers will more easily be able to program in C++.
An anonymous string is one that does not have a name. For example, if a, b and c are strings, then the string built from concatenating these three strings a+b+c is anonymous.
My earlier limited understanding of C++ prevented me from implementing an efficient anonymous string system. Whenever anonymous strings could be used I would make use of a temporary variable and use this variable to build what was an anonymous string. For example, if a, b and c are string objects, and foo is a function with argument string, then foo(a+b+c) would be rewritten as the more long-winded:
String temp; temp << a; temp << b; temp << c; foo(temp); |
or the slightly shorter:
String temp; foo(temp << a << b << c); |
See an earlier article for a "proof" that efficient anonymous strings could not be implemented. This proof was only valid while my understanding of C++ was limited, and made invalid after further research.
This article presents a system for string manipulation that is based on an earlier article. The system mentioned in that article had one single string class, whereas my new system splits this class into three classes for greater compatibility with Java.
String(const String&) String& operator = (const String&) |
String_Buffer(const String_Buffer&) String_Buffer& operator = (const String_Buffer&) |
String(const String&) String& operator = (const String&) |
See another article for a diagram showing how my string manipulation system forms a part of my broader non-standard I/O library.
My previous string class had four operator <<'s for writing to four built-in types:
// Inside class string... virtual Writer& operator << (char ch); virtual Writer& operator << (int i); virtual Writer& operator << (double d); virtual Writer& operator << (const char* s); |
These four methods have been moved into the string_buffer class so that only string_buffer objects can be appended to and string objects are read-only. These functions have been changed to non-virtual in accordance with the efficiency motivated design pattern established in a later article.
// Inside class string_buffer... string_buffer& operator << (char ch); string_buffer& operator << (int i); string_buffer& operator << (double d); string_buffer& operator << (const char* s); |
The method set_char_at from class string has been moved to string_buffer so that string objects are read-only and string_buffer objects can be written to. The method set_char_at(int i, char ch) writes ch to location i in the the string_buffer object this.
The reason I have done this is so the Emacs can syntax highlight all string methods in a different colour from the rest of the text.
One day I discovered that the operator + can be defined with strings in such a way that it gives rise to efficient anonymous strings, comparable in efficiency to Java's implementation of anonymous strings. The solution is to create a new class which I call string_buffer2 for manipulating anonymous strings. Two operator +'s are then defined like so:
String_Buffer2 operator + (const String& s1, const String& s2)
{
string_buffer2 result;
result << s1;
result << s2;
return result;
}
String_Buffer2& operator + (const String_Buffer2& csb, const String& s)
{
String_Buffer2& sb = const_cast<String_Buffer2&>(csb); // Ugly but essential cast
sb << s;
return sb;
}
|
With the above definitions in force and if a, b, and c are strings (or can be converted to strings via a conversion operator), an expression like a+b+c will be parsed as follows:
String_Buffer2 temp; temp << a; temp << b; temp << c; String result = temp; |
Compare the above C++ code with how Java parses the anonymous string a+b+c and you will see that in terms of efficiency, they are practically identical:
StringBuffer temp = new StringBuffer(); temp.append(a); temp.append(b); temp.append(c); String result = temp.toString(); |
Regarding the ugly but essential cast, Bjarne Stroustrup told me in an email that a const_cast is guaranteed to work unless the original object was declared const.
You want to build an anonymous string out of arbitrary types it is necessary to call a function like to_string that is analogous to Java's toString method. Inside the file io.hh I define the following template function for this purpose.
template<class T> String to_string(const T& t) { String_Buffer sb; sb >> t; return sb; } |
EXAMPLE: If a, b and c are strings and t is an instance of arbitrary class T, then here is how to build an anonymous string and pass it to method foo:
foo(a+b+c+to_string(t)) |
To instantiate this template function, simple call the function. Note that unlike Java's toString method, the explicit call to the to_string function cannot be omitted. The to_string function internally calls the operator << (Writer&, const T&) function where T is the class name of the object passed as argument to function to_string so operator << (Writer&, const T&) must be defined in class T for the to_string function to work.
What follows is a list of potential problems faced by the new anonymous string system that is outlined in this article, together with solutions to the problems.
A second problem is that with the above definitions in force, the following counter-intuitive behaviour is evidenced:
String_Buffer2 b; b << "apple"; String s = b + " banana"; cout << s << endl; // outputs "apple banana" as expected cout << b << endl; // outputs "apple banana" when "apple" was expected |
A solution to this behaviour is to make the string_buffer2 constructor private, so that string_buffer2 objects can only be created anonymously with operator +, as they should be.
Unfortunately the following operators have existing semantics and cannot be redefined:
operator + (int,char*) operator + (int,char) operator + (char*,char) operator + (char*,int) operator + (char,char*) operator + (char,int) operator + (char,char) |
The following operator has no existing semantics but still cannot be defined:
operator + (char*,char*) |
The consequence of this is that an expression like so:
String s = "hello, " + "world"; |
won't compile. It needs to be rewritten with a explicit call to a string constructor like so:
String s = String() + "hello, " + "world"; |
It should be noted that if two strings values are known at compile time then the strings can be concatenated at compile time like so:
String s = "hello, " "world"; |
Therefore this string class only applies to strings whose values are not known at compile time. Worse still, an expression like so has an unexpected result, due to pointer arithmetic being employed:
String s = "hello" + 2; cout << s << endl; // outputs "llo" bizarrely |
To achieve the concatenation of hello onto the number two requires, as above, an explicit call to the string constructor like so:
String s = String() + "hello" + 2; cout << s << endl; // outputs "hello2" as expected |
Java has a similar problem. If a, b and c are arbitrary types then to send the concatenation of the string built from a, b and c to a method foo one should write:
foo(string() + a + b + c) |
Otherwise, if a and b are of type int, then foo(a+b+c) is not the concatenation of a, b and c, but the arithmetic sum a+b concatenated with c.
A limitation of two earlier string classes developed by the author (1 and 2) was that the maximum size of the string objects was limited to a compile-time constant. Increasing the value of this constant meant that memory was wasted as every string object was sized at the value of this constant.
The string class featured in this article grows the internal string data structure as the size of the string dictates, resulting in both speed and conservative memory use.
In an email Andrew Koenig told me that it was generally not acceptable to change the behaviour of the standard C++ string class. This is because the large amount of existing code that uses the string class and assumes that the class behaves in a certain way would be broken by a change to the behaviour of the string class.
The string manipulating classes presented in this article stand in their own right as a complete C++ Java-like string system. I hope that other people will find my system useful in their own programs, otherwise my system will languish with me as the only user of it!
My String class has two methods for converting String objects to a char* or const char* for accessing C functions:
const char* const_char_star() const; char* char_star(int mem_size, char* s) const; |
Here is how to use them:
void foo(String s) { printf("%s", s.const_char_star()); // Function printf expects a const char* const int MAX_LENGTH = 200 char array[MAX_LENGTH]; bar(s.char_star(MAX_LENGTH,array)) // Function bar expects a char* } |
It should be noted that the method const_char_star returns a pointer that can only be guaranteed to exist until the next string manipulation. If a long term pointer is needed use the method char_star.
Add the following lines of Emacs Lisp code to your .emacs file to achieve correct syntax highlighting of the string classes:
;; The following code highlights the string classes: (kill-local-variable 'c++-font-lock-extra-types) (if (not (boundp 'c++-font-lock-extra-types)) (setq c++-font-lock-extra-types nil)) (setq-default c++-font-lock-extra-types (append '("[A-Z]" "[A-Z0-9_]+[a-z][a-zA-Z0-9_]*") c++-font-lock-extra-types)) |
The following program listings are intended for the GNU C++ compiler, but will probably work on other compilers too.
|
string.hh | ||||
|
|
||||
|
t-string.cc | ||||
|
TString.java | ||||
|
io.tar.gz |
Back to Research Projects |
This page has the following hit count:
|