I've lately been analyzing (and trying to improve) performance of a string intensive application. When checking the way strings are stored in memory in the .NET framework I came across this funny memory layout for a string reference in memory:
(example compiled on an IA32 with VS2010SP1 targeting .NET 3.5)
79330c6c 00000005 00000004 0054 0065 0073 0074 0000 0000 80000000
V-Table(?) Length+1 Length T e s t null-char ??? unclear
for String (4byte-aligned?)
I'd like to know the following:
- Why do we have the string's Length and additionally seemingly Length+1 stored in the structure? Shouldn't the latter be computable? Am I getting something wrong?
- How are strings aligned in memory, i. e. how many trailing null-chars do we get for a length of 3, 4, 5, 6, 7, 8 chars?
- What is the ominous 0x80000000? I didn't find this when e. g. checking an array of strings built up at runtime. Is that a garbage collection pin or something? Just random data? ( I never found anything else but 0x00000000 and 0x80000000 in this place...)
- Since strings are immutable, I would actually have expected to find an Int32 for the hash code of the string in the structure. For example instead of Length+1. However, it seems GetHashCode() will have no short exit for strings, i. e. hash codes are not cashed. Wouldn't that be a good idea?
A second mystery to me is the way String.Intern works. The documentation tells us:
If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects isnot likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String objectcan persist after your application, or even your application domain, terminates.
Could somebody please be less cryptic and define the "not likely" and "can" in above documentation? I would like to have clarification under which circumstances an interned string is actually released if it's no longer used/referenced in the application/the process that interned it. Will it ever be released? Can we have details for each framework version about the behaviour in case it differs? (Regardless of the difference concerning String.Empty; that is understood.)
When the application finally terminates, under which circumstances will the string still live on? Will it live on forever? Will it live only if a second application/process references the same interned text? Is it ever collected at all?
In my evaluation tests at least it seemed the memory is never ever collected again during the lifetime of the application. Making it completely useless for our use cases. It's also more than unclear why the framework should behave this way. Any good reasons?
Anyway, after evaluating this, I went on a rant when I had the idea to simply implement a String table of my own for interning purposes by using HashSet<WeakString> where WeakString is a struct with a WeakReference targeting a string and with an int for saving a string's hashcode. It turned out to be impossible since HashSet<T> offers nothing resembling a Get(T item) method. This may not be a problem for mere value types, but for the intended purpose, it is simply useless though internally it definitely has got all it takes. Whatever references you put into a HashSet, you will never be able to retrieve them in the same quick way HashSet allows you to check whether an element is contained. This could (*cough*) be improved in a future .NET version.
Any good ideas for implementing an application level string pool without wasting too much memory are welcome. To me it seemed one will have to reinvent the wheel here because HashSet lacks a method that probably could be introduced in a few keystrokes...