UniqueStringList Revisited

Introduction

This article is about a class called UniqueStringList, various versions of which I have been using for many years.
I thought it was fully optimized but I recently used a couple of new techniques to make it twice as quick and thought I would share them with you.

Background

Reusing the same copy of a distinct string value is called String Interning. String Interning is a good thing for two main reasons:

It reduces memory consumption: 100,000 references to the same string is far better than 100,000 references to 100,000 copies of the same string.
It speeds up string comparisons: If two string references point to the same place, you know they are equal without having to compare a single character.

The C# compiler automatically Interns all literal strings within your source files when building an assembly.
You may have seen the String.Intern() method? Don't use it in your code! All strings you add to this Intern Pool cannot be removed and will stay there until your app closes!

Although not String Interning per se, in my Optimizing Serialization in .NET article, one of the biggest optimizations was to produce a token for each unique string being serialized. It did not affect the strings on the serializing end but ensured that at the deserializing end, only one copy of the string would exist - automatically interned if you will.

Prevention of duplicate strings is very beneficial if you can remove the duplicates at source. Imagine a CSV file being read with 100,000 lines and assume that some string appears on every line (e.g. "No fault found"). Since each line is read separately and has no knowledge of the lines above and below it, you will be storing 100,000 copies of that same string in memory. If you can Intern all the strings read from a single line, you will end up with just one copy of each unqiue string which can save megabytes of memory.

It is a similar case when reading XML files or retrieving data from a middle tier layer and creating POCOs (Plain Old Class Objects) for each item, the chances are you will be getting duplicates of the strings involved unless you remove them yourself.

Imagine displaying a large amount of data in a grid. You click on the filter icon in the header and, after a short pause, it displays a list of unique items to filter against. Wouldn't it be nice if the process of generating this list was so quick it appeared instantly.

An even simpler scenario is just asking whether the string has been seen before and maybe skipping some processing if it has.

Of course, the process of interning strings is not free, it will take some time and so the less time it takes, the better and that is what this article is about.
Also, don't forget that you will be saving some time by making fewer memory allocations and so ultimately you could get the best of both words: Large memory savings with little or no additional overhead.

So we have some scenarios here for string Interning/identifying duplicates but with slightly different requirements.

The 'tokenizing' scenario needs to get an index back and that index must not subsequently change.
In addition, it is helpful to know whether that index is for an existing string or a new one.
The 'reuse of identical strings' scenario (ie real Interning) requires a string as an output - an index is neither here nor there and it doesn't matter if it is new string or not.
The 'have I seen this string before' scenario simply needs to return a bool.
The 'just give me all the unique strings' scenario doesn't need to return anything when adding - just a way of getting all of them back out at the end.

The code I eventually came up with uses ideas from three sources:

The original code from the Optimizing Serialization in .NET article.
The concept of using Golden Primes and accelerating the bucket count for lower capacities and also having a LoadFactor to have more buckets than capacity to reduce the chance of HashCode collisions.
The .NET 4.0 HashSet<T> code as seen in Reflector.
Instead of storing the strings directly in an array, I use a Slot struct which also stores the hash code of the string and the index of the Next Slot. The storage of the hash code eliminates a lot of string compares and recalcuations especially when expanding capacity. The Next index helps with collisions ensuring that the minimum number of compares is achieved.
Whilst Reflectoring HashSet<T>, I also discovered a HashHelpers class in the System.Collections namespace which has a number of primes precalculated. Unfortunately, this (and a near identical class of the same name in System.Collections.Generic) are marked as internal.
An extremely interesting article by rob tillaart called Optimizing integer divisions with Multiply Shift in C# which shows how to speed up integer divisions, and by implication, modulo operations. The code I have doesn't use divide operations but does use modulo to determine which hash bucket to use for a given hash code.

To promote reusability I have created a MathHelper class and a QuickDivideInfo struct and the code for these is listed below. These incorporate the ideas of precalculating Golden Primes and precalculating the Fast Division information for them.

From the QuickDivideInfo struct, only the ModuloPositive() method is actually used by UniqueStringList but I have included the other methods for completeness and checked them against all possible int numerators. Apart from int.MinValue, all other values return exactly the same result as the '%' operator but 3 times faster.

QuickDivideInfo struct

My time testing has shown that all these methods are inlinable.

I have also added unchecked statements around the arithmetic since there is no chance of overflow. Therefore compiling with the "Check for arithmetic overflow/underflow" option set will not affect the speed increase.

public struct QuickDivideInfo
{
    public readonly int Divisor;
    public readonly long Multiplier;
    public readonly int Shift;

    public QuickDivideInfo(int divisor, long multiplier, int shift)
    {
        Divisor = divisor;
        Multiplier = multiplier;
        Shift = shift;
    }

    public int Divide(int numerator)
    {
        return numerator < 0 ? DivideNegative(numerator) : DividePositive(numerator);
    }

    public int DividePositive(int numerator)
    {
        unchecked
        {
            return (int) ((numerator * Multiplier) >> Shift);
        }
    }

    // Does not work with int.MinValue as the numerator!
    public int DivideNegative(int numerator)
    {
        unchecked
        {
            return (int) -((-numerator * Multiplier) >> Shift);
        }
    }

    public int Modulo(int numerator)
    {
        return numerator < 0 ? ModuloNegative(numerator) : ModuloPositive(numerator);
    }

    public int ModuloPositive(int numerator)
    {
        return numerator - DividePositive(numerator) * Divisor;
    }

    // Does not work with int.MinValue as the numerator!
    public int ModuloNegative(int numerator)
    {
        return numerator - DivideNegative(numerator) * Divisor;
    }
}

MathHelper

For completeness, I have also added the normal precalculated Primes as used by Microsoft and precalculated their Fast Division information too.

public static class MathHelper
{
  public const int Lower31BitMask = 0x7fffffff;

  // Allows quadrupling of bucket table size for smaller sizes then
  // reverting to doubling.
  static readonly int[] GoldenPrimeAccelerators =
    new[]
      {
        389, 1543, 6151, 24593, 98317
      };

  // Based on Golden Primes (as far as possible from nearest two powers of two)
  // at <a href="http://planetmath.org/encyclopedia/GoodHashTablePrimes.html">http://planetmath.org/encyclopedia/GoodHashTablePrimes.html</a>
  // and Optimizing integer divisions with Multiply Shift in C#
  // at <a href="http://www.codeproject.com//KB/cs/FindMulShift.aspx">http://www.codeproject.com//KB/cs/FindMulShift.aspx</a>
  static readonly QuickDivideInfo[] GoldenPrimes =
    new[]
      {
        new QuickDivideInfo(53, 1296593901, 36),    // acceration skip
        new QuickDivideInfo(97, 354224107, 35),     // acceration skip
        new QuickDivideInfo(193, 356059465, 36),    // acceration skip
        new QuickDivideInfo(389, 2826508041, 40),
        new QuickDivideInfo(769, 2859588109, 41),   // acceration skip
        new QuickDivideInfo(1543, 356290223, 39),
        new QuickDivideInfo(3079, 714200473, 41),   // acceration skip
        new QuickDivideInfo(6151, 178753313, 40),
        new QuickDivideInfo(12289, 1431539267, 44), // acceration skip
        new QuickDivideInfo(24593, 2861332257, 46),
        new QuickDivideInfo(49157, 1431510145, 46), // acceration skip
        new QuickDivideInfo(98317, 2862932929, 48),
        new QuickDivideInfo(196613, 715809679, 47),
        new QuickDivideInfo(393241, 1431564749, 49),
        new QuickDivideInfo(786433, 1431653945, 50),
        new QuickDivideInfo(1572869, 2863302429, 52),
        new QuickDivideInfo(3145739, 2863301519, 53),
        new QuickDivideInfo(6291469, 2863305615, 54),
        new QuickDivideInfo(12582917, 1431655197, 54),
        new QuickDivideInfo(25165843, 1431654685, 55),
        new QuickDivideInfo(50331653, 2863311247, 57),
        new QuickDivideInfo(100663319, 2863310877, 58),
        new QuickDivideInfo(201326611, 2863311261, 59),
        new QuickDivideInfo(402653189, 357913937, 57),
        new QuickDivideInfo(805306457, 2863311215, 61),
        new QuickDivideInfo(1610612741, 1431655761, 61),

      };

  // Based on the list of primes in Systems.Collections.Generic.HashHelpers
  static readonly QuickDivideInfo[] Primes =
    new[]
      {
        new QuickDivideInfo(3, 2863311531, 33),
        new QuickDivideInfo(7, 2454267027, 34),
        new QuickDivideInfo(11, 780903145, 33),
        new QuickDivideInfo(17, 2021161081, 35),
        new QuickDivideInfo(23, 2987803337, 36),
        new QuickDivideInfo(29, 2369637129, 36),
        new QuickDivideInfo(37, 3714566311, 37),
        new QuickDivideInfo(47, 2924233053, 37),
        new QuickDivideInfo(59, 582368447, 35),
        new QuickDivideInfo(71, 3871519817, 38),
        new QuickDivideInfo(89, 3088515809, 38),
        new QuickDivideInfo(107, 1284476201, 37),
        new QuickDivideInfo(131, 1049152317, 37),
        new QuickDivideInfo(163, 210795941, 35),
        new QuickDivideInfo(197, 1395319325, 38),
        new QuickDivideInfo(239, 2300233531, 39),
        new QuickDivideInfo(293, 3752599413, 40),
        new QuickDivideInfo(353, 3114763819, 40),
        new QuickDivideInfo(431, 2551071063, 40),
        new QuickDivideInfo(521, 1055193501, 39),
        new QuickDivideInfo(631, 871245347, 39),
        new QuickDivideInfo(761, 1444824741, 40),
        new QuickDivideInfo(919, 2392843587, 41),
        new QuickDivideInfo(1103, 498418689, 39),
        new QuickDivideInfo(1327, 3314277703, 42),
        new QuickDivideInfo(1597, 2753942713, 42),
        new QuickDivideInfo(1931, 284700059, 39),
        new QuickDivideInfo(2333, 1885146383, 42),
        new QuickDivideInfo(2801, 785085061, 41),
        new QuickDivideInfo(3371, 2609342339, 43),
        new QuickDivideInfo(4049, 2172411219, 43),
        new QuickDivideInfo(4861, 904761677, 42),
        new QuickDivideInfo(5839, 188304783, 40),
        new QuickDivideInfo(7013, 2508510773, 44),
        new QuickDivideInfo(8419, 2089581429, 44),
        new QuickDivideInfo(10103, 870641693, 43),
        new QuickDivideInfo(12143, 1448751219, 44),
        new QuickDivideInfo(14591, 602843741, 43),
        new QuickDivideInfo(17519, 2008355049, 45),
        new QuickDivideInfo(21023, 1673613285, 45),
        new QuickDivideInfo(25229, 1394600345, 45),
        new QuickDivideInfo(30293, 2322937451, 46),
        new QuickDivideInfo(36353, 3871413319, 47),
        new QuickDivideInfo(43627, 3225926339, 47),
        new QuickDivideInfo(52361, 167989401, 43),
        new QuickDivideInfo(62851, 1119612165, 46),
        new QuickDivideInfo(75431, 932888921, 46),
        new QuickDivideInfo(90523, 97169703, 43),
        new QuickDivideInfo(108631, 2591110979, 48),
        new QuickDivideInfo(130363, 2159163081, 48),
        new QuickDivideInfo(156437, 1799286465, 48),
        new QuickDivideInfo(187751, 2998385913, 49),
        new QuickDivideInfo(225307, 1249295303, 48),
        new QuickDivideInfo(270371, 2082138815, 49),
        new QuickDivideInfo(324449, 1735095357, 49),
        new QuickDivideInfo(389357, 722922605, 48),
        new QuickDivideInfo(467237, 2409697663, 50),
        new QuickDivideInfo(560689, 2008064911, 50),
        new QuickDivideInfo(672827, 1673386929, 50),
        new QuickDivideInfo(807403, 87154425, 46),
        new QuickDivideInfo(968897, 2324085857, 51),
        new QuickDivideInfo(1162687, 1936720557, 51),
        new QuickDivideInfo(1395263, 403472287, 49),
        new QuickDivideInfo(1674319, 336226223, 49),
        new QuickDivideInfo(2009191, 1120749503, 51),
        new QuickDivideInfo(2411033, 933956447, 51),
        new QuickDivideInfo(2893249, 1556589021, 52),
        new QuickDivideInfo(3471899, 1297157443, 52),
        new QuickDivideInfo(4166287, 135120301, 49),
        new QuickDivideInfo(4999559, 56299961, 48),
        new QuickDivideInfo(5999471, 3002664487, 54),
        new QuickDivideInfo(7199369, 2502219085, 54),
      };

  public static bool IsPrime(int candidate)
  {
    if ((candidate & 1) == 0)
    {
      return candidate == 2;
    }

    int max = (int) Math.Sqrt(candidate);

    for (int i = 3; i <= max; i += 2)
    {
      if (candidate % i == 0)
      {
        return false;
      }
    }

    return true;
  }

  public static int GetPrime(int min)
  {
    return GetPrimeCore(Primes, min);
  }

  public static int GetGoldenPrime(int min)
  {
    return GetPrimeCore(GoldenPrimes, min);
  }

  public static QuickDivideInfo GetPrimeInfo(int min)
  {
    return GetPrimeInfoCore(Primes, min);
  }

  public static QuickDivideInfo GetGoldenPrimeInfo(int min, bool accelerate = true)
  {
    if (accelerate)
    {
      foreach (var goldenPrimeAccelerator in GoldenPrimeAccelerators)
      {
        if (min > goldenPrimeAccelerator) continue;

        min = goldenPrimeAccelerator;
        break;
      }
    }

    return GetPrimeInfoCore(GoldenPrimes, min);
  }

  static QuickDivideInfo GetPrimeInfoCore(QuickDivideInfo[] set, int min)
  {
    for (int i = 0; i < set.Length; ++i)
    {
      if (set[i].Divisor >= min)
      {
        return set[i];
      }
    }

    throw new ArgumentOutOfRangeException("min", "You really need a prime larger than 1,610,612,741?!");
  }

  static int GetPrimeCore(QuickDivideInfo[] set, int min)
  {
    for (int i = 0; i < set.Length; ++i)
    {
      int num = Primes[i].Divisor;
      if (num >= min) return num;
    }

    for (int i = min | 1; i < 2147483647; i += 2)
    {
      if (IsPrime(i))
      {
        return i;
      }
    }

    return min;
  }

  public static bool IsPowerOfTwo(int value)
  {
    return value > 0 && (value & (value - 1)) == 0;
  }

  public static bool IsPowerOfTwo(long value)
  {
    return value > 0 && (value & (value - 1)) == 0;
  }
}

UniqueStringList

I have kept this class lean and mean for performance purposes.

Null values are not supported but are not checked for - that is the responsibility of the caller.
There is no option to remove individual values once added. You can however use the Clear() method to remove them all.
I've added a few additional members such as Count, Contains(), IndexOf() & this[] to show it is still a list but I suspect the Add() and Intern() methods will be used primarily.
If you have an idea of the maximum capacity required then it can be provided in a constructor to presize the internal arrays. Note this is will be rounded up to match the Golden Prime used and the LoadFactor but will not be less than specified.

public sealed class UniqueStringList
{
  const float LoadFactor = .72f;

  Slot[] slots;
  int[] buckets;
  int slotIndex;
  QuickDivideInfo quickDivideInfo;

  public UniqueStringList()
  {
    Expand(0);
  }

  public UniqueStringList(int capacity)
  {
    Expand(capacity);
  }

  public int Count
  {
    get { return slotIndex; }
  }

  public void Clear()
  {
    Array.Clear(slots, 0, slotIndex);
    Array.Clear(buckets, 0, buckets.Length);

    slotIndex = 0;
  }

  public string Intern(string item)
  {
    int index;

    return slotIndex == (index = AddIfMissing(item)) ? item : slots[index].Value;
  }

  public void Intern(ref string item)
  {
    int index;

    if (slotIndex != (index = AddIfMissing(item)))
    {
      item = slots[index].Value;
    }
  }

  public void AddRange(IEnumerable<string> items)
  {
    foreach (var item in items)
    {
      if (item == null) continue;

      AddIfMissing(item);
    }
  }

  public bool Add(string item)
  {
    return slotIndex == AddIfMissing(item);
  }

  public bool Add(string item, out int index)
  {
    return slotIndex == (index = AddIfMissing(item));
  }

  public IEnumerable<string> GetItems()
  {
    var index = 0;

    while(index < slotIndex)
    {
      yield return slots[index++].Value;
    }
  }

  public string this[int index]
  {
    get { return slots[index].Value; }
  }

  public bool Contains(string item)
  {
    return IndexOf(item) != -1;
  }

  public int IndexOf(string item)
  {
    int hashCode = item.GetHashCode() & MathHelper.Lower31BitMask;
    int bucketIndex = quickDivideInfo.ModuloPositive(hashCode);

    for (int i = buckets[bucketIndex] - 1; i >= 0; i = slots[i].Next)
    {
      if (slots[i].HashCode == hashCode && string.Equals(slots[i].Value, item))
      {
        return i;
      }
    }

    return -1;
  }

  int AddIfMissing(string item)
  {
    var hashCode = item.GetHashCode() & MathHelper.Lower31BitMask;
    var bucketIndex = quickDivideInfo.ModuloPositive(hashCode);

    for (int i = buckets[bucketIndex] - 1; i >= 0; i = slots[i].Next)
    {
      if (slots[i].HashCode == hashCode && string.Equals(slots[i].Value, item))
      {
        return i;
      }
    }

    if (slotIndex == slots.Length)
    {
      Expand(slots.Length + 1);
      bucketIndex = quickDivideInfo.ModuloPositive(hashCode);
    }

    slots[slotIndex].HashCode = hashCode;
    slots[slotIndex].Value = item;
    slots[slotIndex].Next = buckets[bucketIndex] - 1;
    buckets[bucketIndex] = ++slotIndex;

    return slotIndex - 1;
  }

  void Expand(int capacity)
  {
    quickDivideInfo = MathHelper.GetGoldenPrimeInfo(Math.Max(quickDivideInfo.Divisor + 1, (int) (capacity / LoadFactor)));
    capacity = Math.Max(capacity, (int) (quickDivideInfo.Divisor * LoadFactor));

    buckets = new int[quickDivideInfo.Divisor];

    Array.Resize(ref slots, capacity);

    for (int i = 0; i < slotIndex; ++i)
    {
      int bucketIndex = quickDivideInfo.ModuloPositive(slots[i].HashCode);

      slots[i].Next = buckets[bucketIndex] - 1;
      buckets[bucketIndex] = i + 1;
    }

  }

  struct Slot
  {
    public int HashCode;
    public string Value;
    public int Next;
  }

}

Using the code

For simple interning use, there are two Intern() overloads:

The first will return the string passed in or an equivalent already seen:

var myString = myUniqueStringList.Intern(myString);

The second just updates (or not) the string passed by ref:

myUniqueStringList.Intern(ref myPOCOClass.AStringField);

There are also two Add() overloads:

Both will return true if the string was new otherwise false

var justAdded = myUniqueStringList.Add("XXX");

The second additionally returns the index of the string in the list:

int index;

if (myUniqueStringList.Add("XXX", ref index) 
{
    ...
}

The AddRange() method allows the additional of multiple strings at once (nulls are checked here and will be ignored).

The GetItems() method returns an enumerator to extract the unique strings in the order they were added. The output from this can populate a string[] or List<string> or whatever.

Points of Interest

Whilst writing this code, I was especially interested in ensuring that calls to the methods on QuickDivideInfo got inlined for speed. I don't know a way to easily check the JIT output so I relied on timings of test cases. I wrote three tests for each method: one to time the normal operation (divide or modulo); one to put the multiplication and shift actually inline; and one to call the QuickDivideInfo method. If the latter two produced roughly equal times then the method was assumed to have inlined. Both should also produce times roughly one third of that method using the normal operator.

I was suprised when my one of the sets produced funny results. At first I thought my method was not being inlined but then I noticed that the test with the physically inlined optimization was also not showing the expected speed up!

This code:

var check = 0;

for (var numerator = 0; numerator <= maxNumerator; numerator++)
{
  check += numerator >= 0
         ? numerator - (int) ((numerator * qdi.Multiplier) >> qdi.Shift) * qdi.Divisor
         : numerator - (int) -((-numerator * qdi.Multiplier) >> qdi.Shift) * qdi.Divisor;
}

return check;

ran 3 times faster than this code:

var check = 0;

for (var numerator = 0; numerator <= maxNumerator; numerator++)
{
  check += numerator >= 0
         ? (int) ((numerator * qdi.Multiplier) >> qdi.Shift)
         : (int) -((-numerator * qdi.Multiplier) >> qdi.Shift);
}

return check;

yet it had more operations to complete!

I eventually discovered that I could get the the expected speed increase back by either inverting the condition to be '<0' or by changing to a if/else construct (with either condition).

So out of the 4 ways of writing the expression, 3 produced fast JITed code but the other, the one I happened to have chosen, produced slow JITed code.

So my conclusion is that there isn't a bug as such in the JIT-compiler since the results were correct but there is definitely something that could be improved.
And it has made me a little uneasy about using ?: in performance-critical code.

Tag uniquestringlist,revisited

推荐.NET配套的通用数据层ORM框架：CYQ.Data 通用数据层框架

IT Technology Blog

Introduction

Background

QuickDivideInfo struct

MathHelper

UniqueStringList

Using the code

Points of Interest

Post Comment

Bulletin

Article Search

Article Categories

Article Archive

New Article

New Comment

Hits Order

Comment Order

Links