Hash Code generation in .NET: a deep dive


A while back there was some debate on a .NET app at $WORK about its various implementations of Object.GetHashCode. My initial vibe after looking at the implementation was to yell at everyone how bad these hash code implementations were, and how badly they would perform. But after being asked about it, I realized that these were just that: vibes, not facts. So naturally, I had to dive in and actually do the research on hash codes. This blog post is an updated and expanded version of an internal issue I made. While doing more research I also made some interesting discoveries about hash code implementations in other languages, so while this article mainly started out as a .NET hash code deep dive, it morphed into a more generic hash code generation and how hash tables need some help from you, the programmer, to efficiently function. This is also where I discovered how staggeringly similar Java and .NET’s hash code rules and requirements are.

The reason for this is that .NET is “Microsoft Java” and you cannot convince me otherwise.

The purpose of hash codes

Before we can talk about the properties of a hash code, we first have to properly define what Object.GetHashCode’s actual purpose is. Why do we need such a fundamental method on the root Object type in .NET at all? Well, Microsoft explains it rather clearly in the article Supplemental API remarks for Object.GetHashCode:

A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value.

That tells us that hash codes are necessary for certain collection operations to be fast. Specifically, System.Collections.Generic.HashSet<T> and System.Collections.Generic.Dictionary<K, V> are collections that are based on hash tables. Even more interesting, HashSet has the following note in its source code:

public class HashSet<T>
{
    // This uses the same array-based implementation as Dictionary<TKey, TValue>.
    // [...]
}

HashSet<T> is just a Dictionary<K, V> in a trenchcoat!

Wikipedia has a great article on hash tables and their workings, so if you want to know the nitty-gritty details: that page explains it way better than I ever could.

However, we do need to establish some fundamental theory before we continue:

Now, if we were only interested in equality we could just ask the programmer to implement Object.Equals or IEqualityComparer for that type. But that only answers a yes/no question of whether the object is equal, it does not produce a value that we can reproduce later for fast lookup.

And that is where Object.GetHashCode comes in: it is a part of the hashing function used by Dictionary and HashSet to select buckets to store your object. Note that I said part: Object.GetHashCode is not h(x) ! The exact internals of Dictionary are out of scope for this article, but the internal implementation uses the value produced by Object.GetHashCode to select a bucket from a range. For our purposes, let’s just say that for our object x , we can define h(x) as follows:

h(x)=SelectBucket(GetHashCode(x))

It may seem pedantic, but I want to be very clear that while Object.GetHashCode is important for hash table based collections, it does not by itself determine where your object ends up in the collection.

Requirements for GetHashCode

Given that Object.GetHashCode is in hash functions, are there any hard requirements on what Object.GetHashCode does? Well, both Java and .NET have only two real requirements:

Then there’s also two “unofficial official” requirements for Object.GetHashCode: a formally, correct implementation doesn’t need to adhere to these, but if you don’t performance is abysmal and normal operation of hash tables go out of the window:

The first requirement has to do with performance: two unequal objects may return the same hash code, and because of the pidgeonhole principle it will be inevitable if you have more objects than possible hash codes. But this should still be happening as little as possible to make the hash table performant.

The second requirement is much more important, because it influences correctness: if a property that is used to compute Object.GetHashCode is changed while the containing object is stored in a hash table, then that object is now stored in the wrong bucket! This can lead to subtle bugs if you try to look up an object whose hash code has changed in the meantime. For example, HashSet.Contains could report false while your object is contained in the collection.

Finally, a hash code is a runtime value: the produced hash code does not need to be consistent between program invocations. This is why .NET has a giant warning not to store hash codes in persistent places: any re-run or reboot will invalidate those hash codes. While mainly to prevent hash flooding attacks (we’ll get back to that later), most implementations even seed their hash codes with a per-run random seed. This virtually guarantees that any attempt to use hash codes out of band will fail spectacularly, which doubles as a neat anti-hyrum’s law move1.

Playing around with some examples

// suppose we have two 🌈 magical 🌈 objects of the same type.
// let's run some tests!

class Point
{
    int X;
    int Y;

    override bool Equals(object? other)
    {
        if (other is not Point point)
        {
            return false;
        }

        return X == point.X && Y == point.Y;
    }

    override int GetHashCode()
    {
        // now what would we put here?
    }
}

Point object1 = GetRandomPoint();
Point object2 = GetRandomPoint();

if (object1.Equals(object2))
{
    // the objects are equal, so their hash codes _must_ also be equal
    Assert(object1.GetHashCode() == object2.GetHashCode());
}
else
{
    // if objects aren't equal, we basically can't say anything meaningful about their hash codes
    // they can be unequal, but they don't have to be!
}

// if we didn't modify the object, then the hash shouldn't change
int hash1 = object1.GetHashCode();
Sleep(10);
int hash2 = object1.GetHashCode();
Assert(hash1 == hash2);

object1.Y = 5;
int hash3 = object1.GetHashCode();
// this next assert is _technically_ illegal. But if you
// want your app to perform well, this should hold, since we changed the objects contents
Assert(hash2 != hash3);

// pop quiz: what happens if you do this?
var set = new HashSet<Point>();
set.Add(point1);
point1.X = 5;
// uh-oh, this is the worst: will the assert fire? Who knows! Since
// we modified point1 we _don't know_
Assert(set.Contains(point1));
set.Add(point1);
// this may print 2, or 1, depending on a variety of factors
Console.WriteLine(set.Count);

So what would that empty GetHashCode method contain? Well, a fully complete and working implementation could be:

override int GetHashCode()
{
    return 4; // chosen by fair dice roll
}

// or, if we want to at least prevent hyrum's law
private static readonly int _hashCode = Random.GetInt();

override int GetHashCode()
{
    return _hashCode;
}

This is terrible because it reduces performance of dictionaries and sets to linked lists.

Meme showing Captain Jack Sparrow and Lieutenant Norrington. Norrington says 'your hash code is the worst I've ever seen', to which Jack replies 'but it is entirely correct'
technically correct is always the best kind of correct.

Desireable properties for a GetHashCode hash function

Clearly, we also need some hashing theory. For our purposes, the following properties are important:

Collision resistance used to be more nice-to-have than a hard requirement, but given that in this day and age hash tables are effectively wired directly to the internet it has become a more important factor (note however, that speed gains still outweigh collision resistance, otherwise we’d all use SHA3 and call it a day).

First up, the most important property of a good hashing function for use in GetHashCode is speed: we’re not trying to cryptographically protect data, so we can trade security for speed in this case. It’s also worth noting that a hash table implementation might query the objects hash code multiple times. Any slowdown in GetHashCode directly slows down the hash table performance.

The avalanche effect is important for small key spaces (such as enums or small strings). The general idea is that if you change even a single bit of the input, the hash output should be drastically different. This is mostly a performance concern since real-life keys may be closely related such as monotonically incrementing integer keys. You don’t want these keys to all land in the same bucket, since you’ve just reduced your dictionary to a linked list in that case.

Given that this avalanche effect doesn’t formally state hard requirements (one could call it the avalanche vibe if so inclined), there is a formaliziation that states hard requirements for this effect: the Strict Avalanche Criterion (SAC). SAC mandates that if a single bit of the input is flipped, then each of the output bits change with a 50% probability. So while conforming to to SAC might not be essential, it is desireable for our hash function.

A uniform distribution is important for the load factor of a hash table: if a hash function doesn’t generate a uniform distribution of outputs then this means some buckets will be filled with more entries than others which impacts performance, so our hash function should absolutely approximate a uniform distribution.

Lastly, collison resistance is a stronger requirement on the uniform distribution requirement (remember, GetHashCode does not select a bucket on its own!). This property is important for defending against hash flooding attacks that could turn a simple lookup cache into a denial of service attack. However, unlike things such as HMAC generation, not having a strong variant of this requirement does not undermine our entire theoretical basis.

Generating hash codes according to the internet

Now that we know why we need them, let’s look at how the almighty internet (StackOverflow) thinks we should generate a hash code for a given object. Jon Skeet says to go with Joshua Bloch’s item 11 in Effective Java:

// https://stackoverflow.com/a/263416
// CC-BY-SA 4.0 Jon Skeet

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

He also mentions FNV hashing which basically replaces the addition with a XOR operation, but changes precious little else:

// https://stackoverflow.com/a/263416
// CC-BY-SA 4.0 Jon Skeet
// Duck's note: the reason Jon calls this "not quite FNV" is
// because FNV only operates on the byte representation of its input,
// which is a lot harder to do in high-level languages

// Note: Not quite FNV!
public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = (int) 2166136261;
        // Suitable nullity checks etc, of course :)
        hash = (hash * 16777619) ^ field1.GetHashCode();
        hash = (hash * 16777619) ^ field2.GetHashCode();
        hash = (hash * 16777619) ^ field3.GetHashCode();
        return hash;
    }
}

This first suggested form was what started the entire search into hash codes in the first place, because to me this looks weird and, to be completely honest, amateurish. Even giants such as Jon Skeet and Joshua Bloch’s reasoning as to why you should do it this way boils down to “it’s good enough, I guess”. Really, Joshua Bloch states the following in Effective Java (3rd ed.):

“They are comparable in quality to the hash functions found in the Java platform libraries’ value types and are adequate for most uses.”

Okay, I can accept the “good enough” part, but that’s not a proper answer. We’re doing math here, we can (and should!) prove things! Just saying “Well, it’s what Java does so it’s fine” isn’t the same as a proper evaluation of the hash function. The best part is once you’ve seen this style of hash, you’ll spot it everywhere. It has spread as a sort of magic incantation that everybody uses with the caveat “it works, don’t ask me why”. That raises the question: what exactly is this type of hash function, and more importantly: does it work as advertised?

Luckily for us, Jon did link a (now defunct) blog post by Julienne Walker introducing us to the principles of hash functions. And if you search for a bit you’ll find a section called “Bernstein hash” after the legendary djb. This specific version is also known as DJBX33A2 (Daniel J. Bernstein, Times 33 with Addition):

unsigned djb_hash ( void *key, int len )
{
    unsigned char *p = key;
    unsigned h = 0;
    int i;

    for ( i = 0; i < len; i++ )
        h = 33 * h + p[i];

     return h;
}

Well that looks awfully familiar doesn’t it? Once you have a name that’s not “that hash code computey thingy” searching becomes a lot easier. Let’s un-boomer that code a bit and give things a better name:

uint DjbHash(byte[] value)
{
    const uint initial = 0;
    const uint factor = 33;
    uint hash = initial;

    foreach (byte b in value)
    {
        hash = (factor * hash) + b;
    }

    return hash
}

I chose a regular ol’ byte array since that’s the closest thing to a void pointer that C# has. There are basically 2 important components to a Bernstein hash:

As Walker explains, Bernstein posted this to comp.lang.c but “its theory is not up to snuff”. The constant 33 for example, was never explained by Bernstein other than “any good multiplier works”. However, as much as I’d like to just take this and dunk on Joshua Bloch, I can’t really do that because the dismissal of the Bernstein hash is done with an equal amount of explaination as its introduction. If I’m flaming Jon Skeet and Joshua Bloch for not being mathematically thorough, then I can’t really throw the same “but somebody on the interwebs said it was bad” back at them.

Mixing in some math

If you trawl stackoverflow a bit more, you can discover that other people have in fact scrutinized this hash, or a derivative from it. One interesting observation is that the Bernstein hash closely mimics a Linear Congruential Generator (LCG) of the form:

Xn+1=(aXn+c)modm

Where m is usually 232 or 264

Or, if you don’t speak italics:

unsigned Xn1 = (a * Xn + c) % m;

Gee wiz, that looks an awful lot like what Bernstein does if you omit the modulo, doesn’t it? The thing is, however, that LCG’s are for generating sequences of values! Since we are hashing, we are faced with the problem that c is the input for our hash, not a constant that we can tactically select. But does this actually matter? Well, if you think about it: no. Recall that there are a few properties that are important for our type of hash functions:

Crucially, we do not need strong cryptographic properties such as:

Mapping this to an LCG, that basically means that the ideal property of its period (amount of iterations before it restarts its sequence) would be equal to m, which would guarantee a perfect uniform distribution! Enter the Hubb-Dobbel theorem, which states that an LCG will have a period that is equal to m if, and only if:

We can almost conform to this theorem, since we control a and m. The only thing we do not control is c, and thus we can never guarantee that c and a are coprime. But that’s the thing: if you conform to the Hubb-Dobbel theorem, you get a perfect LCG. And as stated before, for what we’re after, “pretty decent” results are actually sufficient. Given that we pick m to be 232 , we can pick a so that only our first condition does not hold. Can you guess which one that is? Right! a=33 suddenly makes a whole lot of sense!

Now, remember that as Bernstein says: “any good multiplicator will do”, and the most important properties we can deduce from our LCG are that a must not be even, and if possible, prime.

Also recall again that Jon Skeet also mentioned the Fowler-Noll-Vo (FNV) hash. FNV is stated to be “better” than the Bernstein hash, but it is mainly used because just like the Bernstein hash, it’s implementation fits on the back of an envelope:

unsigned fnv_hash(void *key, int len)
{
    unsigned char *p = key;
    unsigned h = 2166136261;
    int i;

    for (i = 0; i < len; i++)
    {
        h = (h * 16777619) ^ p[i];
    }

    return h;
}

If you squint, you’ll see a LCR-ish construct appearing here as well.

So now we know why these hashes work, and why they became popular. But that doesn’t answer a fundamental question: are they actually good, and are there better alternatives?

Code is cheap, show me the benchmarks

It turns out that experts have not been idle and the state of the art has advanced quite a bit:

These are all non-cryptographic hashing functions: the only reason we’re using them over SHA3 and friends is S P E E D. That raises the question: is there a faster algorithm? Well, luckily, Aras did all the hard work for us and benchmarked a bunch of non-cryptographic hashes. What’s interesting is that SipHash is faster than FNV and Bernstein hashing, but not by a whole lot. Another interesting observation is how slow Bernstein and FNV hashing is if you look at the more optimized variants. Granted, this is most likely due the modern hashes using SSE and AVX to speed up throughput, but the whole reason people stick with Bernstein is that it’s “easy and fast”. Well, the current state of the art has produced some fancy results that make that vibe no longer true.

So that begs the question: what do languages use for their hashing needs? Well, after some trawling, you end up with the following table:

languagealgorithm
PHPheavily optimized Bernstein (Yes, really)
JavaBernstein (Yes, really)
JavaScript (V8)rapidhash
RubySipHash-1-3
PythonSipHash, dependent on configuration
RustSipHash-1-3
SwiftSipHash-2-4 and SipHash-1-3
C++ (libc++)Murmur2 or CityHash
GoAES (but not true AES)
.NETxxHash32
Zigwyhash

Several notes:

So… what do I use?

Well, the basic answer is “not Java”, but that’s been true for as long as that language exists. In order of “engineering prudence”, the actual answers are:

As for our earlier .NET example, we can use the aformentioned System.HashCode struct to compute our hash codes. Our Point example from earlier can easily be rewritten to:

override int GetHashCode()
{
    return HashCode.Combine(X, Y);
}

This is the ideal solution to me: .NET gets high quality hash codes from you, can update their chosen algorithm whenever they want to, and you do not need to become a PhD in cryptography to figure all this stuff out by yourself.

Unless you’re a masochist like me of course.

Further reading

Footnotes

  1. Completely unrelated, but people relying on behavior with a giant DO NOT RELY ON THIS sign above it is why Go’s map iteration order is randomized.

    Meme showing D.W. from Arthur standing in front of a sign saying 'do not rely on this behavior' with her proclaiming 'that sign won't stop me because I can't read!'. Her character is marked with the text 'developers'
    Reading? DOCUMENTATION? What kind of madman do you take me for?!
  2. depending on how old the version is you find, you may encounter a version that writes (h * 33) as h = ((h << 5) + h) + p[i] which is exactly the same, but optimized to use shifts instead of multiplication:

    # remember that left shifting a by x
    # is effectively the same as saying (a * (2 ** x))
    ((h << 5) + h) == h * 33
    ((h * (2 ** 5)) + h) == h * 33
    ((h * 32) + h) == h * 33
    h * 33 == h * 33
  3. No, Bun and Firefox’s SpiderMonkey don’t count: they are dying embers of a VC AI pump ‘n dump and Google’s “look FTC, we’re not a browser monopoly” scheme respectively.