Hash Functions all the way down

A while ago I needed fast hash function for ~32 byte keys. We already had MurmurHash used in a bunch of places, so I started with that. But then I tried xxHash and that was a bit faster! So I dropped xxHash into the codebase, landed the thing to mainline and promptly left for vacation, with a mental note of “someday should look into other hash functions, or at least move them all under a single folder”.

So that’s what I did: “hey I’ll move source code of MurmurHash, SpookyHash and xxHash under a single place”. But that quickly spiraled out of control:

The things I found! Here’s what you find in a decent-sized codebase, with many people working on it:

  • Most places use a decent hash function – in our case Murmur2A for 32/64 bit hashes, and SpookyV2 for 128 bit hashes. That’s not a bad place to be in!
  • Murmur hash takes seed as input, and naturally almost all places in code copy-pasted the same random hex value as the seed 🙂
  • There are at least several copies of either FNV or djb2 hash function implementations scattered around, used in random places.
  • Some code uses really, REALLY bad hash functions. There’s even a comment above it, added a number of years ago, when someone found out it’s bad – however they only stopped using it in their own place, and did not replace other usages. Life always takes the path of least resistance 🙂 Thankfully, the places where said hash function was used were nothing “critical”.
  • While 95% of code that uses non-cryptographic hash functions uses them strictly for runtime reasons (i.e. they don’t actually care about exact value of the hash), there are some pieces that hash something, andserialize the hashed value. Each of these need to be looked at in detail, whether you can easily switch to a new hash function.
  • Some other hash related code (specifically, a struct we have to hold 128 bit hashed value, Hash128), were written in a funky way, ages ago. And of course some of our code got that wrong (thankfully, all either debug code, or test mocking code, or something not-critical). Long story short, do not have struct constructors like this: Hash128(const char* str, int len=16)!
    • Someone will think this accepts a string to hash, not “bytes of the hash value”.
    • Someone will pass "foo" into it, and not provide length argument, leading to out-of-bounds reads.
    • Some code will accidentally pass something like 0 to a function that accepts a Hash128, and because C++ is C++, this will get turned into a Hash128(NULL, 16) constructor, and hilarity will ensue.
    • Lesson: be careful with implicit constructors (use explicit). Be careful with default arguments. Don’t set types to const char* unless it’s really a string.

So what started out as “move some files” branch, ended up being a “move files, switch most of hash functions, remove some bad hash functions, change some code, fix some wrong usages, add tests and comments” type of thing. It’s a rabbit hole of hash functions, all the way down!