“Data locality is very important. Keeping data close together in memory is a huge win on basically every modern computing device due to our systems’ inherently multi-level memory hierarchy. Contiguous data leads to speed, and even some perhaps counter-intuitive results (like insert-and-shift into the middle of a dynamic array actually being faster than the equivalent linked list insertion1).
Despite these facts, interestingly enough the C++11 (and beyond) standard library provides a hash table implementation in std::unordered_map that is a separate chaining hash table, with linked list buckets. This implementation implies a strong trade-off: while elements inserted into the map are guaranteed to be stable in memory (never needing to be moved or copied), we now must chase pointers when walking through the buckets in the hash table.
What if we allow for keys to be moved/copied around as the hash table grows? If we relax the requirement that we never move key/value pairs after insertion, we now open the door for implementing a hash table using open addressing. Here, the table is stored as a single huge array containing either a key/value pair or nothing. On an insertion where a collision occurs, an empty slot is located by “probing” through the array looking for an empty slot, following some probing strategy. The most well known is linear probing, which has developed a bit of a bad reputation. But is the disdain deserved?
In this post, I’m going to walk you through some results that motivated the development of an open source (insertion-only) probing hash table framework that’s distributed as part of the MeTA toolkit, and attempt to prove to you that naive, linear probing (or, at least, blocked probing) strategies are not nearly so bad as you might initially think. In particular, we’ll be benchmarking hash tables with both integer and string keys, taking a look at how they perform in terms of building time, memory consumption, and query throughput…”