Hashing for large scale similarity

Similarity computation is a very common task in real-world machine learning and data mining problems such as recommender systems, spam detection, online advertising etc. Consider a tweet recommendation problem where one has to find tweets similar to the tweet user previously clicked. This problem becomes extremely challenging when there are billions of tweets created each day.

In this post, we will discuss the two most common similarity metric, namely Jaccard similarity and Cosine similarity; and Locality Sensitive Hashing based approximation of those metrics.


How Hash Algorithms Work

This page was written for people who really want to know exactly how hash algorithms work. The following is a complete detailed step-by-step walk through of exactly how hash algorithms work. This is written mainly for people with very good knowledge about computers, encryption, and logical operators. I did try to write it so that everyone could understand it, but you might find it boring/dry/confusing if you aren’t really interested in how hash algorithms work and know a fair amount already.


An immutable radix tree implementation in Golang

Provides the iradix package that implements an immutable radix tree. The package only provides a single Tree implementation, optimized for sparse nodes.

As a radix tree, it provides the following:

  • O(k) operations. In many cases, this can be faster than a hash table since the hash function is an O(k) operation, and hash tables have very poor cache locality.
  • Minimum / Maximum value lookups
  • Ordered iteration

A tree supports using a transaction to batch multiple updates (insert, delete) in a more efficient manner than performing each operation one at a time.

For a mutable variant, see go-radix.


Ideal Hash Trees

“Hash Trees with nearly ideal characteristics are described. These Hash Trees require no initial root hash table yet are faster and use significantly less space than chained or double hash trees. Insert, search and delete times are small and constant, independent of key set size, operations are O(1). Small worst-case times for insert, search and removal operations can be guaranteed and misses cost less than successful searches. Array Mapped Tries(AMT), first described in Fast and Space Efficient Trie Searches, Bagwell [2000], form the underlying data structure. The concept is then applied to external disk or distributed storage to obtain an algorithm that achieves single access searches, close to single access inserts and greater than 80 percent disk block load factors. Comparisons are made with Linear Hashing, Litwin, Neimat, and Schneider [1993] and B-Trees, R.Bayer and E.M.McCreight [1972]. In addition two further applications of AMTs are briefly described, namely, Class/Selector dispatch tables and IP Routing tables. Each of the algorithms has a performance and space usage that is comparable to contemporary implementations but simpler…”


Perceptual Hashing

“Perceptual hashes are another category of hashing functions that map source data into hashes while maintaining correlation. These types of functions allow us to make meaningful comparisons between hashes in order to indirectly measure the similarity between the source data…”

“A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are “close” to one another if the features are similar…”