Why Bloom filters work the way they do

“Imagine you’re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access http://domain/etc. You’d like a way of checking whether domain is known to be a malicious site. What’s a good way of doing this?

An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious domains, and domains need (say) an average of 20 bytes to store, then you need 20 megabytes of storage. That’s quite an overhead for a single feature in your web browser. Is there a better way?

In this post I’ll describe a data structure which provides an excellent way of solving this kind of problem. The data structure is known as a Bloom filter. Bloom filter are much more memory efficient than the naive “store-everything” approach, while remaining extremely fast. I’ll describe both how Bloom filters work, and also some extensions of Bloom filters to solve more general problems…”