Xor Filters: Faster and Smaller Than Bloom Filters

Though the Bloom filter is a textbook algorithm, it has some significant downsides. A major one is that it needs many data accesses and many hash values to check that an object is part of the set. In short, it is not optimally fast.

Can you do better? Yes. Among other alternatives, Fan et al. introduced Cuckoo filters which use less space and are faster than Bloom filters. While implementing a Bloom filter is a relatively simple exercise, Cuckoo filters require a bit more engineering.

Could we do even better while limiting the code to something you can hold in your head?

It turns out that you can with xor filters. We just published a paper called Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters that will appear in the Journal of Experimental Algorithmics.

https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/

Data modeling with Amazon DynamoDB

Modeling your data in the DynamoDB database structure requires a different approach from modeling in traditional relational databases. Alex DeBrie has written a number of applications using DynamoDB and is the creator of DynamoDBGuide.com, a free resource for learning DynamoDB. In this session, we review the key principles of modeling your DynamoDB tables and teach you some practical patterns to use in your data models. Leave this session with steps to follow and principles to guide you as you work with DynamoDB.

https://www.youtube.com/watch?v=DIQVJqiSUkE

Uber Go Style Guide

Styles are the conventions that govern our code. The term style is a bit of a misnomer, since these conventions cover far more than just source file formatting—gofmt handles that for us.

The goal of this guide is to manage this complexity by describing in detail the Dos and Don’ts of writing Go code at Uber. These rules exist to keep the code base manageable while still allowing engineers to use Go language features productively.

This guide was originally created by Prashant Varanasi and Simon Newton as a way to bring some colleagues up to speed with using Go. Over the years it has been amended based on feedback from others.

This documents idiomatic conventions in Go code that we follow at Uber. A lot of these are general guidelines for Go, while others extend upon external resources:

  1. Effective Go
  2. The Go common mistakes guide

All code should be error-free when run through golint and go vet. We recommend setting up your editor to:

  • Run goimports on save
  • Run golint and go vet to check for errors

You can find information in editor support for Go tools here: https://github.com/golang/go/wiki/IDEsAndTextEditorPlugins

https://github.com/uber-go/guide/blob/master/style.md

Go implementation of C++ STL iterators and algorithms

Although Go doesn’t have generics, we deserve to have reuseable general algorithms. iter helps improving Go code in several ways:

  • Some simple loops are unlikely to be wrong or inefficient, but calling algorithm instead will make the code more concise and easier to comprehend. Such as AllOfFindIfAccumulate.
  • Some algorithms are not complicated, but it is not easy to write them correctly. Reusing code makes them easier to reason for correctness. Such as ShuffleSamplePartition.
  • STL also includes some complicated algorithms that may take hours to make it correct. Implementing it manually is impractical. Such as NthElementStablePartitionNextPermutation.
  • The implementation in the library contains some imperceptible performance optimizations. For instance, MinmaxElement is done by taking two elements at a time. In this way, the overall number of comparisons is significantly reduced.

There are alternative libraries have similar goals, such as gostlgods and go-stp. What makes iter unique is:

  • Non-intrusive. Instead of introducing new containers, iter tends to reuse existed containers in Go (slice, string, list.List, etc.) and use iterators to adapt them to algorithms.
  • Full algorithms (>100). It includes almost all algorithms come before C++17. Check the Full List.

https://github.com/disksing/iter

How Uber stores financial transactions in ledgers using Amazon DynamoDB

Each day, millions of people move around the world with Uber. The associated financial transactions are stored in Uber’s Ledger Store with provable completeness backed by Amazon DynamoDB. In this session, we discuss why provable completeness is key for compliant storage of financial and other ledger-like use cases, and we explain how this can be implemented at global scale by using DynamoDB.

https://www.youtube.com/watch?v=iN6mhI5hFt4

How Verizon Media implemented push notification using Amazon DynamoDB

Verizon Media had to create a better, stronger, and faster push notification system to serve the requirements of iconic Verizon brands, fulfill push notification completion time of 27 million devices in under three minutes, and consistently show the push “toast” on all users’ lock screens. Verizon decided to use Amazon DynamoDB and other AWS services such as Amazon ElastiCache and Amazon SQS in conjunction with its own Vespa search engine to power all the use cases of its brands. It also uses Kubernetes to orchestrate microservices across many Amazon EC2 instances.

https://www.youtube.com/watch?v=FwWT6a3ikZ4

Go Modules from the private repository in the pipeline

Go Modules is a significant step forward in making the Golang dependency management “humanly natural”. However, when I recently tried to convert a project to Go 1.12 forcing it to use the Go Modules I came across a nasty issue. My problem was that some of the “in house” built modules are kept in the private Github repositories.

https://medium.com/@ikolomiyets/go-modules-from-the-private-repository-in-the-pipeline-5921d3ba0e40

Parsing 18 billion JSON lines with Go

At my employer Tjek we recently decided to rebuild our event pipeline and move it to Google BigQuery to reduce complexity in the stack and to remove some services that were no longer maintainable.

BigQuery offers a bunch of nice tools for querying and visualizing our data, which would enable a number of our internal teams to work directly with the data & share it with customers without having to make a request to the engineering department.

The new service that would replace the old http service was easy to write, but then came the question of moving the historic data into BigQuery.

To migrate the old data, we had to backfill our entire dataset, accumulated over the last 10 years, into BigQuery and this is the story of how it was done.

https://itnext.io/parsing-18-billion-lines-json-with-go-738be6ee5ed2

The Lesson to Unlearn

“The most damaging thing you learned in school wasn’t something you learned in any specific class. It was learning to get good grades.

When I was in college, a particularly earnest philosophy grad student once told me that he never cared what grade he got in a class, only what he learned in it. This stuck in my mind because it was the only time I ever heard anyone say such a thing.

For me, as for most students, the measurement of what I was learning completely dominated actual learning in college. I was fairly earnest; I was genuinely interested in most of the classes I took, and I worked hard. And yet I worked by far the hardest when I was studying for a test.

In theory, tests are merely what their name implies: tests of what you’ve learned in the class. In theory you shouldn’t have to prepare for a test in a class any more than you have to prepare for a blood test. In theory you learn from taking the class, from going to the lectures and doing the reading and/or assignments, and the test that comes afterward merely measures how well you learned…”

http://paulgraham.com/lesson.html