Strategies for Working with Message Queues

Message queues like Apache Kafka are a common component of distributed systems. This blog post will look at several different strategies for improving performance when working with message queues.

Model Overview

Kafka consists of topics which have one or more partitions.

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

With this structured commit log, each consumer follows the same basic steps:

  1. A consumer is assigned a particular topic-partition (either manually or automatically via a consumer group)
  2. The previous offset is read so that the consumer will begin where it last left off
  3. Messages are consumed from Kafka
  4. Messages are processed in some way
  5. The processed message offset is committed back to Kafka

Other types of message queues (like AMQP) have a similar flow – messages are consumed, processed and acknowledged. Generally we rely on idempotent message processing – that is the ability to process the same message twice with no ill effect – and err on the side of only committing if we’re certain we’ve done what we need to. This gives us durability and guarantees that every message will be processed, even if our consumer process crashes.

Xor Filters: Faster and Smaller Than Bloom Filters

Though the Bloom filter is a textbook algorithm, it has some significant downsides. A major one is that it needs many data accesses and many hash values to check that an object is part of the set. In short, it is not optimally fast.

Can you do better? Yes. Among other alternatives, Fan et al. introduced Cuckoo filters which use less space and are faster than Bloom filters. While implementing a Bloom filter is a relatively simple exercise, Cuckoo filters require a bit more engineering.

Could we do even better while limiting the code to something you can hold in your head?

It turns out that you can with xor filters. We just published a paper called Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters that will appear in the Journal of Experimental Algorithmics.

Go implementation of C++ STL iterators and algorithms

Although Go doesn’t have generics, we deserve to have reuseable general algorithms. iter helps improving Go code in several ways:

  • Some simple loops are unlikely to be wrong or inefficient, but calling algorithm instead will make the code more concise and easier to comprehend. Such as AllOfFindIfAccumulate.
  • Some algorithms are not complicated, but it is not easy to write them correctly. Reusing code makes them easier to reason for correctness. Such as ShuffleSamplePartition.
  • STL also includes some complicated algorithms that may take hours to make it correct. Implementing it manually is impractical. Such as NthElementStablePartitionNextPermutation.
  • The implementation in the library contains some imperceptible performance optimizations. For instance, MinmaxElement is done by taking two elements at a time. In this way, the overall number of comparisons is significantly reduced.

There are alternative libraries have similar goals, such as gostlgods and go-stp. What makes iter unique is:

  • Non-intrusive. Instead of introducing new containers, iter tends to reuse existed containers in Go (slice, string, list.List, etc.) and use iterators to adapt them to algorithms.
  • Full algorithms (>100). It includes almost all algorithms come before C++17. Check the Full List.

How Twitch monitors its services with Amazon CloudWatch

Twitch is the leading service and community for multiplayer entertainment and is owned by Amazon. Twitch also provides social and features and micro-transaction features that drive content engagement for its audiences. These services operate at a high transaction volume.

Twitch uses Amazon CloudWatch to monitor its business-critical services. It emits custom metrics then visualizes and alerts based on predefined thresholds for these key metrics. The high volume of transactions handled by the Twitch services makes it difficult to design a metric ingestion strategy that provides sufficient throughput of data while balancing the cost of data ingestion.

Amazon CloudWatch client-side aggregations is a new feature of the PutMetricData API service that helps customers to aggregate data on the client-side, which increases throughput and efficiency. In this blog post we’ll show you how Twitch uses client-side data aggregations to build a more effective metric ingestion architecture while achieving substantial cost reductions.

PySnooper – Never use print for debugging again

PySnooper is a poor man’s debugger.

You’re trying to figure out why your Python code isn’t doing what you think it should be doing. You’d love to use a full-fledged debugger with breakpoints and watches, but you can’t be bothered to set one up right now.

You want to know which lines are running and which aren’t, and what the values of the local variables are.

Most people would use print lines, in strategic locations, some of them showing the values of variables.

PySnooper lets you do the same, except instead of carefully crafting the right print lines, you just add one decorator line to the function you’re interested in. You’ll get a play-by-play log of your function, including which lines ran and when, and exactly when local variables were changed.

What makes PySnooper stand out from all other code intelligence tools? You can use it in your shitty, sprawling enterprise codebase without having to do any setup. Just slap the decorator on, as shown below, and redirect the output to a dedicated log file by specifying its path as the first argument.

Orchestrating backend services with AWS Step Functions

The problem

In many use cases, there are processes that need to execute multiple tasks. We build micro-services or server-less functions like AWS Lambda functions to carry out these tasks. Almost all these services are stateless functions and there is need of queues or databases to maintain the state of individual tasks and the process as a whole. Writing code that orchestrates these tasks can be both painful and hard to debug and maintain. It’s not easy to maintain the state of a process in an ecosystem of micro-services and server-less functions.

Python for NLP: Introduction to the TextBlob Library

This is the seventh article in my series of articles on Python for NLP. In my previous article, I explained how to perform topic modeling using Latent Dirichlet Allocation and Non-Negative Matrix factorization. We used the Scikit-Learn library to perform topic modeling.

In this article, we will explore TextBlob, which is another extremely powerful NLP library for Python. TextBlob is built upon NLTK and provides an easy to use interface to the NLTK library. We will see how TextBlob can be used to perform a variety of NLP tasks ranging from parts-of-speech tagging to sentiment analysis, and language translation to text classification.