Oh shit, git!

Git is hard: screwing up is easy, and figuring out how to fix your mistakes is fucking impossible. Git documentation has this chicken and egg problem where you can’t search for how to get yourself out of a mess, unless you already know the name of the thing you need to know about in order to fix your problem.

So here are some bad situations I’ve gotten myself into, and how I eventually got myself out of them in plain english*.

http://ohshitgit.com/

Moving persistent data out of Redis

Historically, we have used Redis in two ways at GitHub:

We used it as an LRU cache to conveniently store the results of expensive computations over data originally persisted in Git repositories or MySQL. We call this transient Redis.

We also enabled persistence, which gave us durability guarantees over data that was not stored anywhere else. We used it to store a wide range of values: from sparse data with high read/write ratios, like configuration settings, counters, or quality metrics, to very dynamic information powering core features like spam analysis. We call this persistent Redis.

Recently we made the decision to disable persistence in Redis and stop using it as a source of truth for our data. The main motivations behind this choice were to:

  • Reduce the operational cost of our persistence infrastructure by removing some of its complexity.
  • Take advantage of our expertise operating MySQL.
  • Gain some extra performance, by eliminating the I/O latency during the process of writing big changes on the server state to disk.

Transitioning all that information transparently involved planning and coordination. For each problem domain using persistent Redis, we considered the volume of operations, the structure of the data, and the different access patterns to predict the impact on our current MySQL capacity, and the need for provisioning new hardware.

For the majority of callsites, we replaced persistent Redis with GitHub::KV, a MySQL key/value store of our own built atop InnoDB, with features like key expiration. We were able to use GitHub::KV almost identically as we used Redis: from trending repositories and users for the explore page, to rate limiting to spammy user detection.

http://githubengineering.com/moving-persistent-data-out-of-redis

Context aware MySQL pools via HAProxy

At GitHub we use MySQL as our main datastore. While repository data lies in git, metadata is stored in MySQL. This includes Issues, Pull Requests, Comments etc. We also auth against MySQL via a custom git proxy (babeld). To be able to serve under the high load GitHub operates at, we use MySQL replication to scale out read load.

We have different clusters to provide with different types of services, but the single-writer-multiple-readers design applies to them all. Depending on growth of traffic, on application demand, on operational tasks or other constraints, we take replicas in or out of our pools. Depending on workloads some replicas may lag more than others.

Displaying up-to-date data is important. We have tooling that helps us ensure we keep replication lag at a minimum, and typically it doesn’t exceed 1 second. However sometimes lags do happen, and when they do, we want to put aside those lagging replicas, let them catch their breath, and avoid sending traffic their way until they are caught up.

We set out to create a self-managing topology that will exclude lagging replicas automatically, handle disasters gracefully, and yet allow for complete human control and visibility.

http://githubengineering.com/context-aware-mysql-pools-via-haproxy/

Brubeck, a statsd-compatible metrics aggregator

“…Taking an existing application and rewriting it in another language very rarely gives good results. Specially in the case of a Node.js server, you go from having an event loop written in C (libuv, the cross-platform library that powers the Node.js event framework is written in C), to having, well… an event loop written in C.

A straight port of statsd to C would hardly offer the performance improvement we required. Instead of micro-optimizing a straight port to squeeze performance out of it, we focused on redesigning the architecture of the application so it became efficient, and then implemented it as simply as possible: that way, the app will run fast even with few optimizations, and the code will be less complex and hence more reliable.

The first thing we changed in Brubeck was the event-loop based approach of the original statsd. Evented I/O on a single socket is a waste of cycles; while receiving 4 million packets per second, polling for read events will give you unsurprisingly predictable results: there is always a packet ready to be read. Because of this, we replaced the event loop with several worker threads sharing a single listen socket.

Several threads working on aggregating the same metrics means that access to the metrics table needs to be synchronized. We used a modified version of a concurrent, read-lock-free hash table with optimistic locking on writes optimized for applications with a high read-to-write ratios, which performs exceedingly well for our use case…”

http://githubengineering.com/brubeck/

How to undo (almost) anything with Git

“One of the most useful features of any version control system is the ability to “undo” your mistakes. In Git, “undo” can mean many slightly different things.

When you make a new commit, Git stores a snapshot of your repository at that specific moment in time; later, you can use Git to go back to an earlier version of your project.

In this post, I’m going to take a look at some common scenarios where you might want to “undo” a change you’ve made and the best way to do it using Git…”

https://github.com/blog/2019-how-to-undo-almost-anything-with-git