Fast forward another couple months and I’m at my current job as a backend developer for Digg. When I joined, back in April of 2015, the stack at Digg was primarily Python with the exception of two services written in, wait for it, Node. I was even more thrilled to be assigned the task of reworking one of the services which had been causing issues in our pipeline.
Our troublesome Node service had a fairly straightforward purpose. Digg uses Amazon S3 for storage which is peachy, except S3 has no support for batch GET operations. Rather than putting all the onus on our Python web server to request up to 100+ keys at a time from S3, the decision was made to take advantage of Node’s easy async code patterns and great concurrency handling. And so Octo, the S3 content fetching service, was born.
Node Octo performed well except for when it didn’t. Once a day it needed to handle a traffic spike where the requests per minute jump from 50 to 200+. Also keep in mind that for each request, Octo typically fetches somewhere between 10–100 keys from S3. That’s potentially 20,000 S3 GETs a minute. The logs showed that our service slowed down substantially during these spikes, but the trouble was it didn’t always recover. As such, we were stuck bouncing our EC2 instances every couple weeks after Octo would seize up and fall flat on its face.