How we tracked down (what seemed like) a memory leak in one of our Go microservices

A blog special from the Detectify backend team:

The backend developer team at Detectify has been working with Go for some years now, and it’s the language chosen by us to power our microservices. We think Go is a fantastic language and it has proven to perform very well for our operations. It comes with a great tool-set, such as the tool we’ll touch on later on called pprof.

However, even though Go performs very well, we noticed one of our microservices had a behavior very similar to that of a memory leak.

In this post we will go step-by-step through our investigation of this problem, the thought process behind our decisions and the details needed to understand and fix the problem.


When AWS Autoscale Doesn’t

The premise behind autoscaling in AWS is simple: you can maximize your ability to handle load spikes and minimize costs if you automatically scale your application out based on metrics like CPU or memory utilization. If you need 100 Docker containers to support your load during the day but only 10 when load is lower at night, running 100 containers at all times means that you’re using 900% more capacity than you need every night. With a constant container count, you’re either spending more money than you need to most of the time or your service will likely fall over during a load spike.

Full-system dynamic tracing on Linux using eBPF and bpftrace

Linux has two well-known tracing tools:

  • strace allows you to see what system calls are being made.
  • ltrace allows you to see what dynamic library calls are being made.

Though useful, these tools are limited. What if you want to trace what happens inside a system call or library call? What if you want to do more than just logging calls, e.g. you want to compile statistics on certain behavior? What if you want to trace multiple processes and correlate data from multiple sources?

In 2019, there’s finally a decent answer to that on Linux: bpftrace, based on eBPF technology. Bpftrace allows you to write small programs that execute whenever an event occurs.

This article shows you how to setup bpftrace and teaches you its basic usage. I’ll also give an overview of how the tracing ecosystem looks like (e.g. “what’s eBPF?”) and how it came to be what it is today.

Efficient File Copying On Linux

In response to my last post about dd, a friend of mine noticed that GNU cp always uses a 128 KB buffer size when copying a regular file; this is also the buffer size used by GNU cat. If you use strace to watch what happens when copying a file, you should see a lot of 128 KB read/write sequences:

$ strace -s 8 -xx cp /dev/urandom /dev/null
read(3, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072
write(4, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072
read(3, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072
write(4, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072
read(3, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072
write(4, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072
read(3, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072
write(4, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072

As you can see, each copy is operating on buffers 131072 bytes in size, which is 128 KB. GNU cp is part of the GNU coreutils project, and if you go diving into the coreutils source code you’ll find this buffer size is defined in the file src/ioblksize.h. The comments in this file are really fascinating. The author of the code in this file (Jim Meyering) did a benchmark using dd if=/dev/zero of=/dev/null with different values of the block size parameter, bs. On a wide variety of systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these results, shown below. Higher transfer rates are better, and the different symbols represent different system configurations.

buffer size

BPF Compiler Collection (BCC)

BCC is a toolkit for creating efficient kernel tracing and manipulation programs, and includes several useful tools and examples. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.

eBPF was described by Ingo Molnár as:

One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively.

BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.

Iris – The fastest backend web framework for Go

Go is a great technology stack for building scalable, web-based, back-end systems for web applications.

When you think about building web applications and web APIs, or simply building HTTP servers in Go, does your mind go to the standard net/http package? Then you have to deal with some common situations like dynamic routing (a.k.a parameterized), security and authentication, real-time communication and many other issues that net/http doesn’t solve.

The net/http package is not complete enough to quickly build well-designed back-end web systems. When you realize this, you might be thinking along these lines:

  • Ok, the net/http package doesn’t suit me, but there are so many frameworks, which one will work for me?!
  • Each one of them tells me that it is the best. I don’t know what to do!
The truth

I did some deep research and benchmarks with ‘wrk’ and ‘ab’ in order to choose which framework would suit me and my new project. The results, sadly, were really disappointing to me.

I started wondering if golang wasn’t as fast on the web as I had read… but, before I let Golang go and continued to develop with nodejs, I told myself:

Makis, don’t lose hope, give at least a chance to Golang. Try to build something totally new without basing it off the “slow” code you saw earlier; learn the secrets of this language and make others follow your steps!‘.

These are the words I told myself that day [13 March 2016].

The same day, later the night, I was reading a book about Greek mythology. I saw an ancient goddess’ name and was inspired immediately to give a name to this new web framework (which I had already started writing) – Iris.

Two months later, I’m writing this intro.

I’m still here because Iris has succeed in being the fastest go web framework

memleax – detects memory leak of a running process

memleax attachs a running process, hooks memory allocate/free APIs, records all memory blocks, and reports the blocks which live longer than 5 seconds (you can change this time by -e option) in real time.

So it is very convenient to use. There is no need to recompile the program or restart the target process. You run memleax to monitor the target process, wait the real-time memory leak report, and kill it (e.g. by Ctrl-C) to stop monitoring.

NOTE: Since memleax does not run along with the whole life of target process, it assumes the long-live memory blocks are memory leak. Downside is you have to set the expire threshold by -e option according to your scenarios; while the upside is the memory allocation for process initialization is skipped, besides of the convenience.

Dynamic tracing talk

Dynamic tracing technology is a kind of post-modern advanced debugging techniques. It can help software engineers at a very low cost in a very short period of time, to answer some difficult questions about the software systems to more quickly troubleshoot and resolve problems. It is the rise of a large and prosperous background, we are in a rapid growth of the Internet age, as an engineer, faced with the challenge of two aspects: First, the number of size, regardless of the size of the user or the size of the room, are in the machine the rapid growth era. A second aspect of the challenge is the complexity. Our business logic more complex, we run the software systems are becoming more complex, and we know it will be divided into many, many levels, including the operating system kernel and above is a variety of system software, such as database and Web server, and then up virtual machines high-level scripting language or other language interpreter and real-time (JIT) compiler, various levels of abstraction on top of it is the business logic of the application level and a lot of complex code logic.

Sysdig vs DTrace vs Strace: a Technical Discussion

First off, let me start with a big thank you to all of you for your interest in sysdig! We have been overwhelmed by the positive response from the community, and by the quality of the comments, questions, and contributions we’re receiving.

For the uninitiated, sysdig is a system-level exploration and troubleshooting tool for Linux with native support for containers. In this post, I want to try to answer two important and recurring questions we’ve received:

  1. “How does sysdig work?”
  2. “How is this different from the plethora of tools already available to analyze a Linux system or the processes that run on top of it (SystemTap, LTTng, DTrace, strace, ktap to name few of them)?”

I’ll address both questions by providing a technical breakdown of sysdig’s architecture. But before doing that, let’s look at two very well-known tools: strace and DTrace.

Sysdig vs DTrace vs Strace: A technical discussion.