The premise behind autoscaling in AWS is simple: you can maximize your ability to handle load spikes and minimize costs if you automatically scale your application out based on metrics like CPU or memory utilization. If you need 100 Docker containers to support your load during the day but only 10 when load is lower at night, running 100 containers at all times means that you’re using 900% more capacity than you need every night. With a constant container count, you’re either spending more money than you need to most of the time or your service will likely fall over during a load spike.
Linux has two well-known tracing tools:
- strace allows you to see what system calls are being made.
- ltrace allows you to see what dynamic library calls are being made.
Though useful, these tools are limited. What if you want to trace what happens inside a system call or library call? What if you want to do more than just logging calls, e.g. you want to compile statistics on certain behavior? What if you want to trace multiple processes and correlate data from multiple sources?
This article shows you how to setup bpftrace and teaches you its basic usage. I’ll also give an overview of how the tracing ecosystem looks like (e.g. “what’s eBPF?”) and how it came to be what it is today.
In response to my last post about
dd, a friend of mine noticed that GNU
cp always uses a 128 KB buffer size when copying a regular file; this is also the buffer size used by GNU
cat. If you use
strace to watch what happens when copying a file, you should see a lot of 128 KB read/write sequences:
$ strace -s 8 -xx cp /dev/urandom /dev/null ... read(3, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072 write(4, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072 read(3, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072 write(4, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072 read(3, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072 write(4, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072 read(3, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072 write(4, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072 ...
As you can see, each copy is operating on buffers 131072 bytes in size, which is 128 KB. GNU
cp is part of the GNU coreutils project, and if you go diving into the coreutils source code you’ll find this buffer size is defined in the file src/ioblksize.h. The comments in this file are really fascinating. The author of the code in this file (Jim Meyering) did a benchmark using
dd if=/dev/zero of=/dev/null with different values of the block size parameter,
bs. On a wide variety of systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these results, shown below. Higher transfer rates are better, and the different symbols represent different system configurations.
BCC is a toolkit for creating efficient kernel tracing and manipulation programs, and includes several useful tools and examples. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.
eBPF was described by Ingo Molnár as:
One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively.
BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.
Go is a great technology stack for building scalable, web-based, back-end systems for web applications.
When you think about building web applications and web APIs, or simply building HTTP servers in Go, does your mind go to the standard net/http package? Then you have to deal with some common situations like dynamic routing (a.k.a parameterized), security and authentication, real-time communication and many other issues that net/http doesn’t solve.
The net/http package is not complete enough to quickly build well-designed back-end web systems. When you realize this, you might be thinking along these lines:
- Ok, the net/http package doesn’t suit me, but there are so many frameworks, which one will work for me?!
- Each one of them tells me that it is the best. I don’t know what to do!
I did some deep research and benchmarks with ‘wrk’ and ‘ab’ in order to choose which framework would suit me and my new project. The results, sadly, were really disappointing to me.
I started wondering if golang wasn’t as fast on the web as I had read… but, before I let Golang go and continued to develop with nodejs, I told myself:
‘Makis, don’t lose hope, give at least a chance to Golang. Try to build something totally new without basing it off the “slow” code you saw earlier; learn the secrets of this language and make others follow your steps!‘.
These are the words I told myself that day [13 March 2016].
The same day, later the night, I was reading a book about Greek mythology. I saw an ancient goddess’ name and was inspired immediately to give a name to this new web framework (which I had already started writing) – Iris.
Two months later, I’m writing this intro.
memleax attachs a running process, hooks memory allocate/free APIs, records all memory blocks, and reports the blocks which live longer than 5 seconds (you can change this time by -e option) in real time.
So it is very convenient to use. There is no need to recompile the program or restart the target process. You run
memleax to monitor the target process, wait the real-time memory leak report, and kill it (e.g. by Ctrl-C) to stop monitoring.
memleax does not run along with the whole life of target process, it assumes the long-live memory blocks are memory leak. Downside is you have to set the expire threshold by -e option according to your scenarios; while the upside is the memory allocation for process initialization is skipped, besides of the convenience.