Efficient File Copying On Linux

In response to my last post about dd, a friend of mine noticed that GNU cp always uses a 128 KB buffer size when copying a regular file; this is also the buffer size used by GNU cat. If you use strace to watch what happens when copying a file, you should see a lot of 128 KB read/write sequences:

$ strace -s 8 -xx cp /dev/urandom /dev/null
...
read(3, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072
write(4, "\x61\xca\xf8\xff\x1a\xd6\x83\x8b"..., 131072) = 131072
read(3, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072
write(4, "\xd7\x47\x8f\x09\xb2\x3d\x47\x9f"..., 131072) = 131072
read(3, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072
write(4, "\x12\x67\x90\x66\xb7\xed\x0a\xf5"..., 131072) = 131072
read(3, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072
write(4, "\x9e\x35\x34\x4f\x9d\x71\x19\x6d"..., 131072) = 131072
...

As you can see, each copy is operating on buffers 131072 bytes in size, which is 128 KB. GNU cp is part of the GNU coreutils project, and if you go diving into the coreutils source code you’ll find this buffer size is defined in the file src/ioblksize.h. The comments in this file are really fascinating. The author of the code in this file (Jim Meyering) did a benchmark using dd if=/dev/zero of=/dev/null with different values of the block size parameter, bs. On a wide variety of systems, including older Intel CPUs, modern high-end Intel CPUs, and even an IBM POWER7 CPU, a 128 KB buffer size is fastest. I used gnuplot to graph these results, shown below. Higher transfer rates are better, and the different symbols represent different system configurations.

buffer size

https://eklitzke.org/efficient-file-copying-on-linux

An nginx config for 2017

With HTTP/2 in every browser, load balancing with automatic failover, IPv6, a sorry page, separate blog server, HTML5 SSE and A+ HTTPS.

nginx (pronounced ‘Engine X’) has excellent official documentation but putting all the logic together can take a while. An average web app in 2017 might want:

HTTP/2 support in all browsers

For speed! One of the pages on our blog loads in 1.9s on HTTP 1.1. The same page loads in 600ms over HTTP/2.

IPv6 support

If you’re working on IoT devices, which often require IPv6.

Load balancing between multiple app servers with automatic failover.

So you can upgrade your app without taking it offline.

A branded ‘sorry’ page

Just in case you break both the app servers at the same time.

A separate server that handles blogs and marketing content

So you can keep your blog independent of the main app and update it on its own schedule.

Correct proxy headers for working GeoIP and logging.

So your app servers can see the proper origin of browser requests, despite the proxy. Because asking customers for their country when you already know is a waste of their time.

Support for HTML5 Server Sent Events

For realtime streaming.

An A+ on the SSL Labs test

So the users can connect privately to your site.

The various www vs non-www, HTTP vs HTTPS combinations redirected to a single HTTPS site.

This ensures there’s only one, secure copy copy of every resource for both clarity and SEO purposes.

We encourage you to check out the official nginx docs. However…

https://certsimple.com/blog/nginx-http2-load-balancing-config

Screaming-fast Python 3.5+ web micro-framework integrated with pipelining HTTP server based on uvloop and picohttpparser

Is it possible? Probably not until recently. Many large companies have been investigating migrating to other programming languages to boost their operation performance and save on server prices but there is no need really. Python can be right tool for the job and there is a lot of work around performance in the community happening. CPython 3.6 boosted overall interpreter performance with new dictionary implementation, CPython 3.7 is gonna be even faster thanks to introducing faster call convention and dictionary lookup caches. For number crunching tasks you can use PyPy with its just-in-time code compilation. Since recently it can run NumPy test suite and improved overall compatibility with C extensions drastically. Later this year PyPy is expected to reach Python 3.5 conformance.

All this great work inspired me to innovate in one of the areas which Python is used extensively, web and micro-services development.

https://medium.com/@squeaky_pl/million-requests-per-second-with-python-95c137af319#.sz5xqq9cq
https://github.com/squeaky-pl/japronto

How both TCP and Ethernet checksums fail

At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One “lucky” packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared. [Update 2016-02-12: Root cause found: this also involved a kernel bug!]

I was very excited to hear about this error, because it is a real-world example of something I wrote about seven years ago: The TCP checksum is weak. However, the Ethernet CRC is strong, so how could a corrupt packet pass both checks? The answer is that the Ethernet CRC is recalculated by switches. If the switch corrupts the packet and it has the same TCP checksum, the hardware blindly recalculates a new, valid Ethernet CRC when it goes out.

As Mark Callaghan pointed out, this is a very rare scenario and you should never blame the network without strong evidence. However, it isn’t impossible and others have written about similar incidents. My conclusion is that if you are creating a new network protocol, please append a 4 byte CRC (I suggest CRC32C, implemented in hardware on recent Intel, AMD, and ARM CPUs). An alternative is to use an encryption protocol (e.g. TLS), since they include cryptographic hashes (which fixed a similar incident).

The rest of this article describes the details about how this is possible, mostly so I don’t forget them.

http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html

Basics of Making a Rootkit: From syscall to hook!

WARNING: This tutorial is for educational purposes only, and by NO MEANS should you actually be malicious when (or after) making a rootkit. I thought I’d share how to do this for any security minded people who would like to learn more on how to prevent or look for rootkits. This will be done in C on Linux, probably using libraries and functions you’ve never seen. It is also advisable to do this in a VM to get the hang of compiling and loading modules. Messing with the kernel can cause things to go crazy, if not break- you have been warned.

 Jump to:

Basics of Making a Rootkit: From syscall to hook!

PyFilesystem – A Python interface to filesystems of all kinds

PyFilesystem is a Python module I started some time in 2008, and since then it has been very much a part of my personal standard library. I’ve used it in personal and professional projects, as have many other developers and organisations.

Recap

If you aren’t familiar with PyFilesystem; it’s an abstraction layer for filesystems. Essentially anything with files and directories (hard-drive, zip file, ftp server, network filesystems etc.) may be wrapped with a common interface. With it, you can write code that is agnostic as to where the files are physically located.

Here’s a quick example that recursively counts the lines of code in a directory:

def count_python_loc(fs):
    """Count non-blank lines of Python code."""
    count = 0
    for path in fs.walk.files(filter=['*.py']):
        with fs.open(path) as python_file:
            count += sum(1 for line in python_file if line.strip())
    return count

from fs import open_fs
projects_fs = open_fs('~/projects')
print(count_python_loc(projects_fs))

The fs argument to count_python_loc is an FS object, which encapsulates everything you would need to do with a filesystem. Because of this abstraction, the same code will work with any filesystem. For instance, counting the lines of code in a zip file is a single line change:

projects_fs = open_fs('zip://projects.zip')

https://www.willmcgugan.com/blog/tech/post/announcing-pyfilesystem-2/

Tutorial – Write a System Call

A while back, I wrote about writing a shell in C, a task which lets you peek under the covers of a tool you use daily. Underneath even a simple shell are many operating system calls, like read, fork, exec, wait, write, and chdir (to name a few). Now, it’s time to continue this journey down another level, and learn just how these system calls are implemented in Linux.

What is a system call?

Before we start implementing system calls, we’d better make sure we understand exactly what they are. A naive programmer—like me not that long ago—might define a system call as any function provided by the C library. But this isn’t quite true. Although many functions in the C library align nicely with system calls (like chdir), other ones do quite a bit more than simply ask the operating system to do something (such as fork or fprintf). Still others simply provide programming functionality without using the operating system, such as qsort and strtok.

In fact, a system call has a very specific definition. It is a way of requesting that the operating system kernel do something on your behalf. Operations like tokenizing a string don’t require interacting with the kernel, but anything involving devices, files, or processes definitely does.

System calls also behave differently under the hood than a normal function. Rather than simply jumping to some code from your program or a library, your program has to ask the CPU to switch into kernel mode, and then go to a predefined location within the kernel to handle your system call. This can be done in a few different ways, such as a processor interrupt, or special instructions such as syscall or sysenter. In fact, the modern way of making a system call in Linux is to let the kernel provide some code (called the VDSO) which does the right thing to make a system call. Here’s an interesting SO question on the topic.

Thankfully, all that complexity is handled for us. No matter how a system call is made, it all comes down to looking up the particular system call number in a table to find the correct kernel function to call. Since all you need is a table entry and a function, it’s actually very easy to implement your own system call. So let’s give it a shot!

https://brennan.io/2016/11/14/kernel-dev-ep3/

GNU Parallel Tutorial

This tutorial shows off much of GNU parallel‘s functionality. The tutorial is meant to learn the options in GNU parallel. The tutorial is not to show realistic examples from the real world.

https://www.gnu.org/software/parallel/parallel_tutorial.html#GNU-Parallel-Tutorial

Shell tool for executing jobs in parallel

http://linuxsoft.cern.ch/cern/centos/7/cern/x86_64/repoview/parallel.html
http://linuxsoft.cern.ch/cern/centos/7/cern/x86_64/Packages/parallel-20150522-1.el7.cern.noarch.rpm