Interactive cli tool for HTTP inspection

Interactive cli tool for HTTP inspection

Wuzz command line arguments are similar to cURL’s arguments, so it can be used to inspect/modify requests copied from the browser’s network inspector with the “copy as cURL” feature.

Screaming-fast Python 3.5+ web micro-framework integrated with pipelining HTTP server based on uvloop and picohttpparser

Is it possible? Probably not until recently. Many large companies have been investigating migrating to other programming languages to boost their operation performance and save on server prices but there is no need really. Python can be right tool for the job and there is a lot of work around performance in the community happening. CPython 3.6 boosted overall interpreter performance with new dictionary implementation, CPython 3.7 is gonna be even faster thanks to introducing faster call convention and dictionary lookup caches. For number crunching tasks you can use PyPy with its just-in-time code compilation. Since recently it can run NumPy test suite and improved overall compatibility with C extensions drastically. Later this year PyPy is expected to reach Python 3.5 conformance.

All this great work inspired me to innovate in one of the areas which Python is used extensively, web and micro-services development.

How both TCP and Ethernet checksums fail

At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One “lucky” packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared. [Update 2016-02-12: Root cause found: this also involved a kernel bug!]

I was very excited to hear about this error, because it is a real-world example of something I wrote about seven years ago: The TCP checksum is weak. However, the Ethernet CRC is strong, so how could a corrupt packet pass both checks? The answer is that the Ethernet CRC is recalculated by switches. If the switch corrupts the packet and it has the same TCP checksum, the hardware blindly recalculates a new, valid Ethernet CRC when it goes out.

As Mark Callaghan pointed out, this is a very rare scenario and you should never blame the network without strong evidence. However, it isn’t impossible and others have written about similar incidents. My conclusion is that if you are creating a new network protocol, please append a 4 byte CRC (I suggest CRC32C, implemented in hardware on recent Intel, AMD, and ARM CPUs). An alternative is to use an encryption protocol (e.g. TLS), since they include cryptographic hashes (which fixed a similar incident).

The rest of this article describes the details about how this is possible, mostly so I don’t forget them.

Introducing the ‘Startup Kit Serverless Workload’

“What’s the easiest way to get started on AWS?” is a common question. Although there are many well established paths to getting started, including using AWS Elastic Beanstalk, serverless computing is a rapidly growing alternative.

Serverless computing allows you to build and run applications and services without thinking about servers. On AWS, the AWS Lambda service is the central building block for serverless computing. AWS also provides several other services to support serverless architectures. These include Amazon API Gateway, which you can use with Lambda to create a RESTful API, and Amazon DynamoDB, a NoSQL cloud database service that frees you from the burden of setting up a database cluster.

A completely serverless architecture is shown in the following diagram.



Goroutines, Nonblocking I/O, And Memory Usage

I am generally a fan of Go’s approach to concurrency: writing code with goroutines is a lot easier than writing traditional nonblocking network servers in a language like C or C++. However, while working on a highly concurrent network proxy I came across an interesting realization about how the Go concurrency model makes it harder to write programs that do a lot of concurrent I/O with efficient memory usage.

The program in question is a network proxy akin to HAProxy or Envoy. Typically the proxy has a very large number of clients connected, but most of those clients are actually idle with no outstanding network requests. Each client connection has a read buffer and a write buffer. Therefore the naive memory usage of such a program is at least: #connections * (readbuf_sz + writebuf_sz).

There’s a trick you can do in a C or C++ program of this nature to reduce memory usage. Suppose that typically 5% of the client connections are actually active, and the other 95% are idle with no pending reads or writes. In this situation you can create a pool of buffer objects. When connections are actually active they acquire buffers to use for reading/writing from the pool, and when the connections are idle they release the buffers back to the pool. This reduces the number of allocated buffers to approximately the number of buffers actually needed by active connections. In this case using this technique will give a 20x memory reduction, since only 5% as many buffers will be allocated compared to the naive approach.

The reason this technique works at all is due to how nonblocking reads and writes work in C. In C you use a system call like select(2) or epoll_wait(2) to get a notification that a file descriptor is ready to be read/written, and then after that you explicitly call read(2) or write(2) yourself on that file descriptor. This gives you the opportunity to acquire a buffer after the call to select/epoll, but before making the read call…

uThreads: Concurrent User Threads in C++(and C)

uThreads is a concurrent library based on cooperative scheduling of user-level threads(fibers) implemented in C++. User-level threads are lightweight threads that execute on top of kernel threads to provide concurrency as well as parallelism. Kernel threads are necessary to utilize processors, but they come with the following drawbacks:

  • Each suspend/resume operation involves a kernel context switch
  • Thread preemption causes additional overhead
  • Thread priorities and advanced scheduling causes additional overhead

Cooperative user-level threads, on the other hand, provide light weight context switches and omit the additional overhead of preemption and kernel scheduling. Most Operating Systems only support a 1:1 thread mapping (1 user-level thread to 1 kernel-level thread), where multiple kernel threads execute at the same time to utilize multiple cores and provide parallelism. e.g., Linux supports only 1:1 thread mapping. There is also N:1 thread mapping, where multiple user-level threads can be mapped to a single kernel-level thread. The kernel thread is not aware of the user-level threads existence. For example, Facebook’s folly::fiber, libmill, and libtask use N:1 mapping. Having N:1 mapping means if the application blocks at the kernel level, all user-level threads are blocked and application cannot move forward. One way to address this is to only block on user level, hence, blocking user-level threads. This setting works very well with IO bound applications, however, if a user thread requires using a CPU for a while, it can block other user threads and the task is better to be executed asynchronously on another core to prevent this from happening. In order to avoid this problem, user threads can be mapped to multiple kernel-level threads. Thus, creating the third scenario with M:N or hybrid mapping. e.g., go and uC++ use M:N mapping.

uThreads supports M:N mapping of uThreads (user-level threads) over kThreads (kernel-level threads) with cooperative scheduling. kThreads can be grouped together by Clusters, and uThreads can migrate among Clusters. Figure 1 shows the structure of an application implemented using uThreads using a single ReadyQueue Scheduler. You can find the documentation here


Figure 1: uThreads Architecture

Some thoughts on asynchronous API design in a post-async/await world

I’ve recently been exploring the exciting new world of asynchronous I/O libraries in Python 3 – specifically asyncio and curio. These two libraries make some different design choices. This is an essay that I wrote to try to explain to myself what those differences are and why I think they matter, and distill some principles for designing event loop APIs and asynchronous libraries in Python. This is a quickly changing area and the ideas here are very much still under development, so this text probably assumes all kinds of background knowledge and possibly that you live inside my head – but maybe you’ll find it interesting anyway. I’d love to hear what you think or discuss further.

I don’t understand Python’s Asyncio

asyncio is supposed to implement asynchronous IO with the help of coroutines. Originally implemented as a library around the yield and yield from expressions it’s now a much more complex beast as the language evolved at the same time. So here is the current set of things that you need to know exist:

  • event loops
  • event loop policies
  • awaitables
  • coroutine functions
  • old style coroutine functions
  • coroutines
  • coroutine wrappers
  • generators
  • futures
  • concurrent futures
  • tasks
  • handles
  • executors
  • transports
  • protocols