Lessons from Building a Node App in Docker

Here are some tips and tricks that I learned the hard way when developing and deploying web applications written for node.js using Docker.

In this tutorial article, we’ll set up the socket.io chat example in docker, from scratch to production-ready, so hopefully you can learn these lessons the easy way. In particular, we’ll see how to:

  • Actually get started bootstrapping a node application with docker.
  • Not run everything as root (bad!).
  • Use binds to keep your test-edit-reload cycle short in development.
  • Manage node_modules in a container for fast rebuilds (there’s a trick to this).
  • Ensure repeatable builds with npm shrinkwrap.
  • Share a Dockerfile between development and production.

This tutorial assumes you already have some familiarity with Docker and node.js. If you’d like a gentle intro to docker first, you can try my slides about docker (discussion on hacker news) or try one of the many, many other docker intros out there.


Load Balancing DNS Traffic with NGINX and NGINX Plus

NGINX Plus R9 introduces the ability to reverse proxy and load balance UDP traffic, a significant enhancement to NGINX Plus’ Layer 4 load-balancing capabilities.

This blog post looks at the challenges of running a DNS server in a modern application infrastructure to illustrate how the open source NGINX software and NGINX Plus can effectively and efficiently load balance both UDP and TCP traffic (for brevity, we’ll refer to NGINX Plus for the rest of the post).

Load Balancing DNS Traffic with NGINX and NGINX Plus

6 Lesser Known Python Data Analysis Libraries

Python offers a great environment and rich set of libraries to developers while working with data. There are tons of useful libraries out there for novice or experienced developers or analysts for helping out with processing or visualizing datasets. Some of the libraries are really popular and used by millions of developers, for example – Pandas, Numpy, Scikit-learn, NTLK etc. Some of the libraries are not so well known and turned out to be handy in my experience. This article introduces 6 such Python libraries when working with data. Readers might already be familiarized with some of them, but I hope this article still proves to be useful.


Cassandra anti-patterns: Queues and queue-like datasets

Deletes in Cassandra

Cassandra uses a log-structured storage engine. Because of this, deletes do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting. Only then a tombstone can be permanently discarded by compaction.

This scheme allows for very fast deletes (and writes in general), but it’s not free: aside from the obvious RAM/disk overhead of tombstones, you might have to pay a certain price when reading data back if you haven’t modelled your data well.

Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.

Symptoms of a wrong data model

To illustrate this scenario, let’s consider the most extreme case – using Cassandra as a durable queue, a known anti-pattern, e.g.

CREATE TABLE queues ( name text, enqueued_at timeuuid, payload blob, PRIMARY KEY (name, enqueued_at) );

Having enqueued 10000 10-byte messages and then dequeued 9999 of them, one by one, let’s peek at the last remaining message using cqlsh with TRACING ON:

SELECT enqueued_at, payload FROM queues WHERE name = 'queue-1' LIMIT 1;


Tensorflow – Neural Network Playground


What Is a Neural Network?

It’s a technique for building a computer program that learns from data. It is based very loosely on how we think the human brain works. First, a collection of software “neurons” are created and connected together, allowing them to send messages to each other. Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure. For a more detailed introduction to neural networks, Michael Nielsen’s Neural Networks and Deep Learning is a good place to start. For more a more technical overview, try Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.


Running Ansible 2 Programmatically

Ansible 2 is out, and that means it’s time to upgrade the previous article on Running Ansible Programmatically for Ansible 2, which has significant API changes under the hood.

Use Case

At work, we are spinning up hosted trials for a historically on-premise product (no multi-tenancy).

To ensure things run smoothly, we need logging and reporting of Ansible runs while these trials spin up or are updated.

Each server instance (installation of the application) has unique data (license, domain configuration, etc).

Running Ansible programmatically gives us the most flexibility and has proven to be a reliable way to go about this.

At the cost of some code complexity, we gain the ability to avoid generating host and variable files on the system (although dynamic host generations may have let us do this – this is certainly not THE WAY™ to solve this problem).

There are ways to accomplish all of this without diving into Ansible’s internal API. The trade-off seems to be control vs “running a CLI command from another program”, which always feels dirty.

Learning some of Ansible’s internals was fun, so I went ahead and did it.

Overall, there’s just more control when calling the Ansible API programmatically.


Statistics for Software

Software development begins as a quest for capability, doing what could not be done before. Once that what is achieved, the engineer is left with the how. In enterprise software, the most frequently asked questions are, “How fast?” and more importantly, “How reliable?”

Questions about software performance cannot be answered, or even appropriately articulated, without statistics.

Yet most developers can’t tell you much about statistics. Much like math, statistics simply don’t come up for typical projects. Between coding the new and maintaining the old, who has the time?

Engineers must make the time. I understand fifteen minutes can seem like a big commitment these days, so maybe bookmark it. Insistent TLDR seekers can head for our instrumentation section or straight to the summary.

For the dedicated few, class is in session. It’s time to broach the topic, learn what works, and take the guesswork out of software. A few core practices go a long way in generating meaningful systems analysis. And a few common mistakes set projects way back. This guide aims to lighten software maintenance and speed up future development through answers made possible by the right kinds of applied statistics.


How to write a Bloom filter in C++

Bloom filters are data structures which can efficiently determine whether an element is possibly a member of a set or definitely not a member of a set. This article will cover a simple implementation of a C++ bloom filter. It’s not going to cover what bloom filters are or much of the math behind them, as there are other great resources covering those topics.


Debugging of CPython processes with gdb

pdb has been, is and probably always will be the bread and butter of Python programmers, when they need to find the root cause of a problem in their applications, as it’s a built-in and easy to use debugger. But there are cases, when pdb can’t help you, e.g. if your app has got stuck somewhere, and you need to attach to a running process to find out why, without restarting it. This is where gdb shines.