Monitoring and Tuning the Linux Networking Stack: Receiving Data

This blog post explains how computers running the Linux kernel receive packets, as well as how to monitor and tune each component of the networking stack as packets flow from the network toward userland programs.

http://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

Designing a fast Hash Table

This article describes the many design decisions that go into creating a fast, general-purpose hash table. It culminates with a benchmark between my own emilib::HashSet and C++11’s std::unordered_set. If you are interested in hash tables and designing one yourself (no matter which language you are programming in), this article might be for you.

http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table

ArangoDB is designed as a native multi-model database

ArangoDB is a distributed and highly scalable database for all data models. ArangoDB is fully-certified for DC/OS including persistent primitives. Setup and maintenance of a cluster is extremely easy.

Key Features in a Nutshell

Document

  • JOINs
  • Transactions
  • Schemaless
  • JSON Objects
  • Secondary Indexes
  • Compact Storage

Graph

Cluster

Native Multi-Model Database for Graph, Document, Key/Value and Search


https://github.com/joowani/python-arango

Itsy Bitsy Data Structures

Why should I care?

Data Structures might not be the juiciest topic in the world, but they are hugely important to growing as an engineer. Knowing data structures don’t just make your programs faster and more efficient, but they help you organize your code and your thoughts so that you can build more complicated programs without a ton of mental overhead.

But data structures are scary!

Yeah, lots of computer science topics are intimidating, and that’s largely a fault of how they are taught. In this we’re going to do a high level pass over a lot of the key things you need to know in order to dive into them deeper. It’s more about introducing you to the shared language of data structures.

Okay so where do I begin?

Awesome! Head on over to the itsy-bitsy-data-structures.js file.

https://github.com/thejameskyle/itsy-bitsy-data-structures

Text summarization with TensorFlow

Every day, people rely on a wide variety of sources to stay informed — from news stories to social media posts to search results. Being able to develop Machine Learning models that can automatically deliver accurate summaries of longer text can be useful for digesting such large amounts of information in a compressed form, and is a long-term goal of the Google Brain team.

Summarization can also serve as an interesting reading comprehension test for machines. To summarize well, machine learning models need to be able to comprehend documents and distill the important information, tasks which are highly challenging for computers, especially as the lengths of the documents increases.

In an effort to push this research forward, we’re open-sourcing TensorFlow model code for the task of generating news headlines on Annotated English Gigaword, a dataset often used in summarization research. We also specify the hyper-parameters in the documentation that achieve better than published state-of-the-art on the most commonly used metric as of the time of writing. Below we also provide samples generated by the model.

https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html

S3-compatible object storage on Cassandra

In an industry where so many people want to change the world, it’s fair to say that low cost object storage has done just that. Building a business that requires flexible low-latency storage is now affordable in a way we couldn’t imagine before.

When building the Exoscale public cloud offering, we knew that a simple object storage service, protected by Swiss privacy laws, would be crucial. After looking at the existing object storage software projects, we decided to build our own solution: Pithos.

Pithos is an open source S3-compatability layer for Apache Cassandra, the column database. In other words, it allows you to use standard S3 tools to store objects in your own Cassandra cluster. If this is the first time that you’ve looked at object storage software then you may wonder why Pithos is built on top of a NoSQL database but it’s not all that unusual.

https://www.linkedin.com/pulse/s3-compatible-object-storage-cassandra-matthew-revell

Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department

“What is the relationship like between your team and the data scientists?” This is, without a doubt, the question I’m most frequently asked when conducting interviews for data platform engineers. It’s a fine question – one that, given the state of engineering jobs in the data space, is essential to ask as part of doing due diligence in evaluating new opportunities. I’m always happy to answer. But I wish I didn’t have to, because this a question that is motivated by skepticism and fear.

Why is that? If you read the recruiting propaganda of data science and algorithm development departments in the valley, you might be convinced that the relationship between data scientists and engineers is highly collaborative, organic, and creative. Just like peas and carrots.

However, it’s not a well kept secret that this is seldom the case. Most shops foster a relationship between engineers and scientists that lies somewhere in the spectrum between non-existent1 and highly dysfunctional.

http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/

BPF Compiler Collection (BCC)

BCC is a toolkit for creating efficient kernel tracing and manipulation programs, and includes several useful tools and examples. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.

eBPF was described by Ingo Molnár as:

One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively.

BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.

https://github.com/iovisor/bcc