Command-line tools can be 235x faster than your Hadoop cluster

As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Haydenabout using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec).

After reporting that the time required to process the data with 7 c1.medium machine in the cluster took 26 minutes, Tom remarks

“This is probably better than it would take to run serially on my machine but probably not as good as if I did some kind of clever multi-threaded application locally.”

This is absolutely correct, although even serial processing may beat 26 minutes. Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

One especially under-used approach for data processing is using standard shell tools and commands. The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. This is basically like having your own Storm cluster on your local machine. Even the concepts of Spouts, Bolts, and Sinks transfer to shell pipes and the commands between them. You can pretty easily construct a stream processing pipeline with basic commands that will have extremely good performance compared to many modern Big Data ™ tools.

An additional point is the batch versus streaming analysis approach. Tom mentions in the beginning of the piece that after loading 10000 games and doing the analysis locally, that he gets a bit short on memory. This is because all game data is loaded into RAM for the analysis. However, considering the problem for a bit, it can be easily solved with streaming analysis that requires basically no memory at all. The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory.

http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

Hadoop filesystem at Twitter

“Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of our data platform and provides vast storage for analytics of user actions on Twitter. In this post, we will highlight our contributions to ViewFs, the client-side Hadoop filesystem view, and its versatile usage here.

ViewFs makes the interaction with our HDFS infrastructure as simple as a single namespace spanning all datacenters and clusters. HDFS Federation helps with scaling the filesystem to our needs for number of files and directories while NameNode High Availability helps with reliability within a namespace. These features combined add significant complexity to managing and using our several large Hadoop clusters with varying versions. ViewFs removes the need for us to remember complicated URLs by using simple paths. Configuring ViewFs itself is a complex task at our scale. Thus, we run TwitterViewFs, a ViewFs extension we developed, that dynamically generates a new configuration so we have a simple holistic filesystem view…”

https://blog.twitter.com/2015/hadoop-filesystem-at-twitter

Predict Social Network Influence with R and H2O Ensemble Learning

“H2O is an awesome machine learning framework. It is really great for data scientists and business analysts “who need scalable and fast machine learning”. H2O is completely open source and what makes it important is that works right of the box. There seems to be no easier way to start with scalable machine learning. It hast support for R, Python, Scala, Java and also has a REST API and a own WebUI. So you can use it perfectly for research but also in production environments.

H2O is based on Apache Hadoop and Apache Spark which gives it enormous power with in-memory parallel processing…”

https://flipboard.com/topic/machinelearning/predict-social-network-influence-with-r-and-h2o-ensemble-learning/a-BwZuC4crRlScQ4Vvphg0BA%3Aa%3A214019425-edc42df137%2Fthinktostart.com

Introducing streaming k-means in Spark 1.2

“A key advantage of Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics. This facilitates adding extensions that leverage and combine components in novel ways without reinventing the wheel. We have been developing a family of streaming machine learning algorithms in Spark within MLlib. In this post we describe streaming k-means clustering, included in the recently released Spark 1.2…”

Introducing streaming k-means in Apache Spark 1.2

Command-line tools can be 235x faster than your Hadoop cluster

“As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec)…”

http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html