This tutorial shows off much of GNU parallel‘s functionality. The tutorial is meant to learn the options in GNU parallel. The tutorial is not to show realistic examples from the real world.
Linux based distributions have featured set of commands which provide way to configure networking in easy and powerful way through command-line. These set of commands are available from net-tools package which has been there for a long time on almost all distributions, and includes commands like: ifconfig, route, nameif, iwconfig, iptunnel, netstat, arp.
These commands are just about sufficient in configuring the network in a way any novice or an expert Linux user would want, but due to advancement in Linux kernel over past years and unmaintainable of this packaged set of commands, they are getting deprecated and a more powerful alternative which has ability to replace all of these commands is emerging.
This alternative has also been there for quite some time now and is much more powerful than any of these commands. Rest of sections would highlight this alternative and compare it with one of the command from net-tools package i.e. ifconfig.
As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Haydenabout using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec).
After reporting that the time required to process the data with 7 c1.medium machine in the cluster took 26 minutes, Tom remarks
“This is probably better than it would take to run serially on my machine but probably not as good as if I did some kind of clever multi-threaded application locally.”
This is absolutely correct, although even serial processing may beat 26 minutes. Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.
One especially under-used approach for data processing is using standard shell tools and commands. The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. This is basically like having your own Storm cluster on your local machine. Even the concepts of Spouts, Bolts, and Sinks transfer to shell pipes and the commands between them. You can pretty easily construct a stream processing pipeline with basic commands that will have extremely good performance compared to many modern Big Data ™ tools.
An additional point is the batch versus streaming analysis approach. Tom mentions in the beginning of the piece that after loading 10000 games and doing the analysis locally, that he gets a bit short on memory. This is because all game data is loaded into RAM for the analysis. However, considering the problem for a bit, it can be easily solved with streaming analysis that requires basically no memory at all. The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory.
jq is like
sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that
grep and friends let you play with text.
I’ve written a few shell scripts in my time and have read many more, and I see the same issues cropping up again and again (unfortunately even in my own scripts sometimes).
While there are lots of shell programming pitfalls, at least the interpreter will tell you immediately about them. The mistakes I describe below, generally mean that your script will run fine now, but if the data changes or you move your script to another system, then you may have problems.
I think some of the reason shell scripts tend to have lots of issues is that commonly one doesn’t learn shell scripting like “traditional” programming languages. Instead scripts tend to evolve from existing interactive command line use, or are based on existing scripts which themselves have propagated the limitations of ancient shell script interpreters.
It’s definitely worth spending the relatively small amount of time required to learn the shell script language correctly, if one uses linux/BSD/Mac OS X desktops or servers, where it is commonly used.
This is a catalog of things UNIX-like/POSIX-compliant operating systems can do atomically, making them useful as building blocks for thread-safe and multi-process-safe programs without mutexes or read/write locks. The list is by no means exhaustive and I expect it to be updated frequently for the foreseeable future.
The philosophy here is to let the kernel do as much work as possible. At my most pessimistic, I trust the kernel developers more than a trust myself. More practically, it’s stupid to spend CPU time locking around an operation that’s already atomic. Added 2010-01-07.