Data Science Competitions 101: Anatomy and Approach

I recently participated in a weekend-long data science hackathon, titled ‘The Smart Recruits’. Organized by the amazing folks at Analytics Vidhya, it saw some serious competition. Although my performance can be classified as decent at best (47 out of 379 participants), it was among the more satisfying ones I have participated in on both AV (profile) and Kaggle (profile) over the last few months. Thus, I decided it might be worthwhile to try and share some insights as a data science autodidact.

This is how our bodies betray us in a lie

Let me start with a question: How do you know if a person is lying? If you’re like most people, your first response will be something like “Liars don’t make eye contact.” In a survey of 2,520 adults in sixty-three countries, 70 percent of respondents gave that answer. People also tend to list other allegedly telltale signs of lying, such as fidgeting, nervousness and rambling. In an interview with the New York Times, psychologist Charles Bond, who studies deception, said the stereotype of what liars do “would be less puzzling if we had more reason to imagine that it was true.” It turns out that there’s no “Pinocchio effect,” no single nonverbal cue that will betray a liar. Judging a person’s honesty is not about identifying one stereotypical reveal, such as fidgeting or averted eyes. Rather, it’s about how well or poorly our multiple channels of communication — facial expressions, posture, movement, vocal qualities, speech — cooperate.

This is how our bodies betray us in a lie


The early architecture of Uber consisted of a monolithic backend application written in Python that used Postgres for data persistence. Since that time, the architecture of Uber has changed significantly, to a model of microservices and new data platforms. Specifically, in many of the cases where we previously used Postgres, we now use Schemaless, a novel database sharding layer built on top of MySQL. In this article, we’ll explore some of the drawbacks we found with Postgres and explain the decision to build Schemaless and other backend services on top of MySQL.

The Architecture of Postgres

We encountered many Postgres limitations:

  • Inefficient architecture for writes
  • Inefficient data replication
  • Issues with table corruption
  • Poor replica MVCC support
  • Difficulty upgrading to newer releases

We’ll look at all of these limitations through an analysis of Postgres’s representation of table and index data on disk, especially when compared to the way MySQL represents the same data with its InnoDB storage engine. Note that the analysis that we present here is primarily based on our experience with the somewhat old Postgres 9.2 release series. To our knowledge, the internal architecture that we discuss in this article has not changed significantly in newer Postgres releases, and the basic design of the on-disk representation in 9.2 hasn’t changed significantly since at least the Postgres 8.3 release (now nearly 10 years old).

Introduction to Zipline in Python

Zipline is a Python library for trading applications that powers the Quantopian service mentioned above. It is an event-driven system that supports both backtesting and live-trading.

In this article we will learn how to install Zipline and then how to implement Moving Average Crossover strategy and calculate P&L, Portfolio value etc.

This article is divided into the following four sections:

  • Benefits of Zipline
  • Installation (how to install Zipline on local)
  • Structure (format to write code in Zipline),
  • Coding Moving average crossover strategy with Zipline

Benefits of Zipline

  • Ease of use
  • Zipline comes “batteries included” as many common statistics like moving average and linear regression can be readily accessed from within a user-written algorithm.
  • Input of historical data and output of performance statistics are based on Pandas DataFrames to integrate nicely into the existing PyData ecosystem
  • Statistic and machine learning libraries like matplotlib, scipy, statsmodels, and sklearn support development, analysis, and visualization of state-of-the-art trading systems

Goodbye, Object Oriented Programming

There’s a great quote by Joe Armstrong, the creator of Erlang:

The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.

OO language I used was C++ and then Smalltalk and finally .NET and Java.

I was gung-ho to leverage the benefits of Inheritance, Encapsulation, and Polymorphism. The Three Pillars of the Paradigm.

I was eager to gain the promise of Reuse and leverage the wisdom gained by those who came before me in this new and exciting landscape.

I couldn’t contain my excitement at the thought of mapping my real-world objects into their Classes and expected the whole world to fall neatly into place.

I couldn’t have been more wrong.



At the lower levels, Uber’s engineers primarily write in Python, Node.js, Go, and Java. We started with two main languages: Node.js for the Marketplace team, and Python for everyone else. These first languages still power most services running at Uber today.

We adopted Go and Java for high performance reasons. We provide first-class support for these languages. Java takes advantage of the open source ecosystem and integrates with external technologies, like Hadoop and other analytics tools. Go gives us efficiency, simplicity, and runtime speed.

We rip out and replace older Python code as we break up the original code base into microservices. An asynchronous programming model gives us better throughput. We use Tornado with Python, but Go’s native support for concurrency is ideal for most new performance-critical services.

We write tools in C and C++ when it’s necessary (like for high-efficiency, high-speed code at the system level). We use software that’s written in those languages—HAProxy, for example—but for the most part, we don’t actually work in them.

And, of course, those working at the top of the stack write in languages beyond Java, Go, Python, and Node.

Connecting your App to a Wi-Fi Device

With the growth of the Internet of Things, connecting Android applications to Wi-Fi enabled devices is becoming more and more common. Whether you’re building an app for a remote viewfinder, to set up a connected light bulb, or to control a quadcopter, if it’s Wi-Fi based you will need to connect to a hotspot that may not have Internet connectivity.

From Lollipop onwards the OS became a little more intelligent, allowing multiple network connections and not routing data to networks that don’t have Internet connectivity. That’s very useful for users as they don’t lose connectivity when they’re near Wi-Fis with captive portals. Data routing APIs were added for developers, so you can ensure that only the appropriate app traffic is routed over the Wi-Fi connection to the external device.

Visualizing How Developers Rate Their Own Programming Skills

Stack Overflow, the favorite destination for software developers when something breaks for no apparent reason, recently released their 2016 Stack Overflow Survey Results with responses to the questions of “where they work, what they build, and who they are.” You can download the released dataset containing all 56,030 cleaned responses here.

One variable present in the dataset but surprisingly unaddressed in the official Stack Overflow analysis is theprogramming_ability field — On a scale of 1-10, how would you rate your programming ability?

I took a look at the 46,982 users who identified their programming ability in the survey. On average, developers rate themselves 7.09 / 10. And like most 1-10 rating scales, the distribution of self-assessments is unimodal around 7 and 8, with relatively rare 9’s and 10’s.