Writing a cron job microservice with Serverless and AWS Lambda

We recently had a situation where we needed to create a new cron job to fetch all users from our database who are coming to the end of their trial and insert them into our customer.io database. Cron jobs are easy to write, but difficult to setup. You can edit /etc/crontab on the server; if you’re using heroku you can use their Scheduler; or you can use some implementation of cron in your programming language of choice (e.g. Node.js).

The cron job that we needed to write was unrelated to our application code, so whilst we could have put the functionality in there it seemed like the wrong place. Alternatively we could have put the code onto a new server. This would mean provisioning a new box for something that is only going to run once a day for 10 seconds. This seems very wasteful and expensive.


The Algorithms Behind Probabilistic Programming

We recently introduced our report on probabilistic programming. The accompanying prototype allows you to explore the past and future of the New York residential real estate market.

This post gives a feel for the content in our report by introducing the algorithms and technology that make probabilistic programming possible. We’ll dive even deeper into these algorithms in conversation with the Stan Group Tuesday, February 7 at 1 pm ET/10am PT. Please join us!


How both TCP and Ethernet checksums fail

At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One “lucky” packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared. [Update 2016-02-12: Root cause found: this also involved a kernel bug!]

I was very excited to hear about this error, because it is a real-world example of something I wrote about seven years ago: The TCP checksum is weak. However, the Ethernet CRC is strong, so how could a corrupt packet pass both checks? The answer is that the Ethernet CRC is recalculated by switches. If the switch corrupts the packet and it has the same TCP checksum, the hardware blindly recalculates a new, valid Ethernet CRC when it goes out.

As Mark Callaghan pointed out, this is a very rare scenario and you should never blame the network without strong evidence. However, it isn’t impossible and others have written about similar incidents. My conclusion is that if you are creating a new network protocol, please append a 4 byte CRC (I suggest CRC32C, implemented in hardware on recent Intel, AMD, and ARM CPUs). An alternative is to use an encryption protocol (e.g. TLS), since they include cryptographic hashes (which fixed a similar incident).

The rest of this article describes the details about how this is possible, mostly so I don’t forget them.


Serverless at re:Invent 2016 – Wrap-up

The re:Invent 2016 conference was an exciting week to be working on serverless at AWS. We announced new features like support for C# and dead letter queues, and launched new application constructs with Lambda such as Lambda@Edge, AWS Greengrass, Amazon Lex, and AWS Step Functions. In addition we also added support for surfacing services built using API Gateway in the AWS marketplace, expanded the capabilities for custom authorizers, and launched a reference developer portal for managing APIs. Catch up on all the great re:Invent launches here.

In addition to the serverless mini-con with deep dive talks and best practices, we also had deep customer talks by folks from Thomson Reuters, Vevo, Expedia, and FINRA. If you weren’t able to attend the mini-con or missed a specific session, here is a quick link to the entire Serverless Mini Conference Playlist. Other interesting sessions from other tracks are listed below.

Individual Sessions from the Mini Conference

Other Interesting Sessions


Tutorial on Deep Learning

This series of talks is part of the Foundations of Machine Learning Boot Camp. Videos for each talk area will be available through the links bellow.

Lecture 1: Tutorial on Deep Learning I
Lecture 2: Tutorial on Deep Learning II 
Lecture 3: Tutorial on Deep Learning III
Lecture 4: Tutorial on Deep Learning IV 


Five Reasons to Consider Amazon API Gateway for Your Next Microservices Project

API has become an integral part of application design. Architects and developers are spending significant time in designing the API tier. Netflix  — one of the early adopters of polyglot services and APIs — shared some of the advantages of implementing an API layer in their services architecture. Chris Richardson, the founder of the original Cloud Foundry and an expert in microservices, articulated the importance of API Gateway pattern. According to Chris, not only does the API gateway optimize communication between clients and the application, but it also encapsulates the details of the microservices.

Before implementing an API gateway:


After implementing an API gateway:



Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?

Welcome to the first in a series of blog posts about Elasticsearch and Amazon Elasticsearch Service, where we will provide the information you need to get started with Elasticsearch on AWS.

How many instances will you need?
When you create an Amazon Elasticsearch Service domain, this is one of the first questions to answer.


How to Become a Data Scientist – On your own

Big Data, Data Sciences, and Predictive Analytics are the talk of the town and it doesn’t matter which town you are referring to, it’s everywhere, from the White House hiring DJ Patil as the first chief data scientist to the United Nations using predictive analytics to forecast bombings on schools. There are dozens of Startups springing out every month stretching human imagination of how the underlying technologies can be used to improve our lives and everything we do. Data science is in demand and its growth is on steroids. According to Linkedin, “Statistical Analysis” and “Data Mining” are two top-most skills to get hired this year. Gartner says there are 4.4 million jobs for data scientists (and related titles) worldwide in 2015, 1.9 million in the US alone.  One data science job creates another three non-IT jobs, so we are talking about some 13 million jobs altogether. The question is what YOU can do to secure a job and make your dreams come true, and how YOU can become someone that would qualify for these 4.4 million jobs worldwide.

There are at least 50 data science degree programs by universities worldwide offering diplomas in this discipline, it costs from 50,000 to 270,000 US$ and takes 1 to 4 years of your life. It might be a good option if you are looking to join college soon, and it has its own benefits over other programs in similar or not-to-so similar disciplines. I find these programs very expensive for the people from developing countries or working professionals to commit X years of their lives.

Then there are few very good summer programs, fellowships and boot camps that promise you to make a data scientists in very short span of time, some of them are free but almost impossible to get in, while other requires a PhD or advanced degree, and some would cost between 15,000 to 25,000 US$ for 2 months or so. While these are very good options for recent Ph.D. graduates to gain some real industry experience, we have yet to see their quality and performance against a veteran industry analyst. Few of the ones that I really like are Data Incubator, Insight Fellowship,  Metis Bootcamp, Data Science for Social Goods and the famous Zipfian Academy programs.

Let me also mention few paid resources that I am a fan of before I tell you how to do all that for free. First one is the Explore Data Science program by Booz Allen, it costs 1,250 $ but worth a single penny. Second one is recorded lectures by Tim Chartier on DVD, called Big Data: How Data Analytics is transforming the world, it costs 80 bucks and worth your investment. The next in the list are two courses by MIT, Tackling the Big Data Challenges, that costs 500$ and provides you a very solid theoretical foundation on big data, and The Analytics Edge, that costs only 100 bucks and gives a superb introduction on how the analytics can be used to solve day-to-day business problems. If you can spare few hours a day then Udacity offers a perfect Nanodegree for Data Analysts that costs 200$/month can be completed in 6 months or so, they offer this in partnership with Facebook, Zipfian Academy, and MongoDB. ThinkFul has a wonderful program for 500$/month to connect you live with a mentor to guide you to become a data scientist.

Ok, so what one can do to become a data scientist if he/she cannot afford or get selected in the aforementioned competitive and expensive programs. What someone from a developing country can do to improve his/her chances of getting hired in this very important field or even try to use these advanced skills to improve their own surroundings, communities and countries.