How to Become a Data Scientist – On your own

Big Data, Data Sciences, and Predictive Analytics are the talk of the town and it doesn’t matter which town you are referring to, it’s everywhere, from the White House hiring DJ Patil as the first chief data scientist to the United Nations using predictive analytics to forecast bombings on schools. There are dozens of Startups springing out every month stretching human imagination of how the underlying technologies can be used to improve our lives and everything we do. Data science is in demand and its growth is on steroids. According to Linkedin, “Statistical Analysis” and “Data Mining” are two top-most skills to get hired this year. Gartner says there are 4.4 million jobs for data scientists (and related titles) worldwide in 2015, 1.9 million in the US alone.  One data science job creates another three non-IT jobs, so we are talking about some 13 million jobs altogether. The question is what YOU can do to secure a job and make your dreams come true, and how YOU can become someone that would qualify for these 4.4 million jobs worldwide.

There are at least 50 data science degree programs by universities worldwide offering diplomas in this discipline, it costs from 50,000 to 270,000 US$ and takes 1 to 4 years of your life. It might be a good option if you are looking to join college soon, and it has its own benefits over other programs in similar or not-to-so similar disciplines. I find these programs very expensive for the people from developing countries or working professionals to commit X years of their lives.

Then there are few very good summer programs, fellowships and boot camps that promise you to make a data scientists in very short span of time, some of them are free but almost impossible to get in, while other requires a PhD or advanced degree, and some would cost between 15,000 to 25,000 US$ for 2 months or so. While these are very good options for recent Ph.D. graduates to gain some real industry experience, we have yet to see their quality and performance against a veteran industry analyst. Few of the ones that I really like are Data Incubator, Insight Fellowship,  Metis Bootcamp, Data Science for Social Goods and the famous Zipfian Academy programs.

Let me also mention few paid resources that I am a fan of before I tell you how to do all that for free. First one is the Explore Data Science program by Booz Allen, it costs 1,250 $ but worth a single penny. Second one is recorded lectures by Tim Chartier on DVD, called Big Data: How Data Analytics is transforming the world, it costs 80 bucks and worth your investment. The next in the list are two courses by MIT, Tackling the Big Data Challenges, that costs 500$ and provides you a very solid theoretical foundation on big data, and The Analytics Edge, that costs only 100 bucks and gives a superb introduction on how the analytics can be used to solve day-to-day business problems. If you can spare few hours a day then Udacity offers a perfect Nanodegree for Data Analysts that costs 200$/month can be completed in 6 months or so, they offer this in partnership with Facebook, Zipfian Academy, and MongoDB. ThinkFul has a wonderful program for 500$/month to connect you live with a mentor to guide you to become a data scientist.

Ok, so what one can do to become a data scientist if he/she cannot afford or get selected in the aforementioned competitive and expensive programs. What someone from a developing country can do to improve his/her chances of getting hired in this very important field or even try to use these advanced skills to improve their own surroundings, communities and countries.

http://www.datasciencecentral.com/profiles/blogs/how-to-become-a-data-scientist-for-free

AMP + Progressive Web Apps: Start fast, stay engaged – Google I/O 2016

Alex Russell on AMP + Progressive Web Apps: Start fast, stay engaged.

AMP delivers outstanding page-load performance for users browsing content on the mobile web, which is hugely important on limited or flaky networks. AMP gets content in front of users fast.

Progressive Web Apps deliver reliable performance for re-visits to sites thanks to Service Workers and the App Shell architecture. This technique allows sites to deliver rich experiences without worrying about networks.

Until now, however, these approaches for accelerating the mobile web have appeared to be in conflict. What if it were possible to use them in conjunction to deliver fast initial loading and reliable second-visit performance, as well as advanced features like offline reading and richer UI treatment?

Come learn about how to make AMP-based PWAs and hear about how this architecture is working for real-world publishers today.

https://developers.google.com/web/shows/google-io/2016/amp-progressive-web-apps-start-fast-stay-engaged-google-io-2016

How Discord Stores Billions of Messages

Discord continues to grow faster than we expected and so does our user-generated content. With more users comes more chat messages. In July, we announced 40 million messages a day, in December we announced 100 million, and as of this blog post we are well past 120 million. We decided early on to store all chat history forever so users can come back at any time and have their data available on any device. This is a lot of data that is ever increasing in velocity, size, and must remain available. How do we do it? Cassandra!

https://blog.discordapp.com/how-discord-stores-billions-of-messages-7fa6ec7ee4c7#.9ndho7gkx

Reindexing Data with Elasticsearch

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the new functionality that will be added to Elasticsearch 2.3 – the re-index API.

The pre-requisites are quite low – you only need Elasticsearch 2.3 (not yet officially released as of this writing) and you need to be able to run a command on it. And that’s it, nothing more is needed and Elasticsearch will do the rest for us.

https://sematext.com/blog/2016/03/21/reindexing-data-with-elasticsearch/

Caching at Reddit

Performance matters. One of the first tools we as developers reach for when looking to get more performance out of a system is caching. As Reddit has grown in users and response times have improved, the amount of caching has grown to be quite large as well.

In this post we’ll talk about some of the nuts-and-bolts numbers of Reddit’s caching infrastructure—the number of instances, size of instances, and overall throughput. We hope that sharing this information may help others gauge what type of performance and sizing they can expect when building similar clusters. At the very least, we hope you’ll find it interesting to see a bit more about how Reddit works under the hood.

We’ll also go over the Reddit-specific type of work our caches do, how we use mcrouter to manage our caches more effectively, and the custom monitoring (MemcachedSlabCollector and mcsauna) we’ve written to help us understand what’s going on behind the scenes. We’ll also talk about some of the more subtle issues that we’ve run into when deploying changes to our caches.

Caching at Reddit

Dismissing Python Garbage Collection at Instagram

By dismissing the Python garbage collection (GC) mechanism, which reclaims memory by collecting and freeing unused data, Instagram can run 10% more efficiently. Yes, you heard it right! By disabling GC, we can reduce the memory footprint and improve the CPU LLC cache hit ratio. If you’re interested in knowing why, buckle up!

https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172#.mf5lv4rkd