On Machine Learning

“Recently, I wrote how we do classification at CB Insights. The post outlines some of the things that I have been thinking about how to apply machine learning for a given problem along with the process that we adopted for the classification problem at CB Insights, but also gave me a good opportunity to reflect even further about the machine learning process; shortcomings of papers, books and even traditional education system when it comes to teach the machine learning.

My aim is not to focus on the algorithms, methods or classifiers but rather to offer a broader picture on how to approach a machine learning problem, and in the meantime give couple of bad advices. I will offer my bad advice for a classification problem(algorithm=classifier) and be warned that they may generalize better than your favorite classifier.(I will try not to overfit, but let me know if I do so in the comments.)…”



Understanding Paxos

“The first time I heard of the Paxos algorithm was during my bachelor’s degree way back in 2004, when I participated in a Distributed Algorithms course. In the past few years Paxos came up multiple times, usually in the context of a robust implementation of some scalable storage system. It is almost always uttered in awe, as to say “that unbelievably complex algorithm, the Paxos. Beware!”. I’ve decided to reread the original paper, and try to explain what Paxos does, and how it does it. In a nutshell, Paxos solves the problem of resiliently replicating an increasingly growing (ordered) list of items…”


“The previous post gave a general overview of the Paxos algorithm. Here’s a quick recap: Paxos implements a resilient distributed log, such that items can be added and each item is assigned a unique (and increasing) index. The algorithm can be split into three main blocks: a leader election, a consensus on a single item (also called the Synod algorithm) and managing the entire log. In this post I want to go into more depth about the Synod algorithm…”



Using TCP keepalive with Go

“If you have ever written some TCP socket code, you may have wondered: “What will happen to my connection if the network cable is unplugged or the remote machine crashes?”.

The short answer is: nothing. The remote end of the connection won’t be able to send a FIN packet, and the local OS will not detect that the connection is lost. So it’s up to you as the developer to address this scenario…”


Writing a Simple Garbage Collector in C

“People seem to think that writing a garbage collector is really hard, a deep magic understood by a few great sages and Hans Boehm (et al). Well it’s not. In fact, it’s rather straight forward. I claim that the hardest part in writing a GC is writing the memory allocator, which is as hard to write as it is to look up the malloc example in K&R.

A few important things to note before we begin. First, our code will be dependent on the Linux kernel. Not GNU/Linux, but the Linux kernel. Secondly, our code will be 32-bit and not one bit more. Thirdly. Please don’t use this code. I did not intend for it to be wholly correct and there may be subtle bugs I did not catch. Regardless, the ideas themselves are still correct. Now, let’s get started…”


Little Book of R for Time Series

“By Avril Coghlan, Parasite Genomics Group, Wellcome Trust Sanger Institute, Cambridge, U.K. Email: alc@sanger.ac.uk

This is a simple introduction to time series analysis using the R statistics software.

There is a pdf version of this booklet available at https://media.readthedocs.org/pdf/a-little-book-of-r-for-time-series/latest/a-little-book-of-r-for-time-series.pdf.

If you like this booklet, you may also like to check out my booklet on using R for biomedical statistics, http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/, and my booklet on using R for multivariate analysis, http://little-book-of-r-for-multivariate-analysis.readthedocs.org/…”


Why Racket? Why Lisp?

“In prac­ti­cal pro­gram­ming projects, Lisps are rare, and Racket es­pe­cially so. Thus, be­fore I em­barked on my Lisp ad­ven­ture, I wanted to un­der­stand the costs & ben­e­fits of us­ing a Lisp. Why do Lisps have such a great rep­u­ta­tion, but so few users? Was I see­ing some­thing every­one else missed? Or did they know some­thing I didn’t? To find out, I read what­ever I could find about Lisps, in­clud­ing Paul Gra­ham’s Hack­ers & Painters and Pe­ter Seibel’s Prac­ti­cal Com­mon Lisp…”


The S in REST

“REST is a vast improvement over complex things like SOAP and CORBA, but I think we still have a way to go before we’ve reached simple. REST is an acronym for REpresentational State Transfer, and I think the “state” part of that acronym gives rise to a lot of incidental complexity as systems grow.

You can think of state as a combination of value and time, and in the RESTful case, the time dimension is almost always “now”. The trouble then comes the absence of a coordinated notion of time.

Almost every program I write today depends on at least one RESTful service, and my program is just one component in an ensemble. As we develop systems that call systems that call systems, what are the odds that everybody participating in the ensemble has the same notion of time? What happens when systems have different notions of time? When we are talking about services in the small, it’s not a much of a concern, but for networked systems in the large, conflating value and time into state makes our systems increasingly difficult to reason about…”