spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.
Scikit-learn with plotting.
Scikit-plot is the result of an unartistic data scientist’s dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.
Gaining insights is simply a lot easier when you’re looking at a colored heatmap of a confusion matrix complete with class labels rather than a single-line dump of numbers enclosed in brackets. Besides, if you ever need to present your results to someone (virtually any time anybody hires you to do data science), you show them visualizations, not a bunch of numbers in Excel.
That said, there are a number of visualizations that frequently pop up in machine learning. Scikit-plot is a humble attempt to provide aesthetically-challenged programmers (such as myself) the opportunity to generate quick and beautiful graphs and plots with as little boilerplate as possible.
- Example Machine Learning – Notebook by Randal S. Olson, supported by Jason H. Moore. University of Pennsylvania Institute for Bioinformatics
- Python Machine Learning Book – 400 pages rich in useful material just about everything you need to know to get started with machine learning … from theory to the actual code that you can directly put into action!
- Learn Data Science – The initial beta release consists of four major topics: Linear Regression, Logistic Regression, Random Forests, K-Means Clustering
- Scikit-learn Tutorial – By Jake VanderPlas, University of Washington.
- Machine Learning – This repo contains a collection of IPython notebooks detailing various machine learning algorithms. In general, the mathematics follows that presented by Dr. Andrew Ng’s Machine Learning course taught at Stanford University (materials available from ITunes U, Stanford Machine Learning), Dr. Tom Mitchell’s course at Carnegie Mellon, and Christopher M. Bishop’s “Pattern Recognition And Machine Learning”.
- Research Computing Meetup – Linux and Python for data analysis (tutorials). University of Colorado, Computational Science and Engineering.
- Theano Tutorial – A brief IPython notebook-based tutorial on basic Theano concepts, including a toy multi-layer perceptron example..
- IPython Theano Tutorials – A collection of tutorials in ipynb format that illustrate how to do various things in Theano.
- IPython Notebooks – Demonstrations and use cases for many of the most widely used “data science” Python libraries. Implementations of the exercises presented in Andrew Ng’s “Machine Learning” class on Coursera. Implementations of the assignments from Google’s Udacity course on deep learning.
- ISLR Python – This repository contains Python code for a selection of tables, figures and LAB sections from the book ‘An Introduction to Statistical Learning with Applications in R’ by James, Witten, Hastie, Tibshirani (2013).
- Graphing Data with IPython Notebook – Graphing bike path data with IPython Notebook and pandas.
Kaggle released a series with tutorials in their blog. I recommend to anyone who is starting or want to learn more about the tool.
How we use natural language processing to qualify leads
In this blog post I’ll explain how we’re making our sales process at Xenetamore effective by training a machine learning algorithm to predict the quality of our leads based upon their company descriptions.
Head over to GitHub if you want to check out the script immediately, and feel free to suggest improvements as it’s under continuous development.
It started with a request from business development representative Edvard, who was tired of performing the tedious task of going through big excel sheets filled with company names, trying to identify which ones we ought to contact.
This kind of pre-qualification of sales leads can take hours, as it forces the sales representative to figure out what every single company does (e.g. through read about them on LinkedIn) so that he/she can do a qualified guess at whether or not the company is a good fit for our SaaS app.
And how do you make a qualified guess?