Text Data Preprocessing: A Walkthrough in Python

In a pair of previous posts, we first discussed a framework for approaching textual data science tasks, and followed that up with a discussion on a general approach to preprocessing text data. This post will serve as a practical walkthrough of a text data preprocessing task using some common Python tools.

Data preprocessing
Preprocessing, in the context of the textual data science framework.Our goal is to go from what we will describe as a chunk of text (not to be confused with text chunking), a lengthy, unprocessed single string, and end up with a list (or several lists) of cleaned tokens that would be useful for further text mining and/or natural language processing tasks.



The amazing power of word vectors

For today’s post, I’ve drawn material not just from one paper, but from five! The subject matter is ‘word2vec’ – the work of Mikolov et al. at Google on efficient vector representations of words (and what you can do with them). The papers are:

From the first of these papers (‘Efficient estimation…’) we get a description of the Continuous Bag-of-Words and Continuous Skip-gram models for learning word vectors (we’ll talk about what a word vector is in a moment…). From the second paper we get more illustrations of the power of word vectors, some additional information on optimisations for the skip-gram model (hierarchical softmax and negative sampling), and a discussion of applying word vectors to phrases. The third paper (‘Linguistic Regularities…’) describes vector-oriented reasoning based on word vectors and introduces the famous “King – Man + Woman = Queen” example. The last two papers give a more detailed explanation of some of the very concisely expressed ideas in the Milokov papers.

Check out the word2vec implementation on Google Code.


We are publishing pre-trained word vectors for 90 languages, trained on Wikipedia. These are vectors in dimension 300, trained with the default parameters of fastText.


Opinion Mining Extraction of opinions from free text

There’s a lot of buzzword around the term “Sentiment Analysis” and the various ways of doing it. Great! So you report with reasonable accuracies what the sentiment about a particular brand or product is.

Opinion Mining and Sentiment Analysis

After publishing this report, your client comes back to you and says “Hey this is good. Now can you tell me ways in which I can convert the negative sentiments into positive sentiments?” – Sentiment Analysis stops there and we enter the realms of Opinion Mining. Opinion Mining is about having a deeper understanding of the review that was written. Typically, a detailed review will not just have a sentiment attached to it. It will have information and valuable feedback that can literally help to build the next strategy. Over time, some powerful methods have been developed using Natural Language Processing and computational linguistics to extract these subjective opinions.

Opinion Mining

In this blog we will study the stepping stone to Opinion Mining – grammatically tagging a sentence. It will help us break a sentence down into its underlying grammatical structure – nouns, verbs, adjectives etc. that will help us associate what was said about what. Once we are capable enough to do that, we can extract useful opinions that will help us answer the question posed by our client above.


Boosting Sales With Machine Learning

How we use natural language processing to qualify leads

In this blog post I’ll explain how we’re making our sales process at Xenetamore effective by training a machine learning algorithm to predict the quality of our leads based upon their company descriptions.

Head over to GitHub if you want to check out the script immediately, and feel free to suggest improvements as it’s under continuous development.

The problem

It started with a request from business development representative Edvard, who was tired of performing the tedious task of going through big excel sheets filled with company names, trying to identify which ones we ought to contact.

An example of a list of potential companies to contact, pulled from sec.gov

This kind of pre-qualification of sales leads can take hours, as it forces the sales representative to figure out what every single company does (e.g. through read about them on LinkedIn) so that he/she can do a qualified guess at whether or not the company is a good fit for our SaaS app.

And how do you make a qualified guess?

View at Medium.com

Sentiment analysis in Python using NLTK

“NLTK, an external library for Python, makes it incredibly easy to utilise natural language processing techniques with only a few lines of code. This, of course, has great use in the fields of predictive modelling, sentiment analysis, speech recognition and more. What follows is a walkthrough of a basic sentiment analyser, written over the course of an evening, with much credit being given to Laurent Luce’s excellent walkthrough that got me up to speed on this stuff very quickly. My analyser, due to scale, doesn’t use a database, but instead reads from a list of positive tweets (postweets.txt) and negative tweets (negtweets.txt) that were created by me. They are as follows:…”