Sarcasm Detection with Machine Learning in Spark

This post is inspired by a site I found whilst searching for a way to detect sarcasm within sentences. As humans we sometimes struggle detecting sarcasm when we have a lot more contextual information available to us. People are emotive when they speak, they use certain tones and these traits can help us understand when someone is being sarcastic. However we don’t always catch it! So how the hell could a computer detect this, when all it has is text.

Well one way is to Machine Learning. I wondered if I could set up a machine learning model that could accurately predict sarcasm (or accurately enough for it to be effective). This search led me to the above link site where the author Mathieu Cliche cleverly came up with the idea of using tweets as the training set.

As with any machine learning algorithm, its level of accuracy is only as good as the training data it’s provided. Create a large catalog of sarcastic sentences could be rather challenging. However searching for tweets that contain the hastag #sarcasm or #sarcastic would provide me with a vast amount of training data (providing a good percentage of those tweets are actually sarcstic).

Using that approach as the basis, I developed a Spark application using the MlLib api that would use the Naive Bayes classifier to detect sarcasm in sentences – This post will cover the basics and I will be expanding on this next time to utilise sarcastic tweets to train my model.

Predict Social Network Influence with R and H2O Ensemble Learning

“H2O is an awesome machine learning framework. It is really great for data scientists and business analysts “who need scalable and fast machine learning”. H2O is completely open source and what makes it important is that works right of the box. There seems to be no easier way to start with scalable machine learning. It hast support for R, Python, Scala, Java and also has a REST API and a own WebUI. So you can use it perfectly for research but also in production environments.

H2O is based on Apache Hadoop and Apache Spark which gives it enormous power with in-memory parallel processing…”

Introducing streaming k-means in Spark 1.2

“A key advantage of Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics. This facilitates adding extensions that leverage and combine components in novel ways without reinventing the wheel. We have been developing a family of streaming machine learning algorithms in Spark within MLlib. In this post we describe streaming k-means clustering, included in the recently released Spark 1.2…”