Just how data-hungry is deep learning? It is an important question for those of us who don’t have an ocean of data from somewhere like Google or Facebook and still want to see what this deep learning thing is all about. If you have a moderate amount of your own data and your fancy new model gets mediocre performance, it is often hard to tell whether the fault is in your model architecture or in the amount of data that you have. Learning curves and other techniques for diagnosing training-in-progress can help, and much ink has been spilled offering guidance to young deep learners. We wanted to add to this an empirical case study in the tradeoff between data size and model performance for sentiment analysis.
We asked that question ourselves in the course of our work on sunny-side-up, a project assessing deep learning techniques for sentiment analysis (check out our post on learning about deep learning, which also introduces the project). Most real-world text corpora have orders of magnitude fewer documents than, for instance, the popular Amazon Reviews dataset. Even one of the stalwart benchmark datasets for sentiment analysis, IMDB Movie Reviews, has a “mere” tens of thousands of reviews compared to the Amazon dataset’s millions. While deep learning methods have claimed exceptional performance on the IMDB set, some of the top performers are trained on outside datasets. If you were trying to do sentiment analysis in small collections of documents in under-resourced langauges like Hausa or Aymara, then 30 million Amazon Movie reviews might not be a great analogue.
To look at the effects of data size in deep learning for text, I’ll look at the performance of Zhang, Zhao and LeCun’s Crepeconvolutional network architecture on differently sized subsets of the Amazon reviews set. The arXiv manuscript for Crepe claims impressive performance on (a different set of) Amazon reviews, so this is an interesting test bed for examining how much data such an algorithm might need for a sentiment task. Their paper suggests that performance degrades on datasets numbering in the hundreds of thousands of documents (which is still pretty big). But the datasets they compare have many more differences than just size, so it is hard to know how much data size itself impacts performance. Let’s look at performance on differently sized samples from the same dataset.