An Introduction to Stock Market Data Analysis with R (Part 1)

Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evaluating trading strategies (see part 1 and part 2). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both Russian (which used to be on at a link that now appears to no longer work) and Chinese (here and here). R has excellent packages for analyzing stock data, so I feel there should be a “translation” of the post for using R for stock data analysis.

This post is the first in a two-part series on stock data analysis using R, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.

NOTE: The information in this post is of a general nature containing information and opinions from the author’s perspective. None of the content of this post should be considered financial advice. Furthermore, any code written here is provided without any form of guarantee. Individuals who choose to use it do so at their own risk.


In-depth introduction to machine learning in 15 hours of expert videos

In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (also known as “machine learning”), largely due to the high quality of both the textbook and the video lectures. And as an R user, it was extremely helpful that they included R code to demonstrate most of the techniques described in the book.

If you are new to machine learning (and even if you are not an R user), I highly recommend reading ISLR from cover-to-cover to gain both a theoretical and practical understanding of many important methods for regression and classification. It is available as a free PDF download from the authors’ website.

If you decide to attempt the exercises at the end of each chapter, there is a GitHub repository of solutions provided by students you can use to check your work.

As a supplement to the textbook, you may also want to watch the excellent course lecture videos (linked below), in which Dr. Hastie and Dr. Tibshirani discuss much of the material. In case you want to browse the lecture content, I’ve also linked to the PDF slides used in the videos.

11 IPython Tutorials for Data Science and Machine Learning

The 11 IPythonTutorials
  • Example Machine Learning – Notebook by Randal S. Olson, supported by Jason H. Moore. University of Pennsylvania Institute for Bioinformatics
  • Python Machine Learning Book – 400 pages rich in useful material just about everything you need to know to get started with machine learning … from theory to the actual code that you can directly put into action!
  • Learn Data Science – The initial beta release consists of four major topics: Linear Regression, Logistic Regression, Random Forests, K-Means Clustering
  • Machine Learning – This repo contains a collection of IPython notebooks detailing various machine learning algorithms. In general, the mathematics follows that presented by Dr. Andrew Ng’s Machine Learning course taught at Stanford University (materials available from ITunes U, Stanford Machine Learning), Dr. Tom Mitchell’s course at Carnegie Mellon, and Christopher M. Bishop’s “Pattern Recognition And Machine Learning”.
  • Research Computing Meetup – Linux and Python for data analysis (tutorials). University of Colorado, Computational Science and Engineering.
  • Theano Tutorial – A brief IPython notebook-based tutorial on basic Theano concepts, including a toy multi-layer perceptron example..
  • IPython Theano Tutorials – A collection of tutorials in ipynb format that illustrate how to do various things in Theano.
  • IPython Notebooks – Demonstrations and use cases for many of the most widely used “data science” Python libraries. Implementations of the exercises presented in Andrew Ng’s “Machine Learning” class on Coursera. Implementations of the assignments from Google’s Udacity course on deep learning.

MXNet – Deep Learning Framework of Choice at AWS

Machine learning is playing an increasingly important role in many areas of our businesses and our lives and is being employed in a range of computing tasks where programming explicit algorithms is infeasible.

At Amazon, machine learning has been key to many of our business processes, from recommendations to fraud detection, from inventory levels to book classification to abusive review detection. And there are many more application areas where we use machine learning extensively: search, autonomous drones, robotics in fulfillment centers, text and speech recognitions, etc.

Among machine learning algorithms, a class of algorithms called deep learning hascome to represent those algorithms that can absorb huge volumes of data and learn elegant and useful patterns within that data: faces inside photos, the meaning of a text, or the intent of a spoken word. A set of programming models has emerged to help developers define and train AI models with deep learning; along with open source frameworks that put deep learning in the hands of mere mortals. Some examples of popular deep learning frameworks that we support on AWS include Caffe, CNTK, MXNet, TensorFlow, Theano, and Torch.

Among all these popular frameworks, we have concluded that MXNet is the most scalable framework. We believe that the AI community would benefit from putting more effort behind MXNet. Today, we are announcing that MXNet will be our deep learning framework of choice. AWS will contribute code and improved documentation as well as invest in the ecosystem around MXNet. We will partner with other organizations to further advance MXNet.

Python is the Growing Platform for Applied Machine Learning

You should pick the right tool for the job.

The specific predictive modeling problem that you are working on should dictate the specific programming language, libraries and even machine learning algorithms to use.

But, what if you are just getting started and looking for a platform to learn and practice machine learning?

In this post, you will discover that Python is the growing platform for applied machine learning, likely to outpace and topple R in terms of adoption and perhaps capability.

After reading this post you will know:

  • That search volume for Python machine learning is growing fast and has already outpaced R.
  • That the percentage of Python machine learning jobs is growing and has already outpaced R.
  • That Python is used by nearly 50% of polled practitioners and growing.

Let’s get started.

Data Science Competitions 101: Anatomy and Approach

I recently participated in a weekend-long data science hackathon, titled ‘The Smart Recruits’. Organized by the amazing folks at Analytics Vidhya, it saw some serious competition. Although my performance can be classified as decent at best (47 out of 379 participants), it was among the more satisfying ones I have participated in on both AV (profile) and Kaggle (profile) over the last few months. Thus, I decided it might be worthwhile to try and share some insights as a data science autodidact.

Machine learning for financial prediction: experimentation with David Aronson’s latest work

One of the first books I read when I began studying the markets a few years ago was David Aronson’s Evidence Based Technical Analysis. The engineer in me was attracted to the ‘Evidence Based’ part of the title. This was soon after I had digested a trading book that claimed a basis in chaos theory, the link to which actually turned out to be non-existent. Apparently using complex-sounding terms in the title of a trading book lends some measure of credibility. Anyway, Evidence Based Technical Analysis is largely a justification of a scientific approach to trading, including a method for rigorous assessment of the presence of data mining bias in backtest results. There is also a compelling discussion based in cognitive psychology of the reasons that some traders turn away from objective methods and embrace subjective beliefs. I find this area fascinating.

Readers of this blog will know that I am very interested in using machine learning to profit from the markets. Imagine my delight when I discovered that David Aronson had co-authored a new book with Timothy Masters titled Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments – which I will herein refer to as SSML. I quickly devoured the book and have used it as a handy reference ever since. While it is intended as a companion to Aronson’s (free) software platform for strategy development, it contains numerous practical tips for any machine learning practitioner and I’ve implemented most of his ideas in R.

I used SSML to guide my early forays into machine learning for trading, and this series describes some of those early experiments. While a detailed review of everything I learned from SSML and all the research it inspired is a bit voluminous to relate in detail, what follows is an account of what I found to be some of the more significant and practical learnings that I encountered along the way.

This post will focus on feature engineering and also introduce the data mining approach. The next post will focus on algorithm selection and ensemble methods for combining the predictions of numerous learners.