Introduction to Python Data Analytics: Basic Time Series Analysis
When I started at Palantir, I got a crash course in data analytics. In the process, I've realized that pure software engineering skills are just one piece of a broader set of tools needed to interact with data in sophisticated ways. For a software engineer that works with data regularly, some basic data analytics go a long way to being more effective on your team.
In this post, we'll deal with one of the simplest yet most powerful forms of data analysis: time series. Time series data is data that demonstrates how things change over time. For example, you might use time series to plot stock prices or income growth, or any other metric for which the x axis is time.
At the end of the post, I've also included a simple function you can use to analyze the comparative performance of any two stocks using existing APIs.
Today, we're going to pull a couple of datasets from Quandl, a tool that exposes stock ticker and other data for analysis. Some of their datasets are expensive, but others come for free. We'll use a couple of the free datasets today to do the analysis. To plot the data, we'll go ahead and use plotly in a Jupyter Notebook.
For instructions on installing Jupyter (a great tool for these analyses), go here.
Step 1: Ask the right question
Obviously, data analysis of this sort doesn't serve much of a purpose if we don't have the right questions in mind. Oftentimes, we may want to see how one variable is impacting another or how two variables compare over time. Below are a few of the questions we may want to understand.
How did the performance of two equities compare over time? How did rising GDP affect consumer discretionary spending? What trends can we extract by looking at a local real estate market?
Today, we're going to dig into the following: How has the growth of ecommerce stocks compared to the performance of brick-and-mortar stores over the past 10 years?
To address this, we're going to pull some stock data from Quandl.
Step 2: Find and Load your data
The first step to any analysis is loading your data into a place you can analyze it. This could take the form of a loaded excel spreadsheet, hitting an external api, loading a CSV into Pandas, or some other method of getting data into your analytic environment.
In this case, we're going to load the data directly from Quandl's python API because it's the most direct and readily reproduced. First, import the proper packages:
# numbers and dataframe processing
import pandas as pd
import numpy as np
# graphing
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode()
import cufflinks as cf
# wrappers that help loading data
import quandl
quandl.ApiConfig.api_key = 'your_quandl_api_key'
Now that we've imported all our tools, it should be easy to load the right stocks. We're going to quickly load a few historical stock prices from Quandl: Amazon and Walmart.
Using Quandl's python package, we can quickly load our data:
stocks_to_analyze = quandl.get(['WIKI/AMZN.11','WIKI/WMT.11'])
stocks_to_analyze.columns = ["Amazon", "Walmart"]
Now, we have a pandas dataframe locally with the full stock history of Amazon and Walmart. Visualized in Jupyter notebook, this looks something like this:
Step 3: Plot, Transform, Plot, Transform..
Now that we've loaded our data, we can take a crack at plotting it. Time series data has a few quirks that come into play when plotting.
No filter
For example, here's what our raw data looks like without any filtering:
stocks_to_analyze.iplot()
There are a couple problems with this:
- Our two companies went public at different times. Therefore, we have more historical data for Walmart.
- Their magnitude is completely different. We want to see their trends from a comparative perspective, so this often proves problematic.
Limiting Time Frame
To solve problem (1), we can simply take a subset of our data. Let's take a look at only data from 2000 -> present:
# write a new df
stocks_2000_to_present = stocks_to_analyze['2000-01-01':]
Now, our graph should look much better:
stocks_2000_to_present.iplot()
That's slightly better, but we still need to deal with issue (2).
Indexing Data
We can easily index data using pandas. An index is a way to measure direct changes in a value based on distance to the mean value. In time series analysis, we can use it to put Amazon and Walmart's stocks on a similar scale.
This is super easy using our basic Pandas apis:
stocks_as_indices = stocks_2000_to_present / stocks_2000_to_present.mean()
Now, let's just plot it again and see how it has improved:
stocks_as_indices.iplot()
This is a nice step in that it makes our picture a lot more clear. We can actually see that Walmart was growing during this period, although it obviously was not matching Amazon's massive growth (nor could we expect it to).
At this point, we've done a fairly nice basic plot, but let's smooth out our lines there. Currently, we're analyzing every day's worth of data. To smooth out this curve, let's apply a moving average to the graph.
Indexing Data
A moving average (AKA rolling mean) is essentially a window of data in which a group of values is used to plot the value for a given day. For example, plotting a rolling mean of 3 days would mean that today's measurement would really be an average of todays, yesterdays, and the day before.
This has an incredible effect when visualizing time series data. It serves to completely smooth our curves based on the size of the window. It's also incredibly easy in pandas.
Tip: When calculating a moving average on datasets in which you have weekly measurements, try using a window size that is a multiple of 7. This will adjust for weekly variations. For example, sales might pick up on the weekends, so you'll want to make sure you're including a full week each time.
In this case, we can calculate a 20 day (4 weeks in the stock market) moving average like so:
stocks_as_indices.rolling(window=20).mean().iplot()
The impact on our graph is immediately apparent. We have eliminated a great deal of the noise here and can much more readily assess stock performance.
This might be enough to go show someone what we've found, but we can actually take it a step farther to compare the performance of the two stocks directly.
Step 4: Direct Performance comparisons
By taking our two precomputed indices and comparing them as a ratio, we can actually get a more direct sense of the magnitude of Amazon's massive outperformance of Walmart's stock.
If the stocks performed at exactly the same level, we would expect this line to stay roughly flat. If Walmart were outperforming Amazon, it would go down, and if Amazon were outperforming Walmart, it would go up.
The results look like so:
(stocks_as_indices.Amazon / stocks_as_indices.Walmart).iplot()
Step 5: Parameterizing our work
This is great and all, but at the end of the day, is it really a surprise that Amazon (a fast-growing, volatile stock riding a wave of ecommerce) outperformed an entrenched giant like Walmart? Probably not.
It would be great to not have to do all of our work again, though, if we wanted to compare another 2 (or 10) different stocks.
One of the best parts of Python analytics, then, is the ability to quickly parameterize all of the work we just did. Below, I have included a function that will, given any n stocks found in Quandl's WIKI
dataset, index and plot the data for any date range. Feel free to use it in your projects!
def get_stocks(tickers=['AAPL', 'GOOG'], stock_display_names=None, start_date='2000-01-01', end_date=None, window=20):
quandl_stocks = []
for ticker in tickers:
quandl_stocks.append("WIKI/" + ticker + ".11")
stocks_to_analyze = quandl.get(quandl_stocks)
# display prefs
if stock_display_names is not None:
stocks_to_analyze.columns = stock_display_names
# transformations
if end_date is None:
if start_date is None:
stocks_to_analyze_transformed = (stocks_to_analyze / stocks_to_analyze.mean()).rolling(window=window).mean()
else:
stocks_to_analyze_transformed = (stocks_to_analyze[start_date:] / stocks_to_analyze[start_date:].mean()).rolling(window=window).mean()
elif start_date is None:
stocks_to_analyze_transformed = (stocks_to_analyze[:end_date] / stocks_to_analyze[:end_date].mean()).rolling(window=window).mean()
else:
stocks_to_analyze_transformed = (stocks_to_analyze[start_date:end_date] / stocks_to_analyze[start_date:end_date].mean()).rolling(window=window).mean()
return stocks_to_analyze_transformed
Testing our function
Here are just a few examples of how we might use this function:
Major Car Companies
get_stocks(['TSLA', 'GM', 'FAST', 'F', 'FCAU'], ['Tesla', 'General Motors', 'Ferrari', 'Ford', 'Fiat Chrysler'], window=50).iplot()
Motley Fool top Value Stocks of 2017
From here: https://www.fool.com/investing/2017/04/21/top-value-stocks-of-2017.aspx
get_stocks(['IBM', 'OC', 'SWK'], ['IBM', 'Owens Corning', 'Black and Decker'], window=30).iplot()
Motif Investing's Casino Gambling Motif
https://www.motifinvesting.com/motifs/casino-gambling
get_stocks(['MGM', 'CZR', 'LVS', 'WYNN', 'MLCO'], ['MGM Resorts', 'Caesars Entertainment Corporation', 'Las Vegas Sands', 'Wynn Resorts', 'Melco Resorts and Entertainment'], start_date='2010-01-01', window=30).iplot()
Concluding thoughts
Obviously, time series is just one way to evaluate a data source, but it is a great first step in making sense of your data's quality (i.e. revealing outliers, gaps, etc.) and also understanding historical trends.
Next time, we'll dig into more advanced analytic techniques such as regression and correlation!