Regression analysis using Python
This tutorial covers regression analysis using the Python StatsModels package with Quandl integration. For motivational purposes, here is what we are working towards: a regression analysis program which receives multiple dataset names from Quandl.com, automatically downloads the data, analyses it, and plots the results in a new window. Actual outputs,
Types of Regression Analysis
Linear regression analysis fits a straight line to some data in order to capture the linear relationship between that data. The regression line is constructed by optimizing the parameters of the straight line function such that the line best fits a sample of (x, y) observations where y is a variable dependent on the value of x. Regression analysis is used extensively in economics, risk management, and trading. One cool application of regression analysis is in calibrating certain stochastic process models such as the Ornstein Uhlenbeck stochastic process.
Nonlinear regression analysis uses a curved function, usually a polynomial, to capture the nonlinear relationship between the two variables. The regression is often constructed by optimizing the parameters of a higherorder polynomial such that the line best fits a sample of (x, y) observations. In the article, Ten Misconceptions about Neural Networks in Finance and Trading, it is shown that a neural network is essentially approximating a multiple nonlinear regression function between the inputs into the neural network and the outputs.
The case for linear vs. nonlinear regression analysis in finance remains open. The issue with linear models is that they often underfit and may also assert assumptions on the variables and the main issue with nonlinear models is that they often overfit. Training and datapreparation techniques can be used to minimize overfitting.
A multiple linear regression analysis is a used for predicting the values of a set of dependent variables, Y, using two or more sets of independent variables e.g. X1, X2, ..., Xn. E.g. you could try to forecast share prices using one fundamental indicator like the PE ratio, or you could used multiple indicators together like the PE, DY, DE ratios, and the share's EPS. Interestingly there is almost no difference between a multiple linear regression and a perceptron (also known as an artificial neuron, the building blocks of neural networks). Both are calculated as the weighted sum of the input vector plus some constant or bias which is used to shift the function. The only difference is that the input signal into the perceptron is fed into an activation function which is often nonlinear.
If the objective of the multiple linear regression is to classify patterns between different classes and not regress a quantity then another approach is to make use of clustering algorithms. Clustering is particularly useful when the data contains multiple classes and more than one linear relationship. Once the data set has been partitioned further regression analysis can be performed on each class. Some useful clustering algorithms are the KMeans Clustering Algorithm and one of my favourite computational intelligence algorithms, Ant Colony Optimization.
The image on the right shows how the KMeans clustering algorithm can be used to partition data into clusters (classes). Regression can then be performed on each class individually.
Logistic Regression Analysis  linear regressions deal with continuous valued series whereas a logistic regression deals with categorical (discrete) values. Discrete values are difficult to work with because they are non differentiable so gradientbased optimization techniques don't apply.
Stepwise Regression Analysis  this is the name given to the iterative construction of a multiple regression model. It works by automatic selecting statistically significant independent variables to include in the regression analysis. This is achieved either by either growing or pruning the variables included in the regression analysis.
Many other regression analyses exist, and in particular, mixed models are worth mentioning here. Mixed models is is an extension to the generalized linear model in which the linear predictor contains random effects in addition to the usual fixed effects. This decision tree can be used to help determine the right components for a model.
Approximating a Regression Analysis
Note: if the formulae below don't show up correctly, try refreshing your browser
After deciding on a regression model you must select a technique for approximating the regression analysis. This involves optimizing the free parameters of the regression model such that some objective function which measures how well the model 'fits' the dataset. The most commonly used approach is called the least squares method.
The least squares method minimizes the sum of the errors squared, where the errors are the residuals between the fitted curve and the set of data points. The residual can be calculated using perpendicular distances or vertical distances. The errors are squared so that the residuals form a continuous differentiable quantity.
In the case of vertical offsets the error is equal to the difference between the value from the dataset and the computed value from the regression line where and are additional explanatory variables in a multiple regression. In the case of perpendicular offsets the error is the sum of the distance, , between the data points, , and the point along the regression curve, , perpendicular to that. For a straight line, this can be calculated by solving a quadratic equation.
These concepts should be familiar to statisticians as well as machine learning enthusiasts because the sumsquared error is the same objective function used when training neural networks.
Simple linear regression analysis
This optimization problem for straight lines was further simplified by Kenney and Keeping who introduced the concept of the center of mass of the dataset, , and related this to the y intercept of the fitted line. This optimization problem is mathematically modelled as,
Sum of x squares
Sum of y squares
These two measurements can be combined to calculate the overall sum of squares,
Overall sum of squares
Using just these three variables, , and the center of mass it is possible to construct the straight line (linear regression) of the form, , which minimizes the sum squared error, , of the residuals between the line and the datapoints.
These parameters are all that is needed to draw the linear regression analysis which fits a set of observed data points. Lastly, the overall quality of the regression analysis is measured using the correlation coefficient,
Iterative methods  for harder problems (fun)
For more complex functions iterative methods need to be applied. An iterative procedure is one which generates a sequence of improving approximate solutions to a particular problem. This is also known as a search or optimization algorithm. There are two classes of optimization algorithms, exhaustive or heuristic. Exhaustive techniques are referred to as "brute force" methods because they deterministically try every combination. Heuristic methods on the other hand use knowledge about the optimization problem to locate good solutions.
One widely used heuristic is the gradient of a function because when this is equal to zero, that point in the function is either a local minima or maxima. Gradient methods such as gradient descent, the Gauss Newton method, and the Levenberg Marquardt algorithm adjust the solution such that the derivate is either minimized or maximized.
Another widely used heuristic is line of sight a.k.a direct methods. These methods don't use gradients but instead generate points within the search space and "look for" the optima. Examples of such algorithms include random search, pattern search, grid search, hill climbers, simulated annealing, and even the particle swarm optimization algorithm. Evolutionary computation is another popular metaheuristic for solving complex optimization problems; they are inspired by the processes found in natural evolution. In this category we find such algorithms as the genetic algorithms, grammatical evolution, and the differential evolution algorithm.
Data considerations
Because the error is the squared distance between the data point and the regression line, large distances have disproportionately large errors which cause the regression analysis to converge on a solution with a poor correlation coefficient. As such, outliers should ideally be removed from the dataset. That said, identifying outliers can be a somewhat tricky task.
One might also consider applying weights to different points in the data set. As an example consider an investor who is analyzing a multiyear time series. He might decide to place a greater weight (importance) on recent years because he assumes that to be an accurate reflection of the future prices. The technique used in this instance is weighted least squares regression analysis.
Quandl Integration
A recurring challenge with any quantitative analysis is the availability of good quality data. Luckily for us, Quandl.com has taken on the data challenge and indexed millions of economics, financial, societal, and country specific datasets. That data is also available through a free API (Application Programming Interface) supported by the Quandl Python package.
Downloading Quandl data
For more than 50 API calls per day then you need to sign up with Quandl.com to get a free authentication token. This can then be used to download datasets through Quandl for Python. For instructions on installing Quandl for Python check out PyPi or the Github page. To get a dataset from Quandl e.g. quandl.com/WIKI/AAPLAppleIncAAPLPricesDividendsSplitsandTradingVolume paste it's name (WIKI/AAPL) into the Quandl.get() function,
import Quandl
data_set = Quandl.get("WIKI/AAPL", authtoken="your token here")
The Quandl.get() function also supports a number of data transformations and manipulations which allow you to specify how you would like the data to be returned, including,
 order:String  ("asc""desc")
 rows:int  the amount of historical data to extract
 frequency:String  ("daily"weekly""monthly""quarterly""annual")
 transformation:String  ("diff""rdiff""normalize""cumul")
 returns:String  ("numpy")
Here is an example of a detailed API call using multiple data transformations,
import Quandl
data_set = Quandl.get("WIKI/AAPL", rows=50, order="desc", frequency="weekly", transformation="normalize", returns="numpy", authtoken="your token here")
Abstraction
For flexibility and reusability I abstracted the Quandl API with a class called QuandlSettings. A QuandlSettings object contains the parameters required to construct any Quandl API call. I also added an additional column parameter which allows the user to specify which column of the dataset to include in the regression analysis.
The dataset name is the decoupled from the QuandlSettings class to improve reusability of QuandlSettings objects. Consider how the quandl_args_prices object is reused for each dataset in economic regression analysis example below,
Custom download method
A custom download method was created in the RegressionAnalysis class which receives a QuandlSettings object and the name of the dataset to be downloaded. In order to extract the correct column for the regression analysis a for loop was used,
There must be a more efficient method that the for loop, I don't know it yet. Please also take note that np.arange(1, quandl_settings.rows + 1, 1) creates an array of numbers increasing from 1 to quandl_settings.rows. This is used because the StatsModels regression analysis model does not support dates (yet) so these values represent time.
Python StatsModels
StatsModels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
I recently started using StatsModels and I've been very impressed. It provides efficient implementations of many statistical tools including, simple linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis tools including ARMA, nonparametric estimators, datasets, statistical tests, and more.
Abstraction
As with the QuandlSettings class, a StatsModelsSettings class was created to improve the reusability of configurations for the regression analysis. At this point in time, these settings are restricted to changing the power of the fitted curve, and specifying whether or not confidence lines around the regression should be computed and plotted,
Note that when the exponent is equal to 1.0 the fitted curve is a straight line. When this is greater than zero, the curve begins to take on nonlinearities.
Ordinary Least Squares
StatsModels includes an ordinary least squares method. Our run_ordinary_least_squares() method wraps it with Quandl data and a StatsModelsSettings object.
In this wrapper method the data from Quandl is treated as the dependent variable, the array of values from 1 to rows is treated as the independent variable (time / dates), and a StatsModelsSettings object is used to store values for the parameters used to compute the regression analysis. The method is implemented as follows,
To print the results of the regression analysis from StatsModels you can add the following command, print(statsmodel_regression.summary()). The output should look something like this,
The Regression Analysis Class
A RegressionAnalysis class was created so that it would be easy to create and store multiple regressions. The RegressionAnalysis class encapsulates the run_ordinary_least_squares() and the get_quandl_data() methods. This class is shown at the end of this article.
MatplotLib
The final piece of the puzzle is to plot the results. Because we want to be able to plot multiple regressions on one canvas, plotting functionality and the RegressionAnalysis class are decoupled. For this Matplotlib was used. MatplotLib is a 2D plotting library which produces figures in a variety of hard copy formats and environments across platforms.
A plot_regression_lines() function was defined as a global method. It receives a list of RegressionAnalysis objects as an argument and plots each out, one by one.
Example usage
By combining object oriented programming, the Quandl API, and existing python packages we have created a program which can do simple regression analysis on any Quandl dataset! The remainder of this article will show some simple example applications of the program and in future articles and tutorials I will construct ever more sophisticated analysis tools.
Fundamental Analysis: Google vs. Yahoo vs. Apple revenues
Quandl.com contains historical fundamental indicators as well as company data for many US companies. This is the code we could need to type if we wanted to compare the revenues of Google, Yahoo, and Apple over the past five years,
And here are the results,
Technical Analysis: Trade entry and exit positions
Regression analysis is used extensively in trading. Technical analysts use the "regression channel" to calculate entry and exit positions into a particular stock.
Another application is pairs trading which monitors the performance of two historically correlated securities. When the correlation temporarily weakens, i.e. one stock moves up while the other moves down, the pairs trade shorts the outperforming stock and buys the underperforming one, betting that the "spread" between the two would eventually converge.
If we wanted to compare the past 350 weeks worth of prices for Google and Yahoo with the regression channel (confidence intervals), we would use the following code,
And here are the results,
Economics: GDP comparison of BRICS nations
Another area in finance whether regression analysis is often used is econometrics. If we wanted to compare the past 15 years of GDP values for the BRICS nations (Brazil, Russia, India, China, and South Africa), we would need to just produce the following code,
And here are the results (as you all guessed, China was #1)
Conclusion and Source Code
In conclusion, regression analysis is a simple and yet useful tool. It can be used to help explain and compare various datasets and is used extensively in finance, trading, risk management, and econometrics. That having been said, regression analysis is not immune to fault and asserts strong requirements on the data being analysed. For a great discussion on the risks and problems with using regression analysis click here.

Very interesting article Stuart! Looking forward to some more.

Great blog Stuart! I appreciate the time put into it. I'm trying to use Stepwise regression to select more important predictor variables, i.e. momentum, price rate of change, volume rate of change, etc., for predicting stock price. Do you know if the scipy module has a method like that?

Nice guide! I've been trying to work on a response time integration of a similar problem, so I was trying to run your code to see how it works, but the program can't get the data from quandl.com, and is complaining that I don't have a directory called quandl. Any andvice? Thanks,
The geek

Hi,
I'm really really new in python ( I download it today but I did some matlab before), your blog/article is really well explain and really useful. I try to do a non linear multivariate regression, in 45 D to capture a trend in finance. Do you have some example of code for that, or do you know witch tools I have to download and were ? I already have my data on an excel sheet and I would like my python regression to be dynamics ( my excel sheet is dynamic). Thanks a lot for any help !
🙂

Hi Stuart,
Great article.
Here is how to get rid of your loop for getting the appropriate column in the quandl_data_set:
 quandl_data_set is a recarray object in numpy (a recorded array) which is essentially an array with column names and dtypes (data types) for those columns. You can index those columns from the array using the names of the columns as you would for keyvalue pairs in a dictionary.
 The method is therefore to first find out the name of the column you want and then index the array using this column name. You do this as follows
# quandl_data_set.dtypes.names is a list of strings containing the names of the columns
col_name = quandl_data_set.dtypes.names[quandl_settings.column]# use the column name to get the right column of data
quandl_prices = quandl_data_set[col_name][::1] Now there are two things to note. Firstly, indexing in Python starts at 0 so make sure you're getting the column you had hoped for i.e if you want the fourth column name you must index as quandl_data_set.dtypes.names[3]. Secondly, your method reverses the order of the prices so that the data at the bottom of the column comes first which in this case puts the data into chronological order  to do the same in this example we have used [::1] which reverses the order to match what you have in your original loop.

Thank you for this very bright presentation. It will really help teachers, students and professionals !

Wow, your post on regression analysis is so great! First, I got to learn enough theory and then many methods for conducting the linear regression. Enjoyed it super much. I hope that I will be able to apply regression with Python to my data data on decision making (from a Psychological perspective; i.e., behavhoural data).
Thanks again,
Python IS so great,
Regards,
Freddy

Very comprehensible code and explanation. But the only problem is that some error come up when it run in my browser with ipython. Problem like ther dataframe object has no attribute data.load_data()
Comments