Regression analysis using Python
This tutorial covers regression analysis using the Python StatsModels package with Quandl integration. For motivational purposes, here is what we are working towards: a regression analysis program which receives multiple data-set names from Quandl.com, automatically downloads the data, analyses it, and plots the results in a new window. Actual outputs,
Types of Regression Analysis
Linear regression analysis fits a straight line to some data in order to capture the linear relationship between that data. The regression line is constructed by optimizing the parameters of the straight line function such that the line best fits a sample of (x, y) observations where y is a variable dependent on the value of x. Regression analysis is used extensively in economics, risk management, and trading. One cool application of regression analysis is in calibrating certain stochastic process models such as the Ornstein Uhlenbeck stochastic process.
Non-linear regression analysis uses a curved function, usually a polynomial, to capture the non-linear relationship between the two variables. The regression is often constructed by optimizing the parameters of a higher-order polynomial such that the line best fits a sample of (x, y) observations. In the article, Ten Misconceptions about Neural Networks in Finance and Trading, it is shown that a neural network is essentially approximating a multiple non-linear regression function between the inputs into the neural network and the outputs.
The case for linear vs. non-linear regression analysis in finance remains open. The issue with linear models is that they often under-fit and may also assert assumptions on the variables and the main issue with non-linear models is that they often over-fit. Training and data-preparation techniques can be used to minimize over-fitting.
A multiple linear regression analysis is a used for predicting the values of a set of dependent variables, Y, using two or more sets of independent variables e.g. X1, X2, ..., Xn. E.g. you could try to forecast share prices using one fundamental indicator like the PE ratio, or you could used multiple indicators together like the PE, DY, DE ratios, and the share's EPS. Interestingly there is almost no difference between a multiple linear regression and a perceptron (also known as an artificial neuron, the building blocks of neural networks). Both are calculated as the weighted sum of the input vector plus some constant or bias which is used to shift the function. The only difference is that the input signal into the perceptron is fed into an activation function which is often non-linear.
If the objective of the multiple linear regression is to classify patterns between different classes and not regress a quantity then another approach is to make use of clustering algorithms. Clustering is particularly useful when the data contains multiple classes and more than one linear relationship. Once the data set has been partitioned further regression analysis can be performed on each class. Some useful clustering algorithms are the K-Means Clustering Algorithm and one of my favourite computational intelligence algorithms, Ant Colony Optimization.
The image on the right shows how the K-Means clustering algorithm can be used to partition data into clusters (classes). Regression can then be performed on each class individually.
Logistic Regression Analysis - linear regressions deal with continuous valued series whereas a logistic regression deals with categorical (discrete) values. Discrete values are difficult to work with because they are non differentiable so gradient-based optimization techniques don't apply.
Stepwise Regression Analysis - this is the name given to the iterative construction of a multiple regression model. It works by automatic selecting statistically significant independent variables to include in the regression analysis. This is achieved either by either growing or pruning the variables included in the regression analysis.
Many other regression analyses exist, and in particular, mixed models are worth mentioning here. Mixed models is is an extension to the generalized linear model in which the linear predictor contains random effects in addition to the usual fixed effects. This decision tree can be used to help determine the right components for a model.
Approximating a Regression Analysis
Note: if the formulae below don't show up correctly, try refreshing your browser
After deciding on a regression model you must select a technique for approximating the regression analysis. This involves optimizing the free parameters of the regression model such that some objective function which measures how well the model 'fits' the data-set. The most commonly used approach is called the least squares method.
The least squares method minimizes the sum of the errors squared, where the errors are the residuals between the fitted curve and the set of data points. The residual can be calculated using perpendicular distances or vertical distances. The errors are squared so that the residuals form a continuous differentiable quantity.
In the case of vertical offsets the error is equal to the difference between the value from the data-set and the computed value from the regression line where and are additional explanatory variables in a multiple regression. In the case of perpendicular offsets the error is the sum of the distance, , between the data points, , and the point along the regression curve, , perpendicular to that. For a straight line, this can be calculated by solving a quadratic equation.
These concepts should be familiar to statisticians as well as machine learning enthusiasts because the sum-squared error is the same objective function used when training neural networks.
Simple linear regression analysis
This optimization problem for straight lines was further simplified by Kenney and Keeping who introduced the concept of the center of mass of the dataset, , and related this to the y intercept of the fitted line. This optimization problem is mathematically modelled as,
Sum of x squares
Sum of y squares
These two measurements can be combined to calculate the overall sum of squares,
Overall sum of squares
Using just these three variables, , and the center of mass it is possible to construct the straight line (linear regression) of the form, , which minimizes the sum squared error, , of the residuals between the line and the datapoints.
These parameters are all that is needed to draw the linear regression analysis which fits a set of observed data points. Lastly, the overall quality of the regression analysis is measured using the correlation coefficient,
Iterative methods - for harder problems (fun)
For more complex functions iterative methods need to be applied. An iterative procedure is one which generates a sequence of improving approximate solutions to a particular problem. This is also known as a search or optimization algorithm. There are two classes of optimization algorithms, exhaustive or heuristic. Exhaustive techniques are referred to as "brute force" methods because they deterministically try every combination. Heuristic methods on the other hand use knowledge about the optimization problem to locate good solutions.
One widely used heuristic is the gradient of a function because when this is equal to zero, that point in the function is either a local minima or maxima. Gradient methods such as gradient descent, the Gauss Newton method, and the Levenberg Marquardt algorithm adjust the solution such that the derivate is either minimized or maximized.
Another widely used heuristic is line of sight a.k.a direct methods. These methods don't use gradients but instead generate points within the search space and "look for" the optima. Examples of such algorithms include random search, pattern search, grid search, hill climbers, simulated annealing, and even the particle swarm optimization algorithm. Evolutionary computation is another popular metaheuristic for solving complex optimization problems; they are inspired by the processes found in natural evolution. In this category we find such algorithms as the genetic algorithms, grammatical evolution, and the differential evolution algorithm.
Because the error is the squared distance between the data point and the regression line, large distances have disproportionately large errors which cause the regression analysis to converge on a solution with a poor correlation coefficient. As such, outliers should ideally be removed from the data-set. That said, identifying outliers can be a somewhat tricky task.
One might also consider applying weights to different points in the data set. As an example consider an investor who is analyzing a multi-year time series. He might decide to place a greater weight (importance) on recent years because he assumes that to be an accurate reflection of the future prices. The technique used in this instance is weighted least squares regression analysis.
A recurring challenge with any quantitative analysis is the availability of good quality data. Luckily for us, Quandl.com has taken on the data challenge and indexed millions of economics, financial, societal, and country specific datasets. That data is also available through a free API (Application Programming Interface) supported by the Quandl Python package.
Downloading Quandl data
For more than 50 API calls per day then you need to sign up with Quandl.com to get a free authentication token. This can then be used to download datasets through Quandl for Python. For instructions on installing Quandl for Python check out PyPi or the Github page. To get a data-set from Quandl e.g. quandl.com/WIKI/AAPL-Apple-Inc-AAPL-Prices-Dividends-Splits-and-Trading-Volume paste it's name (WIKI/AAPL) into the Quandl.get() function,
import Quandl data_set = Quandl.get("WIKI/AAPL", authtoken="your token here")
The Quandl.get() function also supports a number of data transformations and manipulations which allow you to specify how you would like the data to be returned, including,
- order:String - ("asc"|"desc")
- rows:int - the amount of historical data to extract
- frequency:String - ("daily"|weekly"|"monthly"|"quarterly"|"annual")
- transformation:String - ("diff"|"rdiff"|"normalize"|"cumul")
- returns:String - ("numpy"|)
Here is an example of a detailed API call using multiple data transformations,
import Quandl data_set = Quandl.get("WIKI/AAPL", rows=50, order="desc", frequency="weekly", transformation="normalize", returns="numpy", authtoken="your token here")
For flexibility and re-usability I abstracted the Quandl API with a class called QuandlSettings. A QuandlSettings object contains the parameters required to construct any Quandl API call. I also added an additional column parameter which allows the user to specify which column of the dataset to include in the regression analysis.
The dataset name is the decoupled from the QuandlSettings class to improve re-usability of QuandlSettings objects. Consider how the quandl_args_prices object is reused for each dataset in economic regression analysis example below,
Custom download method
A custom download method was created in the RegressionAnalysis class which receives a QuandlSettings object and the name of the dataset to be downloaded. In order to extract the correct column for the regression analysis a for loop was used,
There must be a more efficient method that the for loop, I don't know it yet. Please also take note that np.arange(1, quandl_settings.rows + 1, 1) creates an array of numbers increasing from 1 to quandl_settings.rows. This is used because the StatsModels regression analysis model does not support dates (yet) so these values represent time.
StatsModels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
I recently started using StatsModels and I've been very impressed. It provides efficient implementations of many statistical tools including, simple linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis tools including ARMA, non-parametric estimators, datasets, statistical tests, and more.
As with the QuandlSettings class, a StatsModelsSettings class was created to improve the re-usability of configurations for the regression analysis. At this point in time, these settings are restricted to changing the power of the fitted curve, and specifying whether or not confidence lines around the regression should be computed and plotted,
Note that when the exponent is equal to 1.0 the fitted curve is a straight line. When this is greater than zero, the curve begins to take on non-linearities.
Ordinary Least Squares
StatsModels includes an ordinary least squares method. Our run_ordinary_least_squares() method wraps it with Quandl data and a StatsModelsSettings object.
In this wrapper method the data from Quandl is treated as the dependent variable, the array of values from 1 to rows is treated as the independent variable (time / dates), and a StatsModelsSettings object is used to store values for the parameters used to compute the regression analysis. The method is implemented as follows,
To print the results of the regression analysis from StatsModels you can add the following command, print(statsmodel_regression.summary()). The output should look something like this,
The Regression Analysis Class
A RegressionAnalysis class was created so that it would be easy to create and store multiple regressions. The RegressionAnalysis class encapsulates the run_ordinary_least_squares() and the get_quandl_data() methods. This class is shown at the end of this article.
The final piece of the puzzle is to plot the results. Because we want to be able to plot multiple regressions on one canvas, plotting functionality and the RegressionAnalysis class are decoupled. For this Matplotlib was used. MatplotLib is a 2D plotting library which produces figures in a variety of hard copy formats and environments across platforms.
A plot_regression_lines() function was defined as a global method. It receives a list of RegressionAnalysis objects as an argument and plots each out, one by one.
By combining object oriented programming, the Quandl API, and existing python packages we have created a program which can do simple regression analysis on any Quandl dataset! The remainder of this article will show some simple example applications of the program and in future articles and tutorials I will construct ever more sophisticated analysis tools.
Fundamental Analysis: Google vs. Yahoo vs. Apple revenues
Quandl.com contains historical fundamental indicators as well as company data for many US companies. This is the code we could need to type if we wanted to compare the revenues of Google, Yahoo, and Apple over the past five years,
And here are the results,
Technical Analysis: Trade entry and exit positions
Regression analysis is used extensively in trading. Technical analysts use the "regression channel" to calculate entry and exit positions into a particular stock.
Another application is pairs trading which monitors the performance of two historically correlated securities. When the correlation temporarily weakens, i.e. one stock moves up while the other moves down, the pairs trade shorts the outperforming stock and buys the under-performing one, betting that the "spread" between the two would eventually converge.
If we wanted to compare the past 350 weeks worth of prices for Google and Yahoo with the regression channel (confidence intervals), we would use the following code,
And here are the results,
Economics: GDP comparison of BRICS nations
Another area in finance whether regression analysis is often used is econometrics. If we wanted to compare the past 15 years of GDP values for the BRICS nations (Brazil, Russia, India, China, and South Africa), we would need to just produce the following code,
And here are the results (as you all guessed, China was #1)
Conclusion and Source Code
In conclusion, regression analysis is a simple and yet useful tool. It can be used to help explain and compare various data-sets and is used extensively in finance, trading, risk management, and econometrics. That having been said, regression analysis is not immune to fault and asserts strong requirements on the data being analysed. For a great discussion on the risks and problems with using regression analysis click here.