          # Regression analysis using Python This tutorial covers regression analysis using the Python StatsModels package with Quandl integration. For motivational purposes, here is what we are working towards: a regression analysis program which receives multiple data-set names from Quandl.com, automatically downloads the data, analyses it, and plots the results in a new window. Actual outputs,

##### Types of Regression Analysis

Linear regression analysis fits a straight line to some data in order to capture the linear relationship between that data. The regression line is constructed by optimizing the parameters of the straight line function such that the line best fits a sample of (x, y) observations where y is a variable dependent on the value of x. Regression analysis is used extensively in economics, risk management, and trading. One cool application of regression analysis is in calibrating certain stochastic process models such as the Ornstein Uhlenbeck stochastic process  Non-linear regression analysis uses a curved function, usually a polynomial, to capture the non-linear relationship between the two variables. The regression is often constructed by optimizing the parameters of a higher-order polynomial such that the line best fits a sample of (x, y) observations. In the article, Ten Misconceptions about Neural Networks in Finance and Trading, it is shown that a neural network is essentially approximating a multiple non-linear regression function between the inputs into the neural network and the outputs.

The case for linear vs. non-linear regression analysis in finance remains open. The issue with linear models is that they often under-fit and may also assert assumptions on the variables and the main issue with non-linear models is that they often over-fit. Training and data-preparation techniques can be used to minimize over-fitting. multiple linear regression analysis is a used for predicting the values of a set of dependent variables, Y, using two or more sets of independent variables e.g. X1, X2, ..., Xn. E.g. you could try to forecast share prices using one fundamental indicator like the PE ratio, or you could used multiple indicators together like the PE, DY, DE ratios, and the share's EPS. Interestingly there is almost no difference between a multiple linear regression and a perceptron (also known as an artificial neuron, the building blocks of neural networks). Both are calculated as the weighted sum of the input vector plus some constant or bias which is used to shift the function. The only difference is that the input signal into the perceptron is fed into an activation function which is often non-linear.

If the objective of the multiple linear regression is to classify patterns between different classes and not regress a quantity then another approach is to make use of clustering algorithms. Clustering is particularly useful when the data contains multiple classes and more than one linear relationship. Once the data set has been partitioned further regression analysis can be performed on each class. Some useful clustering algorithms are the K-Means Clustering Algorithm and one of my favourite computational intelligence algorithms, Ant Colony Optimization.

The image on the right shows how the K-Means clustering algorithm can be used to partition data into clusters (classes). Regression can then be performed on each class individually. Logistic Regression Analysis - linear regressions deal with continuous valued series whereas a logistic regression deals with categorical (discrete) values. Discrete values are difficult to work with because they are non differentiable so gradient-based optimization techniques don't apply.

Stepwise Regression Analysis - this is the name given to the iterative construction of a multiple regression model. It works by automatic selecting statistically significant independent variables to include in the regression analysis. This is achieved either by either growing or pruning the variables included in the regression analysis.

Many other regression analyses exist, and in particular, mixed models are worth mentioning here. Mixed models is is an extension to the generalized linear model in which the linear predictor contains random effects in addition to the usual fixed effects. This decision tree can be used to help determine the right components for a model. #### Approximating a Regression Analysis

Note: if the formulae below don't show up correctly, try refreshing your browser

After deciding on a regression model you must select a technique for approximating the regression analysis. This involves optimizing the free parameters of the regression model such that some objective function which measures how well the model 'fits' the data-set. The most commonly used approach is called the least squares method.

The least squares method minimizes the sum of the errors squared, where the errors are the residuals between the fitted curve and the set of data points. The residual can be calculated using perpendicular distances or vertical distances. The errors are squared so that the residuals form a continuous differentiable quantity. $R^2=\sum [y_i - f(x_i, a_1, a_2, ..., a_n)]^2$ $R^2=\sum [d_i]$

In the case of vertical offsets the error is equal to the difference between the $y_i$ value from the data-set and the computed $y_r$ value from the regression line where $y_r = f (x_i, a_1, a_2, ..., a_n)$ and $a_1, a_2, ..., a_n$ are additional explanatory variables in a multiple regression. In the case of perpendicular offsets the error is the sum of the distance, $d_1, d_2, ..., d_n$, between the data points, $(x_i, y_i)$, and the point along the regression curve, $(x_r, y_r)$, perpendicular to that. For a straight line, this can be calculated by solving a quadratic equation.

These concepts should be familiar to statisticians as well as machine learning enthusiasts because the sum-squared error is the same objective function used when training neural networks.

##### Simple linear regression analysis

This optimization problem for straight lines was further simplified by Kenney and Keeping who introduced the concept of the center of mass of the dataset, $(\bar{x},\bar{y})$, and related this to the y intercept of the fitted line. This optimization problem is mathematically modelled as,

Sum of x squares

$ss_{xx} = \sum_{i=1}^n{(x_i - \bar{x})^2}$

$ss_{xx} = ( \sum_{i=1}^n{x_i^2} ) - n \bar{x}^2$

Sum of y squares

$ss_{yy} = \sum_{i=1}^n{(y_i - \bar{y})^2}$

$ss_{yy} = ( \sum_{i=1}^n{y_i^2} ) - n \bar{x}^2$

These two measurements can be combined to calculate the overall sum of squares,

Overall sum of squares

$ss_{xy} = \sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}$

$ss_{xy} = (\sum_{i=1}^n{x_i y_i}) - n \bar{x} \bar{y}$

Using just these three variables, $ss_{xx}, ss_{yy}, ss_{xy}$, and the center of mass it is possible to construct the straight line (linear regression) of the form, $y = mx + c$, which minimizes the sum squared error, $R^2$, of the residuals between the line and the datapoints.

$m=\frac{ss_{xy}}{ss_{xx}}$

$c = \bar{y} - m\bar{x}$

These parameters are all that is needed to draw the linear regression analysis which fits a set of observed data points. Lastly, the overall quality of the regression analysis is measured using the correlation coefficient,

$r^2 = \frac{ss_{xy}^2}{ss_{xx} ss_{yy}}$

##### Iterative methods - for harder problems (fun)

For more complex functions iterative methods need to be applied. An iterative procedure is one which generates a sequence of improving approximate solutions to a particular problem. This is also known as a search or optimization algorithm. There are two classes of optimization algorithms, exhaustive or heuristic. Exhaustive techniques are referred to as "brute force" methods because they deterministically try every combination. Heuristic methods on the other hand use knowledge about the optimization problem to locate good solutions.

One widely used heuristic is the gradient of a function because when this is equal to zero, that point in the function is either a local minima or maxima. Gradient methods such as gradient descent, the Gauss Newton method, and the Levenberg Marquardt algorithm adjust the solution such that the derivate is either minimized or maximized.

Another widely used heuristic is line of sight a.k.a direct methods. These methods don't use gradients but instead generate points within the search space and "look for" the optima. Examples of such algorithms include random search, pattern search, grid search, hill climbers, simulated annealing, and even the particle swarm optimization algorithm. Evolutionary computation is another popular metaheuristic for solving complex optimization problems; they are inspired by the processes found in natural evolution. In this category we find such algorithms as the genetic algorithms, grammatical evolution, and the differential evolution algorithm.

#### Data considerations

Because the error is the squared distance between the data point and the regression line, large distances have disproportionately large errors which cause the regression analysis to converge on a solution with a poor correlation coefficient. As such, outliers should ideally be removed from the data-set. That said, identifying outliers can be a somewhat tricky task. One might also consider applying weights to different points in the data set. As an example consider an investor who is analyzing a multi-year time series. He might decide to place a greater weight (importance) on recent years because he assumes that to be an accurate reflection of the future prices. The technique used in this instance is weighted least squares regression analysis.

#### Quandl Integration

A recurring challenge with any quantitative analysis is the availability of good quality data. Luckily for us, Quandl.com has taken on the data challenge and indexed millions of economics, financial, societal, and country specific datasets. That data is also available through a free API (Application Programming Interface) supported by the Quandl Python package.

For more than 50 API calls per day then you need to sign up with Quandl.com to get a free authentication token. This can then be used to download datasets through Quandl for Python. For instructions on installing Quandl for Python check out PyPi or the Github page. To get a data-set from Quandl e.g. quandl.com/WIKI/AAPL-Apple-Inc-AAPL-Prices-Dividends-Splits-and-Trading-Volume paste it's name (WIKI/AAPL) into the Quandl.get() function,

import Quandl
data_set = Quandl.get("WIKI/AAPL", authtoken="your token here")


The Quandl.get() function also supports a number of data transformations and manipulations which allow you to specify how you would like the data to be returned, including,

• order:String - ("asc"|"desc")
• rows:int - the amount of historical data to extract
• frequency:String - ("daily"|weekly"|"monthly"|"quarterly"|"annual")
• transformation:String - ("diff"|"rdiff"|"normalize"|"cumul")
• returns:String - ("numpy"|)

Here is an example of a detailed API call using multiple data transformations,

import Quandl
data_set = Quandl.get("WIKI/AAPL", rows=50, order="desc", frequency="weekly", transformation="normalize", returns="numpy", authtoken="your token here")
##### Abstraction

For flexibility and re-usability I abstracted the Quandl API with a class called QuandlSettings. A QuandlSettings object contains the parameters required to construct any Quandl API call. I also added an additional column parameter which allows the user to specify which column of the dataset to include in the regression analysis.

The dataset name is the decoupled from the QuandlSettings class to improve re-usability of QuandlSettings objects. Consider how the quandl_args_prices object is reused for each dataset in economic regression analysis example below,

A custom download method was created in the RegressionAnalysis class which receives a QuandlSettings object and the name of the dataset to be downloaded. In order to extract the correct column for the regression analysis a for loop was used,

There must be a more efficient method that the for loop, I don't know it yet. Please also take note that np.arange(1, quandl_settings.rows + 1, 1) creates an array of numbers increasing from 1 to quandl_settings.rows. This is used because the StatsModels regression analysis model does not support dates (yet) so these values represent time.

#### Python StatsModels

StatsModels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

I recently started using StatsModels and I've been very impressed. It provides efficient implementations of many statistical tools including, simple linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis tools including ARMA, non-parametric estimators, datasets, statistical tests, and more.

##### Abstraction

As with the QuandlSettings class, a StatsModelsSettings class was created to improve the re-usability of configurations for the regression analysis. At this point in time, these settings are restricted to changing the power of the fitted curve, and specifying whether or not confidence lines around the regression should be computed and plotted,

Note that when the exponent is equal to 1.0 the fitted curve is a straight line. When this is greater than zero, the curve begins to take on non-linearities.

##### Ordinary Least Squares

StatsModels includes an ordinary least squares method. Our run_ordinary_least_squares() method wraps it with Quandl data and a StatsModelsSettings object.

In this wrapper method the data from Quandl is treated as the dependent variable, the array of values from 1 to rows is treated as the independent variable (time / dates), and a StatsModelsSettings object is used to store values for the parameters used to compute the regression analysis. The method is implemented as follows,

To print the results of the regression analysis from StatsModels you can add the following command, print(statsmodel_regression.summary()). The output should look something like this, ##### The Regression Analysis Class

A RegressionAnalysis class was created so that it would be easy to create and store multiple regressions. The RegressionAnalysis class encapsulates the run_ordinary_least_squares() and the get_quandl_data() methods. This class is shown at the end of this article.

##### MatplotLib

The final piece of the puzzle is to plot the results. Because we want to be able to plot multiple regressions on one canvas, plotting functionality and the RegressionAnalysis class are decoupled. For this Matplotlib was used. MatplotLib is a 2D plotting library which produces figures in a variety of hard copy formats and environments across platforms.

A plot_regression_lines() function was defined as a global method. It receives a list of RegressionAnalysis objects as an argument and plots each out, one by one.

#### Example usage

By combining object oriented programming, the Quandl API, and existing python packages we have created a program which can do simple regression analysis on any Quandl dataset! The remainder of this article will show some simple example applications of the program and in future articles and tutorials I will construct ever more sophisticated analysis tools.

##### Fundamental Analysis: Google vs. Yahoo vs. Apple revenues

Quandl.com contains historical fundamental indicators as well as company data for many US companies. This is the code we could need to type if we wanted to compare the revenues of Google, Yahoo, and Apple over the past five years,

And here are the results, ##### Technical Analysis: Trade entry and exit positions

Regression analysis is used extensively in trading. Technical analysts use the "regression channel" to calculate entry and exit positions into a particular stock.

Another application is pairs trading which monitors the performance of two historically correlated securities. When the correlation temporarily weakens, i.e. one stock moves up while the other moves down, the pairs trade shorts the outperforming stock and buys the under-performing one, betting that the "spread" between the two would eventually converge.

If we wanted to compare the past 350 weeks worth of prices for Google and Yahoo with the regression channel (confidence intervals), we would use the following code,

And here are the results, ##### Economics: GDP comparison of BRICS nations

Another area in finance whether regression analysis is often used is econometrics. If we wanted to compare the past 15 years of GDP values for the BRICS nations (Brazil, Russia, India, China, and South Africa), we would need to just produce the following code,

And here are the results (as you all guessed, China was #1) ##### Conclusion and Source Code

In conclusion, regression analysis is a simple and yet useful tool. It can be used to help explain and compare various data-sets and is used extensively in finance, trading, risk management, and econometrics. That having been said, regression analysis is not immune to fault and asserts strong requirements on the data being analysed. For a great discussion on the risks and problems with using regression analysis click here.

1. Very interesting article Stuart! Looking forward to some more.

• Thanks Arne, I'm glad you enjoyed it! The source code is now available on Github as a PyCharms project, but I suspect any IDE would work.

2. Great blog Stuart! I appreciate the time put into it. I'm trying to use Stepwise regression to select more important predictor variables, i.e. momentum, price rate of change, volume rate of change, etc., for predicting stock price. Do you know if the scipy module has a method like that?

• Hi Peter, thanks for the compliment.

I don't think SciPy has an implementation of an automatic Stepwise regression, but according to SciKit learn the LARS Lasso method is similar (http://scikit-learn.org/stable/modules/linear_model.html). The main difference between the two methods is that the LARS Lasso includes all the variables in the regression, but updates the independent variables coefficients proportionately to those variable's correlations to the dependent variable.

If this is not sufficient, it might be easy to put it together yourself, perhaps even using one of the feature selection algorithms discussed here (http://sebastianraschka.com/Articles/2014_sequential_sel_algos.html) and tracking the quality of the regression function using R^2. If you have any further questions, or need some help implementing the algorithm for yourself please drop me a mail through the contact form on the website.

This is an interesting problem 🙂

• Thanks a lot for the reply! I'll give it my best shot with your advice, and will definitely drop a message if I get nowhere.

3. Nice guide! I've been trying to work on a response time integration of a similar problem, so I was trying to run your code to see how it works, but the program can't get the data from quandl.com, and is complaining that I don't have a directory called quandl. Any andvice? Thanks,

The geek

4. Hi,

I'm really really new in python ( I download it today but I did some matlab before), your blog/article is really well explain and really useful. I try to do a non linear multivariate regression, in 4-5 D to capture a trend in finance. Do you have some example of code for that, or do you know witch tools I have to download and were ? I already have my data on an excel sheet and I would like my python regression to be dynamics ( my excel sheet is dynamic). Thanks a lot for any help !

🙂

5. Hi Stuart,

Great article.

Here is how to get rid of your loop for getting the appropriate column in the quandl_data_set:

- quandl_data_set is a recarray object in numpy (a recorded array) which is essentially an array with column names and dtypes (data types) for those columns. You can index those columns from the array using the names of the columns as you would for key-value pairs in a dictionary.

- The method is therefore to first find out the name of the column you want and then index the array using this column name. You do this as follows

# quandl_data_set.dtypes.names is a list of strings containing the names of the columns
col_name = quandl_data_set.dtypes.names[quandl_settings.column]

# use the column name to get the right column of data
quandl_prices = quandl_data_set[col_name][::-1]

- Now there are two things to note. Firstly, indexing in Python starts at 0 so make sure you're getting the column you had hoped for i.e if you want the fourth column name you must index as quandl_data_set.dtypes.names. Secondly, your method reverses the order of the prices so that the data at the bottom of the column comes first which in this case puts the data into chronological order - to do the same in this example we have used [::-1] which reverses the order to match what you have in your original loop.

• Thanks Jerry. This is fantastic and a much more Pythonic approach, thank you. I'm going to add improving this old code to my never ending to-do list.

6. Thank you for this very bright presentation. It will really help teachers, students and professionals !

• Pleasure!

7. Wow, your post on regression analysis is so great! First, I got to learn enough theory and then many methods for conducting the linear regression. Enjoyed it super much. I hope that I will be able to apply regression with Python to my data data on decision making (from a Psychological perspective; i.e., behavhoural data).

Thanks again,

Python IS so great,

Regards,

Freddy

8. Very comprehensible code and explanation. But the only problem is that some error come up when it run in my browser with ipython. Problem like ther dataframe object has no attribute data.load_data()

9. http://honda-sikora.pl/auto-z-kraju-czy-z-zagranicy.html
Purchasing a new or used automobile can be quite a tough approach should you not know what you will be undertaking. By educating yourself about auto shopping before you decide to visit the car dealership, you could make things easier for your self. The following advice can help the next store shopping journey be more enjoyable.

Usually deliver a technician together when shopping for a brand new motor vehicle. Automobile sellers are popular for offering lemons and you do not want to be their after that sufferer. When you can not get a auto mechanic to think about vehicles with you, no less than ensure that you have him look at your closing decision before you purchase it.

Know your boundaries. Before you begin buying for your next auto or truck, decide how much you can manage to shell out, and follow it. Don't forget to include desire for your estimations. You are likely to pay out close to twenty percent as a down payment as well, so prepare yourself.

Prior to seeing a dealer, know what kind of car you desire. Analysis most of you choices prior to buying so you can figure out what works for your financial budget and family members needs. Do your research to find out simply how much you must pay to get a probable auto.

Prior to signing any agreement make time to read through each line, like the fine print. If you find nearly anything detailed that you do not comprehend, usually do not indication until you purchase an answer which you understand. Unsavory salesmen can make use of a contract to put several service fees that have been not reviewed.

When you keep the preceding advice under consideration when that you simply go buying a automobile, you will be prone to get a full offer. Purchasing a car does not have to be a headaches. Just use the guidelines out of this article and you will have the automobile you want with a very good price.

10. There is small typo in run_ordinary_least_squares(), ols_prices should be ols_data

### Submit a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.