Dimensionality Reduction Techniques
The curse of dimensionality is the phenomena whereby an increase in the dimensionality of a data set results in exponentially more data being required to produce a representative sample of that data set. To combat the curse of dimensionality, numerous linear and nonlinear dimensionality reduction techniques have been developed. These techniques aim to reduce the number of dimensions (variables) in a data set through either feature selection or feature extraction without significant loss of information. Feature extraction is the process of transforming the original data set into a data set with fewer dimensions. Two well known, and closely related, feature extraction techniques are Principal Component Analysis (PCA) and Self Organizing Maps (SOM). One can think of dimensionality reduction like a system of aqueducts to make sense of a river of data.
Dimensionality reduction techniques are also used to reduce two undesired characteristics in data namely noise (variance) and redundancy (highly correlated variables). In the context of quantitative and computational finance dimensionality reduction techniques have been used for,
 Hedging fixed income portfolios
 Constructing interest rate models (see section 3.3)
 Developing asset allocation models (see alternative methods)
 Identifying trading strategies & signals (an interesting SOM application)
Discussing each is beyond the scope of this post. That having been said, I invite the reader to follow the above links to writeups on the applications of dimensionality reduction.
The remainder of this article draws much inspiration from a fantastic and succinct tutorial by Professor Alexander Gorban from the University of Leicester.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a statistical algorithm used to transform a set of possibly correlated variables into a set of uncorrelated linear recombinations of those variables called principal components. Put simply, the principal components, , are linear re combinations of the variables in our data set, , where the weights, , are derived from the eigenvectors of the covariance or correlation matrix of our data set. Symbolically this is expressed as,
The first principal component is the straight line that minimizes the sum of the squared distances from the data points. It is the least squares approximation of the data set by a single line. Therefore, the first principal component explains the highest amount of variance from the data set. The residuals are then extracted from the dataset and the next principal component is calculated. As such, each successive component explains less of the variance,
It is often sufficient to select the first principal components, where the amount of variance explained by the principal components , is greater than or equal to some threshold,
Thereby reducing the dimensionality of the problem from the number of variables in , , to the number of principal components, . Some challenges exist when using PCA. Firstly, the algorithm is sensitive to the scale of the variables in the data set, therefore it is advisable to perform mean centering and rather use the correlation matrix of as it is normalized. Another challenge with PCA is that it is inherently linear. Nonlinear adaptations of PCA exist including nonlinear PCA and Kernel PCA which exploits the awesomeness of the Kernel Trick:
For an example application of PCA to a financial data set in WEKA, a popular Java data mining software, scroll down the appendix.
Self Organizing Maps (SOM)
Self Organizing Maps (SOMs) were originally invented by Kohonen in the mid 1990's and are also sometimes referred to as Kohonen Networks. A SOM is a multidimensional scaling technique which constructs an approximation of the probability density function of some underlying data set, , which also preserves the topological structure of that data set.
This is done by mapping input vectors, , in the data set, , to weight vectors, , (neurons) in the feature map, . Preserving the topological structure simply means that if two input vectors are close together in , then the neurons to which those input vectors map in will also be close together. This is the characteristics of SOMs.
The stochastic and batch mapping (learning) algorithms work as follows,
STOCHASTIC LEARNING
 foreach
 Initialize() [2]
 while stopping condition(s) not true:
 Select random input vector
 Find the best matching neuron for [3]
 foreach :
 Update [1]
 end for
 end
BATCH LEARNING
 foreach
 Initialize() [2]
 while stopping condition(s) not true:
 foreach :
 foreach :
 if is in neighborhood of : [3]
 Add to
 end if
 if is in neighborhood of : [3]
 end for
 end for
 foreach :
 update mean over
 end for
 foreach :
 end
[1] The update rule for each weight vector is:
where
and is the neighborhood function of the best matching neuron, , is the weight vector being updated, and is the dimension in the weight vector being updated.
[2] Various initialization strategies exist. Three popular strategies include (1) assigning random weights to the weight vectors, (2) assigning the first input vectors to the weight vectors, where is the number of neurons in the SOM, and (3) randomly sampling weights in the subspace between the first and second principal components for the data set. It is best to test different initialization strategies empirically when developing the model.
[3] The best matching neuron is the one with the smallest euclidean distance from the input vector. The neighborhood function of the best matching unit are the neurons which get updated with the best matching neuron (but to a smaller degree). One popular neighborhood function is the Gaussian Kernel function. The radius of the neighborhood function and the learning rate should be monotonically decreasing functions of time.
If the number of neurons in the SOM is less than the number of patterns in the data set, then we will have reduced the dimensionality of the data set ... but not the dimensionality of the input or weight vectors. As such, the type of dimensionality reduction performed by a SOM is different to that done by PCA and SOM is actually more similar to a clustering algorithm such as Kmeans Clustering.
The difference between SOM and clustering, however, is that a clustering of a data set will (generally speaking) preserve the probability density function of the data set, but not the topological structure of the data set. This makes SOMs especially useful for visualization. By defining a secondary function which transforms a given weight vector into a colour (for example) we are able to visualize the topology, similarity, and the probability density function of the underlying data set in a lower dimension (usually two dimensions because of the grid).
As with PCA, SOM also suffers from some issues. Firstly, the number of neurons needs to be specified upfront; if too few are specified then heterogeneous classes of input vectors may be mapped to the same neuron; if too many are specified homogeneous input vectors may be mapped to different neurons. To deal with this, the growing self organizing map was developed. Another challenge is the computational overhead of finding the best matching neuron for an input vector, shortcut winner search was developed in response to this.
Appendix: Application of PCA
"Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also wellsuited for developing new machine learning schemes." [Source]
One of the features in WEKA is a tool for selecting attributes and performing dimensionality reduction. One of the supported algorithms is Principal Component Analysis. This example applies PCA to a .CSV file containing twelve correlated technical indicators. Redundancy is one of the qualities of data which causes models (especially machine learning models) to overfit.
If we load this into WEKA we will see some basic descriptive statistics of the data set including histograms for each variable (technical indicator), and their minimum value, maximum value, mean sample statistic, and standard deviation sample statistic.
In the select attributes tab, choose the Principal Components attribute evaluator and WEKA will automatically select the Ranker search method.
After clicking start, WEKA extracts the first five principal components. As can be seen, the correlation coefficients of the first three principal components to the closing price are 0.6224, 0.3660, and 0.1643 respectively. Knowing PCA, these three components are uncorrelated and should, in theory at least, contain different information about the movement of the index.
Please note that the above trivial example is intended simply as an illustration of WEKA's capabilities and nothing more.

Stu, as always, a great article. I love the illustrations you used, and particularly the use of SOM to identify similarities in trading strategies and signals  such a perfect use case!

Hello Stuart,
Your blog is very informative and helps me to proceed with my research confidently. Very simple concept explanation with suitable examples. I am totally inspired to do more research in computational finance. 
For the love of God, keep writing these articles.

Hi Stuart,
I have a question. can you explain how I can use the first three components while using these data for classification.
thank you for your responding
Comments