# Dimensionality Reduction Techniques

The curse of dimensionality is the phenomena whereby an increase in the dimensionality of a data set results in exponentially more data being required to produce a representative sample of that data set. To combat the curse of dimensionality, numerous linear and non-linear dimensionality reduction techniques have been developed. These techniques aim to reduce the number of dimensions (variables) in a data set through either feature selection or feature extraction without significant loss of information. Feature extraction is the process of transforming the original data set into a data set with fewer dimensions. Two well known, and closely related, feature extraction techniques are Principal Component Analysis (PCA) and Self Organizing Maps (SOM). One can think of dimensionality reduction like a system of aqueducts to make sense of a river of data.

The Curse of Dimensionality - With one dimension (top left) there are only 10 possible positions therefore 10 datum are required to create a representative sample which 'covers' the problem space. With two dimensions there are 10^2 = 100 possible positions therefore 100 datum are required to create a representative sample which 'covers' the problem space. With just three dimensions there are now 10^3 = 1000 possible positions therefore 1000 datum are required to create a representative sample which 'covers' the problem space ... Whilst hard to illustrate graphically, this exponential growth in the required number of datum continues to grow exponentially indefinitely [Image Source]

Dimensionality reduction techniques are also used to reduce two undesired characteristics in data namely noise (variance) and redundancy (highly correlated variables). In the context of quantitative and computational finance dimensionality reduction techniques have been used for,

Discussing each is beyond the scope of this post. That having been said, I invite the reader to follow the above links to write-ups on the applications of dimensionality reduction.

The remainder of this article draws much inspiration from a fantastic and succinct tutorial by Professor Alexander Gorban from the University of Leicester.

### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical algorithm used to transform a set of possibly correlated variables into a set of uncorrelated linear re-combinations of those variables called principal components. Put simply, the principal components, $Y$, are linear re combinations of the variables in our data set, $\textbf{X}$, where the weights, $\textbf{e}_j^T$, are derived from the eigenvectors of the co-variance or correlation matrix of our data set. Symbolically this is expressed as,

$Y_{ij} = \textbf{e}_j^T \textbf{X} = e_{1j} X_{i1} + e_{2j} X_{i2} + ... + e_{pj}X_{ip}$

The first principal component is the straight line that minimizes the sum of the squared distances from the data points. It is the least squares approximation of the data set by a single line. Therefore, the first principal component explains the highest amount of variance from the data set. The residuals are then extracted from the data-set and the next principal component is calculated. As such, each successive component explains less of the variance,

$Var(Y_1) >Var(Y_2) >... >Var(Y_p)$

It is often sufficient to select the first $k$ principal components, where the amount of variance explained by the principal components $Y_1, ..., Y_k$, is greater than or equal to some threshold,

$\sum^k_{i=1} Var(Y_i) \geq 95\%$

Thereby reducing the dimensionality of the problem from the number of variables in $\textbf{X}$, $m$, to the number of principal components, $k$. Some challenges exist when using PCA. Firstly, the algorithm is sensitive to the scale of the variables in the data set, therefore it is advisable to perform mean centering and rather use the correlation matrix of $\textbf{X}$ as it is normalized. Another challenge with PCA is that it is inherently linear. Non-linear adaptations of PCA exist including non-linear PCA and Kernel PCA which exploits the awesomeness of the Kernel Trick:

For an example application of PCA to a financial data set in WEKA, a popular Java data mining software, scroll down the appendix.

### Self Organizing Maps (SOM)

Self Organizing Maps (SOMs) were originally invented by Kohonen in the mid 1990's and are also sometimes referred to as Kohonen Networks. A SOM is a multi-dimensional scaling technique which constructs an approximation of the probability density function of some underlying data set, $\textbf{X}$, which also preserves the topological structure of that data set.

This is done by mapping input vectors, $\textbf{x}_i$, in the data set, $\textbf{X}$, to weight vectors, $\textbf{w}_j$, (neurons) in the feature map, $\textbf{W}$. Preserving the topological structure simply means that if two input vectors are close together in $\textbf{X}$, then the neurons to which those input vectors map in $\textbf{W}$ will also be close together. This is the characteristics of SOMs.

Illustration of the mapping done by a Self Organizing Map between the input vectors (input layer) and the weight vectors (feature map). [Source]

The stochastic and batch mapping (learning) algorithms work as follows,

STOCHASTIC LEARNING

• foreach $\textbf{w}_i \in \textbf{W}$
• Initialize($\textbf{w}_i$) [2]
• while stopping condition(s) not true:
• Select random input vector $\textbf{x}_i \in \textbf{X}$
• Find the best matching neuron for $\textbf{x}_i$ [3]
• foreach $\textbf{w}_i \in \textbf{W}$:
• Update $\textbf{w}_i$[1]
• end for
• end

BATCH LEARNING

• foreach $\textbf{w}_i \in \textbf{W}$
• Initialize($\textbf{w}_i$) [2]
• while stopping condition(s) not true:
• foreach $\textbf{w}_i \in \textbf{W}$:
• $X_i \leftarrow \emptyset$
• foreach $\textbf{x}_i \in \textbf{X}$:
• if $\textbf{x}_i$ is in neighborhood of $\textbf{w}_i$[3]
• Add $\textbf{x}_i$ to $X_i$
• end if
• end for
• end for
• foreach $\textbf{w}_i \in \textbf{W}$:
• $\textbf{w}_i \leftarrow$ update mean over $X_i$
• end for
• end

[1] The update rule for each weight vector is:

$\textbf{w}_i(t+1) = \textbf{w}_i(t) + (\Delta w_i1(t), ..., \Delta w_{ij}(t))$

where $\Delta w_{ij}(t) = h_{k, ij}(t)[\textbf{x}_i - \textbf{w}_i (t)]$

and $h_{k, ij}$ is the neighborhood function of the best matching neuron, $\textbf{w}_k$, $i$ is the weight vector being updated, and $j$ is the dimension in the weight vector being updated.

[2] Various initialization strategies exist. Three popular strategies include (1) assigning random weights to the weight vectors, (2) assigning the first $n$ input vectors to the weight vectors, where $n$ is the number of neurons in the SOM, and (3) randomly sampling weights in the subspace between the first and second principal components for the data set. It is best to test different initialization strategies empirically when developing the model.

[3] The best matching neuron is the one with the smallest euclidean distance from the input vector. The neighborhood function of the best matching unit are the neurons which get updated with the best matching neuron (but to a smaller degree). One popular neighborhood function is the Gaussian Kernel function. The radius of the neighborhood function and the learning rate should be monotonically decreasing functions of time.

If the number of neurons in the SOM is less than the number of patterns in the data set, then we will have reduced the dimensionality of the data set ... but not the dimensionality of the input or weight vectors. As such, the type of dimensionality reduction performed by a SOM is different to that done by PCA and SOM is actually more similar to a clustering algorithm such as K-means Clustering.

The difference between SOM and clustering, however, is that a clustering of a data set will (generally speaking) preserve the probability density function of the data set, but not the topological structure of the data set. This makes SOMs especially useful for visualization. By defining a secondary function which transforms a given weight vector into a colour (for example) we are able to visualize the topology, similarity, and the probability density function of the underlying data set in a lower dimension (usually two dimensions because of the grid).

Illustration of the mapping done by the SOM as well as assigning a colour to a particular weight vector for visualization of similarity [Image Source]

An even cooler illustration of the resulting mapping done by a SOM where a colour was assigned for similarity and a height was assigned for density [Image Source]

As with PCA, SOM also suffers from some issues. Firstly, the number of neurons needs to be specified upfront; if too few are specified then heterogeneous classes of input vectors may be mapped to the same neuron; if too many are specified homogeneous input vectors may be mapped to different neurons. To deal with this, the growing self organizing map was developed. Another challenge is the computational overhead of finding the best matching neuron for an input vector, shortcut winner search was developed in response to this.

### Appendix: Application of PCA

"Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes." [Source]

One of the features in WEKA is a tool for selecting attributes and performing dimensionality reduction. One of the supported algorithms is Principal Component Analysis. This example applies PCA to a .CSV file containing twelve correlated technical indicators. Redundancy is one of the qualities of data which causes models (especially machine learning models) to over-fit.

Correlation Matrix Technical Indicators

If we load this into WEKA we will see some basic descriptive statistics of the data set including histograms for each variable (technical indicator), and their minimum value, maximum value, mean sample statistic, and standard deviation sample statistic.

In the select attributes tab, choose the Principal Components attribute evaluator and WEKA will automatically select the Ranker search method.

After clicking start, WEKA extracts the first five principal components. As can be seen, the correlation coefficients of the first three principal components to the closing price are 0.6224, 0.3660, and 0.1643 respectively. Knowing PCA, these three components are uncorrelated and should, in theory at least, contain different information about the movement of the index.

Please note that the above trivial example is intended simply as an illustration of WEKA's capabilities and nothing more.

1. Stu, as always, a great article. I love the illustrations you used, and particularly the use of SOM to identify similarities in trading strategies and signals - such a perfect use case!

• Thanks Simon! I appreciate the kind words 🙂

In the future, I would like to post some specific applications and / or tutorials of Self Organizing Maps. There is a lot of potential to use them on large financial data sets.

Let me know if you want to collaborate!

2. Hello Stuart,
Your blog is very informative and helps me to proceed with my research confidently. Very simple concept explanation with suitable examples. I am totally inspired to do more research in computational finance.

• Hi Renu, thanks! I am very happy to hear that :), what is your area of research?

3. For the love of God, keep writing these articles.

4. Hi Stuart,

I have a question. can you explain how I can use the first three components while using these data for classification.