Image Image Image Image Image Image Image Image Image Image

Turing Finance | October 9, 2016

Scroll to top



Monte Carlo K-Means Clustering of Countries

Monte Carlo K-Means Clustering of Countries

In the first part of this three-part series, What Drives Real GDP Growth?, I identified four themes which drive real GDP growth. These themes are based on 19 socioeconomic indicators whose average Spearman and Pearson correlations to real GDP growth were statistically significant and greater than 0.2 across the 188 country sample.

In this article the 188 countries are clustered based on those 19 socioeconomic indicators using a Monte Carlo K-Means clustering algorithm implemented in Python. Clustering can help to reduce the amount of work required to identify attractive investment opportunities by grouping similar countries together and generalizing about them.

This article goes into sufficient detail about distance metrics, measures of cluster quality, clustering algorithms, the K-Means clustering algorithm, before discussing the results from clustering the countries and drawing conclusions. If you are quite comfortable with the topic, feel free to navigate to those sections which sound interesting.

  1. Clustering theory - Measures of similarity and distance
  2. Clustering theory - Classes of clustering algorithms
  3. Clustering theory - The K-Means clustering algorithm
  4. Clustering theory - Measures of clustering quality
  5. Clustering theory - Monte Carlo Methods in clustering
  6. Clustering results - Visualizations and centroid analysis
  7. Clustering results - Conclusions and further research
  8. Appendix A - All the code and some data!

Clustering Theory - Measures of Similarity & Distance

Clustering is the process of partitioning a set of heterogeneous (different) objects into subsets of homogeneous (similar) objects. At the heart of cluster analysis is the assumption that given any two objects you can quantify the similarity or dissimilarity between those objects. In continuous search spaces distance measures similarity.

Below I have written about similarity metrics for continuous search spaces. For each I have included formulas (Given two vectors, \textbf{p} and \textbf{q}) and Python code. All the Python code used to write this article is available.

Manhattan Distance

An interesting paper entitled "On the Surprising Behavior of Distance Metrics in High Dimensional Space" studied the different distance metrics in high dimensional spaces and found that using Manhattan distance and fractional distance were preferable to using the more traditional Euclidean distance measures for clustering,

Manhattan distance is equal to the sum of the absolute distances of each coordinate in the vector,

Manhattan Distance,

d(\textbf{p}, \textbf{q})=\sum^n_{i=1}|p_i-q_i|

Euclidean Distance

Euclidean distance is the ordinary distance between two points or vectors. Some derivatives of Euclidean distance include squared Euclidean distance and half squared Euclidean distance,

Euclidean Distance,

d(\textbf{p}, \textbf{q})=\sqrt{\sum^n_{i=1}(p_i - q_i)^2}


Squared Euclidean Distance,

d(\textbf{p}, \textbf{q})=\sum^n_{i=1}(p_i - q_i)^2


Half Squared Euclidean Distance,

d(\textbf{p}, \textbf{q})=\frac{1}{2}\sum^k_{i=1}(p_i - q_i)^2

Fractional Distance

Interestingly Manhattan Distances, Euclidean Distances, and Fractional Distances are all closely related and are often described in terms of L-norms. Given the following distance equation,


d^f(\textbf{p}, \textbf{q})=\sum^n_{i=1}\Bigg[\Big(\big|p_i - q_i\big|\Big)^f\Bigg]^{\frac{1}{f}}

Notice that when f=1, we get the same value as using Manhattan distance and when f=2 we get the same value as using Euclidean distance. Now if Manhattan distances perform better than Euclidean distances in high dimensional spaces it implies that for values f < 1 performance may improve still further. The paper I mentioned earlier found that at 20 dimensions performance of the fractional distance metric converged after f < 0.3.

During this study I had no luck using the Fractional Distance metric. My (unproved) hypothesis is that the performance of the Fractional Distance metric is sensitive to the scale of the values in the vectors being clustered.

Alternative Distance Metrics

One can also use correlations as a measure of similarity for continuous valued search spaces. Popular measurements include the Pearson coefficient, cosine similarity, and the tanimoto coefficient. If your data is not continuous i.e. it is discrete, then you can either use a function to convert it to a continuous space or us metrics such as Hamming Distance, the Jaccard Index, or Levenshtein Distance (for strings) to measure similarity.

Python Code - Distance Metrics

The Similarity class written below contains implementations of all the distance metrics mentioned above. The default value for fractional distances is currently set to 2.0 (Euclidean distance) but can be adjusted.


Back to the top

Clustering Theory - Classes of Clustering Algorithms

The two main classes of clustering algorithms are hierarchical clustering and partitional clustering. Hierarchical clustering forms clusters by merging small clusters into larger ones or dividing large clusters into smaller ones. Partitional clustering forms clusters by dividing the input data set into mutually exclusive sub sets.

The differences between hierarchical and partitional clustering mostly has to do with the inputs required. Hierarchical clustering only requires a similarity measure whereas partitional clustering may require a number of additional inputs, most commonly the number of clusters, k. Generally speaking, hierarchical clustering algorithms are also better suited to categorical data.

Hierarchical Clustering

There are two types of hierarchical clustering namely agglomerative clustering and divisive clustering. Agglomerative clustering is a bottom-up approach and involves merging smaller clusters (each input pattern by itself) into larger clusters. Divisive clustering is a top down approach which starts with one large cluster (all of the input patterns) and divides them into smaller and smaller clusters until each input pattern is by itself in a cluster.

Partitional Clustering

In this article we will focus on partitional clustering algorithms. The two main categories of partitional clustering algorithms are centroid-based clustering and density-based clustering. This article focuses on centroid-based clustering; in particular the popular K-means clustering algorithm.

Back to the top

Clustering Theory - The K-Means Clustering Algorithm

The K-Means Clustering algorithm is a centroid-based partitional clustering algorithm which works using the mean-shift heuristic. The K-means clustering algorithm consists of three steps (Initialization, Assignment, and Update). These steps are repeated until either the clustering has converged or the number of iterations has been exceeded a.k.a the computational budget has been exhausted.


A set of k centroids are randomly initialized in the search space. These centroids must be in the same order of magnitude as the data patterns being clustered. In other words, it does not make sense to initialize random vectors with values between 0 and 1 if the values in the data patterns are between 0 to 100. To remedy this the data can be normalized between 0 and 1 using,

X_{i} \in (0,1) = \frac{X_i - MIN(X)}{MAX(X) - MIN(X)}

K-Means Clustering Initialization

Note: make sure you normalize your data across each attribute, not each pattern


Once the centroids have been randomly initialized in the space, we iterate through each pattern in the data set and assign it to the closest centroid. Try and do this step in parallel, especially if you have a very large number of patterns in the data set.

K-Means Clustering Assignment


Once the patterns have been assigned to their centroids, the mean-shift heuristic is applied. This heuristic replaces each value in each centroid with the mean of that value over the patterns which have been assigned to that centroid. This shifts the centroid towards the high-dimensional mean of the patterns which belong to it. The problem with the mean-shift heuristic is that it is sensitive to outliers. To overcome this one can either use the K-medoids clustering algorithm or one can standardize the data to dampen the effect of outliers using,

X_{i, 1\sigma} = \frac{X_i - \mu_{X}}{\sigma_{X}}

K-Means Clustering Update


These three steps are repeated for a number of iterations until the clustering has converged on a solution. A really nice GIF showing this is shown below,

K-Means Clustering Gif

Python Code - Additions to the Clustering class

The Python methods below are extensions to the Clustering class which allow it to perform the K-means clustering algorithm. This involved updating the centroids using the mean-shift heuristic.


Back to the top

Clustering Theory - Measures of Clustering Quality

Assuming you have a measure of similarity and a clustering of the data, you still need an objective function measures the quality of that clustering. Most clustering quality metrics try to optimize the clustering in terms of its inter and intra-cluster distances. Put simply these metrics try to ensure that patterns in the same cluster are close together and patterns in different clusters are far apart.

Quantization Error

Quantization error measures the round off error introduced by quantization i.e. mapping a set of input values to a finite smaller set. This is basically what we are doing by clustering n patterns into k clusters.

Quantization error can be calculated using the following very ugly formula,

Q_T=\frac{\sum^{|Z|}_{i=1}||\textbf{z}_i - \textbf{c}_{z_i}||_2^2}{|Z|}.

where \textbf{Z} is the set of data points we are clustering and ||x||_2^2 just means we want to take the l^2 norm of x and square it. So what this means is that quantization error is assuming we are using the Euclidean distance metric.

Alternatively we can calculate quantization error using the following less ugly formula,

Q_T=\sum^n_{i=1} \bigg(\big|d(\textbf{z}_i, \textbf{c}_{z_i})\big|\bigg)^2 \div n

where n = |Z| i.e. the number of patterns in the data set, \textbf{z}_i is the i^{th} pattern in Z, \textbf{c}_{z_i} is the centroid to which \textbf{z}_i is closest, and \big|d(\textbf{z}_i, \textbf{c}_{z_i})\big| is the absolute distance between \textbf{z}_i and \textbf{c}_{z_i} using some distance metric, d().

Quantization Error

Note: The image also assumes we are using the Manhattan distance. 

In the above illustration of the Quantization Error we are calculating the sum of the squared absolute distances between every pattern and the centroid it has been assigned to.

Davies–Bouldin index

The Davies-Bouldin criterion is based on a ratio of intra-cluster and inter-cluster distances of a particular clustering. Mathematically the Davies-Bouldin index is calculated as,

DB=\frac{1}{k}\sum^k_{i=1}max_{i \neq j}\big\{D_{ij}\big\} where


D_{ij} is the ratio between the i^{th} and j^{th} cluster in the data. D_{ij}, \bar{d_i} is the average distance between each pattern in the i^{th} cluster and the centroid of that cluster; and d_{ij} is equal to the distance between the i^{th} and j^{th} centroid. The Davies-Bouldin index is the sum of the worst intra-to-inter cluster distance ratios over all k clusters.

Davies Bouldin Index

Note: The image assumes we are using the Manhattan distance. 

In the above illustration of the Davies-Bouldin index we have three clusters consisting of three patterns each. The first step is to calculate the average intra-cluster distances for each cluster d(\textrm{A}), d(\textrm{A}), \textrm{and } d(\textrm{A}). The next step is to calculate the distances between each pair of centroids, d(\textrm{A}, \textrm{B}), d(\textrm{B}, \textrm{C}), \textrm{and }d(\textrm{A}, \textrm{C}). Lastly we return the maximum ratio of the sum of any pair of intra cluster distances and the distance between those two centroids.

Silhouette Index

The Silhouette Index is one of the most popular ways of measuring a particular clustering's quality. It measure how similar each pattern is to the patterns in it's own cluster as compared to patterns in other clusters.

s_i = \frac{\bar{d'_{z_i}}-\bar{d_{z_i}}}{max\big\{d'_{z_i},d_{z_i}\big\}}

for a pattern z_i then \bar{d'_{z_i}} is the minimum average distance from pattern z_i to the patterns in a different cluster (minimized over k clusters), and \bar{d_{z_i}} is the  average distance from pattern z_i to the patterns in the same cluster as z_i.

A high silhouette value indicates that z is well-matched to its own cluster, and poorly-matched to neighboring clusters. One should aim to maximize s_i for each pattern in the data set.

Silhouette Index

Note: The image also assumes we are using the Manhattan distance. 

In the above illustration of the Silhouette index we have the same three clusters consisting of the same three patterns each as in our last image. This image shows how to calculate the Silhouette index for just patterns z_1 and z_9. In the first step for both we calculate the average distance from each pattern to the patterns in its own cluster d(z_1) and d(z_9). Next we calculate the average distance from each pattern to the patterns of the next closest cluster, d'(z-1) and d'(z_9). For each pattern we calculate the silhouette index as the difference between the average distance to the patterns of another cluster and the average distance to the patterns of its own cluster divided by the max of those two numbers. To calculate the Silhouette Index of the clustering we calculate the Silhouette Index for every pattern and average them.

The good, the bad, and the ugly

After having spent the last few months using these metrics I have concluded that none of them are perfect,

  1. Quantization Error - the computational complexity of this metric is the smallest, however the metric is biased towards a large number of clusters because as you add more centroid the clusters become smaller (more compact) and in the extreme you may have one pattern assigned to each centroid. In this case the Quantization error is minimized. The results were the most believable using this metric The Good.
  2. Davies-Bouldin - as you increase the value of k the distances between each centroid will naturally decrease on average. Because this term is in the denominator you end up dividing by a smaller number for larger values of k. The result of this is that the metric is biased towards solutions with a smaller number of clusters The Bad.
  3. Silhouette Index - the computational complexity of this metric is crazy. Assuming you calculate the distance from each pattern \textbf{z}_i to every other pattern to calculate which cluster is the closest and you do that for each pattern, you need to calculate |\textbf{Z}| * |\textbf{Z}-1| distances. In this example that equates to 35,156 calculations. If you memoize perfectly the number of calculations for each pass is halved. Also, the metric seems to be biased against solutions with a larger number of clusters because, as you might have guessed, the distances between the centroids is smaller so each pattern is more likely to belong more to a neighboring cluster The Ugly.

The following analysis of the different metrics demonstrates the biases quite nicely; despite the fact that they are supposed to be measuring the same thing they are almost perfectly negatively correlated.

QE 1.0 -0.965 -0.894
SB -0.965 1.0 0.949
SI -0.894 0.949 1.0

Clustering Quality

Python Code - Clustering

Before you can evaluate the fitness of a given clustering you need to actually cluster the patterns. The Clustering class contains methods which assign patterns to their nearest centroids.


Python Code - Objective Functions

The ClusteringQuality class measures the quality of a given Clustering of the input patterns.

Back to the top

Clustering Theory - Monte Carlo Methods in Clustering

The two biggest problems with the K-Means clustering algorithm are that:

  1. It is sensitive to the random initialization of the centroids and
  2. The number of centroids initialized, k

For these reasons it the K-means clustering algorithm is often restarted multiple times. Because the initialization is (usually) random we are essentially sampling random high-dimensional starting positions for the centroids which is also known as a Monte Carlo simulation. In order to compare solutions across independent simulations, we need a measure of cluster quality such as those discussed previously.

Deterministic Initialization

I said that the initialization is usually random because there are deterministic initialization techniques for the K-Means clustering algorithm including grid-based initialization and principal component based initialization.

Stochastic Initialization

For stochastic (random) initialization of the centroids one can either use pseudo random numbers or quasi random numbers. The difference is that the next random number in a pseudo random sequence is independent of the previous random numbers whereas in a quasi random number sequence the next random number is dependent on the previous random numbers. Dependent random numbers cover a greater surface of the search space.

pseudo vs quasi random numbers

Comparison of a pseudo random sequence (left) and a quasi random sequence (right) in a two-dimensional space

Picking the right k

In addition to testing different initializations, we can test different values of k in a Monte Carlo framework. Currently there is no optimal way of dynamically determining the right number of clusters, although techniques for determining the right k value are always being researched. I prefer to just empirically try different k values and compare the results although this is time-consuming, especially on large data sets.

Monte Carlo Method - Code

The Monte Carlo simulation used to analyze the Clustering Quality results is shown below,


Back to the top

Clustering Results - Visualization and Centroid Analysis

The final results presented below represent the best clustering found over the range k = \{6, 7, 8\} over 1,000 independent simulations per each value of k. Euclidean distances and the Quantization Error were the distance and quality metric used in the Monte-Carlo K-Means clustering. The data set used to produce the results was the normalized point-in-time 2014 data-set consisting of the 19 socioeconomic indicators identified as being positively correlated to real-GDP growth. These were identified in my previous post, What Drives Real-GDP Growth?

When viewing the results keep it in mind that the number of each cluster is arbitrary so being in cluster one may or may not be better than being in cluster two. In other words, it is an nominal scale.

Map Visualization for ~2014

The following map represents the best possible clustering for k=\{6,7,8\} based on normalized 2014 socioeconomic data. The centroids are broken down in the Cluster Breakdown & Centroid Analysis sub-section.


Map Visualization for ~2004

The following map represents what the countries would have been clustered on if 2004 data were used and the same centroids identified in the 2014 data were used to cluster the countries.


Cluster Breakdown & Centroid Analysis

Each one of the tabs below breaks down the cluster into the country which belong to it and compares the centroid to the median centroid across each one of the 19 socioeconomic indicators we clustered on.

Clustering Centroid Analysis 1


Countries in this Cluster in 2014

Czech Republic


New Zealand
United Kingdom

Clustering Centroid Analysis 2


Countries in this Cluster in 2014

Antigua and Barbuda
Cape Verde
Costa Rica
Dominican Republic
El Salvador


Saint Kitts and Nevis
Saint Vincent and the Grenadines
South Africa
Trinidad and Tobago

Clustering Centroid Analysis 3


Countries in this Cluster in 2014

Sao Tome and Principe

Clustering Centroid Analysis 4


Countries in this Cluster in 2014

Equatorial Guinea

North Korea

Saint Lucia
South Sudan
Sri Lanka

Clustering Centroid Analysis 5


Countries in this Cluster in 2014

Bosnia and Herzegovina


Slovak Republic

Clustering Centroid Analysis 6


Countries in this Cluster in 2014

United States

Clustering Centroid Analysis 7


Countries in this Cluster in 2014

Burkina Faso
Central African Republic

Ivory Coast

Papua New Guinea
Sierra Leone
Solomon Islands

Clustering Centroid Analysis 8


Countries in this Cluster in 2014

Hong Kong

San Marino

Saudi Arabia
South Korea
The Bahamas

Back to the top

Clustering Results - Conclusions and Further Research

Being a quant is not about risk management, derivatives pricing, or algorithmic trading; it is about challenging the way things are done and finding better ways of doing them usually using statistical and computational methods. This way of thinking can be applied to anything including how we think about the world. I started writing this series of articles because I was frustrated by the use of over-simplified buzzwords such as 'The African Growth Story', 'Third World vs. First World', and 'BRICs nations' to describe topics as complex as the world and growth. The conclusions below have presented me with a better framework for thinking about the world and growth; my hope is that it will be as useful to you as it is to me.


China has joined the USA as an outlier

In 2004, the United States was an outlier and occupied a cluster on its own. This cluster was characterized by a low exchange rate at PPP, high imports, high exports, high household expenditures, high industrial production, and relatively high government revenues and spending particularly on health. At this point in time the biggest differences are still that there is a much larger amount of investment happening into China and their population is (obviously) a lot larger with a better demographic (population between 15 and 64). In terms of industrial production China has also overtaken the USA. These are shown in a side by side comparison below,

Geography is a poor measure of similarity

Many companies adjust their strategies according to regions of the world. What the results suggest is that countries in regions of the world are as likely to be clustered together as they are not. This would indicate that as a crude clustering technique, seeing the world as regions of similar countries with similar growth prospects is sub-optimal.

Some colloquialisms are right; others are just wrong

Colloquialisms such as eastern vs. western European countries show up on the map and are (for lack of a better word) right. However colloquialisms such as BRICs (Brazil, Russia, India, China, and South Africa) are clearly motivated more by the political economy than the actual economy. Here are my thoughts on some common colloquialisms,

  1. Eastern vs. Western Europe - There appears to be a clear way distinction between those countries in cluster one versus those in clusters five and two. The past ten years have seen changes in Spain, Ireland, the Czech Republic, and other nearby countries. This is probably the result of the sovereign debt crisis.
  2. Eastern vs. Western Countries - This is an over-simplification. Most Asian countries occupy different clusters, and the traditional Western countries such as the USA and the UK do not actually occupy the same cluster.
  3. BRICs nations - Brazil, Russia, India, China, and South Africa are in different clusters. Whilst they may have trade agreements in place, that doesn't imply that these nations have the same social, demographic, and economic make-up or the same potential for future real-GDP growth.
  4. The African Growth Story - Whilst capital markets have done well in the past ten years, this does not seem to have reflected in major changes to the social, demographic, and economic make-up of the continent. What is interesting is that India and Pakistan are no longer clustered with Central and Southern African countries.
  5. Northern Africa vs. Southern Africa - There is a clear distinction between Northern African countries (Morocco, Algeria, Egypt, Libya, etc.) and the Rest of Africa. Surprisingly, South Africa is now clustered with these countries.
  6. Emerging vs. Developed Nations - Again this appears to be an over-simplification. There seem to be stages of development which is discussed in the next section.

There are many more colloquialisms and I apologize for not commenting on all of them, these six are just the ones which I encounter most frequently in my everyday life. Please comment if you spot other interesting relationships.

There are multiple stages of developing countries

It is impossible to quantify how good it is to be in one cluster vs. another cluster because we do not know the relative importance of each socioeconomic indicator. In some cases we cannot even be sure that a large or small value is good or bad. For example, is a large government spending still good if the government is ineffective? Despite this fact, I have attempted to construct a crude metric for ranking each one of the clusters:

Rank value = Exports + Household Expenditures + Imports + Improved Sanitation + Improved Water + Population + Population Aged 15 To 64 + Population Growth  + Total Investment  + Urban percentage + Cellphone Subscriptions + Government Revenues + Government Spending + Health Expenditure + Industrial Production + Internet Users - Exchange Rate At PPP - Unemployment - Age Dependency Ratio Old

Based on this metric the relative rank of each cluster is shown below,

Cluster Rank Value Rank Count


10.238 1 2


5.191 2 22


5.146 3 20


3.827 4 20


3.825 5 45


3.111 6 32


3.078 7 4


1.799 8 43

This ranking isn't perfect, but it reaffirms our beliefs that the world is an unequal place, and that the distribution of countries is skewed towards lower rank values. In other words there are more worse off countries than well off ones.

This ranking also implies multiple stages of development for countries. The stage of development is inversely proportional to the real-GDP growth potential (countries in the earlier stages of development have greater growth potential) BUT the difficulty associated with unlocking that growth potential is also inversely proportional to the stage of development (countries in the earliest stages of development will find it harder to realize their growth potential).

So what does that mean for investors? I think it means that distinctions should be made between countries in different stages of development. This is because whilst the most under-developed countries represent investments with the greatest return potential, they are also more risky and will probably require longer terms to pay off. Ideally these factors should be weighed against one another and compared to the investor's risk-return appetite.

Further Research

In the third part of this article series I am going to look at predictive models for real-GDP growth. The article will take a while to complete, because I am planning to implement a few models and also take a deeper look at the data.

In addition to this, the following topics are worthy of further research:

  1. Are fractional distances sensitive to the scale of the vectors being clustered?
  2. What unbiased alternatives are there to the clustering quality metrics presented here.
  3. To which socioeconomic indicators are the clustering results the most sensitive?
  4. More detailed analysis of each cluster, and of particular countries in each cluster.

Back to the top

Appendix A - All the code and some data!

I have removed the code from this article because of slow load times on the page. You can still obtain a copy of the Python file (in .txt format) and the data by following the links below.

All the code I used to write this article is shown at the end. To run it you will need the raw data file; you can download a copy of it here. Please let me know if you can identify any optimizations, the code is already memoized to reduce the number on unnecessary distance calculations and (tries) to make use of numpy arrays over conventional lists.

Back to the top


  1. biobird

    Probably one of the best explanation for k mean cluster and the way to assess its quality! Kudos! Well done! Look forward to part 3!

  2. eQuant

    I am, as always, impressed by the clarity of your work. Keep up with such articles, I am myself interested in quantitative computing in my spare time and I always find here a great source of inspiration. Bravo

    • Thank you for the kind words, they mean a lot. What aspects of Quantitative Computing are you most interested in?

      • eQuant

        You're welcome. I am essentially interested in quantitative finance for energy markets, specifically short-term prices forecasts. Hence the biggest challenge to me is often to find relevant causality in the numerous price factors, which happens to be quite complex given the constantly regime-switching interactions between financial and physical signals..

        • That is some really challenging and interesting work. I had a friend who did 'similar' work (short term forecasting) but for a mining company on international FX rates. Drop me a mail sometime, it would be cool to find out what models you are using 🙂

  3. Juane

    A very detailed explanation with a practical application in finance/econ. Lots of value add, thank you for posting Stuart!

  4. Felipe

    Thank you very much Stuart! I have been learning a lot with your website!

  5. iMitwe

    Thank you man. You're the best. You explain all the stuffs in a best way.

Submit a Comment