Monte Carlo K-Means Clustering of Countries
Warning: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead in /home/customer/www/turingfinance.com/public_html/wp-content/plugins/latex/latex.php on line 47
In the first part of this three-part series, What Drives Real GDP Growth?, I identified four themes which drive real GDP growth. These themes are based on 19 socioeconomic indicators whose average Spearman and Pearson correlations to real GDP growth were statistically significant and greater than 0.2 across the 188 country sample.
In this article the 188 countries are clustered based on those 19 socioeconomic indicators using a Monte Carlo K-Means clustering algorithm implemented in Python. Clustering can help to reduce the amount of work required to identify attractive investment opportunities by grouping similar countries together and generalizing about them.
This article goes into sufficient detail about distance metrics, measures of cluster quality, clustering algorithms, the K-Means clustering algorithm, before discussing the results from clustering the countries and drawing conclusions. If you are quite comfortable with the topic, feel free to navigate to those sections which sound interesting.
- Clustering theory - Measures of similarity and distance
- Clustering theory - Classes of clustering algorithms
- Clustering theory - The K-Means clustering algorithm
- Clustering theory - Measures of clustering quality
- Clustering theory - Monte Carlo Methods in clustering
- Clustering results - Visualizations and centroid analysis
- Clustering results - Conclusions and further research
- Appendix A - All the code and some data!
Clustering Theory - Measures of Similarity & Distance
Clustering is the process of partitioning a set of heterogeneous (different) objects into subsets of homogeneous (similar) objects. At the heart of cluster analysis is the assumption that given any two objects you can quantify the similarity or dissimilarity between those objects. In continuous search spaces distance measures similarity.
Below I have written about similarity metrics for continuous search spaces. For each I have included formulas (Given two vectors, and
) and Python code. All the Python code used to write this article is available.
Manhattan Distance
An interesting paper entitled "On the Surprising Behavior of Distance Metrics in High Dimensional Space" studied the different distance metrics in high dimensional spaces and found that using Manhattan distance and fractional distance were preferable to using the more traditional Euclidean distance measures for clustering,
Manhattan distance is equal to the sum of the absolute distances of each coordinate in the vector,

Euclidean Distance
Euclidean distance is the ordinary distance between two points or vectors. Some derivatives of Euclidean distance include squared Euclidean distance and half squared Euclidean distance,



Fractional Distance
Interestingly Manhattan Distances, Euclidean Distances, and Fractional Distances are all closely related and are often described in terms of L-norms. Given the following distance equation,

Notice that when , we get the same value as using Manhattan distance and when
we get the same value as using Euclidean distance. Now if Manhattan distances perform better than Euclidean distances in high dimensional spaces it implies that for values
performance may improve still further. The paper I mentioned earlier found that at 20 dimensions performance of the fractional distance metric converged after
.
During this study I had no luck using the Fractional Distance metric. My (unproved) hypothesis is that the performance of the Fractional Distance metric is sensitive to the scale of the values in the vectors being clustered.
Alternative Distance Metrics
One can also use correlations as a measure of similarity for continuous valued search spaces. Popular measurements include the Pearson coefficient, cosine similarity, and the tanimoto coefficient. If your data is not continuous i.e. it is discrete, then you can either use a function to convert it to a continuous space or us metrics such as Hamming Distance, the Jaccard Index, or Levenshtein Distance (for strings) to measure similarity.
Python Code - Distance Metrics
The Similarity class written below contains implementations of all the distance metrics mentioned above. The default value for fractional distances is currently set to 2.0 (Euclidean distance) but can be adjusted.
Clustering Theory - Classes of Clustering Algorithms
The two main classes of clustering algorithms are hierarchical clustering and partitional clustering. Hierarchical clustering forms clusters by merging small clusters into larger ones or dividing large clusters into smaller ones. Partitional clustering forms clusters by dividing the input data set into mutually exclusive sub sets.
The differences between hierarchical and partitional clustering mostly has to do with the inputs required. Hierarchical clustering only requires a similarity measure whereas partitional clustering may require a number of additional inputs, most commonly the number of clusters, . Generally speaking, hierarchical clustering algorithms are also better suited to categorical data.
Hierarchical Clustering
There are two types of hierarchical clustering namely agglomerative clustering and divisive clustering. Agglomerative clustering is a bottom-up approach and involves merging smaller clusters (each input pattern by itself) into larger clusters. Divisive clustering is a top down approach which starts with one large cluster (all of the input patterns) and divides them into smaller and smaller clusters until each input pattern is by itself in a cluster.
Partitional Clustering
In this article we will focus on partitional clustering algorithms. The two main categories of partitional clustering algorithms are centroid-based clustering and density-based clustering. This article focuses on centroid-based clustering; in particular the popular K-means clustering algorithm.
Clustering Theory - The K-Means Clustering Algorithm
The K-Means Clustering algorithm is a centroid-based partitional clustering algorithm which works using the mean-shift heuristic. The K-means clustering algorithm consists of three steps (Initialization, Assignment, and Update). These steps are repeated until either the clustering has converged or the number of iterations has been exceeded a.k.a the computational budget has been exhausted.
Initialization
A set of centroids are randomly initialized in the search space. These centroids must be in the same order of magnitude as the data patterns being clustered. In other words, it does not make sense to initialize random vectors with values between 0 and 1 if the values in the data patterns are between 0 to 100. To remedy this the data can be normalized between 0 and 1 using,
Note: make sure you normalize your data across each attribute, not each pattern
Assignment
Once the centroids have been randomly initialized in the space, we iterate through each pattern in the data set and assign it to the closest centroid. Try and do this step in parallel, especially if you have a very large number of patterns in the data set.
Update
Once the patterns have been assigned to their centroids, the mean-shift heuristic is applied. This heuristic replaces each value in each centroid with the mean of that value over the patterns which have been assigned to that centroid. This shifts the centroid towards the high-dimensional mean of the patterns which belong to it. The problem with the mean-shift heuristic is that it is sensitive to outliers. To overcome this one can either use the K-medoids clustering algorithm or one can standardize the data to dampen the effect of outliers using,
Iteration
These three steps are repeated for a number of iterations until the clustering has converged on a solution. A really nice GIF showing this is shown below,
Python Code - Additions to the Clustering class
The Python methods below are extensions to the Clustering class which allow it to perform the K-means clustering algorithm. This involved updating the centroids using the mean-shift heuristic.
Clustering Theory - Measures of Clustering Quality
Assuming you have a measure of similarity and a clustering of the data, you still need an objective function measures the quality of that clustering. Most clustering quality metrics try to optimize the clustering in terms of its inter and intra-cluster distances. Put simply these metrics try to ensure that patterns in the same cluster are close together and patterns in different clusters are far apart.
Quantization Error
Quantization error measures the round off error introduced by quantization i.e. mapping a set of input values to a finite smaller set. This is basically what we are doing by clustering patterns into
clusters.
Quantization error can be calculated using the following very ugly formula,
.
where is the set of data points we are clustering and
just means we want to take the
norm of
and square it. So what this means is that quantization error is assuming we are using the Euclidean distance metric.
Alternatively we can calculate quantization error using the following less ugly formula,
where i.e. the number of patterns in the data set,
is the
pattern in
,
is the centroid to which
is closest, and
is the absolute distance between
and
using some distance metric,
.
Note: The image also assumes we are using the Manhattan distance.
In the above illustration of the Quantization Error we are calculating the sum of the squared absolute distances between every pattern and the centroid it has been assigned to.
Davies–Bouldin index
The Davies-Bouldin criterion is based on a ratio of intra-cluster and inter-cluster distances of a particular clustering. Mathematically the Davies-Bouldin index is calculated as,
where
is the ratio between the
and
cluster in the data.
,
is the average distance between each pattern in the
cluster and the centroid of that cluster; and
is equal to the distance between the
and
centroid. The Davies-Bouldin index is the sum of the worst intra-to-inter cluster distance ratios over all
clusters.
Note: The image assumes we are using the Manhattan distance.
In the above illustration of the Davies-Bouldin index we have three clusters consisting of three patterns each. The first step is to calculate the average intra-cluster distances for each cluster . The next step is to calculate the distances between each pair of centroids,
. Lastly we return the maximum ratio of the sum of any pair of intra cluster distances and the distance between those two centroids.
Silhouette Index
The Silhouette Index is one of the most popular ways of measuring a particular clustering's quality. It measure how similar each pattern is to the patterns in it's own cluster as compared to patterns in other clusters.
for a pattern then
is the minimum average distance from pattern
to the patterns in a different cluster (minimized over
clusters), and
is the average distance from pattern
to the patterns in the same cluster as
.
A high silhouette value indicates that is well-matched to its own cluster, and poorly-matched to neighboring clusters. One should aim to maximize
for each pattern in the data set.
Note: The image also assumes we are using the Manhattan distance.
In the above illustration of the Silhouette index we have the same three clusters consisting of the same three patterns each as in our last image. This image shows how to calculate the Silhouette index for just patterns and
. In the first step for both we calculate the average distance from each pattern to the patterns in its own cluster
and
. Next we calculate the average distance from each pattern to the patterns of the next closest cluster,
and
. For each pattern we calculate the silhouette index as the difference between the average distance to the patterns of another cluster and the average distance to the patterns of its own cluster divided by the max of those two numbers. To calculate the Silhouette Index of the clustering we calculate the Silhouette Index for every pattern and average them.
The good, the bad, and the ugly
After having spent the last few months using these metrics I have concluded that none of them are perfect,
- Quantization Error - the computational complexity of this metric is the smallest, however the metric is biased towards a large number of clusters because as you add more centroid the clusters become smaller (more compact) and in the extreme you may have one pattern assigned to each centroid. In this case the Quantization error is minimized. The results were the most believable using this metric The Good.
- Davies-Bouldin - as you increase the value of
the distances between each centroid will naturally decrease on average. Because this term is in the denominator you end up dividing by a smaller number for larger values of
. The result of this is that the metric is biased towards solutions with a smaller number of clusters The Bad.
- Silhouette Index - the computational complexity of this metric is crazy. Assuming you calculate the distance from each pattern
to every other pattern to calculate which cluster is the closest and you do that for each pattern, you need to calculate
distances. In this example that equates to 35,156 calculations. If you memoize perfectly the number of calculations for each pass is halved. Also, the metric seems to be biased against solutions with a larger number of clusters because, as you might have guessed, the distances between the centroids is smaller so each pattern is more likely to belong more to a neighboring cluster The Ugly.
The following analysis of the different metrics demonstrates the biases quite nicely; despite the fact that they are supposed to be measuring the same thing they are almost perfectly negatively correlated.
x
QE
DB
SI
QE
1.0
-0.965
-0.894
SB
-0.965
1.0
0.949
SI
-0.894
0.949
1.0
Python Code - Clustering
Before you can evaluate the fitness of a given clustering you need to actually cluster the patterns. The Clustering class contains methods which assign patterns to their nearest centroids.
Python Code - Objective Functions
The ClusteringQuality class measures the quality of a given Clustering of the input patterns.
Clustering Theory - Monte Carlo Methods in Clustering
The two biggest problems with the K-Means clustering algorithm are that:
- It is sensitive to the random initialization of the centroids and
- The number of centroids initialized,
For these reasons it the K-means clustering algorithm is often restarted multiple times. Because the initialization is (usually) random we are essentially sampling random high-dimensional starting positions for the centroids which is also known as a Monte Carlo simulation. In order to compare solutions across independent simulations, we need a measure of cluster quality such as those discussed previously.
Deterministic Initialization
I said that the initialization is usually random because there are deterministic initialization techniques for the K-Means clustering algorithm including grid-based initialization and principal component based initialization.
Stochastic Initialization
For stochastic (random) initialization of the centroids one can either use pseudo random numbers or quasi random numbers. The difference is that the next random number in a pseudo random sequence is independent of the previous random numbers whereas in a quasi random number sequence the next random number is dependent on the previous random numbers. Dependent random numbers cover a greater surface of the search space.

Comparison of a pseudo random sequence (left) and a quasi random sequence (right) in a two-dimensional space
Picking the right k
In addition to testing different initializations, we can test different values of in a Monte Carlo framework. Currently there is no optimal way of dynamically determining the right number of clusters, although techniques for determining the right
value are always being researched. I prefer to just empirically try different
values and compare the results although this is time-consuming, especially on large data sets.
Monte Carlo Method - Code
The Monte Carlo simulation used to analyze the Clustering Quality results is shown below,
Clustering Results - Visualization and Centroid Analysis
The final results presented below represent the best clustering found over the range over 1,000 independent simulations per each value of
. Euclidean distances and the Quantization Error were the distance and quality metric used in the Monte-Carlo K-Means clustering. The data set used to produce the results was the normalized point-in-time 2014 data-set consisting of the 19 socioeconomic indicators identified as being positively correlated to real-GDP growth. These were identified in my previous post, What Drives Real-GDP Growth?.
When viewing the results keep it in mind that the number of each cluster is arbitrary so being in cluster one may or may not be better than being in cluster two. In other words, it is an nominal scale.
Map Visualization for ~2014
The following map represents the best possible clustering for based on normalized 2014 socioeconomic data. The centroids are broken down in the Cluster Breakdown & Centroid Analysis sub-section.
.
Map Visualization for ~2004
The following map represents what the countries would have been clustered on if 2004 data were used and the same centroids identified in the 2014 data were used to cluster the countries.
.
Cluster Breakdown & Centroid Analysis
Each one of the tabs below breaks down the cluster into the country which belong to it and compares the centroid to the median centroid across each one of the 19 socioeconomic indicators we clustered on.
Countries in this Cluster in 2014
Australia
Austria
Belgium
Canada
Czech Republic
Denmark
Estonia
Finland
France
Germany
Iceland
Japan
Luxembourg
Malta
Netherlands
New Zealand
Norway
Sweden
Switzerland
United Kingdom
Countries in this Cluster in 2014
Albania
Algeria
Antigua and Barbuda
Armenia
Azerbaijan
Belize
Cape Verde
Colombia
Costa Rica
Cuba
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Fiji
Grenada
Jamaica
Jordan
Kazakhstan
Kosovo
Libya
Maldives
Mauritius
Mexico
Moldova
Morocco
Panama
Paraguay
Peru
Philippines
Romania
Saint Kitts and Nevis
Saint Vincent and the Grenadines
Seychelles
South Africa
Suriname
Syria
Thailand
Trinidad and Tobago
Tunisia
Turkey
Tuvalu
Ukraine
Uzbekistan
Countries in this Cluster in 2014
Indonesia
Iran
Sao Tome and Principe
Vietnam
Countries in this Cluster in 2014
Bangladesh
Bhutan
Bolivia
Botswana
Djibouti
Equatorial Guinea
Gabon
Gambia
Guatemala
Guyana
Honduras
India
Iraq
Kyrgyzstan
Laos
Mongolia
Myanmar
Namibia
Nepal
Nicaragua
North Korea
Pakistan
Saint Lucia
Samoa
Senegal
South Sudan
Sri Lanka
Swaziland
Tajikistan
Tonga
Turkmenistan
Vanuatu
Countries in this Cluster in 2014
Barbados
Bosnia and Herzegovina
Bulgaria
Croatia
Cyprus
Georgia
Greece
Hungary
Ireland
Italy
Latvia
Lithuania
Macedonia
Montenegro
Poland
Portugal
Serbia
Slovak Republic
Slovenia
Spain
Countries in this Cluster in 2014
China
United States
Countries in this Cluster in 2014
Afghanistan
Angola
Benin
Burkina Faso
Burundi
Cambodia
Cameroon
Central African Republic
Chad
Comoros
Congo
Congo-Brazzaville
Eritrea
Ethiopia
Ghana
Guinea
Guinea-Bissau
Haiti
Ivory Coast
Kenya
Kiribati
Lesotho
Liberia
Madagascar
Malawi
Mali
Mauritania
Mozambique
Niger
Nigeria
Papua New Guinea
Rwanda
Sierra Leone
Solomon Islands
Somalia
Sudan
Tanzania
Timor-Leste
Togo
Uganda
Yemen
Zambia
Zimbabwe
Countries in this Cluster in 2014
Argentina
Bahrain
Belarus
Brazil
Brunei
Chile
Hong Kong
Israel
Kuwait
Lebanon
Malaysia
Oman
Qatar
Russia
San Marino
Saudi Arabia
Singapore
South Korea
The Bahamas
UAE
Uruguay
Venezuela
Clustering Results - Conclusions and Further Research
Being a quant is not about risk management, derivatives pricing, or algorithmic trading; it is about challenging the way things are done and finding better ways of doing them usually using statistical and computational methods. This way of thinking can be applied to anything including how we think about the world. I started writing this series of articles because I was frustrated by the use of over-simplified buzzwords such as 'The African Growth Story', 'Third World vs. First World', and 'BRICs nations' to describe topics as complex as the world and growth. The conclusions below have presented me with a better framework for thinking about the world and growth; my hope is that it will be as useful to you as it is to me.
Conclusions
China has joined the USA as an outlier
In 2004, the United States was an outlier and occupied a cluster on its own. This cluster was characterized by a low exchange rate at PPP, high imports, high exports, high household expenditures, high industrial production, and relatively high government revenues and spending particularly on health. At this point in time the biggest differences are still that there is a much larger amount of investment happening into China and their population is (obviously) a lot larger with a better demographic (population between 15 and 64). In terms of industrial production China has also overtaken the USA. These are shown in a side by side comparison below,
Geography is a poor measure of similarity
Many companies adjust their strategies according to regions of the world. What the results suggest is that countries in regions of the world are as likely to be clustered together as they are not. This would indicate that as a crude clustering technique, seeing the world as regions of similar countries with similar growth prospects is sub-optimal.
Some colloquialisms are right; others are just wrong
Colloquialisms such as eastern vs. western European countries show up on the map and are (for lack of a better word) right. However colloquialisms such as BRICs (Brazil, Russia, India, China, and South Africa) are clearly motivated more by the political economy than the actual economy. Here are my thoughts on some common colloquialisms,
- Eastern vs. Western Europe - There appears to be a clear way distinction between those countries in cluster one versus those in clusters five and two. The past ten years have seen changes in Spain, Ireland, the Czech Republic, and other nearby countries. This is probably the result of the sovereign debt crisis.
- Eastern vs. Western Countries - This is an over-simplification. Most Asian countries occupy different clusters, and the traditional Western countries such as the USA and the UK do not actually occupy the same cluster.
- BRICs nations - Brazil, Russia, India, China, and South Africa are in different clusters. Whilst they may have trade agreements in place, that doesn't imply that these nations have the same social, demographic, and economic make-up or the same potential for future real-GDP growth.
- The African Growth Story - Whilst capital markets have done well in the past ten years, this does not seem to have reflected in major changes to the social, demographic, and economic make-up of the continent. What is interesting is that India and Pakistan are no longer clustered with Central and Southern African countries.
- Northern Africa vs. Southern Africa - There is a clear distinction between Northern African countries (Morocco, Algeria, Egypt, Libya, etc.) and the Rest of Africa. Surprisingly, South Africa is now clustered with these countries.
- Emerging vs. Developed Nations - Again this appears to be an over-simplification. There seem to be stages of development which is discussed in the next section.
There are many more colloquialisms and I apologize for not commenting on all of them, these six are just the ones which I encounter most frequently in my everyday life. Please comment if you spot other interesting relationships.
There are multiple stages of developing countries
It is impossible to quantify how good it is to be in one cluster vs. another cluster because we do not know the relative importance of each socioeconomic indicator. In some cases we cannot even be sure that a large or small value is good or bad. For example, is a large government spending still good if the government is ineffective? Despite this fact, I have attempted to construct a crude metric for ranking each one of the clusters:
Rank value = Exports + Household Expenditures + Imports + Improved Sanitation + Improved Water + Population + Population Aged 15 To 64 + Population Growth + Total Investment + Urban percentage + Cellphone Subscriptions + Government Revenues + Government Spending + Health Expenditure + Industrial Production + Internet Users - Exchange Rate At PPP - Unemployment - Age Dependency Ratio Old
Based on this metric the relative rank of each cluster is shown below,
6 8 1 5 2 4 3 7
Cluster
Rank Value
Rank
Count
10.238
1
2
5.191
2
22
5.146
3
20
3.827
4
20
3.825
5
45
3.111
6
32
3.078
7
4
1.799
8
43
This ranking isn't perfect, but it reaffirms our beliefs that the world is an unequal place, and that the distribution of countries is skewed towards lower rank values. In other words there are more worse off countries than well off ones.
This ranking also implies multiple stages of development for countries. The stage of development is inversely proportional to the real-GDP growth potential (countries in the earlier stages of development have greater growth potential) BUT the difficulty associated with unlocking that growth potential is also inversely proportional to the stage of development (countries in the earliest stages of development will find it harder to realize their growth potential).
So what does that mean for investors? I think it means that distinctions should be made between countries in different stages of development. This is because whilst the most under-developed countries represent investments with the greatest return potential, they are also more risky and will probably require longer terms to pay off. Ideally these factors should be weighed against one another and compared to the investor's risk-return appetite.
Further Research
In the third part of this article series I am going to look at predictive models for real-GDP growth. The article will take a while to complete, because I am planning to implement a few models and also take a deeper look at the data.
In addition to this, the following topics are worthy of further research:
- Are fractional distances sensitive to the scale of the vectors being clustered?
- What unbiased alternatives are there to the clustering quality metrics presented here.
- To which socioeconomic indicators are the clustering results the most sensitive?
- More detailed analysis of each cluster, and of particular countries in each cluster.
Appendix A - All the code and some data!
I have removed the code from this article because of slow load times on the page. You can still obtain a copy of the Python file (in .txt format) and the data by following the links below.
All the code I used to write this article is shown at the end. To run it you will need the raw data file; you can download a copy of it here. Please let me know if you can identify any optimizations, the code is already memoized to reduce the number on unnecessary distance calculations and (tries) to make use of numpy arrays over conventional lists.
-
Probably one of the best explanation for k mean cluster and the way to assess its quality! Kudos! Well done! Look forward to part 3!
-
I am, as always, impressed by the clarity of your work. Keep up with such articles, I am myself interested in quantitative computing in my spare time and I always find here a great source of inspiration. Bravo
-
A very detailed explanation with a practical application in finance/econ. Lots of value add, thank you for posting Stuart!
-
Thank you very much Stuart! I have been learning a lot with your website!
-
Thank you man. You're the best. You explain all the stuffs in a best way.
-
Thank you for your KM cluster example. It is amazing.
And I want to know what kind of drawing tools you use to plot the maps in the article?
Nice figures and nice chart really help to visualize the idea.
Comments