Monte Carlo KMeans Clustering of Countries
In the first part of this threepart series, What Drives Real GDP Growth?, I identified four themes which drive real GDP growth. These themes are based on 19 socioeconomic indicators whose average Spearman and Pearson correlations to real GDP growth were statistically significant and greater than 0.2 across the 188 country sample.
In this article the 188 countries are clustered based on those 19 socioeconomic indicators using a Monte Carlo KMeans clustering algorithm implemented in Python. Clustering can help to reduce the amount of work required to identify attractive investment opportunities by grouping similar countries together and generalizing about them.
This article goes into sufficient detail about distance metrics, measures of cluster quality, clustering algorithms, the KMeans clustering algorithm, before discussing the results from clustering the countries and drawing conclusions. If you are quite comfortable with the topic, feel free to navigate to those sections which sound interesting.
 Clustering theory  Measures of similarity and distance
 Clustering theory  Classes of clustering algorithms
 Clustering theory  The KMeans clustering algorithm
 Clustering theory  Measures of clustering quality
 Clustering theory  Monte Carlo Methods in clustering
 Clustering results  Visualizations and centroid analysis
 Clustering results  Conclusions and further research
 Appendix A  All the code and some data!
Clustering Theory  Measures of Similarity & Distance
Clustering is the process of partitioning a set of heterogeneous (different) objects into subsets of homogeneous (similar) objects. At the heart of cluster analysis is the assumption that given any two objects you can quantify the similarity or dissimilarity between those objects. In continuous search spaces distance measures similarity.
Below I have written about similarity metrics for continuous search spaces. For each I have included formulas (Given two vectors, and ) and Python code. All the Python code used to write this article is available.
Manhattan Distance
An interesting paper entitled "On the Surprising Behavior of Distance Metrics in High Dimensional Space" studied the different distance metrics in high dimensional spaces and found that using Manhattan distance and fractional distance were preferable to using the more traditional Euclidean distance measures for clustering,
Manhattan distance is equal to the sum of the absolute distances of each coordinate in the vector,
Euclidean Distance
Euclidean distance is the ordinary distance between two points or vectors. Some derivatives of Euclidean distance include squared Euclidean distance and half squared Euclidean distance,
Fractional Distance
Interestingly Manhattan Distances, Euclidean Distances, and Fractional Distances are all closely related and are often described in terms of Lnorms. Given the following distance equation,
Notice that when , we get the same value as using Manhattan distance and when we get the same value as using Euclidean distance. Now if Manhattan distances perform better than Euclidean distances in high dimensional spaces it implies that for values performance may improve still further. The paper I mentioned earlier found that at 20 dimensions performance of the fractional distance metric converged after .
During this study I had no luck using the Fractional Distance metric. My (unproved) hypothesis is that the performance of the Fractional Distance metric is sensitive to the scale of the values in the vectors being clustered.
Alternative Distance Metrics
One can also use correlations as a measure of similarity for continuous valued search spaces. Popular measurements include the Pearson coefficient, cosine similarity, and the tanimoto coefficient. If your data is not continuous i.e. it is discrete, then you can either use a function to convert it to a continuous space or us metrics such as Hamming Distance, the Jaccard Index, or Levenshtein Distance (for strings) to measure similarity.
Python Code  Distance Metrics
The Similarity class written below contains implementations of all the distance metrics mentioned above. The default value for fractional distances is currently set to 2.0 (Euclidean distance) but can be adjusted.
Clustering Theory  Classes of Clustering Algorithms
The two main classes of clustering algorithms are hierarchical clustering and partitional clustering. Hierarchical clustering forms clusters by merging small clusters into larger ones or dividing large clusters into smaller ones. Partitional clustering forms clusters by dividing the input data set into mutually exclusive sub sets.
The differences between hierarchical and partitional clustering mostly has to do with the inputs required. Hierarchical clustering only requires a similarity measure whereas partitional clustering may require a number of additional inputs, most commonly the number of clusters, . Generally speaking, hierarchical clustering algorithms are also better suited to categorical data.
Hierarchical Clustering
There are two types of hierarchical clustering namely agglomerative clustering and divisive clustering. Agglomerative clustering is a bottomup approach and involves merging smaller clusters (each input pattern by itself) into larger clusters. Divisive clustering is a top down approach which starts with one large cluster (all of the input patterns) and divides them into smaller and smaller clusters until each input pattern is by itself in a cluster.
Partitional Clustering
In this article we will focus on partitional clustering algorithms. The two main categories of partitional clustering algorithms are centroidbased clustering and densitybased clustering. This article focuses on centroidbased clustering; in particular the popular Kmeans clustering algorithm.
Clustering Theory  The KMeans Clustering Algorithm
The KMeans Clustering algorithm is a centroidbased partitional clustering algorithm which works using the meanshift heuristic. The Kmeans clustering algorithm consists of three steps (Initialization, Assignment, and Update). These steps are repeated until either the clustering has converged or the number of iterations has been exceeded a.k.a the computational budget has been exhausted.
Initialization
A set of centroids are randomly initialized in the search space. These centroids must be in the same order of magnitude as the data patterns being clustered. In other words, it does not make sense to initialize random vectors with values between 0 and 1 if the values in the data patterns are between 0 to 100. To remedy this the data can be normalized between 0 and 1 using,
Note: make sure you normalize your data across each attribute, not each pattern
Assignment
Once the centroids have been randomly initialized in the space, we iterate through each pattern in the data set and assign it to the closest centroid. Try and do this step in parallel, especially if you have a very large number of patterns in the data set.
Update
Once the patterns have been assigned to their centroids, the meanshift heuristic is applied. This heuristic replaces each value in each centroid with the mean of that value over the patterns which have been assigned to that centroid. This shifts the centroid towards the highdimensional mean of the patterns which belong to it. The problem with the meanshift heuristic is that it is sensitive to outliers. To overcome this one can either use the Kmedoids clustering algorithm or one can standardize the data to dampen the effect of outliers using,
Iteration
These three steps are repeated for a number of iterations until the clustering has converged on a solution. A really nice GIF showing this is shown below,
Python Code  Additions to the Clustering class
The Python methods below are extensions to the Clustering class which allow it to perform the Kmeans clustering algorithm. This involved updating the centroids using the meanshift heuristic.
Clustering Theory  Measures of Clustering Quality
Assuming you have a measure of similarity and a clustering of the data, you still need an objective function measures the quality of that clustering. Most clustering quality metrics try to optimize the clustering in terms of its inter and intracluster distances. Put simply these metrics try to ensure that patterns in the same cluster are close together and patterns in different clusters are far apart.
Quantization Error
Quantization error measures the round off error introduced by quantization i.e. mapping a set of input values to a finite smaller set. This is basically what we are doing by clustering patterns into clusters.
Quantization error can be calculated using the following very ugly formula,
.
where is the set of data points we are clustering and just means we want to take the norm of and square it. So what this means is that quantization error is assuming we are using the Euclidean distance metric.
Alternatively we can calculate quantization error using the following less ugly formula,
where i.e. the number of patterns in the data set, is the pattern in , is the centroid to which is closest, and is the absolute distance between and using some distance metric, .
Note: The image also assumes we are using the Manhattan distance.
In the above illustration of the Quantization Error we are calculating the sum of the squared absolute distances between every pattern and the centroid it has been assigned to.
Davies–Bouldin index
The DaviesBouldin criterion is based on a ratio of intracluster and intercluster distances of a particular clustering. Mathematically the DaviesBouldin index is calculated as,
where
is the ratio between the and cluster in the data. , is the average distance between each pattern in the cluster and the centroid of that cluster; and is equal to the distance between the and centroid. The DaviesBouldin index is the sum of the worst intratointer cluster distance ratios over all clusters.
Note: The image assumes we are using the Manhattan distance.
In the above illustration of the DaviesBouldin index we have three clusters consisting of three patterns each. The first step is to calculate the average intracluster distances for each cluster . The next step is to calculate the distances between each pair of centroids, . Lastly we return the maximum ratio of the sum of any pair of intra cluster distances and the distance between those two centroids.
Silhouette Index
The Silhouette Index is one of the most popular ways of measuring a particular clustering's quality. It measure how similar each pattern is to the patterns in it's own cluster as compared to patterns in other clusters.
for a pattern then is the minimum average distance from pattern to the patterns in a different cluster (minimized over clusters), and is the average distance from pattern to the patterns in the same cluster as .
A high silhouette value indicates that is wellmatched to its own cluster, and poorlymatched to neighboring clusters. One should aim to maximize for each pattern in the data set.
Note: The image also assumes we are using the Manhattan distance.
In the above illustration of the Silhouette index we have the same three clusters consisting of the same three patterns each as in our last image. This image shows how to calculate the Silhouette index for just patterns and . In the first step for both we calculate the average distance from each pattern to the patterns in its own cluster and . Next we calculate the average distance from each pattern to the patterns of the next closest cluster, and . For each pattern we calculate the silhouette index as the difference between the average distance to the patterns of another cluster and the average distance to the patterns of its own cluster divided by the max of those two numbers. To calculate the Silhouette Index of the clustering we calculate the Silhouette Index for every pattern and average them.
The good, the bad, and the ugly
After having spent the last few months using these metrics I have concluded that none of them are perfect,
 Quantization Error  the computational complexity of this metric is the smallest, however the metric is biased towards a large number of clusters because as you add more centroid the clusters become smaller (more compact) and in the extreme you may have one pattern assigned to each centroid. In this case the Quantization error is minimized. The results were the most believable using this metric The Good.
 DaviesBouldin  as you increase the value of the distances between each centroid will naturally decrease on average. Because this term is in the denominator you end up dividing by a smaller number for larger values of . The result of this is that the metric is biased towards solutions with a smaller number of clusters The Bad.
 Silhouette Index  the computational complexity of this metric is crazy. Assuming you calculate the distance from each pattern to every other pattern to calculate which cluster is the closest and you do that for each pattern, you need to calculate distances. In this example that equates to 35,156 calculations. If you memoize perfectly the number of calculations for each pass is halved. Also, the metric seems to be biased against solutions with a larger number of clusters because, as you might have guessed, the distances between the centroids is smaller so each pattern is more likely to belong more to a neighboring cluster The Ugly.
The following analysis of the different metrics demonstrates the biases quite nicely; despite the fact that they are supposed to be measuring the same thing they are almost perfectly negatively correlated.
x  QE  DB  SI 

QE  1.0  0.965  0.894 
SB  0.965  1.0  0.949 
SI  0.894  0.949  1.0 
Python Code  Clustering
Before you can evaluate the fitness of a given clustering you need to actually cluster the patterns. The Clustering class contains methods which assign patterns to their nearest centroids.
Python Code  Objective Functions
The ClusteringQuality class measures the quality of a given Clustering of the input patterns.
Clustering Theory  Monte Carlo Methods in Clustering
The two biggest problems with the KMeans clustering algorithm are that:
 It is sensitive to the random initialization of the centroids and
 The number of centroids initialized,
For these reasons it the Kmeans clustering algorithm is often restarted multiple times. Because the initialization is (usually) random we are essentially sampling random highdimensional starting positions for the centroids which is also known as a Monte Carlo simulation. In order to compare solutions across independent simulations, we need a measure of cluster quality such as those discussed previously.
Deterministic Initialization
I said that the initialization is usually random because there are deterministic initialization techniques for the KMeans clustering algorithm including gridbased initialization and principal component based initialization.
Stochastic Initialization
For stochastic (random) initialization of the centroids one can either use pseudo random numbers or quasi random numbers. The difference is that the next random number in a pseudo random sequence is independent of the previous random numbers whereas in a quasi random number sequence the next random number is dependent on the previous random numbers. Dependent random numbers cover a greater surface of the search space.
Picking the right k
In addition to testing different initializations, we can test different values of in a Monte Carlo framework. Currently there is no optimal way of dynamically determining the right number of clusters, although techniques for determining the right value are always being researched. I prefer to just empirically try different values and compare the results although this is timeconsuming, especially on large data sets.
Monte Carlo Method  Code
The Monte Carlo simulation used to analyze the Clustering Quality results is shown below,
Clustering Results  Visualization and Centroid Analysis
The final results presented below represent the best clustering found over the range over 1,000 independent simulations per each value of . Euclidean distances and the Quantization Error were the distance and quality metric used in the MonteCarlo KMeans clustering. The data set used to produce the results was the normalized pointintime 2014 dataset consisting of the 19 socioeconomic indicators identified as being positively correlated to realGDP growth. These were identified in my previous post, What Drives RealGDP Growth?.
When viewing the results keep it in mind that the number of each cluster is arbitrary so being in cluster one may or may not be better than being in cluster two. In other words, it is an nominal scale.
Map Visualization for ~2014
The following map represents the best possible clustering for based on normalized 2014 socioeconomic data. The centroids are broken down in the Cluster Breakdown & Centroid Analysis subsection.
.
Map Visualization for ~2004
The following map represents what the countries would have been clustered on if 2004 data were used and the same centroids identified in the 2014 data were used to cluster the countries.
.
Cluster Breakdown & Centroid Analysis
Each one of the tabs below breaks down the cluster into the country which belong to it and compares the centroid to the median centroid across each one of the 19 socioeconomic indicators we clustered on.
Countries in this Cluster in 2014
Australia
Austria
Belgium
Canada
Czech Republic
Denmark
Estonia
Finland
France
Germany
Iceland
Japan
Luxembourg
Malta
Netherlands
New Zealand
Norway
Sweden
Switzerland
United Kingdom
Countries in this Cluster in 2014
Albania
Algeria
Antigua and Barbuda
Armenia
Azerbaijan
Belize
Cape Verde
Colombia
Costa Rica
Cuba
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Fiji
Grenada
Jamaica
Jordan
Kazakhstan
Kosovo
Libya
Maldives
Mauritius
Mexico
Moldova
Morocco
Panama
Paraguay
Peru
Philippines
Romania
Saint Kitts and Nevis
Saint Vincent and the Grenadines
Seychelles
South Africa
Suriname
Syria
Thailand
Trinidad and Tobago
Tunisia
Turkey
Tuvalu
Ukraine
Uzbekistan
Countries in this Cluster in 2014
Indonesia
Iran
Sao Tome and Principe
Vietnam
Countries in this Cluster in 2014
Bangladesh
Bhutan
Bolivia
Botswana
Djibouti
Equatorial Guinea
Gabon
Gambia
Guatemala
Guyana
Honduras
India
Iraq
Kyrgyzstan
Laos
Mongolia
Myanmar
Namibia
Nepal
Nicaragua
North Korea
Pakistan
Saint Lucia
Samoa
Senegal
South Sudan
Sri Lanka
Swaziland
Tajikistan
Tonga
Turkmenistan
Vanuatu
Countries in this Cluster in 2014
Barbados
Bosnia and Herzegovina
Bulgaria
Croatia
Cyprus
Georgia
Greece
Hungary
Ireland
Italy
Latvia
Lithuania
Macedonia
Montenegro
Poland
Portugal
Serbia
Slovak Republic
Slovenia
Spain
Countries in this Cluster in 2014
China
United States
Countries in this Cluster in 2014
Afghanistan
Angola
Benin
Burkina Faso
Burundi
Cambodia
Cameroon
Central African Republic
Chad
Comoros
Congo
CongoBrazzaville
Eritrea
Ethiopia
Ghana
Guinea
GuineaBissau
Haiti
Ivory Coast
Kenya
Kiribati
Lesotho
Liberia
Madagascar
Malawi
Mali
Mauritania
Mozambique
Niger
Nigeria
Papua New Guinea
Rwanda
Sierra Leone
Solomon Islands
Somalia
Sudan
Tanzania
TimorLeste
Togo
Uganda
Yemen
Zambia
Zimbabwe
Countries in this Cluster in 2014
Argentina
Bahrain
Belarus
Brazil
Brunei
Chile
Hong Kong
Israel
Kuwait
Lebanon
Malaysia
Oman
Qatar
Russia
San Marino
Saudi Arabia
Singapore
South Korea
The Bahamas
UAE
Uruguay
Venezuela
Clustering Results  Conclusions and Further Research
Being a quant is not about risk management, derivatives pricing, or algorithmic trading; it is about challenging the way things are done and finding better ways of doing them usually using statistical and computational methods. This way of thinking can be applied to anything including how we think about the world. I started writing this series of articles because I was frustrated by the use of oversimplified buzzwords such as 'The African Growth Story', 'Third World vs. First World', and 'BRICs nations' to describe topics as complex as the world and growth. The conclusions below have presented me with a better framework for thinking about the world and growth; my hope is that it will be as useful to you as it is to me.
Conclusions
China has joined the USA as an outlier
In 2004, the United States was an outlier and occupied a cluster on its own. This cluster was characterized by a low exchange rate at PPP, high imports, high exports, high household expenditures, high industrial production, and relatively high government revenues and spending particularly on health. At this point in time the biggest differences are still that there is a much larger amount of investment happening into China and their population is (obviously) a lot larger with a better demographic (population between 15 and 64). In terms of industrial production China has also overtaken the USA. These are shown in a side by side comparison below,
Geography is a poor measure of similarity
Many companies adjust their strategies according to regions of the world. What the results suggest is that countries in regions of the world are as likely to be clustered together as they are not. This would indicate that as a crude clustering technique, seeing the world as regions of similar countries with similar growth prospects is suboptimal.
Some colloquialisms are right; others are just wrong
Colloquialisms such as eastern vs. western European countries show up on the map and are (for lack of a better word) right. However colloquialisms such as BRICs (Brazil, Russia, India, China, and South Africa) are clearly motivated more by the political economy than the actual economy. Here are my thoughts on some common colloquialisms,
 Eastern vs. Western Europe  There appears to be a clear way distinction between those countries in cluster one versus those in clusters five and two. The past ten years have seen changes in Spain, Ireland, the Czech Republic, and other nearby countries. This is probably the result of the sovereign debt crisis.
 Eastern vs. Western Countries  This is an oversimplification. Most Asian countries occupy different clusters, and the traditional Western countries such as the USA and the UK do not actually occupy the same cluster.
 BRICs nations  Brazil, Russia, India, China, and South Africa are in different clusters. Whilst they may have trade agreements in place, that doesn't imply that these nations have the same social, demographic, and economic makeup or the same potential for future realGDP growth.
 The African Growth Story  Whilst capital markets have done well in the past ten years, this does not seem to have reflected in major changes to the social, demographic, and economic makeup of the continent. What is interesting is that India and Pakistan are no longer clustered with Central and Southern African countries.
 Northern Africa vs. Southern Africa  There is a clear distinction between Northern African countries (Morocco, Algeria, Egypt, Libya, etc.) and the Rest of Africa. Surprisingly, South Africa is now clustered with these countries.
 Emerging vs. Developed Nations  Again this appears to be an oversimplification. There seem to be stages of development which is discussed in the next section.
There are many more colloquialisms and I apologize for not commenting on all of them, these six are just the ones which I encounter most frequently in my everyday life. Please comment if you spot other interesting relationships.
There are multiple stages of developing countries
It is impossible to quantify how good it is to be in one cluster vs. another cluster because we do not know the relative importance of each socioeconomic indicator. In some cases we cannot even be sure that a large or small value is good or bad. For example, is a large government spending still good if the government is ineffective? Despite this fact, I have attempted to construct a crude metric for ranking each one of the clusters:
Rank value = Exports + Household Expenditures + Imports + Improved Sanitation + Improved Water + Population + Population Aged 15 To 64 + Population Growth + Total Investment + Urban percentage + Cellphone Subscriptions + Government Revenues + Government Spending + Health Expenditure + Industrial Production + Internet Users  Exchange Rate At PPP  Unemployment  Age Dependency Ratio Old
Based on this metric the relative rank of each cluster is shown below,
Cluster  Rank Value  Rank  Count 

6 
10.238  1  2 
8 
5.191  2  22 
1 
5.146  3  20 
5 
3.827  4  20 
2 
3.825  5  45 
4 
3.111  6  32 
3 
3.078  7  4 
7 
1.799  8  43 
This ranking isn't perfect, but it reaffirms our beliefs that the world is an unequal place, and that the distribution of countries is skewed towards lower rank values. In other words there are more worse off countries than well off ones.
This ranking also implies multiple stages of development for countries. The stage of development is inversely proportional to the realGDP growth potential (countries in the earlier stages of development have greater growth potential) BUT the difficulty associated with unlocking that growth potential is also inversely proportional to the stage of development (countries in the earliest stages of development will find it harder to realize their growth potential).
So what does that mean for investors? I think it means that distinctions should be made between countries in different stages of development. This is because whilst the most underdeveloped countries represent investments with the greatest return potential, they are also more risky and will probably require longer terms to pay off. Ideally these factors should be weighed against one another and compared to the investor's riskreturn appetite.
Further Research
In the third part of this article series I am going to look at predictive models for realGDP growth. The article will take a while to complete, because I am planning to implement a few models and also take a deeper look at the data.
In addition to this, the following topics are worthy of further research:
 Are fractional distances sensitive to the scale of the vectors being clustered?
 What unbiased alternatives are there to the clustering quality metrics presented here.
 To which socioeconomic indicators are the clustering results the most sensitive?
 More detailed analysis of each cluster, and of particular countries in each cluster.
Appendix A  All the code and some data!
I have removed the code from this article because of slow load times on the page. You can still obtain a copy of the Python file (in .txt format) and the data by following the links below.
All the code I used to write this article is shown at the end. To run it you will need the raw data file; you can download a copy of it here. Please let me know if you can identify any optimizations, the code is already memoized to reduce the number on unnecessary distance calculations and (tries) to make use of numpy arrays over conventional lists.

Probably one of the best explanation for k mean cluster and the way to assess its quality! Kudos! Well done! Look forward to part 3!

I am, as always, impressed by the clarity of your work. Keep up with such articles, I am myself interested in quantitative computing in my spare time and I always find here a great source of inspiration. Bravo

A very detailed explanation with a practical application in finance/econ. Lots of value add, thank you for posting Stuart!

Thank you very much Stuart! I have been learning a lot with your website!

Thank you man. You're the best. You explain all the stuffs in a best way.
Comments