Introduction to Graph Clustering

+8

Previously I wrote a blog on how to divide the countries of the world into 10 even groups. While I did the grouping manually and intuitively, I wonder if there is a more scientific way to do the same. After some research, there are indeed many methodologies in doing similar task. In this blog, I will go through some of these.

Before we begin, let me explain some basic terminology in the science of map and grouping:

  • Vector: a list of numbers.
  • Matrix: a table of numbers.
  • Graph: a group of related and connected objects. In our case, it is the world map.
  • Vertex or node: the object in a graph, e.g. the numbered circles in the diagram below. In our case, these are the countries.
  • Edge: the line connecting the vertices. In our case, these can be the country borders.
  • Clustering: the process of grouping nodes, which is what I am trying to do here.
A graph with 6 nodes and 7 edges

To explain the different methods, in this blog I will attempt to divide the South American countries into two groups. South America is used because it is relatively simple with only 12 countries (excluding France), and there are no island countries (because for island countries we need to decide how to set the edges, more on that later).

Hierarchical Clustering

Hierarchical clustering is to build clusters like a family tree. It can either starts bottom-up by merging smaller clusters into larger ones, or top-down by breaking down large clusters into smaller ones. Here I will use the first method on the South American countries.

First treat the 12 countries as 12 clusters, and calculate the distance between each of them (I am using the capitals here).

ARBOBRCLCOECGYPYPESRUYVE
Argentina0
Bolivia22350
Brazil234021600
Chile1135190230090
Colombia46602436366442480
Ecuador43572137377637857300
Guyana4605281527524667177923910
Paraguay10371465146215513770357835700
Peru313610773165246818801324295825130
Suriname451328682535466720992680346347631320
Uruguay2062361227213404770449446351069329445270
Venezuela509330043589490310281755104341042745138751640

It turns out that Argentina (Buenos Aires) and Uruguay (Montevideo) are the closest together, so let's merge them to form a new cluster. Now, recentre Argentina + Uruguay and recalculate the distance between it and the remaining nodes:

BOBRCLCOECGYPYPESRVEAR+UY
Argentina + Uruguay22962304123847144424461810483214451851280

The next closest pair is Guyana + Suriname (346 km). So merge these, recalculate the distance and repeat the process until it forms the number of desired clusters (in my case, until two left). The whole process can be drawn out like a tree (no this is not the result of the Copa América 😛):

Hierarchical clustering of South American countries
The result of hierarchical clustering

K-means

K-means is a popular way to group points closed to each other into k number of clusters. It starts by having an educated guess on the centroids of the clusters and then iteratively improving the location of the centroids until they converge (i.e. they stop changing).

In my example, I am trying to form 2 groups (k = 2). Again I use the capitals to represent the location of each country, and I start by randomly setting the two centroids at Lat/Long {0, -70} (near the border between Colombia & Brazil) and {-20, -50} (southwest of Brasilia).

Capitals of South American countries, with my initial guess in red

The next step is to check whether the capitals are closer to Point 1 or Point 2.

CountryCapitalDistance (km)Group
Point 1Point 2
ArgentinaBuenos Aires403318222
BoliviaLa Paz184619551
BrazilBrasília29915182
ChileSantiago372125302
ColombiaBogotá68437961
EcuadorQuito94838041
GuyanaGeorgetown151631111
ParaguayAsunción31119802
PeruLima154830161
SurinameParamaribo177129311
UruguayMontevideo413117572
VenezuelaCaracas121638611

This gives us the preliminary grouping of the capitals/countries. The next step is to recalculate the two centroids based on the two groups.

  • Point 1 (green): {-0.1435, -68.2838}
  • Point 2 (blue): {-28.7952, -58.1563}
Countries are divided into group 1 (green) and group 2 (blue), and then new centroids are calculated

You can see how the centroids move and "improve". Now repeat the calculation of the distances between the capitals and the two new centroids:

CountryCapitalDistance (km)Group
Point 1Point 2
ArgentinaBuenos Aires39676462
BoliviaLa Paz181917072
BrazilBrasília283117872
ChileSantiago371312962
ColombiaBogotá83340861
EcuadorQuito113938441
GuyanaGeorgetown136439591
ParaguayAsunción30223922
PeruLima163827011
SurinameParamaribo160238671
UruguayMontevideo40586972
VenezuelaCaracas119244671

You can see that Bolivia is now switched to Group 2. Keep repeating the same until the centroids and the groupings do not change anymore, and now you have the two clusters.

After two iterations, the South American countries are divided into two clusters

Spectral Clustering

So far the two methods above are purely based on the distance between the country capitals. We have not considered the country border (edges) yet. The issue is that in the case of non-bordering countries that are near each other (e.g. Djibouti & Yemen), these will likely be put into the same cluster although they are 7 edges apart.

To properly consider the relationship between countries based on borders, we do need to look at the edges and turn the countries into a graph as I mentioned earlier. (Again for simplicity sake, I am ignoring France and Panama)

South American countries represented as a graph

Spectral clustering requires some basic understanding of linear algebra which I will not go into the details of the maths behind it (I don't know that well myself either 😅). Rather I would show you how it is done. In this method, there are three steps involved:

  1. Construct a Laplacian matrix representation of the graph.
  2. Compute the eigenvalue and eigenvector of the matrix.
  3. Apply the eigenvector to the nodes to form the clusters.

Let's see how it is done. In simple term, the Laplacian matrix is a table of numbers that represents the above graph. You can read the Wikipedia page to learn more, but basically the formula is L = D - A, or put simply, a table where the diagonal is the number of edges (number of bordering countries), and the rest of the numbers are either 0 if they don't border each other, or -1 if they do. (Note that all rows and columns sum up to 0). So for South America, the Laplacian matrix looks like this:

ARBOBRCLCOECGYPYPESRUYVE
Argentina5-1-1-1000-100-10
Bolivia-15-1-1000-1-1000
Brazil-1-190-10-1-1-1-1-1-1
Chile-1-1030000-1000
Colombia00-104-100-100-1
Ecuador0000-1200-1000
Guyana00-1000300-10-1
Paraguay-1-1-1000030000
Peru0-1-1-1-1-1005000
Suriname00-1000-100200
Uruguay-10-1000000020
Venezuela00-10-10-100003

The next step is to find the eigenvalue and eigenvector of the above matrix. In simple term, the eigenvector (v) is a special vector of a matrix (T) such that multiplying by it will scale the vector linearly by a constant eigenvalue (λ), i.e. Tv = λv. You can think of it as a way to enlarge a photo without distorting it.

There is no magic formula to calculate the eigenvalue and eigenvector of a matrix. The simplest way is to use the Excel Goal Seek tool under What-If Analysis to iterate until it finds the best possible answer. I will not go through the method here, but you can check out YouTube videos such as this and this on how this can be done. (Or if you are already using Matlab then you don't need me to teach you how).

With that, it calculates the second smallest eigenvalue of our Laplacian matrix to be 1.08958 (you don't want the eigenvalue of 0 which is useless to us), and the corresponding eigenvector is

-0.83
-0.82
0.22
-1.21
-0.02
-0.75
1.96
-0.75
-0.66
2.39
-0.67
1.13

Applying the eigenvector to our list of countries and sorting it, we have a group of countries with negative numbers and another group with positive numbers. i.e.

Group 1Group 2
Chile-1.21Brazil0.22
Argentina-0.83Venezuela1.13
Bolivia-0.82Guyana1.96
Paraguay-0.75Suriname2.39
Ecuador-0.75
Uruguay-0.67
Peru-0.66
Colombia-0.02
The result of spectral clustering
Colour scale of the spectral clustering result

As you can see, since the South American countries are all well interconnected, it is difficult to form two distinct clusters. But when I apply the same calculation to both North and South America, the result is much more obvious.

Group 1Group 2
Uruguay-0.504Costa Rica0.103
Paraguay-0.501Nicaragua0.359
Argentina-0.501Honduras0.599
Suriname-0.499El Salvador0.680
Chile-0.497Guatemala0.732
Bolivia-0.495Belize0.794
Guyana-0.492Mexico0.821
Brazil-0.485USA0.902
Peru-0.472Canada0.944
Venezuela-0.469
Ecuador-0.451
Colombia-0.410
Panama-0.157
North and South America form 2 distinct clusters

Modularity Clustering

Modularity is a way to quantify how closely related the nodes are within a cluster. In this method, we try to maximise the combined modularity score of the clusters so that we are putting closely related nodes together rather than unrelated ones.

The formula for calculating modularity looks quite daunting but it is actually much simpler than what it seems. Let me explain:

1*yfEFKqD7VmC7lDXXrjL_CA.png

where
Q = modularity score, what we are trying to calculate for our clusters
m = total number of edges in our graph, i.e. total number of country borders
Aij = adjacency score between node i and j, 1 if node i borders node j, or 0 if they don't
γ = resolution factor, equals to 1 by default unless you want to add weight to the expected probability of node i and j being connected
ki and kj = degrees of node i and j. In the standard case where we treat all borders equal, this is the number of borders (edges) that country i and j (node i and j) have.
δ(ci, cj) = Kronecker delta, just a fancy way to say that if i = j then it is 0 else 1, i.e. only add up when country i and j are different, e.g. count Argentina-Chile, not Argentina-Argentina.

So how does it work? Let's say we want to examine the modularity score of the two clusters formed by spectral clustering above. First there are in total 23 edges in the graph (23 borders, m=23). For Group 1, we need to sum up for all node pairs, e.g. for Chile - Argentina:

  • Aij = 1 (they do border each other)
  • ki = 3 (Chile borders 3 other countries)
  • kj = 5 (Argentina borders 5)
Hence,
 Aij - (ki kj / 2m) = 1 - (3 x 5) / (2 x 23) = 0.674

And for say Colombia - Bolivia,
 Aij - (ki kj / 2m) = 0 - (4 x 5) / (2 x 23) = -0.435

So on so forth. The calculation gives Q1 = 0.0681 and Q2 = 0.0647, and hence a total score of 0.133.

Let's say I want to even out the number of countries in each group so I move Colombia to Group 2. Redoing the same calculation gives a new total score of 0.148, so it is not a bad idea! And if I want to have exactly 6 countries in each group so I move Peru to Group 2 too, the calculation gives a new total score of 0.124. So I should probably think twice before doing that.

The result of modularity clustering

Considerations

There are many other methods of clustering, some are focusing on the finding the nodes that are closely connected to each other, others are trying to find nodes with similar topology, e.g. countries like Brazil, Germany, DR Congo are similar as they tie other countries together. Some methodology is limited to finding a small number of clusters (e.g. 2) and fails when trying to form larger number of clusters (e.g. 10 groups of COTW), or when there is no clear distinction among the nodes to form a cluster, or unable to ensure each cluster has the similar number of nodes.

Another consideration comes to how we define the edges, especially for island countries. There are only 4 bridges/tunnels in the world that connect non-neighbouring countries together, namely:

  • United Kingdom - France: Channel Tunnel
  • Denmark - Sweden: Øresund Bridge
  • Bahrain - Saudi Arabia: King Fahd Causeway
  • Singapore - Malaysia: Johor Causeway and others

For others, how should the edges be defined? For example, should Japan be connected to Russia, and Taiwan? and South Korea? Or should United States be connected to Russia? The Bahamas? Kiribati? How about the territories (e.g. Canada-Denmark)? And don't get me started on Cyprus (to Greece or not to Greece, that is the question 😄).

Besides, should all borders and edges be treated equal? They are in my calculation above, but not all borders are the same, e.g. there is no land crossing between Colombia and Panama. Some bordering countries are culturally and geopolitically very far apart, e.g. Russia-North Korea, China-Afghanistan.

Final Thoughts

After reading all these, you may ask the question why? Why do we spend so much effort into clustering? It is very obvious to anyone that North and South America form two different groups without all these calculation. But we cannot always do it manually and visually, especially when the number of nodes is huge and the edges are complex.

The application of clustering goes far beyond countries and maps. What if these countries are high-schoolers and the graph represents their social network on TikTok? Should I push ad selling American goods to the popular Brazilian influencer who has so many connections? Clustering can help identify how genes are grouped together, or help AI identify objects in photos, or identify fraudulent activities in your credit card transactions, or for Natural Language Processing (NLP) to understand words (or dog barks) with similar meanings. And the opportunities are endless.

As you can see, doing these calculation manually is almost impossible (it's tedious and error-prone). People would write Python scripts and let machine learning do the hard work. If you find this interesting, study Discrete Maths and Data Science in university and your future is bright!

Advertisement time: I do have few quizzes here with graphs, have fun with these!

3 Comments
+2
Level 78
Jun 26, 2025
Great and detailed blog!!

I can’t help but think about how Topological Data Analysis tools must be very applicable here too. I imagine you could construct a Rips Complex based on various geopolitical similarities. I’m not well versed enough in TDA to know how to interpret the persistent homology, but I’m sure there’s an R package that could easily do all the heavy lifting for you!

+2
Level 70
Jun 26, 2025
Thanks. I don't know TDA but will research into it. I do have another project in mind which is to divide the countries (and subdivisions) of the world based on different metrics like population and GDP, but find doing it manually too hard and unsatisfactory. I'll see if there are free tools that can do it better.
+1
Level 78
Jun 26, 2025
Sounds like it should be fun!