Which are the best clustering metrics? (explained simply)

Clustering is a common unsupervised learning approach, but it can be difficult to know which the best evaluation metrics are to measure performance. In this post, I explain why we need to consider different metrics, and which is best to choose.

Stephen Allwright

5 Sep 2022

What are unsupervised clustering algorithms?

Clustering algorithms are a machine learning technique used to find distinct groups in a dataset when we don’t have a supervised target to aim for. Typical examples are finding customers with similar behaviour patterns or products with similar characteristics, and other tasks where the goal is to find groups with distinct characteristics.

clustering algorithm illustration diagram

How to measure clustering performance

For supervised learning problems such as a regression model that predicts house prices, there is a target that you are trying to predict for. From this target, you can easily infer some form of accuracy by using metrics such as RMSE, MAPE, or MAE.

However, when implementing a clustering algorithm for a dataset with no such target to aim for, an ‘accuracy’ score is not possible. We, therefore, need to look for other types of measurement that give us an indication of performance.

The most common measurement for cluster performance is the distinctness or uniqueness of the clusters created, this is because the most common goal for clustering is to create clusters that are as unique as possible.

What are the criteria of good clustering?

The goal of clustering is to find distinct patterns or behaviour in a dataset. Therefore, the criteria of good clustering are distinct groups with as little similarity between them as possible.

How do you evaluate the accuracy of clustering?

There is no measure of accuracy for clustering models as there isn’t a target variable to measure the accuracy against. Instead, we need to find other ways of measuring performance, such as the similarity or distinctness of the groups created.

Which are the best clustering metrics?

The most common ways of measuring the performance of clustering models are to either measure the distinctiveness or the similarity between the created groups. Given this, there are three common metrics to use, these are:

Silhouette Score
Calinski-Harabaz Index
Davies-Bouldin Index

What is Silhouette Score?

Silhouette Score is the mean Silhouette Coefficient for all clusters, which is calculated using the mean intra-cluster distance and the mean nearest-cluster distance. This score is between -1 and 1, where the higher the score the more well-defined and distinct your clusters are.

It can be calculated using scikit-learn in the following way:

from sklearn import metrics
from sklearn.cluster import KMeans

my_model = KMeans().fit(X)
labels = my_model.labels_
metrics.silhouette_score(X,labels)

What is Calinski-Harabaz Index?

Calinski-Harabaz Index is calculated using the between-cluster dispersion and within-cluster dispersion in order to measure the distinctiveness between groups. Like the Silhouette Score, the higher the score the more well-defined the clusters are. This score has no bound, meaning that there is no ‘acceptable’ or ‘good’ value.

It can be calculated using scikit-learn in the following way:

from sklearn import metrics
from sklearn.cluster import KMeans

my_model = KMeans().fit(X)
labels = my_model.labels_
metrics.calinski_harabasz_score(X, labels)

What is Davies-Bouldin Index?

Davies-Bouldin Index is the average similarity of each cluster with its most similar cluster. Unlike the previous two metrics, this score measures the similarity of your clusters, meaning that the lower the score the better separation there is between your clusters.

It can be calculated using scikit-learn in the following way:

from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score

my_model = KMeans().fit(X)
labels = my_model.labels_
davies_bouldin_score(X, labels)

Which is the best clustering evaluation metric?

The most commonly used metric for measuring the performance of a clustering algorithm is the Silhouette Score. This is likely due to it being bound from -1 to 1, making it possible to easily understand the performance and compare it against models from different datasets.

How do you compare clustering methods?

In order to compare the performance of clustering methods, we need to use metrics which have an upper and lower bound. The most common clustering metric, Silhouette Score, can therefore be used for comparison as it’s bounded between -1 and 1.

Metrics for imbalanced data
What is imbalanced data?
What is a baseline machine learning model?

References

Silhouette Coefficient
Calinski-Harabaz Index
Davies-Bouldin Index

Metrics

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.

What are unsupervised clustering algorithms?

How to measure clustering performance

What are the criteria of good clustering?

How do you evaluate the accuracy of clustering?

Which are the best clustering metrics?

What is Silhouette Score?

What is Calinski-Harabaz Index?

What is Davies-Bouldin Index?

Which is the best clustering evaluation metric?

How do you compare clustering methods?

Related articles

References

Stephen Allwright Twitter

Check out our handy topic pages