What are unsupervised clustering algorithms?
Clustering algorithms are an unsupervised learning technique used to find distinct groups in a dataset. Typical examples are finding customers with similar behaviour patterns, products with similar characteristics, and other tasks where the goal is to find groups with distinct characteristics in a dataset.
Can I measure the accuracy of a clustering model?
For supervised learning problems such as a linear regression model that predicts house prices, there is a target that you are trying to predict for. From this target you can infer some form of accuracy by using metrics such as RMSE, MAPE, MAE etc. However, when implementing a clustering algorithm for a dataset with no such target to aim for, an ‘accuracy’ score is not possible. We therefore need to look for other types of measurement that give us an indication of performance. The most common is the distinctness or uniqueness of the clusters created, after all if all clusters look the same then you haven't achieved your goal of creating clusters with unique characteristics. To measure the distinctness of clusters there are 3 common metrics to use, these are:
Which performance metrics are useable for clustering models?
This score is between -1 and 1, where the higher the score the more well defined and distinct your clusters are. It can be calculated using scikit-learn in the following way:
from sklearn import metrics from sklearn.cluster import KMeans my_model = KMeans().fit(X) labels = my_model.labels_ metrics.silhouette_score(X,labels)
Like the Silhouette Coefficient, the higher the score the more well defined the clusters are. This score has no bound, meaning that there is no ‘acceptable’ or ‘good’ value and must be tracked throughout the development of your model to see if it improves or not. It can be calculated using scikit-learn in the following way:
from sklearn import metrics from sklearn.cluster import KMeans my_model = KMeans().fit(X) labels = my_model.labels_ metrics.calinski_harabasz_score(X, labels)
Unlike the previous two metrics, this score measures the similarity of your clusters, meaning that the lower the score the better separation there is between your clusters. It can be calculated using scikit-learn in the following way:
from sklearn.cluster import KMeans from sklearn.metrics import davies_bouldin_score my_model = KMeans().fit(X) labels = my_model.labels_ davies_bouldin_score(X, labels)
Which performance metric should I choose for my clustering algorithm?
The most commonly used metric for measuring performance of a clustering algorithm is the Silhouette Coefficient. This is likely due to it's bound from -1 to 1, making it possible to easily understand the performance and compare against models from different datasets.