Silhouette or Elbow? That is the Question.

栏目: IT技术 · 发布时间: 3年前

When you want to cluster a dataset with no labels, one of the most common questions that you encounter is “what is the right number of clusters?”. This question often raises when you work with, for example, the k-means algorithm that required you to fix the number of clusters. I encountered this question many times, and I am sure you did as well.

The problem can become more controversial when you work with high-dimensional data. The real-world data are often high-dimensional and you need to reduce its dimension to visualize and analyze it. The clustering results can be different in original space compared to the dimension-reduced space. You used the same algorithm but you see a discrepancy. That means the clustering results are sensitive to the space dimension mostly due to the performance of distance metrics in those spaces also referred to as the curse of dimensionality.

The clustering results can be different in original space compared to the dimension-reduced space. In other words, the clustering results are sensitive to the space dimension. Do not panic!

In this article, I do not want to explain the math behind the silhouette and elbow methods. You probably can easily find it elsewhere. However, I would like to share my experience working with these methods along with some insights that may help you.

— There is no right number of clusters but there is an optimal one.

Let me be very straight. There is no right number of clusters but there is an optimal one. I assume you select, for example, the k-means algorithm for the problem. You must run the algorithm for several consecutive k, i.e., the number of clusters. Then, you must compute the clustering performance for each k. Now, you are able to determine the k such that it works well for your problem.

First, you must select a performance metric that can evaluate the clustering quality as needed. Then, you must run the clustering algorithm with several configurations and evaluate the performance for each run. Now, you have everything to determine the number of clusters such that suits your problem.

The next question you must answer is “What is a good metric to evaluate the clustering performance?”. The answer is “Of course, It depends”.

— You can cluster using inertia and evaluate using the silhouette.

The scoring metric or objective function in a clustering algorithm can be different from the performance metric that you want to use for evaluation. For example, the k-means algorithm is designed to cluster data by minimizing the sum of within-cluster variances, also known as inertia. However, you may want to use the silhouette coefficients to evaluate the clustering performance.

You can not easily change the objective function in the k-means algorithm due to optimization challenges. However, you definitely can use a different performance metric to evaluate the results of the k-means algorithm. Now, you may ask why we do not use a similar metric in both stages? The answer is that I also wish we could do that easily but the k-means algorithm is an NP-hard problem. Under standard configuration, the algorithm converges to a good local-optimum in a reasonable time. However, if you change the objective function there is no guarantee to converge to any good result. Note that there are some works that modified the objective function in the k-means algorithm but similar concerns still exist.

In general, when the objective function in an optimization problem such as a clustering algorithm becomes more complex the search space becomes more rugged. In this case, there is a high chance that the search algorithm does not converge as needed.

— So what?

The silhouette and elbow methods are two simple, yet important, methods to find the optimum number of clusters. The silhouette method uses the silhouette coefficient, and the elbow method used inertia, the original scoring function in the k-means algorithm.

The elbow method only uses intra-cluster distances while the silhouette method uses a combination of inter- and intra-cluster distances. So, you can expect that they end up with different results.

According to the literature, the elbow method is often used with inertia. However, the elbow method, in general, only uses a heuristic to determine the elbow of a curve as a special point. In cluster analysis, the special point indicates the number of clusters. So, you may want to use the elbow method with different scoring functions rather than inertia, although it is not that common.

The Last Word

I suggest using the silhouette since it uses inter- and intra-cluster distances in its scoring function while the elbow method only uses intra-cluster distances. However, this does not mean the silhouette method is better. The silhouette and elbow methods are two out of many methods to determine the number of clusters in a dataset. If you want to learn more, you can also read about the information criterion methods such as the Akaike information criterion and Bayesian information criterion. As I said above neither of them is better nor worse. They all capture different characteristics of your data.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Silhouette or Elbow? That is the Question.

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

人类思维如何与互联网共同进化

[美] 约翰·布罗克曼 / 付晓光 / 浙江人民出版社 / 2017-3 / 79.90元

➢人类是否因互联网的诞生进入了公平竞争的场域？ “黑天鹅事件”频频发生，我们的预测能力是否正在退化？智人的第四阶段有哪些特征？全球脑会使人类成为“超级英雄”吗？虚拟现实技术会不会灭绝人类的真实体验？还有更多不可预知答案的问题，你将在本书中找到属于自己的答案！ ➢ 我们的心智正和互联网发生着永无止境的共振，人类思维会因此产生怎样的进化效应？本书编者约翰•布......一起来看看《人类思维如何与互联网共同进化》这本书的介绍吧!

码农工具

XML、JSON 在线转换

在线XML、JSON转换工具

HSV CMYK 转换工具

HSV CMYK互换工具