DBSCAN Clustering Best Practices

栏目: IT技术 · 发布时间: 3年前

内容简介:Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocodeDBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitivel

Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocode

May 8 ·6min read

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

The DBSCAN algorithm has the following characteristics:

  • Density-based, robust to noise points which are away from the density core
  • No need to know the number of clusters
  • Can create clusters of arbitrary shape

DBSCAN is usually suitable for cluster analysis of lower-dimensional data.

Basic concept

The basic concepts of DBSCAN can be summarized by 1, 2, 3, and 4.

:one: Core idea: Based on density.

Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

DBSCAN Clustering Best Practices

:two: Algorithm parameters: neighbourhood radius (eps/ɛ) and the minimum number of points (min_points).

These two algorithm parameters can actually describe what is dense in the cluster

When the number of points within the neighbourhood radius (eps) is greater than the minimum number of points (min_points), it is dense.

DBSCAN Clustering Best Practices

:three: Types of points: core points, boundary points and noise points.

  • The point where the number of sample points in the neighbourhood radius (eps) is greater than or equal to min_points is called the core point.
  • Points that do not belong to the core point but are within the neighbourhood of a certain core point are called boundary points.
  • Noise points are neither core points nor boundary points.

DBSCAN Clustering Best Practices

The relationship between :four: kinds of points: direct density, reachable density, connected density, and non-density connection.

If P is the core point and Q is in the R neighbourhood of P, then the density from P to Q is called direct. If P to Q density is direct, then Q to P does not necessarily have relation direct density.

If there is a core point S such that the density from S to P and Q is reachable, then the density of P and Q are connected density.

The density connection is symmetrical, if P and Q density are connected, then Q and P are also connected with a certain density. Two points connected by density belong to the same cluster.

If the two points do not belong to the density connection relationship, the two points are not density connected. Two points that are not connected by density belong to different clusters, or there are noise points.

DBSCAN Clustering Best Practices

DBSCAN algorithm steps

DBSCAN Clustering Best Practices

The algorithm step of DBSCAN is divided into two steps.

1. Find the core point to form a temporary cluster.

Scan all sample points. If the number of points within a radius of a certain sample point is> = MinPoints

It will be included in the list of core points, and the points with direct density will form corresponding temporary clusters.

2. Combine temporary clusters to obtain clusters.

For each temporary cluster, check whether the point in it is a core point. If so, merge the temporary cluster corresponding to the point with the current temporary cluster to obtain a new temporary cluster.

This operation is repeated until each point in the current temporary cluster is either not in the core point list, or the point with a direct density is already in the temporary cluster, and the temporary cluster is upgraded to a cluster.

Continue to perform the same merge operation on the remaining temporary clusters until all the temporary clusters are processed.

DBSCAN Clustering Best Practices

DBSCAN use examples

Generate sample points

import numpy as np
import pandas as pd
from sklearn import datasets
%matplotlib inlineX,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = [‘feature1’,’feature2'])df.plot.scatter(‘feature1’,’feature2', s = 100,alpha = 0.6, title = ‘dataset by make_moon’)

DBSCAN Clustering Best Practices

Call the DBSCAN interface to complete the clustering

from sklearn.cluster import dbscancore_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20)df = pd.DataFrame(np.c_[X,cluster_ids],columns = [‘feature1’,’feature2',’cluster_id’])
df[‘cluster_id’] = df[‘cluster_id’].astype(‘i2’)df.plot.scatter(‘feature1’,’feature2', s = 100,
 c = list(df[‘cluster_id’]),cmap = ‘rainbow’,colorbar = False,
 alpha = 0.6,title = ‘DBSCAN cluster result’)

DBSCAN Clustering Best Practices

DBSCAN algorithm generate overlapping clusters?

No. DBSCAN will produce mutually exclusive clusters. If minimum cluster size is 1, these clusters will also be exhaustive.

To generate overlapping clusters (also called hierarchical clusters), a variant of DBSCAN called HDBSCAN can be used

Summary of DBSCAN

Compared with the traditional K-Means algorithm, the biggest difference of DBSCAN is that there is no need to input the number of categories k. Of course, its biggest advantage is that it can find clusters of arbitrary shapes, not like K-Means, which is generally only used for convex ones. Sample set clustering. At the same time, it can find outliers while clustering, which is similar to the BIRCH algorithm.

Advantages

  • It is possible to cluster dense data sets of any shape. In contrast, clustering algorithms such as K-Means are generally only applicable to convex data sets.

DBSCAN Clustering Best Practices

We can clearly see the difference between DBSCAN and K-means (K-means trying to create spherical clusters)
  • Anomalies can be found while clustering, and are not sensitive to anomalies in the data set.
  • The clustering results are not biased. In contrast, the initial value of the clustering algorithm such as K-Means has a great influence on the clustering results.

Disadvantages

  • If the density of the sample set is not uniform and the difference in cluster spacing is very different, the cluster quality is poor, then DBSCAN clustering is generally not suitable.
  • If the sample set is large, the clustering convergence time is long, at this time, the size of the KD tree or the spherical tree established when searching for the nearest neighbour can be improved.

Conclusion

So when do we need to use DBSCAN to cluster?

Generally speaking, if the data set is dense and the data set is not convex, then DBSCAN will be much better than K-Means clustering.

If the data set is not dense, DBSCAN is not recommended for clustering.

We have made the third clustering algorithm i.e. DBSCAN Clustering. For related codes visit my Github repository .

Stay Tuned!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

算法设计与分析基础

算法设计与分析基础

乐威汀 (Anany Levitin) / 清华大学出版社 / 2003-8 / 39.00元

《算法设计与分析基础(影印版)》由清华大学出版社出版。一起来看看 《算法设计与分析基础》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

随机密码生成器
随机密码生成器

多种字符组合密码