DBSCAN Clustering Best Practices

栏目: IT技术 · 发布时间: 5年前

内容简介：Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocodeDBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitivel

Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocode

Pawan Jain

May 8 ·6min read

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

The DBSCAN algorithm has the following characteristics:

Density-based, robust to noise points which are away from the density core
No need to know the number of clusters
Can create clusters of arbitrary shape

DBSCAN is usually suitable for cluster analysis of lower-dimensional data.

Basic concept

The basic concepts of DBSCAN can be summarized by 1, 2, 3, and 4.

:one: Core idea: Based on density.

Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

:two: Algorithm parameters: neighbourhood radius (eps/ɛ) and the minimum number of points (min_points).

These two algorithm parameters can actually describe what is dense in the cluster

When the number of points within the neighbourhood radius (eps) is greater than the minimum number of points (min_points), it is dense.

:three: Types of points: core points, boundary points and noise points.

The point where the number of sample points in the neighbourhood radius (eps) is greater than or equal to min_points is called the core point.
Points that do not belong to the core point but are within the neighbourhood of a certain core point are called boundary points.
Noise points are neither core points nor boundary points.

The relationship between :four: kinds of points: direct density, reachable density, connected density, and non-density connection.

If P is the core point and Q is in the R neighbourhood of P, then the density from P to Q is called direct. If P to Q density is direct, then Q to P does not necessarily have relation direct density.

If there is a core point S such that the density from S to P and Q is reachable, then the density of P and Q are connected density.

The density connection is symmetrical, if P and Q density are connected, then Q and P are also connected with a certain density. Two points connected by density belong to the same cluster.

If the two points do not belong to the density connection relationship, the two points are not density connected. Two points that are not connected by density belong to different clusters, or there are noise points.

DBSCAN algorithm steps

The algorithm step of DBSCAN is divided into two steps.

1. Find the core point to form a temporary cluster.

Scan all sample points. If the number of points within a radius of a certain sample point is> = MinPoints

It will be included in the list of core points, and the points with direct density will form corresponding temporary clusters.

2. Combine temporary clusters to obtain clusters.

For each temporary cluster, check whether the point in it is a core point. If so, merge the temporary cluster corresponding to the point with the current temporary cluster to obtain a new temporary cluster.

This operation is repeated until each point in the current temporary cluster is either not in the core point list, or the point with a direct density is already in the temporary cluster, and the temporary cluster is upgraded to a cluster.

Continue to perform the same merge operation on the remaining temporary clusters until all the temporary clusters are processed.

DBSCAN use examples

Generate sample points

import numpy as np
import pandas as pd
from sklearn import datasets
%matplotlib inlineX,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = [‘feature1’,’feature2'])df.plot.scatter(‘feature1’,’feature2', s = 100,alpha = 0.6, title = ‘dataset by make_moon’)

Call the DBSCAN interface to complete the clustering

from sklearn.cluster import dbscancore_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20)df = pd.DataFrame(np.c_[X,cluster_ids],columns = [‘feature1’,’feature2',’cluster_id’])
df[‘cluster_id’] = df[‘cluster_id’].astype(‘i2’)df.plot.scatter(‘feature1’,’feature2', s = 100,
 c = list(df[‘cluster_id’]),cmap = ‘rainbow’,colorbar = False,
 alpha = 0.6,title = ‘DBSCAN cluster result’)

DBSCAN algorithm generate overlapping clusters?

No. DBSCAN will produce mutually exclusive clusters. If minimum cluster size is 1, these clusters will also be exhaustive.

To generate overlapping clusters (also called hierarchical clusters), a variant of DBSCAN called HDBSCAN can be used

Summary of DBSCAN

Compared with the traditional K-Means algorithm, the biggest difference of DBSCAN is that there is no need to input the number of categories k. Of course, its biggest advantage is that it can find clusters of arbitrary shapes, not like K-Means, which is generally only used for convex ones. Sample set clustering. At the same time, it can find outliers while clustering, which is similar to the BIRCH algorithm.

Advantages

It is possible to cluster dense data sets of any shape. In contrast, clustering algorithms such as K-Means are generally only applicable to convex data sets.

We can clearly see the difference between DBSCAN and K-means (K-means trying to create spherical clusters)

Anomalies can be found while clustering, and are not sensitive to anomalies in the data set.
The clustering results are not biased. In contrast, the initial value of the clustering algorithm such as K-Means has a great influence on the clustering results.

Disadvantages

If the density of the sample set is not uniform and the difference in cluster spacing is very different, the cluster quality is poor, then DBSCAN clustering is generally not suitable.
If the sample set is large, the clustering convergence time is long, at this time, the size of the KD tree or the spherical tree established when searching for the nearest neighbour can be improved.

Conclusion

So when do we need to use DBSCAN to cluster?

Generally speaking, if the data set is dense and the data set is not convex, then DBSCAN will be much better than K-Means clustering.

If the data set is not dense, DBSCAN is not recommended for clustering.

We have made the third clustering algorithm i.e. DBSCAN Clustering. For related codes visit my Github repository .

Stay Tuned!

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

DBSCAN Clustering Best Practices

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

你的品牌，价值千万

温迪 / 人民邮电出版社 / 2018-7-1 / 49.00元

“大道无术，万法归心。” 不管是互联网、社交媒体，还是 AI 怎样让人眼花缭乱。从“真心”出发塑造的个人品牌，都将带你从容面对任何一种变化的冲击。现代生活变得越来越透明，如果你不懂得如何真实、精准地定位和呈现自己，你的个人品牌在碎片信息中被误解、被曲解就是一种必然。本书分四步引导你剖析自己、发现自我，构建可持续的品牌生态系统，策划品牌战略，提升个人呈现力，并在最后带你勾画出一幅完整的个人......一起来看看《你的品牌，价值千万》这本书的介绍吧!

码农工具