DBSCAN Clustering Best Practices

栏目: IT技术 · 发布时间: 4年前

内容简介:Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocodeDBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitivel

Practical and informative guide of using DBSCAN Clustering and discussion about its advantages, disadvantages with pseudocode

May 8 ·6min read

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) is a very famous density-based clustering algorithm. Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

The DBSCAN algorithm has the following characteristics:

  • Density-based, robust to noise points which are away from the density core
  • No need to know the number of clusters
  • Can create clusters of arbitrary shape

DBSCAN is usually suitable for cluster analysis of lower-dimensional data.

Basic concept

The basic concepts of DBSCAN can be summarized by 1, 2, 3, and 4.

:one: Core idea: Based on density.

Intuitively, the DBSCAN algorithm can find all the dense regions of the sample points and treat these dense regions as clusters one by one.

DBSCAN Clustering Best Practices

:two: Algorithm parameters: neighbourhood radius (eps/ɛ) and the minimum number of points (min_points).

These two algorithm parameters can actually describe what is dense in the cluster

When the number of points within the neighbourhood radius (eps) is greater than the minimum number of points (min_points), it is dense.

DBSCAN Clustering Best Practices

:three: Types of points: core points, boundary points and noise points.

  • The point where the number of sample points in the neighbourhood radius (eps) is greater than or equal to min_points is called the core point.
  • Points that do not belong to the core point but are within the neighbourhood of a certain core point are called boundary points.
  • Noise points are neither core points nor boundary points.

DBSCAN Clustering Best Practices

The relationship between :four: kinds of points: direct density, reachable density, connected density, and non-density connection.

If P is the core point and Q is in the R neighbourhood of P, then the density from P to Q is called direct. If P to Q density is direct, then Q to P does not necessarily have relation direct density.

If there is a core point S such that the density from S to P and Q is reachable, then the density of P and Q are connected density.

The density connection is symmetrical, if P and Q density are connected, then Q and P are also connected with a certain density. Two points connected by density belong to the same cluster.

If the two points do not belong to the density connection relationship, the two points are not density connected. Two points that are not connected by density belong to different clusters, or there are noise points.

DBSCAN Clustering Best Practices

DBSCAN algorithm steps

DBSCAN Clustering Best Practices

The algorithm step of DBSCAN is divided into two steps.

1. Find the core point to form a temporary cluster.

Scan all sample points. If the number of points within a radius of a certain sample point is> = MinPoints

It will be included in the list of core points, and the points with direct density will form corresponding temporary clusters.

2. Combine temporary clusters to obtain clusters.

For each temporary cluster, check whether the point in it is a core point. If so, merge the temporary cluster corresponding to the point with the current temporary cluster to obtain a new temporary cluster.

This operation is repeated until each point in the current temporary cluster is either not in the core point list, or the point with a direct density is already in the temporary cluster, and the temporary cluster is upgraded to a cluster.

Continue to perform the same merge operation on the remaining temporary clusters until all the temporary clusters are processed.

DBSCAN Clustering Best Practices

DBSCAN use examples

Generate sample points

import numpy as np
import pandas as pd
from sklearn import datasets
%matplotlib inlineX,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = [‘feature1’,’feature2'])df.plot.scatter(‘feature1’,’feature2', s = 100,alpha = 0.6, title = ‘dataset by make_moon’)

DBSCAN Clustering Best Practices

Call the DBSCAN interface to complete the clustering

from sklearn.cluster import dbscancore_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20)df = pd.DataFrame(np.c_[X,cluster_ids],columns = [‘feature1’,’feature2',’cluster_id’])
df[‘cluster_id’] = df[‘cluster_id’].astype(‘i2’)df.plot.scatter(‘feature1’,’feature2', s = 100,
 c = list(df[‘cluster_id’]),cmap = ‘rainbow’,colorbar = False,
 alpha = 0.6,title = ‘DBSCAN cluster result’)

DBSCAN Clustering Best Practices

DBSCAN algorithm generate overlapping clusters?

No. DBSCAN will produce mutually exclusive clusters. If minimum cluster size is 1, these clusters will also be exhaustive.

To generate overlapping clusters (also called hierarchical clusters), a variant of DBSCAN called HDBSCAN can be used

Summary of DBSCAN

Compared with the traditional K-Means algorithm, the biggest difference of DBSCAN is that there is no need to input the number of categories k. Of course, its biggest advantage is that it can find clusters of arbitrary shapes, not like K-Means, which is generally only used for convex ones. Sample set clustering. At the same time, it can find outliers while clustering, which is similar to the BIRCH algorithm.

Advantages

  • It is possible to cluster dense data sets of any shape. In contrast, clustering algorithms such as K-Means are generally only applicable to convex data sets.

DBSCAN Clustering Best Practices

We can clearly see the difference between DBSCAN and K-means (K-means trying to create spherical clusters)
  • Anomalies can be found while clustering, and are not sensitive to anomalies in the data set.
  • The clustering results are not biased. In contrast, the initial value of the clustering algorithm such as K-Means has a great influence on the clustering results.

Disadvantages

  • If the density of the sample set is not uniform and the difference in cluster spacing is very different, the cluster quality is poor, then DBSCAN clustering is generally not suitable.
  • If the sample set is large, the clustering convergence time is long, at this time, the size of the KD tree or the spherical tree established when searching for the nearest neighbour can be improved.

Conclusion

So when do we need to use DBSCAN to cluster?

Generally speaking, if the data set is dense and the data set is not convex, then DBSCAN will be much better than K-Means clustering.

If the data set is not dense, DBSCAN is not recommended for clustering.

We have made the third clustering algorithm i.e. DBSCAN Clustering. For related codes visit my Github repository .

Stay Tuned!


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Node.js开发指南

Node.js开发指南

郭家寶(BYVoid) / 人民邮电出版社 / 2012-7 / 45.00元

Node.js是一种方兴未艾的新技术,诞生于2009年。经过两年的快速变化,Node.js生态圈已经逐渐走向稳定。Node.js采用了以往类似语言和框架中非常罕见的技术,总结为关键词就是:非阻塞式控制流、异步I/O、单线程消息循环。不少开发者在入门时总要经历一个痛苦的思维转变过程,给学习带来巨大的障碍。 而本书的目的就是帮助读者扫清这些障碍,学会使用Node.js进行Web后端开发,同时掌握事件驱......一起来看看 《Node.js开发指南》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

SHA 加密
SHA 加密

SHA 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具