内容简介:“The unsupervised learning is the way most people will learn in the future. You have this model of how the world works in your head and you’re refining it to predict what you think is going to happen in the future.” —Unsupervised learning forms a very nich
“The unsupervised learning is the way most people will learn in the future. You have this model of how the world works in your head and you’re refining it to predict what you think is going to happen in the future.” — Mark Zuckerberg
Unsupervised learning forms a very niche part of Machine Learning, simply because most tasks have a label on them ( supervised ). However, in cases where we lack these ‘ labelled ’ data, clustering methods can help us to find patterns by making inferences on the dataset. Common areas in which clustering is applied includes customer segmentation (for ad-targeting), population analysis (understanding demographics) and also anomaly detection.
Some people view unsupervised learning as a ‘ grey area ’ in Machine Learning because it can sometimes be hard to interpret the types of clusters that the algorithms output, because there is no single ‘metric’ that can tell us how meaningful those predicted clusters are.
K-Means
K-means clustering is a distance-based clustering method for finding clusters and cluster centers in a set of unlabelled data. This is a fairly tried and tested method and can be implemented easily using sci-kit learn .
The goal of K-Means is fairly straightforward — to group points that are ‘similar’ (based on distance) together. This is done so by regarding the center of data points as the center of the corresponding clusters (centroids).
The core idea is as such: Update the cluster centroid by iterative computation and the iterative process will be continued until some convergence criteria is met.
How it works:
- To start, one chooses the desired number of clusters, say K .
- K random points are selected from the dataset to be the cluster centroids.
- At each step, select an unassigned data point and check which centroid is closest to it. The definition of closeness is usually measured by a distance metric, normally in the form of euclidean distance.
- After each point is allocated a cluster, recalculate the new cluster centroid using the mean of the data points within the same cluster. This serves to optimise the location of the cluster centroid.
- Repeat steps 3 and 4 until either a) the centroids have stabilised (do not pass a threshold) or b) the desired number of iterations have been achieved.
Note that the centroids are in fact not actual data points in the dataset.
An example using the Iris Dataset
For those who are not familiar with this dataset, it contains data on variations of iris
flowers, with features including sepal_length
, sepal_width
, petal_length
and petal_width
. Since this is an unsupervised problem, I will not reveal how many different types of irises’ there are ( to find out more about the dataset, please visit this
link
).
We will use the iris
dataset to showcase how K-Means work. Let’s first import the data and visualise it.
import numpy as np from sklearn import datasets, cluster import matplotlib.pyplot as plt %matplotlib inlineiris = datasets.load_iris() x = iris.data[:, [1,3]] # takes the 2nd and 4th column of the data plt.scatter(x[:,0], x[:,1]) plt.xlabel('Sepal width') plt.ylabel('Petal width') plt.show()
From the above visualisation, it seems like there might be two distinct clusters. Now, let’s use the K-Means algorithm by sci-kit learn.
from sklearn.cluster import KMeansx_ = iris.data # Note that in we use all 4 columns here random_state = 2020kmeans = KMeans(n_clusters=2, init='k-means++', random_state=random_state) kmeans.fit_transform(x_) label = kmeans.labels_ color = ['red','green','blue']plt.figure(figsize=(8, 6)) plt.scatter(x_fit[:,0], x_fit[:,1], c=[color[i] for i in label], marker='+') plt.title('Kmeans on Iris dataset with 2 clusters', fontweight='bold', fontsize=14) plt.show()
It looks like segregation using n_clusters = 2
yields pretty decent results, except for a small number of data points that may be clustered wrongly (green ones in the centre that should probably be red).
From the above, we determined the number of clusters n_clusters
by visually inspecting the data (using two features). It is clear that determining the optimal n_clusters
is a very important step in clustering, especially K-Means where this number has to be specified in the algorithm beforehand. By using two features to visually inspect the data is clearly not sufficient as there may be additional clusters if our data is high-dimension and our human judgment may fail miserably (especially with anything more 3-D).
How to determine the optimal number of clusters
Method 1:Elbow Plots
This method computes the total intra-cluster sum of squared errors
(SSE) using sklearn’s inertia_
method. The point at which the elbow kinks (gradient sharply changes) suggests that there is diminishing returns and we would generally want to take the n_clusters
before this occurs. Intuitively, we want a number k
that minimises the SSE, but at the same time an increasing k
will naturally reduce the SSE to zero (where every data point is its own cluster). In the below example, n_clusters = 3
seems to be the most optimal.
elbow = [] kmax = 10for k in range(2, kmax+1): kmeans = KMeans(n_clusters = k).fit(x_) elbow.append(kmeans.inertia_) plt.figure(figsize=(8,6)) plt.plot(np.arange(2,11), elbow) plt.xlabel('Number of Clusters') plt.ylabel('Inertia (Intra cluster sum of squares)') plt.title('Inertia vs n_clusters to determine optimal cluster size', fontweight='bold') plt.show()
However, this method may not work well when the data is not very clustered. Hence, it may be better to do a cross-check with our next method.
Method 2:Silhouette Plots
The Silhouette Coefficient is essentially what we call the inter-cluster
distance. In general, clustering aims to minimise intra-cluster distance while maximising inter-cluster distance. Using both methods will ensure that n_clusters
is more optimally defined.
from sklearn.metrics import silhouette_score sil = [] kmax = 10for k in range(2, kmax+1): kmeans = KMeans(n_clusters = k).fit(x_) labels = kmeans.labels_ sil.append(silhouette_score(x_, labels, metric = 'euclidean')) plt.figure(figsize=(8,6)) plt.plot(np.arange(2,11), elbow) plt.xlabel('Number of Clusters') plt.ylabel('Silhouette Coefficient (Inter-cluster distance)') plt.title('Silhouette vs n_clusters to determine optimal cluster size', fontweight='bold') plt.show()
Ideally, we would want the silhouette coefficient to be as high as possible (i.e. maximise inter-cluster distance). In this instance, n_clusters = 2
seems to be most optimal.
Reconciling difference between Elbow and Silhouette
What should we do since the elbow plot suggests n_clusters = 3
while the silhouette plot suggests n_clusters = 2
?
Method 1:Try to visualise the data as much as we can! In this method, we use Principal Component Analysis (PCA) so that we can do a 3-D plot of the data. Since the topic is on clustering, I will not delve into the details of PCA (let me know in the comments if you guys want it covered!).
from sklearn import decomposition from mpl_toolkits.mplot3d import Axes3Dx_full = iris.datapca = decomposition.PCA() xr = pca.fit_transform(x_full)kmeans = cluster.KMeans(n_clusters=3, init='k-means++', random_state=random_state) kmeans.fit(xr) label = kmeans.labels_ color = ['red', 'green', 'blue']plt.figure(figsize=(12, 12)) fig = plt.subplot(1, 1, 1, projection='3d') fig.scatter(xr[:,0], xr[:,1], xr[:,2], c=[color[i] for i in label]) fig.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='yellow', s=100) fig.set_xlabel('Principal Component 1') fig.set_ylabel('Principal Component 2') fig.set_zlabel('Principal Component 3') plt.show()
From the 3-D plot above, it does seem like there might be three different clusters — though not completely certain as the inter-cluster distance is not high.
Method 2:Use descriptive statistics to make sense of the clusters
Descriptive statistics involves using measures such as mean, max, min and median to find out how different the clusters are. In unsupervised learning, there isn’t a ‘right’ answer to the problem at hand and most of the time, finding the optimal number of clusters involve looking at the results of these clusters (i.e. compare the descriptive statistics of these different clusters).
( In this case however, if we want to ‘cheat’, we do know that there is in fact 3 different types of iris variations in the dataset (since it is labelled!)
Evaluation of K-Means
Advantages :
- Easy to use and easy to understand
- Suitable for datasets where hierarchical relationship amongst clusters can be easily detected
- Relatively low time complexity and high computing efficiency which leads to high scalability
Disadvantages :
- Not suitable for non-convex data
-
Highly sensitive to initialisation of cluster centroids — it is worth tuning hyperparameters such as
n_init
andinit='random'
to ensure that the clustering result you get is consistent - Highly sensitive to outliers — the means in K-Means means that outliers will skew the cluster centroid quite significantly — other algorithms that use mode or median will be less prone to outliers
- Easily drawn into local optimal
- Clustering result is sensitive to number of clusters
Final Thoughts and Closing Remarks
K-Means is a good model to get started with on any clustering problem as it has the ability to fit on high dimension data. However, it is important to know what kind of distance metric is being used, the sensitivity of the model with regards to outliers and most importantly, the use of domain knowledge to tune the model accordingly.
In the second part of this 3-part series, I will explore a different dataset and use Agglomerative Clustering (one of two variants of hierarchical clustering) for our analysis. I will also go through methods to obtain descriptive statistics of the clusters yielded by the model.
Pro Tip : Explaining the clusters generated is equally (if not more) important than obtaining the clusters themselves!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
C#入门经典
[美] Karli Watson、Christian Nagel / 齐立波、黄静 / 清华大学出版社 / 2008-12 / 118.00元
这是一本成就无数C#程序员的经典名著,厚而不“重”,可帮助您轻松掌握C#的各种编程知识,为您的职业生涯打下坚实的基础,《C#入门经典》自第1版出版以来,全球销量已经达数万册,在中国也有近8万册的销量,已经成为广大初级C#程序员首选的入门教程,也是目前国内市场上最畅销的C#专业店销书,曾两次被CSDN、《程序员》等机构和读者评选为“最受读者喜爱的十大技术开发类图书”!第4版面向C#2008和.NET......一起来看看 《C#入门经典》 这本书的介绍吧!