K-Means Clustering — One rule to group them all

栏目: IT技术 · 发布时间: 4年前

内容简介:· Basic mathematics and Euclidian geometryIn machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want

K-Means Clustering — One rule to group them all

Photo by Ellen Qin on Unsplash

This article will try to:

  • Explain graphically the basics of K-Means Clustering, an unsupervised machine learning algorithm
  • Discuss the method of finding optimum value of K and centroid location of the clusters

What you are already supposed to know:

· Basic mathematics and Euclidian geometry

In machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want to know who are the people with low income power, the people with high or very high income power, which you think can be helpful in devising a perfect marketing strategy. You have the shopping data of customers with you and now you want to group customers with similar shopping preferences together or you being the bio student want to know which cells share similar properties based upon the data about the cells you have at your hand.

All the above stated problems come under the domain of an unsupervised machine learning method called clustering. Although there are a number of clustering algorithms there but when it comes to the simplest one, the award will go to K-Means clustering. To understand the algorithm let’s suppose we have the following data with us:

We have ten data points with us with their x and y coordinates given. Here we have variables represented as x and y but in practical situations, they can be different like monthly income or daily spent in market segmentation case. Let’s start with two clusters and first visualize the above data to get an idea about it:

The above graph shows the data we have at our hand (Graph 1) and the two possible clusters that can we make out of it (Graph 2). Note that, the decision of finding the number of clusters (2 in the present case) in any data is totally arbitrary & based upon a random guess. Later you will find out how this random guess at first leads us to the optimum number of clusters possible in a data. The question now is how to calculate and analytically find out the two clusters.

  1. First, assume two random points anywhere near the data & consider them as centre of two clusters (centroids)
  2. Find the distance of every data point from both the centres
  3. Assign every data point to the centroid to which it is nearest, hence make two clusters
  4. Calculate the centre of both the formed clusters and shift the centroids there.
  5. Go to step 1 and repeat the process until there is no change in the clusters formed.

Let’s apply the above logic to the given data. We know in the cartesian coordinate system the distance between two points (x1, y1) and (x2, y2) is given by:

Using the above formula, we can calculate the distance between each point from assumed centres. Let’s assume our centroids at C1 = (4,1) and C2 = (6,1), graphically speaking, as shown in the below graph:

If we calculate the distance of each point from two centroids the result will be as shown in the below table:

Based upon the above distance values, each point will be assigned to the centroid to which its distance is minimum or to which it is nearer e.g. consider the first data point, its distance form C1 is 3.6 and from C2 is 5.4. As it is nearer to C1, it will be assigned to this particular centroid. Doing the same to each and every point, the assignment will be as shown below:

As you can see from the above table, each point, based upon its distance from the assumed centroids has been assigned to a centroid. After the assignment, the new position of the centroids is calculated by calculating the centre of the points assigned to each cluster by calculating the mean distance of each coordinate as shown in the table. Hence the new position of centroids will be C1 = (2.6, 3.8) and C2 = (7.6, 3.4) as shown in the graph below:

The above graph shows how the single loop of calculation has bought the two centroids closer to data. If you run another loop of calculations, you can see that the centroids do not move any further and there is no data point which changes its centroid from C1 to C2 or vice versa. This is where the calculation loop or recursion is stopped.

The above process of assigning data to centroids to make clusters out of it is called K-Means clustering and the value of K is the number of clusters formed e.g. the value of K in the current case will be 2.

The final two clusters are as shown below:

How to decide the value of K?

We started with K=2 with random coordinates, i.e. we assigned random values to both, number of centroids and position of centroids. Although, we finally found out the way to calculate the final position of centroids the question that still remains is that what will be the value of K which would fit the given data in the most optimum manner. We would try to find the answer to this question too but let’s first understand what cost function is.

It is a function which gives the measure of the imperfection of the model with respect to every value of K . Ideally speaking, each point in a cluster should be as close to its centroid as possible or the distance between each point in a particular cluster from its centroid should be zero. We will calculate this distance & consider the sum of squares of the distances as the cost or measure of imperfection. We will repeat the same for each cluster & will find the value of K which will minimize this cost.

Refer to Table-1 and Table-2, we have each point sorted in a cluster and distance of the points from their respective centroids which can be summarized as below:

Each of the distance is squared & added together, the total sum for both the clusters is 15.72 + 10.76 = 26.84. This figure is the cost of the model when we set K = 2. Likewise, the cost can be calculated for different values of K and when the cost values are plotted against K values, we will get a graph as shown below:

The optimum value of K , as per the above graph would be the one where it shows maximum deflection (marked in red) or where the cost curve makes an elbow. Although the higher values of K reduce the cost further but they lead to overfitting. This method of finding out the value of optimum K is known as the elbow method.

This is all about K-Means clustering & how Optimum value of K and position of centroids is found out. Hope you like it. Do post your comments

Further reading:

Thanks,

Have a good time 

If you have any queries about the article, you can reach me on LinkedIn


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

机器消灭秘密

机器消灭秘密

安迪•格林伯格 (Andy Greenberg) / 王崧、王涛、唐禾 / 重庆出版社 / 2017-8-10 / 49.8

《机器消灭秘密》一书中,格林伯格深入研究并生动再现了那些拥有全能技术的网络安全魔术师,他们将任何企图染指个人隐私的所谓国家机密的保密性打得粉碎。这本精心组织的著作是对此题材感兴趣的读者的必读之书,即便现在你可能不感兴趣,将来也极有可能希望了解这些内容,因为任何人都会不可避免地置身其中。无论你是初涉电脑屏幕之后的虚拟战场的新生,还是经验丰富的维基解密观察家,本书都是不可多得的上乘之作,你总会在其中发......一起来看看 《机器消灭秘密》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换