K-Means Clustering — One rule to group them all

栏目: IT技术 · 发布时间: 4年前

内容简介:· Basic mathematics and Euclidian geometryIn machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want

K-Means Clustering — One rule to group them all

Photo by Ellen Qin on Unsplash

This article will try to:

  • Explain graphically the basics of K-Means Clustering, an unsupervised machine learning algorithm
  • Discuss the method of finding optimum value of K and centroid location of the clusters

What you are already supposed to know:

· Basic mathematics and Euclidian geometry

In machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want to know who are the people with low income power, the people with high or very high income power, which you think can be helpful in devising a perfect marketing strategy. You have the shopping data of customers with you and now you want to group customers with similar shopping preferences together or you being the bio student want to know which cells share similar properties based upon the data about the cells you have at your hand.

All the above stated problems come under the domain of an unsupervised machine learning method called clustering. Although there are a number of clustering algorithms there but when it comes to the simplest one, the award will go to K-Means clustering. To understand the algorithm let’s suppose we have the following data with us:

We have ten data points with us with their x and y coordinates given. Here we have variables represented as x and y but in practical situations, they can be different like monthly income or daily spent in market segmentation case. Let’s start with two clusters and first visualize the above data to get an idea about it:

The above graph shows the data we have at our hand (Graph 1) and the two possible clusters that can we make out of it (Graph 2). Note that, the decision of finding the number of clusters (2 in the present case) in any data is totally arbitrary & based upon a random guess. Later you will find out how this random guess at first leads us to the optimum number of clusters possible in a data. The question now is how to calculate and analytically find out the two clusters.

  1. First, assume two random points anywhere near the data & consider them as centre of two clusters (centroids)
  2. Find the distance of every data point from both the centres
  3. Assign every data point to the centroid to which it is nearest, hence make two clusters
  4. Calculate the centre of both the formed clusters and shift the centroids there.
  5. Go to step 1 and repeat the process until there is no change in the clusters formed.

Let’s apply the above logic to the given data. We know in the cartesian coordinate system the distance between two points (x1, y1) and (x2, y2) is given by:

Using the above formula, we can calculate the distance between each point from assumed centres. Let’s assume our centroids at C1 = (4,1) and C2 = (6,1), graphically speaking, as shown in the below graph:

If we calculate the distance of each point from two centroids the result will be as shown in the below table:

Based upon the above distance values, each point will be assigned to the centroid to which its distance is minimum or to which it is nearer e.g. consider the first data point, its distance form C1 is 3.6 and from C2 is 5.4. As it is nearer to C1, it will be assigned to this particular centroid. Doing the same to each and every point, the assignment will be as shown below:

As you can see from the above table, each point, based upon its distance from the assumed centroids has been assigned to a centroid. After the assignment, the new position of the centroids is calculated by calculating the centre of the points assigned to each cluster by calculating the mean distance of each coordinate as shown in the table. Hence the new position of centroids will be C1 = (2.6, 3.8) and C2 = (7.6, 3.4) as shown in the graph below:

The above graph shows how the single loop of calculation has bought the two centroids closer to data. If you run another loop of calculations, you can see that the centroids do not move any further and there is no data point which changes its centroid from C1 to C2 or vice versa. This is where the calculation loop or recursion is stopped.

The above process of assigning data to centroids to make clusters out of it is called K-Means clustering and the value of K is the number of clusters formed e.g. the value of K in the current case will be 2.

The final two clusters are as shown below:

How to decide the value of K?

We started with K=2 with random coordinates, i.e. we assigned random values to both, number of centroids and position of centroids. Although, we finally found out the way to calculate the final position of centroids the question that still remains is that what will be the value of K which would fit the given data in the most optimum manner. We would try to find the answer to this question too but let’s first understand what cost function is.

It is a function which gives the measure of the imperfection of the model with respect to every value of K . Ideally speaking, each point in a cluster should be as close to its centroid as possible or the distance between each point in a particular cluster from its centroid should be zero. We will calculate this distance & consider the sum of squares of the distances as the cost or measure of imperfection. We will repeat the same for each cluster & will find the value of K which will minimize this cost.

Refer to Table-1 and Table-2, we have each point sorted in a cluster and distance of the points from their respective centroids which can be summarized as below:

Each of the distance is squared & added together, the total sum for both the clusters is 15.72 + 10.76 = 26.84. This figure is the cost of the model when we set K = 2. Likewise, the cost can be calculated for different values of K and when the cost values are plotted against K values, we will get a graph as shown below:

The optimum value of K , as per the above graph would be the one where it shows maximum deflection (marked in red) or where the cost curve makes an elbow. Although the higher values of K reduce the cost further but they lead to overfitting. This method of finding out the value of optimum K is known as the elbow method.

This is all about K-Means clustering & how Optimum value of K and position of centroids is found out. Hope you like it. Do post your comments

Further reading:

Thanks,

Have a good time 

If you have any queries about the article, you can reach me on LinkedIn


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

用户故事地图

用户故事地图

Jeff Patton / 李涛、向振东 / 清华大学出版社 / 2016-4-1 / 59.00元

用户故事地图作为一种有效的需求工具,越来越广泛地应用于开发实践中。本书以用户故事地图为主题,强调以合作沟通的方式来全面理解用户需求,涉及的主题包括怎么以故事地图的方式来讲用户需求,如何分解和优化需求,如果通过团队协同工作的方式来积极吸取经验教训,从中洞察用户的需求,开发真正有价值的、小而美的产品和服务。本书适合产品经理、用户体验设计师、产品负责人、业务分析师、IT项目经理、敏捷教练和精益教练阅读和......一起来看看 《用户故事地图》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

html转js在线工具
html转js在线工具

html转js在线工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试