内容简介:· Basic mathematics and Euclidian geometryIn machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want
K-Means Clustering — One rule to group them all
This article will try to:
- Explain graphically the basics of K-Means Clustering, an unsupervised machine learning algorithm
- Discuss the method of finding optimum value of K and centroid location of the clusters
What you are already supposed to know:
· Basic mathematics and Euclidian geometry
In machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want to know who are the people with low income power, the people with high or very high income power, which you think can be helpful in devising a perfect marketing strategy. You have the shopping data of customers with you and now you want to group customers with similar shopping preferences together or you being the bio student want to know which cells share similar properties based upon the data about the cells you have at your hand.
All the above stated problems come under the domain of an unsupervised machine learning method called clustering. Although there are a number of clustering algorithms there but when it comes to the simplest one, the award will go to K-Means clustering. To understand the algorithm let’s suppose we have the following data with us:
We have ten data points with us with their x and y coordinates given. Here we have variables represented as x and y but in practical situations, they can be different like monthly income or daily spent in market segmentation case. Let’s start with two clusters and first visualize the above data to get an idea about it:
The above graph shows the data we have at our hand (Graph 1) and the two possible clusters that can we make out of it (Graph 2). Note that, the decision of finding the number of clusters (2 in the present case) in any data is totally arbitrary & based upon a random guess. Later you will find out how this random guess at first leads us to the optimum number of clusters possible in a data. The question now is how to calculate and analytically find out the two clusters.
- First, assume two random points anywhere near the data & consider them as centre of two clusters (centroids)
- Find the distance of every data point from both the centres
- Assign every data point to the centroid to which it is nearest, hence make two clusters
- Calculate the centre of both the formed clusters and shift the centroids there.
- Go to step 1 and repeat the process until there is no change in the clusters formed.
Let’s apply the above logic to the given data. We know in the cartesian coordinate system the distance between two points (x1, y1) and (x2, y2) is given by:
Using the above formula, we can calculate the distance between each point from assumed centres. Let’s assume our centroids at C1 = (4,1) and C2 = (6,1), graphically speaking, as shown in the below graph:
If we calculate the distance of each point from two centroids the result will be as shown in the below table:
Based upon the above distance values, each point will be assigned to the centroid to which its distance is minimum or to which it is nearer e.g. consider the first data point, its distance form C1 is 3.6 and from C2 is 5.4. As it is nearer to C1, it will be assigned to this particular centroid. Doing the same to each and every point, the assignment will be as shown below:
As you can see from the above table, each point, based upon its distance from the assumed centroids has been assigned to a centroid. After the assignment, the new position of the centroids is calculated by calculating the centre of the points assigned to each cluster by calculating the mean distance of each coordinate as shown in the table. Hence the new position of centroids will be C1 = (2.6, 3.8) and C2 = (7.6, 3.4) as shown in the graph below:
The above graph shows how the single loop of calculation has bought the two centroids closer to data. If you run another loop of calculations, you can see that the centroids do not move any further and there is no data point which changes its centroid from C1 to C2 or vice versa. This is where the calculation loop or recursion is stopped.
The above process of assigning data to centroids to make clusters out of it is called K-Means clustering and the value of K is the number of clusters formed e.g. the value of K in the current case will be 2.
The final two clusters are as shown below:
How to decide the value of K?
We started with K=2 with random coordinates, i.e. we assigned random values to both, number of centroids and position of centroids. Although, we finally found out the way to calculate the final position of centroids the question that still remains is that what will be the value of K which would fit the given data in the most optimum manner. We would try to find the answer to this question too but let’s first understand what cost function is.
It is a function which gives the measure of the imperfection of the model with respect to every value of K . Ideally speaking, each point in a cluster should be as close to its centroid as possible or the distance between each point in a particular cluster from its centroid should be zero. We will calculate this distance & consider the sum of squares of the distances as the cost or measure of imperfection. We will repeat the same for each cluster & will find the value of K which will minimize this cost.
Refer to Table-1 and Table-2, we have each point sorted in a cluster and distance of the points from their respective centroids which can be summarized as below:
Each of the distance is squared & added together, the total sum for both the clusters is 15.72 + 10.76 = 26.84. This figure is the cost of the model when we set K = 2. Likewise, the cost can be calculated for different values of K and when the cost values are plotted against K values, we will get a graph as shown below:
The optimum value of K , as per the above graph would be the one where it shows maximum deflection (marked in red) or where the cost curve makes an elbow. Although the higher values of K reduce the cost further but they lead to overfitting. This method of finding out the value of optimum K is known as the elbow method.
This is all about K-Means clustering & how Optimum value of K and position of centroids is found out. Hope you like it. Do post your comments
Further reading:
Thanks,
Have a good time
If you have any queries about the article, you can reach me on LinkedIn
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
深入浅出Web设计(中文版)
(美)瓦特罗尔、(美)西罗托 / O'Reilly Taiwan公司 / 东南大学出版社 / 2010-11 / 99.00元
你将从《深入浅出Web设计(中文版)》学到什么?你曾经希望看看书就能学到真正的网站设计吗?曾经想过该如何同时达成让网站看起来美观,又能有效率地沟通信息,还要通过可访问性与可用性的策略吗?《深入浅出Web设计》正是精通上述主题的秘笈。我们将学到如何设计一个绝佳、用户友好的网站,上谈客户需求,下说手绘分镜表,乃至完成在线所需的HTML与css主文件……而且会有一个真正可以运作的网站!一起来看看 《深入浅出Web设计(中文版)》 这本书的介绍吧!