内容简介:An SVM model is the representation of the dataset as points in space so that the example of the separate categories is divided by a clear gap which is as wide as possible.1. Maximal Margin Classifier2. Support vector classifier3. Support Vector Machines4.
An SVM model is the representation of the dataset as points in space so that the example of the separate categories is divided by a clear gap which is as wide as possible.
1. Maximal Margin Classifier2. Support vector classifier3. Support Vector Machines4. Support Vector Machine for more than two classes
Any new incoming data is then mapped to one of these few categories based on which side of the gap they fall on.
For example, in the above image, we can clearly see that there are two categories in the dataset. Category blue and category pink.Our aim is to differentiate between the two categories. One simple way of doing that is to draw a line between the two categories. But as we can see, there is an infinite number of lines that can clearly divide the dataset into two parts.What we actually do is that we can choose a hyperplane that maximizes the margin between the classes. The data points (Vectors) touching the two outer lines are called support vectors.This simple example of two-dimension linear plots can be further used in a dataset having more dimensions. Each time our idea will be to draw a hyperplane that can divide the data into different categories.Now we are going to talk about some related mathematics and discuss different terms related to SVM.
In general, the discussion of SVM is divided into three parts according to how SVM evolved.
- Maximal Margin Classifier
- Support Vector Classifier
- Support Vector Machine
We will slowly move the article toward the Support Vector Machine but for a proper understanding of SVM’s we have to go through, Maximal Margin Classifier and Support Vector Classifier.
Maximal Margin Classifier
Maximal Margin Classifier is a model that is used to classify the observations into two parts using a hyperplane.
What is a Hyperplane?
Simply put, a hyperplane is a subspace inp-dimensional
space having
p -1
dimensions. For example, in two-dimensional space, the hyperplane will be of 1 dimension, or it will be a line. Similarly, in the case of 3 dimensions, it will be a two-dimensional plane.
In two dimensions the equation of the hyperplane are given by,
$\beta_0 + \beta_1X_1 + \beta_2X_2 = 0$
$where\ vector\ (X1,\ X2)\ is\ on\ the\ hyperplane$
We can also find some similarities of this equation with the equation of a line.It’s fairly easy to extend this equation and find the equation of a hyperplane inp
dimensions.
$\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0$
Now if,
$\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0 > 0$
Then the vector is on the one side of the hyperplane and if,
$\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p = 0 < 0$
The vector is on the other side of the plane.To sum it up, our main aim in case of Maximal Margin Classifier is to create a hyperplane fitted on a training data ofn X p
matrix
X
, containing
n
training observations in
p
-dimensional space such that all these vectors falls in one of the two classes divided by the hyperplane.
If we represent the classes(labels) for all the n values,
$y_1, ..., y_n\ \epsilon \{-1, 1\}$
Where -1 represents one class and 1 represents the other class.Our main aim for any incoming test vector,
$x^* = (x_1^*\ ...\ x_p^*)^T$
is that our model has to allot this incoming test vector to one of the two classes. This equation given the class of the incoming test vector.
$f_x^* = \beta_0 + \beta_1x_{1}^* + \beta_2x_{2}^* + ... + \beta_px_{p}^*$
If the value of this function is positive, we assign it to class 1, otherwise, we assign it to class -1.A simple issue in this approach is that there are infinite number of hyperplanes possible that can divide a perfect distribution.
The problem reduces to choosing the best hyperplane possible which divides the observations into two parts.A natural choice is to find the perpendicular distance of each observation from the potential hyperplanes, the one which produces the maximum
margin
from both the sides is chosen as the hyperplane.
Once we have the hyperplane, it is fairly easy to predict the classes of test observations.The only assumption that we are making here is that a hyperplane dividing the observations in the training set will also divide the observations in the test set, which is not always true. Therefore, this model can lead to overfitting when
p
is large.We have already discussed that the points on dashed line are called support vectors and it has been found that the position of hyperplane only depends on support vector and is not dependent on the other observations in the dataset.
This is how we define a Maximal margin classifier. There are a few issues with Maximal Margin Classifier.
- It doesn’t work on observations where no clear hyperplane is present between different classes.
- A small addition of observation near the hyperplane can lead to a lot of change in the hyperplane making it a lot volatile.
Simly put, we create a hyperplane which almost separates all of the classes, rather than surely separating all of the classes. This brings us to the concept of Support Vector Classifier.
Support vector classifier
In case of Support vector classifiers, we allow a few observations to be on the wrong side of hyperplane making the model a little more robust to individual observation and helps us to better classify other and most of the observations.
Support vector classifier is also known as a soft margin classifier.
The observations on the wrong side of the hyperplane are obviously misclassified by the model. But this helps to improve the overall accuracy of the model.
There is not much difference in the idea behind the generation of the model. In case of support vector classifier as well, we want to maximize the value of Margin.
$y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_px_{ip}) \geq M(1 - \epsilon_{i})$
$where\ \epsilon_i \geq 0\ and\ \displaystyle \sum _{i=1}^{n} \epsilon_i \leq C$
For any given observation vector on either side of plane,epsilon
( also called a slack variable), gives the point at which it is located, relative to the hyperplane and margin.If
i
th slack variable is on the right side of the hyperplane then the value of that variable is 0. Also, if
$\epsilon_i > 0$
then the pointi
is on the wrong side of the margin. But if,
$\epsilon_i > 1$
then the slack variable is on the wrong side of the hyperplane.If we extend this observation to the tuning variable,C
, we can deduce that
C
is the number which determines the count and severity of the violations to the margins and the hyperplane.The value of
C
is considered as the tuning parameter which is generally chosen by cross-validation.
C
also controls the bias-variance trade-off for the model.If the value of
C
is small, we allow a lesser number of observations to be on the wrong side which will fit perfectly to a data set having data with high bias and low variance and vice-versa.Again similar to Maximal Margin Classifier, it was found that all the observations don’t get to decide the position of a hyperplane of the Margin. It is only dependent on the observations on or inside the margins.
If we expand these points a little we can get to the Support Vector Machines. Let’s discuss them in some detail.
Support Vector Machines
In the Support vector Machine, we introduce another factor called the kernel, which is the result of enlarging of support vector classifiers in a specific way.According to our discussions in support vector classifier, its equation can be re-written as,
$f(x) = \beta_0 +\displaystyle \sum_{i=1}^n \alpha_i<x, x_i>$
$where\ <x, x_i>\ is\ the\ inner\ product\ between\ the\ new\ point\ x\ and\ other\ x_i\ points$
The implementation of the inner product is hidden on purpose and we should be good without knowing the details of it.
We can directly replace all the instances of the inner product with a general term called the kernel.
$f(x) = \beta_0 +\displaystyle \sum_{i \epsilon S} \alpha_iK(x, x_i)$
as only support vectors are responsible for the creation of the hyperplane.Forp
planes equation of kernel becomes,
$K(x_i, x_{i^`}) = (1 +\displaystyle \sum_{j=1}^p x_{ij}x{i`j} )^d$
which is known as a polynomial kernel of degreed
. This type of model leads to much flexible decision boundary.
Support Vector Machine for more than two classes
During the discussion for support vector machines, we haven’t really talked about the case when the number of possible classifications can be more than 2.We can solve these problems by extending the simple SVM in two ways.
- One versus One Classification
Finally, for a vector, we will choose the class to which it belonged to most of the time during the training of the model.
- One versus All Classification
K
class to the remaining
K - 1
classes. Finally, we will assign any upcoming test vector to the class which produces the highest values of the constant, or we want to maximize.
$\beta_0k +\beta_{1k}x_1^* +\beta_{2k}x_2^* +...+\beta_{pk}x_p^*$
Thats it for this version of Support Vector Machine discussion. Feel free to express your thoughts in the comments and share this post with your freinds.
以上所述就是小编给大家介绍的《Introduction to Support Vector Machines in Machine Learning》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Math Adventures with Python
Peter Farrell / No Starch Press / 2018-11-13 / GBP 24.99
Learn math by getting creative with code! Use the Python programming language to transform learning high school-level math topics like algebra, geometry, trigonometry, and calculus! In Math Adventu......一起来看看 《Math Adventures with Python》 这本书的介绍吧!