Machine Learning Basics: Support Vector Regression

栏目: IT技术 · 发布时间: 4年前

内容简介:Learn to build a Support Vector Regression (SVR) model in Machine Learning and analyze the results.In the previous stories I had explained the Machine Learning program for building Linear and Polynomial Regression model in Python. In this article, we will

Learn to build a Support Vector Regression (SVR) model in Machine Learning and analyze the results.

In the previous stories I had explained the Machine Learning program for building Linear and Polynomial Regression model in Python. In this article, we will go through the program for building a Support Vector Regression model based on non-linear data.

Overview of SVR

Support Vector Machine (SVM) is a very popular Machine Learning algorithm that is used in both Regression and Classification. Support Vector Regression is similar to Linear Regression in that the equation of the line is y= wx+b In SVR, this straight line is referred to as hyperplane . The data points on either side of the hyperplane that are closest to the hyperplane are called as Support Vectors which is used to plot the boundary line.

Unlike other Regression models that try to minimize the error between the real and predicted value, the SVR tries to fit the best line within a threshold value (Distance between hyperplane and boundary line), a . Thus, we can say that SVR model tries satisfy the condition -a < y-wx+b < a . It used the points with this boundary to predict the value.

Machine Learning Basics: Support Vector Regression

Source

For a non-linear regression, the kernel function transforms the data to a higher dimensional and performs the linear separation. Here we will use the rbf kernel.

In this example, we will go through the implementation of Support Vector Regression (SVM) , in which we will predict the Marks of a student based on his or her number of hours put into study.

Problem Analysis

In this data, we have one independent variable Hours of Study and one dependent variable Marks . In this problem, we have to train a SVR model with this data to understand the correlation between the Hours of Study and Marks of the student and be able to predict the student’s mark based on their number of hours dedicated to studies.

Step 1: Importing the libraries

In this first step, we will be importing the libraries required to build the ML model. The NumPy library and the matplotlib are imported. Additionally, we have imported the Pandas library for data analysis.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step 2: Importing the dataset

In this step, we shall use pandas to store the data obtained from my github repository and store it as a Pandas DataFrame using the function “ pd.read_csv ”.

We go through our dataset and assign the independent variable (x) to the column “ Hours of Study ” and the dependent variable (y) to the last column, which is the “ Marks ” to be predicted.

dataset = pd.read_csv('https://raw.githubusercontent.com/mk-gurucharan/Regression/master/SampleData.csv')X = dataset.iloc[:, 0].values
y = dataset.iloc[:, 1].values
y = np.array(y).reshape(-1,1)
dataset.head(5)>>Hours of Study Marks
32.502345 31.707006
53.426804 68.777596
61.530358 62.562382
47.475640 71.546632
59.813208 87.230925

We use the corresponding .iloc function to slice the DataFrame to assign these indexes to X and Y. In this, the Hours of Study is taken as the independent variable and is assigned to X. The dependent variable that is to be predicted is the last column which is Marks and it is assigned to y. We will reshape the variable y to a column vector using reshape(-1,1) .

Step 3: Feature Scaling

Most of the data that are available usually are of varying ranges and magnitudes which makes building the model difficult. Thus, the range of the data needs to be normalized to a smaller range which enables the model to be more accurate when training. In this dataset, the data is normalized between to small values near zero. For example, the score of 87.23092513 is normalized to 1.00475931 and score of 53.45439421 is normalized to -1.22856288 .

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X.reshape(-1,1))
y = sc_y.fit_transform(y.reshape(-1,1))

Feature Scaling is mostly performed internally in most of the common Regression and Classification models. Support Vector Machine is not a commonly used class and hence the data is normalized to a limited range.

Step 4: Training the Support Vector Regression model on the Training set

In building any ML model, we always need to split the data into the training set and the test set. The SVR Model will be trained with the values of the training set and the predictions are tested on the test set . Out of 100 rows, 80 rows are used for training and the the model is tested on the remaining 20 rows as given by the condition, test_size=0.2

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Step 5: Training the Support Vector Regression model on the Training set

In this, the function SVM is imported and is assigned to the variable regressor . The kernel “rbf” (Radial Basis Function) is used. RBF kernel is used to introduce a non-linearity to the SVR model. This is done because our data is non-linear. The regressor.fit is used to fit the variables X_train and y_train by reshaping the data accordingly.

from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train.reshape(-1,1), y_train.reshape(-1,1))

Step 6: Predicting the Test set Results

In this step, we are going to predict the scores of the test set using the SVR model built. The regressor.predict function is used to predict the values for the X_test. We assign the predicted values to y_pred. We now have two data, y_test (real values) and y_pred (predicted values).

y_pred = regressor.predict(X_test)
y_pred = sc_y.inverse_transform(y_pred)

Step 7: Comparing the Test Set with Predicted Values

In this step, we shall display the values of y_test as Real Values and y_pred values as Predicted Values for each X_test against each other in a Pandas DataFrame.

df = pd.DataFrame({'Real Values':sc_y.inverse_transform(y_test.reshape(-1)), 'Predicted Values':y_pred})
df>>
Real Values   Predicted Values
31.707006     53.824386
76.617341     61.430210
65.101712     63.921849
85.498068     80.773056
81.536991     72.686906
79.102830     60.357810
95.244153     89.523157
52.725494     54.616087
95.455053     82.003370
80.207523     81.575287
79.052406     67.225121
83.432071     73.541885
85.668203     78.033983
71.300880     76.536061
52.682983     63.993284
45.570589     53.912184
63.358790     76.077840
57.812513     62.178748
82.892504     64.172003
83.878565     93.823265

We can see that there is a significant deviation of the predicted values with the real values of the test set and hence we can conclude that this model is not the perfect fit for the following data.

Step 8: Visualising the SVR results

In this last step, we shall visualize the SVR model that was built using the given data and plot the values of “ y ” and “ y_pred ” on the graph to visualize the results

X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(sc_X.inverse_transform(X_test), sc_y.inverse_transform(y_test.reshape(-1)), color = 'red')
plt.scatter(sc_X.inverse_transform(X_test), y_pred, color = 'green')plt.title('SVR Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Python语言程序设计

Python语言程序设计

[美]梁勇(Lang Y. D.) / 李娜 / 机械工业出版社 / 2015-4 / 79.00元

本书采用“问题驱动”、“基础先行”和“实例和实践相结合”的方式,讲述如何使用Python语言进行程序设计。本书首先介绍Python程序设计的基本概念,接着介绍面向对象程序设计方法,最后介绍算法与数据结构方面的内容。为了帮助学生更好地掌握相关知识,本书每章都包括以下模块:学习目标,引言,关键点,检查点,问题,本章总结,测试题,编程题,注意、提示和警告。 本书可以作为高等院校计算机及相关专业Py......一起来看看 《Python语言程序设计》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

SHA 加密
SHA 加密

SHA 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具