The Most Basic Machine Learning Algorithm in Python: Linear Regression

栏目: IT技术 · 发布时间: 5年前

内容简介:Learn the concepts of the linear regression algorithm and a working example in Python. This article follows the steps discussed in Andrew Ng’s machine learning course and implements it in Python.The most basic machine learning algorithm has to be the linea

Learn the concepts of the linear regression algorithm and a working example in Python. This article follows the steps discussed in Andrew Ng’s machine learning course and implements it in Python.

The most basic machine learning algorithm has to be the linear regression algorithm. Nowadays, there are so many advanced machine learning algorithms, libraries, and techniques available that linear regression may seem to be not important. But It is always a good idea to learn the basics. That way you will grasp the concepts very clearly. In this article, I will explain the linear regression algorithm step by step.

Ideas and Formulas

Linear regression uses the very basic idea of prediction. Here is the formula:

Y = BX + C

We all learned this formula in school. Just to remind you, this is the equation of a straight line. Here, Y is the dependent variable, B is the slope and C is the intercept. Typically, for linear regression, it is written as:

The Most Basic Machine Learning Algorithm in Python: Linear Regression

Here, ‘h’ is the hypothesis or the predicted dependent variable, X is the input feature, and theta0 and theta1 are the coefficients. Theta values are initialized randomly to start with. Then using gradient descent, we will update the theta value to minimize the cost function. Here is the explanation of cost function and gradient descent.

Cost Function and Gradient Descent

Cost function determines how far the prediction is from the original dependent variable. Here is the formula for that

The Most Basic Machine Learning Algorithm in Python: Linear Regression

The idea of any machine learning algorithm is to minimize the cost function so that the hypothesis is close to the original dependent variable. We need to optimize the theta value to do that. If we take the partial derivative of the cost function based on theta0 and theta1 respectively, we will get the gradient descent. To update the theta values we need to deduct the gradient descent from the corresponding theta values:

The Most Basic Machine Learning Algorithm in Python: Linear Regression

After the partial derivative, the formulas above will turn out to be:

The Most Basic Machine Learning Algorithm in Python: Linear Regression

Here, m is the number of training data and alpha is the learning rate. I am talking about one variable linear regression. That’s why I have only two theta values. If there are many variables, there will be theta values for each variable.

Working Example

The dataset I am going to use is from Andrew Ng’s machine learning course in Coursera. Here is the process of implementing a linear regression step by step in Python.

  1. Import the packages and the dataset.
import numpy as np
import pandas as pd
df = pd.read_csv('ex1data1.txt', header = None)
df.head()
The Most Basic Machine Learning Algorithm in Python: Linear Regression

In this dataset, column zero is the input feature and column 1 is the output variable or dependent variable. We will use column 0 to predict column 1 using the straight-line formula above.

2. Plot column 1 against column 0.

The Most Basic Machine Learning Algorithm in Python: Linear Regression

The relation between the input variable and the output variable is linear. Linear regression works best when the relationship is linear.

3. Initialize the theta values. I am initializing the theta values as zeros. But any other values should also work as well.

theta = [0,0]

4. Define the hypothesis and the cost function as per the formulas discussed before.

def hypothesis(theta, X):
    return theta[0] + theta[1]*Xdef cost_calc(theta, X, y):
    return (1/2*m) * np.sum((hypothesis(theta, X) - y)**2)

5. Calculate the number of training data as the length of the DataFrame. And then define the function for gradient descent. In this function, we will update the theta values until the cost function is it’s minimum. It may take any number of iteration. In each iteration, it will update the theta values and with each updated theta values we will calculate the cost to keep track of the cost.

m = len(df)
def gradient_descent(theta, X, y, epoch, alpha):
    cost = []
    i = 0
    while i < epoch:
        hx = hypothesis(theta, X)
        theta[0] -= alpha*(sum(hx-y)/m)
        theta[1] -= (alpha * np.sum((hx - y) * X))/m
        cost.append(cost_calc(theta, X, y))
        i += 1
    return theta, cost

6. Finally, define the predict function. It will get the updated theta from gradient descent function and predict the hypothesis or the predicted output variable.

def predict(theta, X, y, epoch, alpha):
    theta, cost = gradient_descent(theta, X, y, epoch, alpha)
    return hypothesis(theta, X), cost, theta

7. Using the predict function, find the hypothesis, cost, and updated theta values. I choose the learning rate as 0.01 and I will run this algorithm for 2000 epochs or iterations.

y_predict, cost, theta = predict(theta, df[0], df[1], 2000, 0.01)

The final theta values are -3.79 and 1.18.

8. Plot the original y and the hypothesis or the predicted y in the same graph.

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(df[0], df[1], label = 'Original y')
plt.scatter(df[0], y_predict, label = 'predicted y')
plt.legend(loc = "upper left")
plt.xlabel("input feature")
plt.ylabel("Original and Predicted Output")
plt.show()

The Most Basic Machine Learning Algorithm in Python: Linear Regression

The hypothesis plot is a straight line as expected from the formula and the line is passing through in an optimum position.

9. Remember, we kept track of the cost function in each iteration. Let’s plot the cost function.

plt.figure()
plt.scatter(range(0, len(cost)), cost)
plt.show()

The Most Basic Machine Learning Algorithm in Python: Linear Regression

As I mentioned before, our purpose was to optimize the theta values to minimize the cost. As you can see from this graph, the cost went down drastically in the beginning and then it became stable. That means the theta values are optimized correctly as we expected.

I hope this was helpful. Here is the link to the dataset used in this article:

Here is the solution to some other machine learning algorithms:

Polynomial Regression From Scratch in Python

Logistic Regression in Python From Scratch to End With a Real Dataset

Multiclass Classification With Logistic Regression One vs All Method From Scratch Using Python

Logistic Regression with Python Using Optimization Function

The Workflow of Neural Network Explained


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

编程匠艺

编程匠艺

(美)古德利弗(Goodliffe, P.)著 / 韩江,陈玉译 / 电子工业出版社 / 2008-9 / 79.00元

如果你可以编写出合格的代码,但是想更进一步、创作出组织良好而且易于理解的代码,并希望成为一名真正的编程专家或提高现有的职业技能,那么《编程匠艺——编写卓越的代码》都会为你给出答案。本书的内容遍及编程的各个要素,如代码风格、变量命名、错误处理和安全性等。此外,本书还对一些更广泛的编程问题进行了探讨,如有效的团队合作、开发过程和文档编写,等等。本书各章的末尾均提供一些思考问题,这些问题回顾了各章中的一......一起来看看 《编程匠艺》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

html转js在线工具
html转js在线工具

html转js在线工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换