内容简介:Link to Github Repo…
Detailing and Building a Linear Regression model from scratch
Jul 8 ·9min read
Linear Regression is a popular linear Machine Learning algorithm for regression-based problems. It is usually one of the first algorithms that is learnt when first learning Machine Learning, due to its simplicity and how it builds into other algorithms like Logistic Regression and Neural Networks. In this story we are going to implement it from scratch so that we can build our intuition about what is going on within a Linear Regression model.
Link to Github Repo…
Note: There are many frameworks with highly optimized code such as Scikit-Learn, Tensorflow and PyTorch, therefore it is usually unnecessary to build your own algorithm from scratch. However, it serves a good purpose to our intuition of what is happening within the model when we build it from scratch and this helps when we attempt to improve our models performance.
A linear model is an algorithm that makes a prediction by simply computing a weighted sum of the input features plus a bias term (also referred to as the intercept term). Taking that into perspective, what we are doing when we use a linear regression model is we hope to explain the relationship between a dependent variable (i.e house price) and one or more independent variables (i.e. location, bedrooms, area, etc).
When we train a model we are attempting to set the parameters to get a line that best fits the training data. Therefore, when we train a Linear Regression model we are trying to find the value of theta that best minimizes the cost function . The most common cost function of a regression models is the RMSE , however, it is much easier to minimize the MSE as it leads to the same result.
Creating the Model
If you have never written a Machine Learning algorithm from scratch, I greatly encourage you to do so. John Sullivan wrote a very useful story title 6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study w hich is the best advice I have managed to find on the internet about writing algorithms from scratch.
Chunking the Algorithm
- Randomly initialize parameters for the hypothesis function
- Calculate the Partial Derivative (Read more about this here )
- Update parameters
- Repeat 2–3 for n number of iterations (Until cost function is minimized otherwise)
- Inference
Implementation
For this section, I will be leveraging 3 Python packages. NumPy for Linear algebra, Scikit-Learn which is a popular Machine Learning framework and Matplotlib to visualize our data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
First, we need a dataset. To do this I will sklearn.datasets.make_regression
which allows you to generate a random regression problem — see Documentation . Next, I will split my data into training and test sets with sklearn.model_selection.train_test_split
— Documentation .
# creating the data set X, y = make_regression(n_samples=100, n_features=1, n_targets=1, noise=20, random_state=24) # splitting training and test X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=24)
Lets use matplotlib.pyplot
to see how our data looks — Documentation .
# visualize
plt.scatter(X, y)
plt.show()
Now we can begin our implementation of Linear Regression. The first step of our chunk is to randomly initialize parameters for our hypothesis function.
def param_init(X):
"""
Initialize parameters for linear regression model
__________________
Input(s)
X: Training data
__________________
Output(s)
params: Dictionary containing coefficients
"""
params = {} # initialize dictionary
_, n_features = X.shape # shape of training data
# initializing coefficents to 0
params["W"] = np.zeros(n_features)
params["b"] = 0
return params
Excellent. Next we want to calculate the partial derivatives and update our parameters. We do this using a very important Machine Learning algorithm called Gradient Descent . Ergo, we can implement steps 2–4 with Gradient Descent.
def gradient_descent(X, y, params, alpha, n_iter):
"""
Gradient descent to minimize cost function
__________________
Input(s)
X: Training data
y: Labels
params: Dictionary contatining random coefficients
alpha: Model learning rate
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
W = params["W"]
b = params["b"]
m = X.shape[0] # number of training instances
for _ in range(n_iter):
# prediction with random weights
y_pred = np.dot(X, W) + b
# taking the partial derivative of coefficients
dW = (2/m) * np.dot(X.T, (y_pred - y))
db = (2/m) * np.sum(y_pred - y)
# updates to coefficients
W -= alpha * dW
b -= alpha * db
params["W"] = W
params["b"] = b
return params
Using these functions we can train our Linear Regression model on our training data to obtain model parameters that we will require for inference.
def train(X, y, alpha=0.01, n_iter=1000):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Training data
y: Labels
alpha: Model learning rate
n_iter: Number of iterations
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
init_params = param_init(X)
params = gradient_descent(X, y, init_params, alpha, n_iter)
return params
Now, when we run this function we would obtain the optimized weights from our training data that we will use for inference on our test data. Next, we need to create a predict function for our inference that uses our stored weights.
def predict(X_test, params):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Unseen data
params: Dictionary contianing optimized weights from training
__________________
Output(s)
y_preds: Predictions of model
"""
y_preds = np.dot(X_test, params["W"]) + params["b"]
return y_preds
Great! Let’s run these functions and plot them to see what happens…
params = train(X_train, y_train) # train model
y_preds = predict(X_test, params) # inferenceplt.scatter(X_test, y_test)
plt.plot(X_test, y_preds, color="red")
plt.title("Predictions Dummy Regression Data")
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.show()
Our line of best fit seems to be pretty decent. To be entirely sure of our implementation we are fortunate enough to have many Machine Learning libraries, with optimized code, to compare our implementation to. To compare, I am simply going to check the RMSE of our implementation vs Scikit-learn.
Note: It may also be worthwhile to plot their implementation on the same plot and if their line of best fit covers yours then you are on the right tracks.
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
sklearn_y_preds = lin_reg.predict(X_test)print(f"My implementation: {np.sqrt(mean_squared_error(y_test, y_preds))}\nSklearn implementation: {np.sqrt(mean_squared_error(y_test, y_preds))}")>>>> My implementation: 20.986105292320207
Sklearn implementation: 20.986105292320207
It’s a match!
For this implementation we used procedural programming which can become quite complex and cumbersome to maintain and does not maximize the potential of Python which is an Object oriented Programming (OOP) language. On that note, here is an OOP implementation of our Linear Regression model.
class LinReg():
"""
Custom made Linear Regression class
"""
def __init__(self, alpha=0.01, n_iter= 1000):
self.alpha = alpha
self.n_iter = n_iter
self.params = {}
def param_init(self, X_train):
"""
Initialize parameters for linear regression model
__________________
Input(s)
X: Training data
"""
_, n_features = self.X.shape # shape of training data
# initializing coefficents to 0
self.params["W"] = np.zeros(n_features)
self.params["b"] = 0
return self
def gradient_descent(self, X_train, y_train):
"""
Gradient descent to minimize cost function
__________________
Input(s)
X: Training data
y: Labels
params: Dictionary contatining random coefficients
alpha: Model learning rate
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
W = self.params["W"]
b = self.params["b"]
m = X_train.shape[0]
for _ in range(self.n_iter):
# prediction with random weights
y_pred = np.dot(X_train, W) + b
# taking the partial derivative of coefficients
dW = (2/m) * np.dot(X_train.T, (y_pred - y_train))
db = (2/m) * np.sum(y_pred - y_train)
# updates to coefficients
W -= self.alpha * dW
b -= self.alpha * db
self.params["W"] = W
self.params["b"] = b
return self
def train(self, X_train, y_train):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Training data
y: Labels
alpha: Model learning rate
n_iter: Number of iterations
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
self.params = param_init(X_train)
gradient_descent(X_train, y_train, self.params , self.alpha, self.n_iter)
return self
def predict(self, X_test):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Unseen data
params: Dictionary contianing optimized weights from training
__________________
Output(s)
y_preds: Predictions of model
"""
y_preds = np.dot(X_test, self.params["W"]) + self.params["b"]
return y_preds
Let’s call this class and make it make a prediction on our test data…
linreg = LinReg() linreg.train(X_train, y_train) linreg.predict(X_test)>>>> array([ 4.73888182, -90.06369632, 80.39799712, 66.76983607, -49.97207144, 93.77905208, 34.30778991, -38.2209702 , 78.03331698, 53.81416352, 102.96993005, 151.71946744, 95.52801857, 104.82707085, 98.0492089 , 45.05150211, -7.29917923, -78.41675446, -27.14118529, -98.52923336, 170.75840972, -106.22126739, 24.86194847, -21.39127805, 50.24074837])
As a sanity check we will test if the predictions are the same as our procedural implementation (since we know this is similar to the scikit-learn implementation already).
linreg.predict(X_test) == y_preds>>>> array([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True])
And there you have it!
Note: The problem we used as an example is simple linear regression problem (Univariate Linear Regression), y = WX + b, where W is the weight and b is the bias.
Assumptions
Before deciding that we want to use Linear Regression it is important to understand the assumptions that the model makes of our data so that we can perform the necessary feature engineering to align with our model.
For more on feature engineering, see my previous story on this topic…
Linear Relationship— Linear regression only captures linear relationships, ergo assumes that there is a linear relationship between the features and the target that we want to predict. We can check the validity of this by plotting scatter graphs of the features in relation to the target variable.
Multicollinearity — Refers to a situation in which two or more explanatory variables in a multiple regression model are highly related (Source: Wikipedia). When we implement Linear Regression we are assuming that there is little to no multicollinearity since when present it weakens the statistical power of the regression model. To identify whether multicollinearity is present in our data we can use a correlation matrix to see which features are highly correlated and remove one of the highly correlated features.
Homoscedasticity — In statistics, a sequence (or a vector) of random variables is homoscedastic if all its random variables have the same finite variance (Source: Wikipedia). Linear Regression assumes that the noise is the same (may also be referred to as error term or random disturbance) across all observations and does not depend on the values of the independent variables. We may use a scatter plot of the residual values against the predicted values to check of homoscedasticity. We should expect no patterns in the distribution because if there is a pattern then the data is heteroscedastic.
Multivariate Normality— We assume the variables we use for Linear Regression follow the Normal Distribution . We can check this using a Histogram or a QQ-plot and use a non-linear transformation (i.e. log-transformation) to change the distribution of the data to fit our assumption.
Autocorrelation —The correlation of a signal with a delayed copy of itself as a function of delay — Informally, it is the similarity between observations as a function of the time lag between them (Source: Wikipedia). To put it simply Autocorrelation occurs when the residuals are not independent from each other. We may use a scatter plot to visually check for autocorrelation or the Durbin-Watsons’s test which test the null hypothesis that the residuals are not linearly autocorrelated.
Pros
- Easy to implement
- Easy to interpret the output coefficients
- Less complex
Cons
- Sensitive to Outliers
- Assumes linear relationship between independent and dependent variables
- Assumes independence between features
For more about Linear Regression you can read the Wikipedia page.
Wrap Up
Linear Regression is one of the easiest Machine Learning models and most likely the first you would learn or should learn. It is a very useful tool to analyze the relationship between variables, but due to the many assumptions of the model and simplification of real-world problems, it is not recommended for most practical applications.
Thank you for taking the time to go through this story. If there is anything that I have missed or you’d like me to clarify, leave a response in the comments. Additionally, If you’d like to get in contact with me, I am most accessible on LinkedIn.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。