Assumptions in Linear Regression you might not know.

栏目: IT技术 · 发布时间: 4年前

DS INTO THE REAL WORLD

Assumptions in Linear Regression you might not know.

The model should conform to these assumptions to produce a best Linear Regression fit to the data.

Sparsh Gupta

Jul 16 ·6min read

Assumptions in Linear Regression you might not know. — Photo by Joseph Barrientos on Unsplash

— All the images (plots) are generated and modified by Author.

Introduction

At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:

xis the independent variable,

yis the dependent variable,

β 1 is the coefficient of x, i.e. slope,

β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Linear regressionis a linear approach to modelling the relationship between a scalar response (or dependent variable ) and one or more explanatory variables (or independent variables ).

— Wikipedia

Linear Regression Types

1. Simple Linear Regression— The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.

2. Multiple Linear Regression— The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:

Multiple Linear Regression

x1to xn are the independent variable,

yis the dependent variable,

β 1 to β n are the coefficients of respective x features, and

β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Assumptions in Linear Regression

1. Linear Relationship— It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.

The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.

Polynomial Regression

This assumption is used for implementing the Polynomial regression , which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.

2. Homoscedasticity (Constant Variance)— It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.

There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.

3. Multivariate Normality— It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.

The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.

4. No Multicollinearity— Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.

Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.

5. No Auto-correlation— It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.

Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:

The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.

6. No Extrapolation— Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.

Conclusion:

We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Assumptions in Linear Regression you might not know.

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

极致产品

周鸿祎 / 中信出版社 / 2018-6 / 58.00

周鸿祎作为*知名的产品经理之一，一手打造了众多国民级的产品。他关于打造爆款的理念，比如刚需、高频、“小白”思维等，已成为网络热词而被广泛接受。本书是周鸿祎首次系统总结其20年产品经理的心得，不仅将以往的理念进行总结、归纳，而且在与包括各方面创业者、产品经理的碰撞后，将其观念进一步升华，成为迄今为止首部将其产品理念倾囊相授的作品。一起来看看《极致产品》这本书的介绍吧!

码农工具