Assumptions in Linear Regression you might not know.

栏目: IT技术 · 发布时间: 4年前

DS INTO THE REAL WORLD

Assumptions in Linear Regression you might not know.

The model should conform to these assumptions to produce a best Linear Regression fit to the data.

Jul 16 ·6min read

Assumptions in Linear Regression you might not know.

Photo by Joseph Barrientos on Unsplash

— All the images (plots) are generated and modified by Author.

Introduction

At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:

Assumptions in Linear Regression you might not know.
Simple Linear Regression

xis the independent variable,

yis the dependent variable,

β 1 is the coefficient of x, i.e. slope,

β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Linear regressionis a linear approach to modelling the relationship between a scalar response (or dependent variable ) and one or more explanatory variables (or independent variables ).

Wikipedia

Linear Regression Types

1. Simple Linear Regression— The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.

2. Multiple Linear Regression— The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:

Multiple Linear Regression

x1to xn are the independent variable,

yis the dependent variable,

β 1 to β n are the coefficients of respective x features, and

β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Assumptions in Linear Regression

Assumptions in Linear Regression you might not know.

Photo by Tom Roberts on Unsplash

1. Linear Relationship— It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.

Assumptions in Linear Regression you might not know.

Image by Author

The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.

Polynomial Regression

This assumption is used for implementing the Polynomial regression , which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.

2. Homoscedasticity (Constant Variance)— It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.

Assumptions in Linear Regression you might not know.

Image by Author — Modified

There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.

3. Multivariate Normality— It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.

Assumptions in Linear Regression you might not know.

The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.

4. No Multicollinearity— Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.

Assumptions in Linear Regression you might not know.

Correlation Heatmap — Image by Author

Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.

5. No Auto-correlation— It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.

Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:

Assumptions in Linear Regression you might not know.
Durbin-Watson Equation

The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.

6. No Extrapolation— Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.

Assumptions in Linear Regression you might not know.

Image by Author — Modified

Conclusion:

We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

世界因你不同

世界因你不同

李开复、范海涛 / 中信出版社 / 2009-9 / 29.80元

这是李开复唯一的一本自传,字里行间,是岁月流逝中沉淀下来的宝贵的人生智慧和职场经验。捣蛋的“小皇帝”,11岁的“留学生”,奥巴马的大学同学,26岁的副教授,33岁的苹果副总裁,谷歌中国的创始人,他有着太多传奇的经历,为了他,两家最大的IT公司对簿公堂。而他的每一次人生选择,都是一次成功的自我超越。 透过这本自传,李开复真诚讲述了他鲜为人知的成长史、风雨兼程的成功史和烛照人生的心灵史,也首次全......一起来看看 《世界因你不同》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

随机密码生成器
随机密码生成器

多种字符组合密码

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具