DS INTO THE REAL WORLD
Assumptions in Linear Regression you might not know.
The model should conform to these assumptions to produce a best Linear Regression fit to the data.
Jul 16 ·6min read
— All the images (plots) are generated and modified by Author.
Introduction
At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:
xis the independent variable,
yis the dependent variable,
β 1 is the coefficient of x, i.e. slope,
β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.
Linear regressionis a linear approach to modelling the relationship between a scalar response (or dependent variable ) and one or more explanatory variables (or independent variables ).
Linear Regression Types
1. Simple Linear Regression— The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.
2. Multiple Linear Regression— The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:
x1to xn are the independent variable,
yis the dependent variable,
β 1 to β n are the coefficients of respective x features, and
β 0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.
Assumptions in Linear Regression
1. Linear Relationship— It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.
The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.
This assumption is used for implementing the Polynomial regression , which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.
2. Homoscedasticity (Constant Variance)— It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.
There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.
3. Multivariate Normality— It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.
The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.
4. No Multicollinearity— Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.
Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.
5. No Auto-correlation— It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.
Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:
The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.
6. No Extrapolation— Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.
Conclusion:
We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Java学习指南(第4版)(上下册)
Patrick Niemeyer、Daniel Leuck / 李强、王建新、吴戈 / 人民邮电出版社 / 2014-7 / 128.00元
《Java学习指南(第4版)(上、下册)》是畅销Java学习指南的最新版,详细介绍了Java 6和Java 7的语言特性和API。本书全面介绍了Java的基础知识,力图通过完备地介绍Java语言、其类库、编程技术以及术语,从而成为一本名符其实的入门级图书。 《Java学习指南(第4版)(上、下册)》加入了从Java 6和Java 7发布以后的变化,包括新的语言功能、并发工具(Fork-Joi......一起来看看 《Java学习指南(第4版)(上下册)》 这本书的介绍吧!