Diving Deeper into Linear Regression

栏目: IT技术 · 发布时间: 4年前

W hen I say “linear regression”, most of the people start thinking about the good old Ordinary Least Square (OLS) regression. If you are not familiar with the term, these equations might help…

Did you also think about OLS? If yes then you are on the right track. But there’s more to linear regression than just OLS! First, let us look at OLS a bit more closely.

OLS

The name of this technique came from the cost function. Here, we take the sum of squared errors (the difference between ground truths and predictions) and try to minimize this. By minimizing the cost function we achieve the optimal value of the vector β (contains bias and weights). In the below plot, the contour (concentric ellipses) of the cost function is shown. After the minimization, we get β as the point at the center.

At first, it seems like OLS is enough for any regression problem. But as we increase the number of features and the complexity of data OLS tends to overfit the training data. The concept of overfitting is vast and deserves a separate article (you can find plenty of them) so I’m going to give you a brief. Overfitting means the model has learned the training data so well that it fails to generalize. In other words, the model has learned even the small scale (insignificant) variations in the train data so it fails to produce good predictions on unseen (validation and test) data. To tackle the problem of overfitting we can use many techniques. Adding a regularization (penalty) term to our cost function is one such technique. But what term should we use? We generally use one of the following two methods.

Ridge

In this case, we add the sum of squares of weights to our least square cost function. So now it looks something like this…

m: 1+dimension of β; λ: regularization parameter

But how does this term prevent overfitting? Adding this term is equivalent to adding an extra constraint on the possible values of β. Because to achieve the minimum cost, the sum of β²_j’s must not exceed a certain value (say r). This technique prevents the model from assigning very large weights to some features over the others thus tackling overfitting. Mathematically,

In other words, β should lie inside(or on) the circle with radius √r centered at the origin. Here’s the visualization…

Notice that because of the constraint (red circle), the final value of β is closer to the origin than it was in the OLS.

Lasso

The only difference between Ridge and Lasso is the regularization term. Here, we add the sum of absolute values of the weights to our least square cost function. So the cost function becomes…

In this case, the constraint can be written as…

Now we can visualize the constraint as a square instead of a circle.

It is worth noting that, if the contours hit a corner of the square then one feature is completely neglected (weight becomes 0). For higher-dimensional feature space, we can use this trick to reduce the number of features.

Note: In the regularization term we are not using bias (β_0) because only the very large weights (β_i’s for i>0) corresponding to the features contribute to the overfitting. Bias term is just an intercept hence does not have much to do with the overfitting.

Phew…that was a lot about regularization. The common thing among the above methods was: they all have residuals/errors (ground truth-prediction) in their cost function. These errors are parallel to the y-axis. We could also consider errors along the x-axis and proceed similarly. See the plot below.

What if we use a different kind of error?

Major axis (Orthogonal) regression

In this case, we consider errors in both directions (x-axis and y-axis). The sum of the square of perpendicular distances between the observed data points and the predicted line is to be minimized. Let’s visualize this by taking only one feature.

Let our model be

Then the regression coefficients can be obtained by minimizing

under the constraints

Reduced Major axis regression

This is very similar to the above method with a slight change. Here, we minimize the sum of areas of the rectangle formed by (X_i, Y_i) and (x_i, y_i).

The total area extended by n data points is,

The constraints here are the same as orthogonal regression.

When should you use orthogonal regression?

One should go for orthogonal and reduced major axis regressions when the uncerttainties are present in study (y) and explanatory (x) variables both.

One interesting thing in orthogonal regression is, it produces symmetrical fit w.r.t y-errors and x-errors. But in OLS we don’t get the symmetry for we minimize either y-errors or x-errors, not both.

Still curious? Watch a video that I made recently…

I hope you enjoyed the reading. Until next time…Happy learning!

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Diving Deeper into Linear Regression

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

代码之美

Grey Wilson / 聂雪军 / 机械工业出版社 / 2008年09月 / 99.00元

《代码之美》介绍了人类在一个奋斗领域中的创造性和灵活性：计算机系统的开发领域。在每章中的漂亮代码都是来自独特解决方案的发现，而这种发现是来源于作者超越既定边界的远见卓识，并且识别出被多数人忽视的需求以及找出令人叹为观止的问题解决方案。《代码之美》33章，有38位作者，每位作者贡献一章。每位作者都将自己心目中对于“美丽的代码”的认识浓缩在一章当中，张力十足。38位大牛，每个人对代码之美都有自......一起来看看《代码之美》这本书的介绍吧!

码农工具