内容简介:Terence teaches inLinear and logistic regression models are important because they are interpretable, fast, and form the basis of deep learning neural networks. They are also extremely simple; we're just fitting lines (or hyperplanes) through training data
Terence teaches in University of San Francisco's MS in Data Science program and you might know him as the creator of the ANTLR parser generator.
Linear and logistic regression models are important because they are interpretable, fast, and form the basis of deep learning neural networks. They are also extremely simple; we're just fitting lines (or hyperplanes) through training data. Unfortunately, linear models have a tendency to chase outliers in the training data, which often leads to models that don't generalize well to new data. To produce models that generalize better, we all know to regularize our models. There are many forms of regularization, such as early stopping and drop out for deep learning, but for isolated linear models, Lasso (L1) and Ridge (L2) regularization are most common. The mathematics behind fitting linear models and regularization are well described elsewhere, such as in the excellent book The Elements of Statistical Learning (ESL) by Hastie, Tibshirani, and Friedman. The world certainly doesn't need yet another article on the mechanics of regularized linear models, so I'm going to assume that you're familiar with basics.
What's lacking is a simple and intuitive explanation for what exactly is going on during regularization. The goal of this article is to explain how regularization behaves visually, dispelling some myths and answering important questions along the way. This article has exploded way beyond my initial intentions, so let me start out by summarizing the key elements; that way you can quickly get back to watching YouTube or playing Animal Crossing.
TL;DR The key nuggets
-
Personally, my biggest initial stumbling block was this: The math used to implement regularization does not correspond to pictures commonly used to explain regularization . Take a look at the oft-copied picture (shown below left) from page 71 of ESL in the section on “Shrinkage Methods.” Students see this multiple times in their careers but have trouble mapping that to the relatively straightforward mathematics used to regularize linear model training. The simple reason is that that illustration shows how we regularize models conceptually, with hard constraints , not how we actually implement regularization, with soft constraints ! The math actually corresponds to the picture on the right, with the loss function in blue-red (blue = lower, red = higher loss) and the regularization penalty term in orange, emanating from the (0,0).
Hard constraint illustration from ESL page 71. Soft constraint with non-regularized loss function (blue-red) term and penalty term (orange). Regularization conceptually uses a hard constraint to prevent coefficients from getting too large (the cyan circles from the ESL picture). For implementation purposes, however, we convert that “subject to” hard constraint to a soft constraint by adding the constraint as a term to the loss function. The hard constraint in the regularized loss function, , becomes a term in the equation and with a new constant that controls regularization: . The penalty term is a soft constraint because there is no threshold or clipping; larger coefficients are simply more expensive. The effect is to shift the ordinary loss function “bowl” upwards and the loss function minimum towards the origin, as shown to the right; that bowl is the addition of the loss and penalty bowls. The only thing changing in that picture is lambda; the training data remains the same. The moving bowl is the nonregulated loss function plus the penalty L2 term.
- Here's a key question about L1 Lasso: Does L1 encourage coefficients to shrink to zero or does it simply not discourage zeroes
? It encourages zeros, which I verified by running lots of simulations of random loss functions (different minima locations and shape). In the following images, green dots indicate the location of a loss function minimum that results in a zero regularized coefficient. Blue indicates a loss function that does not result in a zero coefficient and orange indicates a near miss. L1 tends not to give near misses and so the simulation on the left is just blue/green.
The L1 diamond hard constraint on the left tends to zero coefficients for any loss function whose minimum is in the zone perpendicular to the diamond edges. The L2 circular constraint tends to zero coefficients for loss function minimums sitting really close to or on one of the axes.
-
If L1 encourages zero coefficients, why does it do that?! Why doesn't L2 encourages zero coefficients? Imagine we have some training data where the variable associated with is very predictive but the other variable is not. In that case, we would see a loss function with contour lines kind of like the following where the big black dot is the non-regularized loss function minimum.
For the L1 case, the optimal location is the purple dot at the diamond tip. This is true because any movement of away from that spot increases the loss. Take a look at the contour line emanating from the L1 purple dot. The associated ellipse has the same loss value at all locations. Any position outside of that ellipse, away from the black dot, has higher loss; any inside that ellipse have lower loss. Any movement in either direction along the diamond edge, away from the purple dot, increases the loss because immediately moves outside of the contour associated with the L1 purple dot. Moving the black dotup into the left, however, would start to make nonzero L1 coefficients more likely.
For the L2 case, the optimal location is at the other purple dot location, and not on the axis like the L1 case. Because the L2 purple dot is inside the contour line that means the L1 dot on the axis, it has lower loss value than either of the orange dots. Moving the purple dot along the circular constraint line in any direction would increase the loss. The optimal L2 coefficients are nonzero, despite the fact that the non-regularized loss function minimum (black dot) is very close to the axis.
Deep dive for those lured in by my cool visualizationsThe text and visualizations from the summary above were pulled from the following subsections that go into much more detail. There are also many more simulations, so I recommend checking it out if a hazy regularization understanding has been bugging the crap out of you for years!
- 1. A quick review of linear regression
- 2. How regularization works conceptually
- 3. The difference between L1 and L2 regularization
- 4. How we express regularization in practice
Acknowledgements
I'd like to thank mathematicians Steve Devlin, David Uminsky, and Jeff Hamrick, also faculty in University of San Francisco's MS in Data Science program, for helping me understand the mathematics and why L1 regularization encourages zero coefficients.
Resources
Here are the original papers on Ridge and Lasso regression:
- Ridge Regression: Biased Estimation for Nonorthogonal Problems by Hoerl and Kennard, Journal Technometrics, 1970 .
- Regression Shrinkage and Selection via the Lasso by Tibshirani in Journal of the Royal Statistical Society, 1996.
Here are some in-depth articles:
- My MSDS621 project could prove useful for those interested in the implementation of regularization by gradient descent: Using gradient descent to fit regularized linear models
- Lecture notes on ridge regression by Wessel N. van Wieringen .
- Deep Learning Basics Lecture 3: Regularization I (slides) by Yingyu Liang at Princeton University.
- Regularized Regression from @bradleyboehmke at University of Cincinnati.
If you are new to deep learning, check out Jeremy Howard's full course called Practical Deep Learning for Coders . Then you might be interested in some articles that relate the regularization described in this article to deep learning.
- Understanding AdamW: Weight decay or L2 regularization? Sylvain Gugger and Jeremy Howard
- Three Mechanisms of Weight Decay Regularization by Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。