Gradient Descent, the Learning Rate, and the importance of Feature Scaling

栏目: IT技术 · 发布时间: 4年前

内容简介:The content of this post is a partial reproduction of a chapter from the book:What doEvery time we train a deep learning model, or any neural network for that matter, we're using

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Photo by Steve Arrington on Unsplash

The content of this post is a partial reproduction of a chapter from the book: Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide ”.

Introduction

What do gradient descent , the learning rate , and feature scaling have in common? Let's see…

Every time we train a deep learning model, or any neural network for that matter, we're using gradient descent (with backpropagation). We use it to minimize a loss by updating the parameters/weights of the model.

The parameter update depends on two values: a gradient and a learning rate . The learning rate gives you control of how big (or small) the updates are going to be. A bigger learning rate means bigger updates and, hopefully, a model that learns faster .

But there is a catch, as always… if the learning rate is too big , the model will not learn anything . This leads us to two fundamental questions :

  • How big is "too big"?
  • Is there anything I can do to use a bigger learning rate?

Unfortunately, there is no clear-cut answer for the first question. It will always depend on many factors.

But there is an answer for the second one: feature scaling ! How does that work? Well, that's why I've written this post: to show you, in detail, how gradient descent, the learning rate, and the feature scaling are connected.

In this post we will:

  • define a model and generate a synthetic dataset
  • randomly initialize the parameters
  • explore the loss surface and visualize the gradients
  • understand the effects of using different learning rates
  • understand the effects of feature scaling

Model

The model must be simple and familiar , so you can focus on the inner workings of gradient descent. So, I will stick with a model as simple as it can be: a linear regression with a single feature x !

Gradient Descent, the Learning Rate, and the importance of Feature Scaling
Simple Linear Regression

In this model, we use a feature ( x ) to try to predict the value of a label ( y ). There are three elements in our model:

  • parameter b , the bias (or intercept ), which tells us the expected average value of y when x is zero
  • parameter w , the weight (or slope ), which tells us how much y increases, on average, if we increase x by one unit
  • and that last term (why does it always have to be a Greek letter?), epsilon , which is there to account for the inherent noise , that is, the error we cannot get rid of

Data Generation

We know our model already. In order to generate synthetic data for it, we need to pick values for its parameters . I chose b = 1 and w = 2 .

First, let’s generate our feature ( x ): we use Numpy’s rand method to randomly generate 100 ( N ) points between 0 and 1.

Then, we plug our feature ( x ) and our parameters b and w into our equation to compute our labels ( y ). But we need to add some Gaussian noise ( epsilon ) as well; otherwise, our synthetic dataset would be a perfectly straight line.

We can generate noise using Numpy ’s randn method, which draws samples from a normal distribution (of mean 0 and variance 1), and then multiplying it by a factor to adjust for the level of noise . Since I don’t want to add so much noise, I picked 0.1 as my factor.

Synthetic Dataset

Train-Validation-Test Split

It is beyond the scope of this post to explain the reasoning behind the

train-

validation-test split

, but there are two points I’d like to make:
  1. The split should always be the first thing you do — no preprocessing, no transformations; nothing happens before the split — that’s why we do this immediately after the synthetic data generation
  2. In this post we will use only the training set — so I did not bother to create a test set , but I performed a split nonetheless to highlight point #1 :-)
Train-validation split

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Synthetic data

Random Initialization

In our example, we already know the true values of the parameters, but this will obviously never happen in real life: if we knew the true values, why even bother to train a model to find them?!

OK, given that we’ll never know the true values of the parameters, we need to set initial values for them. How do we choose them? It turns out; a random guess is as good as any other.

So, we can randomly initialize the parameters/weights (we have only two, b and w ).

Random start point

Our randomly initialized parameters are: b = 0.49 and w = -0.13 . Are these parameters any good?

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Obviously not… but, exactly how bad are they? That's what the loss is for. Our goal will be to minimize it.

Loss Surface

After choosing a random starting point for our parameters, we use them to make predictions , compute the corresponding errors , and aggregate these errors into a loss . Since this is a linear regression, we’re using Mean Squared Error (MSE) as our loss. The code below performs these steps:

Making predictions and computing the loss

We have just computed the loss (2.74) corresponding to our randomly initialized parameters ( b = 0.49 and w = -0.13). Now, what if we did the same for ALL possible values of b and w ? Well, not all possible values, but all combinations of evenly spaced values in a given range ?

We could vary b between -2 and 4 , while varying w between -1 and 5 , for instance, each range containing 101 evenly spaced points. If we compute the losses corresponding to each different combination of the parameters b and w inside these ranges , the result would be a grid of losses , a matrix of shape (101, 101).

These losses are our loss surface , which can be visualized in a 3D plot, where the vertical axis ( z ) represents the loss values. If we connect the combinations of b and w that yield the same loss value , we’ll get an ellipse . Then, we can draw this ellipse in the original b x w plane (in blue, for a loss value of 3). This is, in a nutshell, what a contour plot does. From now on, we’ll always use the contour plot, instead of the corresponding 3D version.

The plots below show us the loss surface for the suggested ranges of parameters , using our training set to compute the loss for each combination of b and w .

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Loss surface

In the center of the plot, where parameters ( b, w ) have values close to (1, 2), the loss is at its minimum value. This is the point we’re trying to reach using gradient descent.

In the bottom, slightly to the left, there is the random start point, corresponding to our randomly initialized parameters ( b = 0.49 and w = -0.13).

This is one of the nice things about tackling a simple problem like a linear

regression with a single feature: we have only two parameters , and thus we can compute and visualize the loss surface .

Cross-Sections

Another nice thing is that we can cut a cross-section in the loss surface to check what the loss looks like if the other parameter were held constant .

Let’s start by making b =0.52 (the value from our evenly spaced range that is closest to our initial random value for b , 0.4967) — we cut a cross-section vertically (the red dashed line) on our loss surface (left plot), and we get the resulting plot on the right:

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Vertical cross-section — parameter b is fixed

What does this cross-section tell us? It tells us that, if we keep b constant (at 0.52), the loss , seen from the perspective of parameter w , can be minimized if w gets increased (up to some value between 2 and 3).

Sure, different values of b produce different cross-section loss curves for w . And those curves will depend on the shape of the loss surface (more on that later, in the “ Learning Rate ” section).

OK, so far, so good… what about the other cross-section? Let’s cut it horizontally now, making w = -0.16 (the value from our evenly spaced range that is closest to our initial random value for b, -0.1382). The resulting plot is on the right:

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Horizontal cross-section — parameter w is fixed

Now, if we keep w constant (at -0.16), the loss , seen from the perspective of parameter b , can be minimized if b gets increased (up to some value close to 2).

In general, the purpose of this cross-section is to get the effect on the loss of changing a single parameter , while keeping everything else constant . This is, in a nutshell, a gradient :-)

Visualizing Gradients

From the previous section, we already know that to minimize the loss , both b and w needed to be increased . So, keeping the spirit of using gradients, let’s increase each parameter a little bit (always keeping the other one fixed!). By the way, in this example, a little bit equals 0.12 (for convenience sake, so it results in a nicer plot).

What effect do these increases have on the loss? Let’s check it out:

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Computing (approximate) gradients, geometrically

On the left plot, increasing w by 0.12 yields a loss reduction of 0.21 . The geometrically computed and roughly approximate gradient is given by the ratio between the two values: -1.79 . How does this result compare to the actual value of the gradient (-1.83)? It is actually not bad for a crude approximation… Could it be better? Sure, if we make the increase in w smaller and smaller (like 0.01, instead of 0.12), we’ll get better and better approximations… in the limit, as the increase approaches zero , we’ll arrive at the precise value of the gradient . Well, that’s the definition of a derivative!

The same reasoning goes for the plot on the right: increasing b by the same 0.12 yields a bigger loss reduction of 0.35 . Bigger loss reduction, bigger ratio, bigger gradient — and bigger error, too, since the geometric approximation (-2.90) is farther away from the actual value (-3.04).

Updating Parameters

Finally, we use the gradients to update the parameters. Since we are trying to minimize our losses , we reverse the sign of the gradient for the update.

There is still another (hyper-)parameter to consider: the learning rate , denoted by the Greek letter eta (that looks like the letter n ), which is the multiplicative factor that we need to apply to the gradient for the parameter update.

Gradient Descent, the Learning Rate, and the importance of Feature Scaling
Updating parameters b and w

We can also interpret this a bit differently: each parameter is going to have its

value updated by a constant value eta (the learning rate), but this constant is going to be weighted by how much that parameter contributes to minimizing the loss (its gradient).

Honestly, I believe this way of thinking about the parameter update makes more sense: first, you decide on a learning rate that specifies your step size , while the gradients tell you the relative impact (on the loss) of taking a step for each parameter. Then you take a given number of steps that’s proportional to that relative impact: more impact, more steps .

“How to choose a learning rate?”

Unfortunately, that is a topic on its own and beyond the scope of this post.

Learning Rate

The learning rate is the most important hyper-parameter — there is a gigantic

amount of material on how to choose a learning rate, how to modify the learning rate during the training, and how the wrong learning rate can completely ruin the model training.

Maybe you’ve seen this famous graph below(from Stanford’s CS231n class) that shows how a learning rate that is too big or too small affects the loss during training.

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Source

Most people will see it (or have seen it) at some point in time. This is pretty much general knowledge, but I think it needs to be thoroughly explained and visually demonstrated to be truly understood. So, let’s start!

I will tell you a little story (trying to build an analogy here, please bear with me!): imagine you are coming back from hiking in the mountains and you want to get back home as quickly as possible. At some point in your path, you can either choose to go ahead or to make a right turn .

The path ahead is almost flat, while the path to your right is kinda steep. The

steepnessis the gradient . If you take a single step one way or the other, it will lead to different outcomes (you’ll descend more if you take one step to the right instead of going ahead).

But, here is the thing: you know that the path to your right is getting you home faster , so you don’t take just one step, but multiple steps in that direction: the steeper the path, the more steps you take ! Remember, “ more impact, more steps ”! You just cannot resist the urge to take that many steps; your behavior seems to be completely determined by the landscape. This analogy is getting weird, I know…

But, you still have one choice : you can adjust the size of your step . You can choose to take steps of any size, from tiny steps to long strides. That’s your learning rate .

OK, let’s see where this little story brought us so far… that’s how you’ll move, in a nutshell:

updated location = previous location + step size * number of steps

Now, compare it to what we did with the parameters:

updated value = previous value - learning rate * gradient

You got the point, right? I hope so because the analogy completely falls apart now… at this point, after moving in one direction (say, the right turn we talked about), you’d have to stop and move in the other direction (for just a fraction of a step, because the path was almost flat , remember?). And so on and so forth… Well, I don’t think anyone has ever returned from hiking in such an orthogonal zigzag path…

Anyway, let’s explore further the only choice you have: the size of your step, I

mean, the learning rate .

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Small Learning Rate

It makes sense to start with baby steps , right? This means using a small learning rate . Small learning rates are safe(r) , as expected. If you were to take tiny steps while returning home from your hiking, you’d be more likely to arrive there safe and sound — but it would take a lot of time . The same holds true for training models: small learning rates will likely get you to (some) minimum point, eventually . Unfortunately, time is money, especially when you’re paying for GPU time in the cloud… so, there is an incentive to try bigger learning rates .

How does this reasoning apply to our model? From computing our (geometric) gradients, we know we need to take a given number of steps : 1.79 (parameter w ) and 2.90 (parameter b ), respectively. Let’s set our step size to 0.2 (small-ish). It means we move 0.36 for w and 0.58 for b .

IMPORTANT: in real life, a learning rate of 0.2 is usually considered BIG — but in our very simple linear regression example, it still qualifies as small-ish.

Where does this movement lead us? As you can see in the plots below (as shown by the new dots to the right of the original ones), in both cases, the movement took us closer to the minimum — more so on the right because the curve is steeper .

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Using a small-ish learning rate

Big Learning Rate

What would have happened if we had used a big learning rate instead, say, a step size of 0.8 ? As we can see in the plots below, we start to, literally, run into trouble

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Using a BIG learning rate

Even though everything is still OK on the left plot, the right plot shows us a completely different picture: we ended up on the other side of the curve . That is not good… you’d be going back and forth , alternately hitting both sides of the curve.

“Well, even so, I may still reach the minimum, why is it so bad?”

In our simple example, yes, you’d eventually reach the minimum because the curve is nice and round .

But, in real problems, the “curve” has some really weird shape that allows for

bizarre outcomes, such as going back and forth without ever approaching the minimum .

In our analogy, you moved so fast that you fell down and hit the other side of the valley , then kept going down like a ping-pong . Hard to believe, I know, but you definitely don’t want that…

Very Big Learning Rate

Wait, it may get worse than that… let’s use a really big learning rate , say, a step size of 1.1 !

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Using a REALLY BIG learning rate

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Ok, that is bad… on the right plot, not only we ended up on the other side of the curve again, but we actually climbed up . This means our loss increased , instead of decreasing! How is that even possible? You’re moving so fast downhill that you end up climbing it back up ?! Unfortunately, the analogy cannot help us anymore. We need to think about this particular case in a different way…

First, notice that everything is fine on the left plot. The enormous learning rate did not cause any issues because the left curve is less steep than the one on the right. In other words, the curve on the left can take bigger learning rates than the curve on the right.

What can we learn from it?

Too big, for a learning rate , is a relative concept: it depends on how steep the curve is or, in other words, it depends on how big the gradient is .

We do have many curves, many gradients : one for each parameter. But we only have one single learning rate to choose (sorry, that’s the way it is!).

It means that the size of the learning rate is limited by the steepest curve . All other curves must follow suit, meaning, they’d be using a sub-optimal learning rate, given their shapes.

The reasonable conclusion is: it is best if all the curves are equally steep , so the learning rate is closer to optimal for all of them!

“Bad” Feature

How do we achieve equally steep curves? I’m on it! First, let’s take a look at a slightly modified example, which I am calling the “bad” dataset:

  • I multiplied our feature ( x ) by 10 , so it is in the range [0, 10] now, and renamed it bad_x
  • but since I do not want the labels ( y ) to change , I divided the original true_w parameter by 10 and renamed it bad_w — this way, both bad_w * bad_x and w * x yield the same results
Generating "bad" dataset

Then I performed the same split as before for both, original and bad , datasets and plot the training sets side by side, as you can see below:

Train-validation split for the "bad" dataset

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Same data, different scales for feature x

The only difference between the two plots is the scale of feature x . Its range was [0, 1], now it is [0, 10]. The label y hasn’t changed, and I did not touch true_b .

Does this simple scaling have any meaningful impact on our gradient descent? Well, if it hadn’t, I wouldn’t be asking it, right? Let’s compute a new loss surface and compare to the one we had before:

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Loss surface — before and after scaling feature x

Look at the contour values of the plot above: the dark blue line was 3.0 , and now it is 50.0 ! For the same range of parameter values, loss values are much bigger .

Let’s look at the cross-sections before and after we multiplied feature x by 10:

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Comparing cross-sections: before and after

What happened here? The red curve got much steeper (bigger gradient), and thus we must use a smaller learning rate to safely descend along it.

More importantly, the difference in steepness between the red and the black curves increased .

This is exactly what WE NEED TO AVOID !

Do you remember why?

Because the size of the learning rate is limited by the steepest curve !

How can we fix it? Well, we ruined it by scaling it 10x bigger … perhaps we can make it better if we scale it in a different way .

Scaling / Standardizing / Normalizing

Different how? There is this beautiful thing called the StandardScaler , which transforms a feature in such a way that it ends up with zero mean and unit standard deviation .

How does it achieve that? First, it computes the mean and the standard deviation of a given feature ( x ) using the training set ( N points):

Gradient Descent, the Learning Rate, and the importance of Feature Scaling

Mean and Standard Deviation, as computed in the StandardScaler

Then it uses both values to scale the feature:

Gradient Descent, the Learning Rate, and the importance of Feature Scaling
Standardization

If we were to recompute the mean and the standard deviation of the scaled feature, we would get 0 and 1, respectively. This preprocessing step is commonly referred to as normalization , although, technically, it should always be referred to as standardization .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Developer's Guide to Social Programming

Developer's Guide to Social Programming

Mark D. Hawker / Addison-Wesley Professional / 2010-8-25 / USD 39.99

In The Developer's Guide to Social Programming, Mark Hawker shows developers how to build applications that integrate with the major social networking sites. Unlike competitive books that focus on a s......一起来看看 《Developer's Guide to Social Programming》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具