内容简介:The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.Unfortunate
How to make your optimizer less sensitive to the choice of hyperparameters
The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.
Unfortunately you normally have a multitude of weights and a loss landscape that is hardly simple, not to mention no longer suited for a 2D drawing.
Finding a minimum of such a function is no longer a trivial task. The most common optimizers like Adam or SGD require very time-consuming hyperparameter tuning and can get caught in the local minima. The importance of choosing a hyperparameter like learning rate can be summarized by the following picture:
The recently proposed LookAhead optimizer makes the optimization process
less sensitive to suboptimal hyperparameters and therefore lessens the need for extensive hyperparameter tuning.
It sounds like something worth exploring!
The algorithm
Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of “fast weights” generated by another optimizer.
The optimizer keeps two sets of weights: fast weights θ
and slow weights ϕ
. They are both initialized with the same values. A standard optimizer (e.g. Adam, SGD, …) with a certain learning rate η
is used to update the fast weights θ
for a defined number of steps k
resulting in some new values θ’
.
Then a crucial thing happens: the slow weights ϕ
are moved along the direction defined by the difference of weight vectors θ’- ϕ
. The length of this step is controlled by the parameter α
— the slow weights learning rate.
The process is then repeated starting by re-setting the fast weights values to newly computed slow weights values ϕ’
. You can see the pseudocode below:
What’s the point of this?
To answer this question we will study the (slightly modified) picture from the LookAhead publication , but as an introduction let’s first look at another picture. If our model only had three weights, the loss function could be easily visualized like in the picture below.
Obviously in real-life examples we have much more than three weights resulting in weight space with higher dimensionality. Nevertheless we can still visualize the loss by projecting it to a hyperplane in such a space.
That is what is presented in the LookAhead paper :
We see a projection of the objective function (in this case it’s accuracy, but it could be loss just as well) to a hyperplane in the weight space. Different colors correspond to different objective function values: the brighter the color, the more optimal the value. The behavior of the LookAhead optimizer is shown in the following way: the blue dashed line represents the trajectory of the fast weights θ
(with blue squares indicating ten subsequent states), while the violet line shows the direction of fast weight update θ’- ϕ
. The violet triangles indicate two subsequent slow-weights values ϕ
, ϕ’
. The distance between the triangles is defined by slow-weights learning rate α
.
We can see that the standard optimizer (in this case SGD) traverses a sub-optimal green region, whereas the second slow-weight state is already much closer to the optimum. The paper describes it more elegantly:
When oscillating in the high curvature directions, the fast weights updates make rapid progress along the low curvature directions. The slow weights help smooth out the oscillations through the parameter interpolation. The combination of fast weights and slow weights improves learning in high curvature directions, reduces variance, and enables Lookahead to converge rapidly in practice.
How to use it in Keras?
Now to the practical side of method: so far there is only an unofficial Keras implementation which can easily be used with your current optimizer:
As you can see, apart from the optimizer itself Lookahead
expects two arguments:
-
sync_period
which corresponds to previously introducedk
— number of steps after which the two set of weights are synchronized, -
slow_step
which corresponds toα
learning rate of the slow weights.
To check that it works as expected you can set slow_step
to 1
and compare the behavior of Lookahead
with that of a regular optimizer.
For the α
of 1
the LookAhead update step reduces to:
which means that the LookAhead gets reduced to its underlying standard optimizer. We can also see it on the modified weights trajectory picture:
Now the end state for slow weight is the same as the end state for fast weights.
You can test it using the following code:
Final word
LookAhead is an effective optimization algorithm which at a negligible computational cost makes the process of finding the minimum of a loss function more stable. What is more, less hyperparameter tuning is required.
It is said to be particularly effective when combined with Rectified Adam optimizer . I will cover this topic in my next article.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Building Web Reputation Systems
Randy Farmer、Bryce Glass / Yahoo Press / 2010 / GBP 31.99
What do Amazon's product reviews, eBay's feedback score system, Slashdot's Karma System, and Xbox Live's Achievements have in common? They're all examples of successful reputation systems that enable ......一起来看看 《Building Web Reputation Systems》 这本书的介绍吧!