内容简介:Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Impro
Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.
Jul 19 ·6min read
I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Improving Deep Neural Networks: Hyperparameter Tuning, Regularisation, and Optimisation. Before I started this sub-course I had already done all of those steps for traditional machine learning algorithms in my previous projects. I’ve tuned hyperparameters for decision trees such as max_depth and min_samples_leaf, and for SVMs tuned C, kernel, and gamma. For regularisation I have applied Ridge (L2 penalty), Lasso (L1 penalty), and ElasticNet (L1 and L2) to regression models. So I thought it would not be much more than translating those concepts over to neural networks. Well, I was somewhat right, but given how Andrew Ng explains the mathematics and visually represents the inner-workings of these optimisation methods, I have a much greater understanding from a fundamental level.
In this article I want to go over some of Andrew’s explanations for these techniques, accompanied with some mathematics and diagrams.
Hyperparameter Tuning
Here are a few popular hyperparameters that are tuned for deep networks:
- α (alpha): learning rate
- β (beta): momentum
- number of layers
- number of hidden units
- learning rate decay
- mini-batch size
There are others specific to optimisation techniques, for instance, you have β1, β2, and ε for the Adam optimisation.
Grid Search vs Random Search
Let’s say for a model we have more than one hyperparameter we are tuning, one hyperparameter probably will have more of an influence on train/validation accuracy than another hyperparameter. In this case, we may want to try a wider variety of values for the more impactful hyperparameter, but also at the same time, we don’t want to run too many models as it is time consuming.
For this example let us say we are optimising two different hyperparameters, α and ε. We know α is more important and needs to be tuned by trying out as many different values as possible. Then again you still want to try 5 different ε values as well. So, if I choose to try 5 different α values, that comes to 25 different models. We have run 25 models with different combinations of 5 α and 5 ε.
But we want to try more α values without increasing the number of models. Here is Andrew’s solution:
For this, we use a random search, where we choose 25 different random values of each α and ε, and each pair of values is used for each model. Now we have to only run 25 models but we get to try 25 different values of α instead of the 5 we did in a grid search.
Bonus: Using a coarse to fine can help further to improve tuning. This involves zooming into a smaller region of the hyperparameters which performed best and then creating more models within that region to more precisely tune those hyperparameters.
Choosing a Scale
When trying out different hyperparameter values, choosing the correct scale can be difficult, especially trying to make sure you thoroughly search within a range of really large numbers and a range of really small numbers.
Learning rate is a hyperparameter that can vary so much based on the model, it can between 0.000001 and 0.000002, or between 0.8 and 0.9. It is very hard to search fairly between these two different ranges at once when looking at a linear scale, but we can solve this issue with using the log scale.
Let’s say we are looking at values between 0.0001 and 1 for α. Using a linear scale means 10% of the attempted α values are between 0.0001 and 0.1 and 90% between 0.1 and 1. This is bad, as we are not giving a thorough search for such a wide range of values. By using a log of 10 scale, 25% of α values are between 0.0001 and 0.001, 25% between 0.001 and 0.01, 25% between 0.01 and 0.1, and a final 25% between 0.1 and 1. This way we have a thorough search of α. The range of 0.0001 to 0.1 was 10% with a linear scale but 75% with a log scale.
Here is a little bit of mathematics with a numpy function to demonstrate how this works for a random value for α.
Regularisation
Overfitting can be a huge problem with models due to high variance, this can be solved by getting more training data, but that’s not always possible, so a great alternative is regularisation.
L2 Regularisation (‘Weight Decay’)
Regularisation utilises one of two penalty techniques, L1 and L2, with neural networks L2 is predominantly used.
We must first look at the cost function for a neural network:
And then add the L2 penalty term, which includes the Frobenius Norm:
With L2 regularisation the weight reduces not only by the learning rate and backpropagation but also by the middle term which includes the regularisation hyperparameter λ (lambda). The larger λ is the smaller w becomes.
How does regularisation prevent overfitting?
We see that L2 regularisation uses the λ penalty to reduce the weights w, but how does this reduce variance and prevent overfitting of the model?
If w is small the size of z will drop too, if z is a large positive number it will become smaller, if it is a large negative number it will become larger, nearing to 0. When passing z through the activation function we have a more linear effect (as you can see the image below, the tanh curve is more linear near 0).
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。