Improving Deep Neural Networks

栏目: IT技术 · 发布时间: 5年前

内容简介:Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Impro

Andrew Ng’s advice for Hyperparameter Tuning and Regularisation from his Deep Learning Specialisation course.

Improving Deep Neural Networks

Pawel Kadysz

I have recently been going through Coursera’s Deep Learning Specialisation course, designed and taught by Andrew Ng. The second sub-course is Improving Deep Neural Networks: Hyperparameter Tuning, Regularisation, and Optimisation. Before I started this sub-course I had already done all of those steps for traditional machine learning algorithms in my previous projects. I’ve tuned hyperparameters for decision trees such as max_depth and min_samples_leaf, and for SVMs tuned C, kernel, and gamma. For regularisation I have applied Ridge (L2 penalty), Lasso (L1 penalty), and ElasticNet (L1 and L2) to regression models. So I thought it would not be much more than translating those concepts over to neural networks. Well, I was somewhat right, but given how Andrew Ng explains the mathematics and visually represents the inner-workings of these optimisation methods, I have a much greater understanding from a fundamental level.

In this article I want to go over some of Andrew’s explanations for these techniques, accompanied with some mathematics and diagrams.

Hyperparameter Tuning

Here are a few popular hyperparameters that are tuned for deep networks:

  • α (alpha): learning rate
  • β (beta): momentum
  • number of layers
  • number of hidden units
  • learning rate decay
  • mini-batch size

There are others specific to optimisation techniques, for instance, you have β1, β2, and ε for the Adam optimisation.

Grid Search vs Random Search

Let’s say for a model we have more than one hyperparameter we are tuning, one hyperparameter probably will have more of an influence on train/validation accuracy than another hyperparameter. In this case, we may want to try a wider variety of values for the more impactful hyperparameter, but also at the same time, we don’t want to run too many models as it is time consuming.

For this example let us say we are optimising two different hyperparameters, α and ε. We know α is more important and needs to be tuned by trying out as many different values as possible. Then again you still want to try 5 different ε values as well. So, if I choose to try 5 different α values, that comes to 25 different models. We have run 25 models with different combinations of 5 α and 5 ε.

But we want to try more α values without increasing the number of models. Here is Andrew’s solution:

For this, we use a random search, where we choose 25 different random values of each α and ε, and each pair of values is used for each model. Now we have to only run 25 models but we get to try 25 different values of α instead of the 5 we did in a grid search.

Improving Deep Neural Networks

Left: Grid Search, Right: Random Search

Bonus: Using a coarse to fine can help further to improve tuning. This involves zooming into a smaller region of the hyperparameters which performed best and then creating more models within that region to more precisely tune those hyperparameters.

Choosing a Scale

When trying out different hyperparameter values, choosing the correct scale can be difficult, especially trying to make sure you thoroughly search within a range of really large numbers and a range of really small numbers.

Learning rate is a hyperparameter that can vary so much based on the model, it can between 0.000001 and 0.000002, or between 0.8 and 0.9. It is very hard to search fairly between these two different ranges at once when looking at a linear scale, but we can solve this issue with using the log scale.

Let’s say we are looking at values between 0.0001 and 1 for α. Using a linear scale means 10% of the attempted α values are between 0.0001 and 0.1 and 90% between 0.1 and 1. This is bad, as we are not giving a thorough search for such a wide range of values. By using a log of 10 scale, 25% of α values are between 0.0001 and 0.001, 25% between 0.001 and 0.01, 25% between 0.01 and 0.1, and a final 25% between 0.1 and 1. This way we have a thorough search of α. The range of 0.0001 to 0.1 was 10% with a linear scale but 75% with a log scale.

Improving Deep Neural Networks

Left: Linear Scale, Right: Log Scale

Here is a little bit of mathematics with a numpy function to demonstrate how this works for a random value for α.

Improving Deep Neural Networks

Regularisation

Overfitting can be a huge problem with models due to high variance, this can be solved by getting more training data, but that’s not always possible, so a great alternative is regularisation.

L2 Regularisation (‘Weight Decay’)

Regularisation utilises one of two penalty techniques, L1 and L2, with neural networks L2 is predominantly used.

We must first look at the cost function for a neural network:

Cost Function

And then add the L2 penalty term, which includes the Frobenius Norm:

Improving Deep Neural Networks

L2 penalty term, which includes the Frobenius Norm

With L2 regularisation the weight reduces not only by the learning rate and backpropagation but also by the middle term which includes the regularisation hyperparameter λ (lambda). The larger λ is the smaller w becomes.

Improving Deep Neural Networks

Weight Decay

How does regularisation prevent overfitting?

We see that L2 regularisation uses the λ penalty to reduce the weights w, but how does this reduce variance and prevent overfitting of the model?

λ rises, w falls, changing the magnitude of z

If w is small the size of z will drop too, if z is a large positive number it will become smaller, if it is a large negative number it will become larger, nearing to 0. When passing z through the activation function we have a more linear effect (as you can see the image below, the tanh curve is more linear near 0).


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

统计思维:程序员数学之概率统计

统计思维:程序员数学之概率统计

Allen B.Downey / 张建锋、陈钢 / 人民邮电出版社 / 2013-5 / 29.00元

代码跑出来的概率统计问题; 程序员的概率统计开心辞典; 开放数据集,全代码攻略。 现实工作中,人们常被要求用数据说话。可是,数据自己是不能说话的,只有对它进行可靠分析和深入挖掘才能找到有价值的信息。概率统计是数据分析的通用语言,是大数据时代预测未来的根基。 站在时代浪尖上的程序员只有具备统计思维才能掌握数据分析的必杀技。本书正是一本概率统计方面的入门图书,但视角极为独特,折......一起来看看 《统计思维:程序员数学之概率统计》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具