Deep Learning from Scratch Part 3: Generalizing Gradient Descent

栏目: IT技术 · 发布时间: 4年前

内容简介：Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I

Note: As an Amazon Associate I earn from qualifying purchases. I get commissions for purchases made through links in this post. See a more full disclaimerhere

Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.

Companion Resource

While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I try to present a high-level of what I’m learning. You can get that book Grokking Deep Learning here at Manning Publications for the ebook version which is cheaper or Amazon for the physical copy here https://amzn.to/2YVTrmz

Chapter 5: Generalizing Gradient Descent Notes

If Chapter 4 was looking to introduce you to gradient descent (GD), Chapter 5 is looking to generalize that concept in a few different ways:

* Multiple input nodes with one output node

* Freezing One Weight

* One input node with multiple output nodes

* Multiple input and output nodes

Gradient Descent w/ multiple input nodes & one output node

Deep Learning from Scratch Part 3: Generalizing Gradient Descent

Since you have multiple input nodes that share one output node, the delta that was calculated needs to be distributed evenly back to each of the input nodes. Doing this will give you the appropriate weight_delta for each node.
Remember: The weight_delta value is telling you how far your prediction (positive or negative) is from the actual value in relation to the respective input value. Using math, the equation would look like the following: weight_delta = that specific input value * the delta calculated ”).
After finding this weight_delta value, you would then calculate the new weight value with weight -= alpha * that specific’s input node’s weight_delta . With this new weight , your prediction for each of the input nodes would be the pred = input node value * new weight value . You repeat this process over x iterations.

Freezing One Weight

Freezing one weight basically allows you to see which of the input nodes has the biggest influence on your prediction value. Another way of saying it is how Trask puts it, " a (or just an input node in your neural network) may be a powerful input with lots of predictive power, but if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn to incorporate a into its prediction" [1].
If you’re wondering how you would freeze one weight, you would just make that weight’s value 0 on every iteration. If you multiply anything by 0 , you’ll always get the value 0 . Essentially, the weight value will always be the same as if you were“freezing” the weight in a certain state

Gradient Descent with one input node & multiple out nodes

This time gradient descent is in reverse to the first subtopic. You have one input node having an influence on three different output nodes. Since three output nodes share one input node, each delta value is going to tell you how far off you are from the original input node in the prediction.
Equations to keep in mind:
* pred = one input node value * initial weight
* delta = pred - true
Because of the three output nodes, weight_delta is going to be a list of weight_delta = one input node value * list of weight_deltas from each output node
Finally you would repeat the first bullet topic’s approach to calculating the new weight value to test out the new prediction
The difference between the first subtopic and this subtopic is just which side of the neural network (input vs output) has one or more nodes. Then you do the necessary multiplication

Gradient Descent with multiple input and output nodes

This last subtopic is when you have multiple input and output nodes. If you understood the first and third subtopics (in these notes and the book), then this shouldn’t be as hard to fathom.
For each row of weight values and input values, you’re going to find the delta values.
After you find the delta values, you have to calculate each row’s delta_weight values for each output.
Finally, you calculate the new weight values for each column in the row and assign those as the new weight values to use in the prediction. See code snippet below

# this code snippet assumes that you have calculated your weight_deltas
	# this nested for loop is basically assigning new weights to each column in a row (i -> each row in a matrix, j -> each column in a row)
	# You go through all the columns (j) in row (i) and then you move to the next row and start at column 0

	for i in range(len(weights)):
		for j in range(len(weights[0])):
			weights[i][j] -= alpha * weight_deltas[i][j]

A few GIF's on Gradient Descent

So in between this post's notebook and the previous post's notebook, there is a lot of talking about gradient descent in Deep Learning. However, I want to show a visual representation of what is actually going on with the math. I'm not well suited in matplotlib (as of yet) so I believe the gifs below are good in showing/plotting what is going on mathematically.

With both GIF's below you see that (whether it's the dots or the black line), both are trying to get to the lowest point in the parabola. Trask says, "What you're really trying to do with the neural network is find the lowest point on this big error plane (the parabola's below), where the lowest point refers to the lowest error " [2]. This "lowest error" means you have reached a point in your iterations where your pred = input * weights actually is very close to your values that you want to see or your true in this case [2].

Jupyter Notebook

As always the jupyter notebook is provided chap5_generalizingGradientDescent | Kaggle for you to follow along with.

As always, until next time ✌

References

[1] “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 263.

[2] “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 267.

[3] Ng, Andrew. "Linear Regression with One Variable | Gradient Descent - [Andrew Ng]" Youtube , 22 June 2020. https://www.youtube.com/watch?v=F6GSRDoB-Cg&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=8

[4] Tejani, Alykhan. "A Brief Introduction To Gradient Descent" alykhantejani.github.io , 22 June 2020. https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/

以上所述就是小编给大家介绍的《Deep Learning from Scratch Part 3: Generalizing Gradient Descent》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Deep Learning from Scratch Part 3: Generalizing Gradient Descent

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

HTML Dog

Patrick Griffiths / New Riders Press / 2006-11-22 / USD 49.99

For readers who want to design Web pages that load quickly, are easy to update, accessible to all, work on all browsers and can be quickly adapted to different media, this comprehensive guide represen......一起来看看《HTML Dog》这本书的介绍吧!

码农工具