内容简介:Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I
Note: As an Amazon Associate I earn from qualifying purchases. I get commissions for purchases made through links in this post. See a more full disclaimerhere
Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.
Companion Resource
While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I try to present a high-level of what I’m learning. You can get that book Grokking Deep Learning here at Manning Publications for the ebook version which is cheaper or Amazon for the physical copy here https://amzn.to/2YVTrmz
Chapter 5: Generalizing Gradient Descent Notes
If Chapter 4 was looking to introduce you to gradient descent (GD), Chapter 5 is looking to generalize that concept in a few different ways:
* Multiple input nodes with one output node
* Freezing One Weight
* One input node with multiple output nodes
* Multiple input and output nodes
Gradient Descent w/ multiple input nodes & one output node
- Since you have multiple input nodes that share one output node, the
delta
that was calculated needs to be distributed evenly back to each of the input nodes. Doing this will give you the appropriateweight_delta
for each node. - Remember: The
weight_delta
value is telling you how far your prediction (positive or negative) is from the actual value in relation to the respective input value. Using math, the equation would look like the following:weight_delta = that specific input value * the delta calculated
”). - After finding this
weight_delta
value, you would then calculate the newweight
value withweight -= alpha * that specific’s input node’s weight_delta
. With this newweight
, your prediction for each of the input nodes would be thepred = input node value * new weight value
. You repeat this process over x iterations.
Freezing One Weight
- Freezing one weight basically allows you to see which of the input nodes has the biggest influence on your prediction value. Another way of saying it is how Trask puts it, " a (or just an input node in your neural network) may be a powerful input with lots of predictive power, but if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn to incorporate a into its prediction" [1].
- If you’re wondering how you would freeze one weight, you would just make that weight’s value
0
on every iteration. If you multiply anything by0
, you’ll always get the value0
. Essentially, the weight value will always be the same as if you were“freezing” the weight in a certain state
Gradient Descent with one input node & multiple out nodes
- This time gradient descent is in reverse to the first subtopic. You have one input node having an influence on three different output nodes. Since three output nodes share one input node, each
delta
value is going to tell you how far off you are from the original input node in the prediction. - Equations to keep in mind:
*pred = one input node value * initial weight
*delta = pred - true
- Because of the three output nodes,
weight_delta
is going to be a list ofweight_delta = one input node value * list of weight_deltas from each output node
- Finally you would repeat the first bullet topic’s approach to calculating the new
weight
value to test out the new prediction - The difference between the first subtopic and this subtopic is just which side of the neural network (input vs output) has one or more nodes. Then you do the necessary multiplication
Gradient Descent with multiple input and output nodes
- This last subtopic is when you have multiple input and output nodes. If you understood the first and third subtopics (in these notes and the book), then this shouldn’t be as hard to fathom.
- For each row of weight values and input values, you’re going to find the
delta
values. - After you find the
delta
values, you have to calculate each row’sdelta_weight
values for each output. - Finally, you calculate the new weight values for each column in the row and assign those as the new weight values to use in the prediction. See code snippet below
# this code snippet assumes that you have calculated your weight_deltas # this nested for loop is basically assigning new weights to each column in a row (i -> each row in a matrix, j -> each column in a row) # You go through all the columns (j) in row (i) and then you move to the next row and start at column 0 for i in range(len(weights)): for j in range(len(weights[0])): weights[i][j] -= alpha * weight_deltas[i][j]
A few GIF's on Gradient Descent
So in between this post's notebook and the previous post's notebook, there is a lot of talking about gradient descent in Deep Learning. However, I want to show a visual representation of what is actually going on with the math. I'm not well suited in matplotlib (as of yet) so I believe the gifs below are good in showing/plotting what is going on mathematically.
With both GIF's below you see that (whether it's the dots or the black line), both are trying to get to the lowest point in the parabola. Trask says, "What you're really trying to do with the neural network is find the lowest point on this big error plane (the parabola's below), where the lowest point refers to the lowest error
" [2]. This "lowest error" means you have reached a point in your iterations where your pred = input * weights
actually is very close to your values that you want to see or your true
in this case [2].
Jupyter Notebook
As always the jupyter notebook is provided chap5_generalizingGradientDescent | Kaggle for you to follow along with.
As always, until next time ✌
References
[1] “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 263.
[2] “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 267.
[3] Ng, Andrew. "Linear Regression with One Variable | Gradient Descent - [Andrew Ng]" Youtube , 22 June 2020. https://www.youtube.com/watch?v=F6GSRDoB-Cg&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=8
[4] Tejani, Alykhan. "A Brief Introduction To Gradient Descent" alykhantejani.github.io , 22 June 2020. https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/
以上所述就是小编给大家介绍的《Deep Learning from Scratch Part 3: Generalizing Gradient Descent》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
HTML Dog
Patrick Griffiths / New Riders Press / 2006-11-22 / USD 49.99
For readers who want to design Web pages that load quickly, are easy to update, accessible to all, work on all browsers and can be quickly adapted to different media, this comprehensive guide represen......一起来看看 《HTML Dog》 这本书的介绍吧!