内容简介:Mathematically, backpropagation is the process of computing gradients for the components of a function by applying theThis is a great mental model because it means backpropagation is a
As the beating heart of deep learning, a solid understanding of backpropagation is required for any deep learning practitioner. Although there are a lot of good resources that explain backpropagation on the internet already, most of them explain from very different angles and each is good for a certain type of audience. In this post, I’m going to combine intuition, animated graphs and code together for beginners and intermediate level students of deep learning for easier consumption. A good assessment of the understanding of any algorithm is whether you can code it out yourself from scratch. After reading this post, you should have an idea of how to implement your own version of backpropagation in Python.
Real-valued circuits, and a force of correction
Mathematically, backpropagation is the process of computing gradients for the components of a function by applying the chain rule . In the case of neural networks, the function of interest is the loss function . I like the interpretation by Andrej Karpathy in CS231n: take the compute graph as real-valued circuits with logic gates. The gates are the operations in the function, e.g. addition, multiplication, exponentiation, matrix multiplication, etc.
This is a great mental model because it means backpropagation is a local process . Every gate in the circuit diagram gets some inputs and can right away compute:
(1) its output value
(2) the local gradient
Keep in mind that this is still the forward pass of the neural net. Quoting the CS231n note:
In the forward pass, the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in.
However, once the forward pass is over, during the backward pass (backpropagation) the gate will learn about the gradient of its output value on the final output of the entire circuit. It applies the chain rule, i.e. taking the output’s gradient and multiply it into every gradient it computes for all of its inputs. The backward pass can be implemented using a recursive approach to traverse back from the output of the circuit back to all the inputs.
Intuitively, the final effect of backprop and its associated weight updates is that the circuit “wants” to output a value that’s closer to whatever target value we have. Take the gradient of the add gate (-4) in the graph above as an example, it means that changing q by +1 will result in a change of -4 in f . If we’d like a higher f, we could make q lower. This is essentially what a gradient is. People sometimes call it “sensitivity”. Another great metaphor is a force of correction . The sign of the gradient indicates the direction of correction, and the magnitude indicates the strength .
Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make adjustments to the final output value.
From function to compute graph
One of the best ways to visualize backprop is to draw the compute graph of the function. Let’s look at this weird function below to demonstrate how to draw its compute graph and then do backprop on it manually.
To calculate its gradients, we can decompose it into add, sigmoid, square gates as shown below:
Concretely, the process consists of 3 high-level steps
- Build the compute graph from operations (gates)
- Run the forward pass through each operation
- Run the backward pass based on (1) the values computed in the forward pass, and (2) the backward function for each gate to calculate their local gradients
As shown in the animation above, you can follow along and manually calculate the values. I’ll show how to implement it in the last section, but now let’s look at a trick that will help us simplify the process.
Staged computation
Any kind of differentiable function can act as a gate, and we can group multiple gates into a single gate whenever it is convenient.
Since we should never explicitly solve for the gradients analytically, the selection of these function components becomes a problem to consider. Take the sigmoid function for example:
We can decompose it into add, multiply, negate, exponentiate, reciprocal gates individually as shown in the following compute graph:
A simple sigmoid already has so many operations and gradients, it seems unnecessarily complex. What we could do alternatively is just to have one sigmoid gate applying on the output of the red box along with the function that calculates its gradient. The gradient of the sigmoid is very simple:
This way we avoid a lot of unnecessary computation. It saves us time, space, energy, makes the code more modular and easier to read, and avoids numerical problems.
The code for the sigmoid gate could look something like:
It has both the forward pass output and the function to compute the local gradient for the backward pass in just a few lines of code. In the next section, I will fit this into the bigger picture and show how to write a mini autograd library with such components.
Autograd: a recursive approach
To let a computer calculate the gradients for any function expressed in Directed Acyclic Graphs (DAG) using the chain rule, we need to write the code for those 3 high-level steps mentioned before. Such a program is often called auto-differentiation or autograd. As you’ll see next, we can structure our code into a Tensor
class that defines the data and operations, so that it not only can support building dynamic compute graphs on the fly, but also backpropagating through it recursively.
Forward pass: building the graph
A Tensor object has data
, grad
, a _backward()
function, a _prev
set of Tensor nodes, and a _op
operation. When we execute an expression, it builds the compute graph on the fly since we have overridden the Python operators such as +
, *
and **
with customized dunder methods
. The current Tensor’s _backward()
, _prev
and _op
are defined by its parent(s), i.e. the Tensor(s) that produced it. For example, the c
in c = a + b
has the _backward()
defined in __add__
, and _prev = {a, b}
, _op = '+'
. This way we can define any operation we want and let Python construct the graph.
Here we are talking about neural networks, so the expression we care about is the loss function
. Take the MSE loss for example (using 1 scalar data point for simplicity), MSELoss = (wx - y)**2
. The graph is:
Notice that subtraction is actually negation and addition. I named the intermediate nodes for illustration purposes only. With the graph ready, we can implement backprop!
Backward pass: topological sort
backward
should compute gradients one by one starting from the current Tensor node and move to its ancestors. The order of the traversal needs to the topologically sorted to make sure the dependencies are calculated at every step. One simple way to implement this topological sort is a depth-first search. Here’s an animation to show how it’s executed for the MSELoss example.
This is graph traversal 101: a plain old DFS. If you have a big complex neural net you just replace the upper left corner that is wx
with your big graph, and the DFS will go there and compute the gradients, provided that all the operations and their _backward
are defined in the Tensor
class. Some of the obvious operations we need here include max
, sum
, matrix multiplication, transpose, etc. To use the gradients to update your weights, do:
for p in <your_parameters>: p.data -= learning_rate * p.grad
There you have it, the autograd algorithm in a few lines of code. It is the backbone of any deep learning framework. Now that the hard part is behind us, to complete the implementation of your mini deep learning framework, you just need to implement a model interface with a collection of layer types, loss functions, and some optimizers.
Recommended readings
This post is heavily inspired by Andrej Karpathy’s awesome CS231n lectures and beautifully written micrograd: tiniest autograd engine . If you would like to take a look at a different and more extended version of a DIY deep learning framework that closely resembles PyTorch, check out the one implemented in Grokking Deep Learning by Andrew Trask . If you prefer diving into PyTorch’s autograd directly instead, Elliot Waite has a great video on Youtube which will save you a lot of time digging into the source code. Keep on learning!
以上所述就是小编给大家介绍的《Building A Mental Model for Backpropagation》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
从莎草纸到互联网:社交媒体2000年
[英]汤姆·斯丹迪奇 / 林华 / 中信出版社 / 2015-12 / 58.00元
【内容简介】 社交媒体其实并不是什么新鲜的东西。从西塞罗和其他古罗马政治家用来交换信息的莎草纸信,到宗教改革、美国革命、法国大革命期间印制的宣传小册子,过去人类跟同伴交流信息的方式依然影响着现代社会。在报纸、广播和电视在散播信息上面统治了几十年后,互联网的出现使社交媒体重新变成人们与朋友分享信息的有力工具,并推动公共讨论走向一个新的模式。 汤姆•斯丹迪奇在书中提醒我们历史上的社交网络其......一起来看看 《从莎草纸到互联网:社交媒体2000年》 这本书的介绍吧!
XML、JSON 在线转换
在线XML、JSON转换工具
RGB CMYK 转换工具
RGB CMYK 互转工具