内容简介:Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode
Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode autodiff’ is the autodiff method used by most deep learning frameworks, due to its efficiency and accuracy.
Let’s:
- Look at how reverse-mode autodiff works.
- Create a minimal autodiff framework in Python.
The small autodiff framework will deal with scalars. We can (and will later) use NumPy to vectorise it.
Note on terminology: from now on ‘autodiff’ will refer to ‘reverse-mode autodiff’. ‘Gradient’ is used loosely, but in this context generally means ‘first order partial derivative’.
How does autodiff work?
Let’s start with an example.
a = 4 b = 3 c = a + b # = 4 + 3 = 7 d = a * c # = 4 * 7 = 28
Q1: What is the gradient of $d$ with respect to $a$, i.e. $\frac{\partial{d}}{\partial{a}}$? (Go ahead and try this!)
Solving the ‘traditional’ way:
There’s many ways to solve Q1 , but let’s use the product rule, i.e. if $ y = x_1x_2 $ then $y’ = x_1’x_2 + x_1x_2’$.
Phew… and if you wanted to know $\frac{\partial{d}}{\partial{b}}$ you’d have to carry out the process again.
Solving the autodiff way
We’ll now look at the autodiff way to solve Q1 . Here is a figure:
On the left we see the system represented as a graph. Each variable is a node; e.g. $d$ is the topmost node, and $a$ and $b$ are leaf nodes at the bottom.
On the right we see the system from autodiff’s point of view. Let’s call the values on the graph edges local derivatives . By using local derivatives and simple rules, we will be able to compute the derivatives that we want.
Here is the answer to Q1 , calculated the autodiff way. Can you see how it relates to the figure?
We get this answer from the graph by finding the routes from $d$ to $a$ (not going against the dotted arrows), and then applying the following rules:
- Multiply the edges of a route.
- Add together the different routes.
The first route is straight from $d$ to $a$, which gives us the $\frac{\partial{\bar{d}}}{\partial{a}}$ term. The second route is from $d$ to $c$ to $a$, which gives us the term $\frac{\partial{\bar{d}}}{\partial{c}} * \frac{\partial{\bar{c}}}{\partial{a}}$.
Our autodiff implementation will go down the graph and compute the derivative of $d$ with respect to every sub-node, rather than just computing it for a particular node, as we have just done with $d$. Notice that we could compute the gradient of $d$ with respect to $c$, and $b$, without much more work.
Local derivatives
We saw ‘local derivatives’ on the graph edges above, written in the form: $ \frac{\partial \bar{y} }{\partial x}$.
The bar is to convey that we are doing a simpler kind of differentiation.
In general: to get a local derivative, treat the variables going into a node as not being functions of other variables.
For example, recall that $d = a * c$. Then compare $ \frac{\partial{d}}{\partial{a}} = 2a + b $, to $ \frac{\partial \bar{d} }{\partial a} = c $. The local derivative $\frac{\partial \bar{d} }{\partial a} = c$, is obtained by treating $c$ as a constant before differentiating the expression for $d$.
Local derivatives make our lives easier.
It is often easy to define the local derivatives of simple functions, and adding functions to the autodiff framework is easy if you know the local derivatives. E.g.
Addition: $n = a + b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = 1$ and $\frac{\partial \bar{n} }{\partial b} = 1$.
Multiplication: $n = a * b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = b$ and $\frac{\partial \bar{n} }{\partial b} = a$.
Let’s create the framework
Description of the implementation.
A node contains two pieces of data:
value grad
The function get_gradients
uses the nodes’ grad
data to go through the graph recursively, computing the gradients. It uses the rules we saw above:
- Multiply the edges of a route (into a node).
- Add together the different routes (that lead to a node).
The tuples in stack
(in get_gradients
) are similar to the tuples in grad
, but they contain the current route value, instead of the local derivative value.
Implementing just enough to solve the example above:
from collections import defaultdict class Var: """A leaf node (a node with no children).""" def __init__(self, value): self.value = value # the scalar value of the node. class Add: """The node that results from adding two nodes.""" def __init__(self, a, b): self.value = a.value + b.value self.grad = [(a, 1), (b, 1)] # child nodes & corresponding 'local derivatives' class Mul: """The node that results from multiplying two nodes.""" def __init__(self, a, b): self.value = a.value * b.value self.grad = [(a, b.value), (b, a.value)] def get_gradients(parent_node): """Go down the graph, and compute derivative of `parent_node` with respect to each node.""" gradients = defaultdict(lambda : 0) stack = parent_node.grad.copy() # list of (node, route_value) tuples. while stack: node, route_value = stack.pop() gradients[node] += route_value # "Add together the different routes." if not isinstance(node, Var): # if the node has children, put them onto the stack. for child_node, child_route_value in node.grad: stack.append((child_node, child_route_value * route_value)) # "Multiply the edges of a route." return dict(gradients)
Solving the example above:
a = Var(4) b = Var(3) c = Add(a, b) # = 4 + 3 = 7 d = Mul(a, c) # = 4 * 7 = 28 gradients = get_gradients(d) print('d.value =', d.value) print("The partial derivative of d with respect to a =", gradients[a])
d.value = 28 The partial derivative of d with respect to a = 11
Success!
We also got gradients for the other nodes:
print('gradients[b] =', gradients[b]) print('gradients[c] =', gradients[c])
gradients[b] = 4 gradients[c] = 4
Let’s take a look at the grad
values (the local derivatives):
print('dict(d.grad)[a] =', dict(d.grad)[a]) print('dict(d.grad)[c] =', dict(d.grad)[c]) print('dict(c.grad)[a] =', dict(c.grad)[a]) print('dict(c.grad)[b] =', dict(c.grad)[b])
dict(d.grad)[a] = 7 dict(d.grad)[c] = 4 dict(c.grad)[a] = 1 dict(c.grad)[b] = 1
We saw these in our example above as:
$ \frac{\partial \bar{d} }{\partial a} = c = 7$
$ \frac{\partial \bar{d} }{\partial c} = a = 4$
$ \frac{\partial \bar{c} }{\partial a} = 1$
$ \frac{\partial \bar{c} }{\partial b} = 1$
A few small improvements
Let’s:
- Enable the use of operators, such as +, *, - …
- Add a few more functions
class Ops: """Enables use of +, *, -, etc.""" def __add__(self, other): return Add(self, other) def __mul__(self, other): return Mul(self, other) def __sub__(self, other): return Add(self, Neg(other)) def __truediv__(self, other): return Mul(self, Inv(other)) class Var(Ops): def __init__(self, value): self.value = value class Add(Ops): def __init__(self, a, b): self.value = a.value + b.value self.grad = [(a, 1), (b, 1)] class Mul(Ops): def __init__(self, a, b): self.value = a.value * b.value self.grad = [(a, b.value), (b, a.value)] class Neg(Ops): def __init__(self, var): self.value = -1 * var.value self.grad = [(var, -1)] class Inv(Ops): def __init__(self, var): self.value = 1 / var.value self.grad = [(var, -var.value ** -2 )]
Some more examples
We can get the gradients of arbitrary functions made from the functions we’ve added to the framework. E.g.
a = Var(230.3) b = Var(33.2) def f(a, b): return (a / b - a) * (b / a + a + b) * (a - b) y = f(a, b) gradients = get_gradients(y) print("The partial derivative of y with respect to a =", gradients[a]) print("The partial derivative of y with respect to b =", gradients[b])
The partial derivative of y with respect to a = -153284.83150602411 The partial derivative of y with respect to b = 3815.0389441500993
We can use numerical estimates to check that we’re getting correct results:
delta = Var(1e-8) numerical_grad_a = (f(a + delta, b) - f(a, b)) / delta numerical_grad_b = (f(a, b + delta) - f(a, b)) / delta print("The numerical estimate for a =", numerical_grad_a.value) print("The numerical estimate for b =", numerical_grad_b.value)
The numerical estimate for a = -153284.89243984222 The numerical estimate for b = 3815.069794654846
It’s easy to add more functions
You just need to be able to define the local derivatives.
import numpy as np class Sin(Ops): def __init__(self, var): self.value = np.sin(var.value) self.grad = [(var, np.cos(var.value))] class Exp(Ops): def __init__(self, var): self.value = np.exp(var.value) self.grad = [(var, self.value)] class Log(Ops): def __init__(self, var): self.value = np.log(var.value) self.grad = [(var, 1. / var.value)]
Let’s check that these work:
a = Var(43.) b = Var(3.) c = Var(2.) def f(a, b, c): f = Sin(a * b) + Exp(c - (a / b)) return Log(f * f) * c y = f(a, b, c) gradients = get_gradients(y) print("The partial derivative of y with respect to a =", gradients[a]) print("The partial derivative of y with respect to b =", gradients[b]) print("The partial derivative of y with respect to c =", gradients[c])
The partial derivative of y with respect to a = 60.85353612046653 The partial derivative of y with respect to b = 872.2331479536114 The partial derivative of y with respect to c = -3.2853671032530305
delta = Var(1e-8) numerical_grad_a = (f(a + delta, b, c) - f(a, b, c)) / delta numerical_grad_b = (f(a, b + delta, c) - f(a, b, c)) / delta numerical_grad_c = (f(a, b, c + delta) - f(a, b, c)) / delta print("The numerical estimate for a =", numerical_grad_a.value) print("The numerical estimate for b =", numerical_grad_b.value) print("The numerical estimate for c =", numerical_grad_c.value)
The numerical estimate for a = 60.85352186602222 The numerical estimate for b = 872.232160009645 The numerical estimate for c = -3.285367089489455
That’s the end of our minimal autodiff implementation!
Of course there’s various features missing, such as:
- Vectorisation.
- Placeholder variables.
- Nth derivatives.
- Optimisations.
- All the other amazing stuff deep learning / autodiff frameworks can do.
The most fruitful addition to our minimal framework would be vectorisation (but not carried out in the manner that follows).
A naive vectorisation
We will look at an incredibly computationally inefficient way to vectorise our autodiff framework. (It is not recommended to use this for anything, because of how slow it is.)
The approach is:
Var
import numpy as np to_var = np.vectorize(lambda x : Var(x)) # convert NumPy array into array of Var objects to_vals = np.vectorize(lambda var : var.value) # get values from array of Var objects
A single linear layer artificial neural network (fitting noise to noise):
import matplotlib.pyplot as plt def update_weights(weights, gradients, lrate): for i in range(weights.shape[0]): for j in range(weights.shape[1]): weights[i, j].value -= lrate * gradients[weights[i, j]] np.random.seed(0) input_size = 50 output_size = 10 lrate = 0.001 x = to_var(np.random.random(input_size)) y_true = to_var(np.random.random(output_size)) weights = to_var(np.random.random((input_size, output_size))) loss_vals = [] for i in range(100): y_pred = np.dot(x, weights) loss = np.sum((y_true - y_pred) * (y_true - y_pred)) loss_vals.append(loss.value) gradients = get_gradients(loss) update_weights(weights, gradients, lrate) plt.plot(loss_vals) plt.xlabel("Time step"); plt.ylabel("Loss"); plt.title("Single linear layer learning") plt.show()
Coming soon: a part two, where we look at how to vectorize our minimal framework more efficiency.
References:
- Hands-On Machine Learning with Scikit-Learn and TensorFlow. A Géron, A. 2017
- https://en.wikipedia.org/wiki/Automatic_differentiation
- https://stackoverflow.com/a/43184871/4572034 (using a stack instead of recursive function)
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
码出高效:Java开发手册
杨冠宝、高海慧 / 电子工业出版社 / 2018-10 / 99.00元
《码出高效:Java 开发手册》源于影响了全球250万名开发工程师的《阿里巴巴Java开发手册》,作者静心沉淀,对Java规约的来龙去脉进行了全面而彻底的内容梳理。《码出高效:Java 开发手册》以实战为中心,以新颖的角度全面阐述面向对象理论,逐步深入地探索怎样成为一位优秀开发工程师。比如:如何驾轻就熟地使用各类集合框架;如何得心应手地处理高并发多线程问题;如何顺其自然地写出可读性强、可维护性好的......一起来看看 《码出高效:Java开发手册》 这本书的介绍吧!