From scratch: reverse-mode automatic differentiation (in Python)

栏目: IT技术 · 发布时间: 4年前

内容简介:Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode

Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode autodiff’ is the autodiff method used by most deep learning frameworks, due to its efficiency and accuracy.

Let’s:

  • Look at how reverse-mode autodiff works.
  • Create a minimal autodiff framework in Python.

The small autodiff framework will deal with scalars. We can (and will later) use NumPy to vectorise it.

Note on terminology: from now on ‘autodiff’ will refer to ‘reverse-mode autodiff’. ‘Gradient’ is used loosely, but in this context generally means ‘first order partial derivative’.

How does autodiff work?

Let’s start with an example.

a = 4
b = 3
c = a + b  # = 4 + 3 = 7
d = a * c  # = 4 * 7 = 28

Q1: What is the gradient of $d$ with respect to $a$, i.e. $\frac{\partial{d}}{\partial{a}}$? (Go ahead and try this!)

Solving the ‘traditional’ way:

There’s many ways to solve Q1 , but let’s use the product rule, i.e. if $ y = x_1x_2 $ then $y’ = x_1’x_2 + x_1x_2’$.

Phew… and if you wanted to know $\frac{\partial{d}}{\partial{b}}$ you’d have to carry out the process again.

Solving the autodiff way

We’ll now look at the autodiff way to solve Q1 . Here is a figure:

From scratch: reverse-mode automatic differentiation (in Python)

On the left we see the system represented as a graph. Each variable is a node; e.g. $d$ is the topmost node, and $a$ and $b$ are leaf nodes at the bottom.

On the right we see the system from autodiff’s point of view. Let’s call the values on the graph edges local derivatives . By using local derivatives and simple rules, we will be able to compute the derivatives that we want.

Here is the answer to Q1 , calculated the autodiff way. Can you see how it relates to the figure?

We get this answer from the graph by finding the routes from $d$ to $a$ (not going against the dotted arrows), and then applying the following rules:

  • Multiply the edges of a route.
  • Add together the different routes.

The first route is straight from $d$ to $a$, which gives us the $\frac{\partial{\bar{d}}}{\partial{a}}$ term. The second route is from $d$ to $c$ to $a$, which gives us the term $\frac{\partial{\bar{d}}}{\partial{c}} * \frac{\partial{\bar{c}}}{\partial{a}}$.

Our autodiff implementation will go down the graph and compute the derivative of $d$ with respect to every sub-node, rather than just computing it for a particular node, as we have just done with $d$. Notice that we could compute the gradient of $d$ with respect to $c$, and $b$, without much more work.

Local derivatives

We saw ‘local derivatives’ on the graph edges above, written in the form: $ \frac{\partial \bar{y} }{\partial x}$.

The bar is to convey that we are doing a simpler kind of differentiation.

In general: to get a local derivative, treat the variables going into a node as not being functions of other variables.

For example, recall that $d = a * c$. Then compare $ \frac{\partial{d}}{\partial{a}} = 2a + b $, to $ \frac{\partial \bar{d} }{\partial a} = c $. The local derivative $\frac{\partial \bar{d} }{\partial a} = c$, is obtained by treating $c$ as a constant before differentiating the expression for $d$.

Local derivatives make our lives easier.

It is often easy to define the local derivatives of simple functions, and adding functions to the autodiff framework is easy if you know the local derivatives. E.g.

Addition: $n = a + b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = 1$ and $\frac{\partial \bar{n} }{\partial b} = 1$.

Multiplication: $n = a * b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = b$ and $\frac{\partial \bar{n} }{\partial b} = a$.

Let’s create the framework

Description of the implementation.

A node contains two pieces of data:

value
grad

The function get_gradients uses the nodes’ grad data to go through the graph recursively, computing the gradients. It uses the rules we saw above:

  • Multiply the edges of a route (into a node).
  • Add together the different routes (that lead to a node).

The tuples in stack (in get_gradients ) are similar to the tuples in grad , but they contain the current route value, instead of the local derivative value.

Implementing just enough to solve the example above:

from collections import defaultdict

class Var:
    """A leaf node (a node with no children)."""
    def __init__(self, value):
        self.value = value  # the scalar value of the node.

class Add:
    """The node that results from adding two nodes."""
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]  # child nodes & corresponding 'local derivatives'

class Mul:
    """The node that results from multiplying two nodes."""
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]

def get_gradients(parent_node):
    """Go down the graph, and compute derivative of `parent_node` with respect to each node."""
    gradients = defaultdict(lambda : 0)
    stack = parent_node.grad.copy()  # list of (node, route_value) tuples.
    while stack:
        node, route_value = stack.pop()
        gradients[node] += route_value  # "Add together the different routes."
        if not isinstance(node, Var): 
            # if the node has children, put them onto the stack.
            for child_node, child_route_value in node.grad:
                stack.append((child_node, child_route_value * route_value))  # "Multiply the edges of a route."
    return dict(gradients)

Solving the example above:

a = Var(4)
b = Var(3)
c = Add(a, b) # = 4 + 3 = 7
d = Mul(a, c) # = 4 * 7 = 28

gradients = get_gradients(d)

print('d.value =', d.value)
print("The partial derivative of d with respect to a =", gradients[a])
d.value = 28
The partial derivative of d with respect to a = 11

Success!

We also got gradients for the other nodes:

print('gradients[b] =', gradients[b])
print('gradients[c] =', gradients[c])
gradients[b] = 4
gradients[c] = 4

Let’s take a look at the grad values (the local derivatives):

print('dict(d.grad)[a] =', dict(d.grad)[a])
print('dict(d.grad)[c] =', dict(d.grad)[c])
print('dict(c.grad)[a] =', dict(c.grad)[a])
print('dict(c.grad)[b] =', dict(c.grad)[b])
dict(d.grad)[a] = 7
dict(d.grad)[c] = 4
dict(c.grad)[a] = 1
dict(c.grad)[b] = 1

We saw these in our example above as:

$ \frac{\partial \bar{d} }{\partial a} = c = 7$

$ \frac{\partial \bar{d} }{\partial c} = a = 4$

$ \frac{\partial \bar{c} }{\partial a} = 1$

$ \frac{\partial \bar{c} }{\partial b} = 1$

A few small improvements

Let’s:

  • Enable the use of operators, such as +, *, - …
  • Add a few more functions
class Ops:
    """Enables use of +, *, -, etc."""
    def __add__(self, other):
        return Add(self, other)
    
    def __mul__(self, other):
        return Mul(self, other)
    
    def __sub__(self, other):
        return Add(self, Neg(other))
    
    def __truediv__(self, other):
        return Mul(self, Inv(other))

class Var(Ops):
    def __init__(self, value):
        self.value = value
        
class Add(Ops):
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]

class Mul(Ops):
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]
    
class Neg(Ops):
    def __init__(self, var):
        self.value = -1 * var.value
        self.grad = [(var, -1)]

class Inv(Ops):
    def __init__(self, var):
        self.value = 1 / var.value
        self.grad = [(var, -var.value ** -2 )]

Some more examples

We can get the gradients of arbitrary functions made from the functions we’ve added to the framework. E.g.

a = Var(230.3)
b = Var(33.2)

def f(a, b):
    return (a / b - a) * (b / a + a + b) * (a - b)

y = f(a, b)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
The partial derivative of y with respect to a = -153284.83150602411
The partial derivative of y with respect to b = 3815.0389441500993

We can use numerical estimates to check that we’re getting correct results:

delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b) - f(a, b)) / delta
numerical_grad_b = (f(a, b + delta) - f(a, b)) / delta
print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
The numerical estimate for a = -153284.89243984222
The numerical estimate for b = 3815.069794654846

It’s easy to add more functions

You just need to be able to define the local derivatives.

import numpy as np

class Sin(Ops):
    def __init__(self, var):
        self.value = np.sin(var.value)
        self.grad = [(var, np.cos(var.value))]

class Exp(Ops):
    def __init__(self, var):
        self.value = np.exp(var.value)
        self.grad = [(var, self.value)]
        
class Log(Ops):
    def __init__(self, var):
        self.value = np.log(var.value)
        self.grad = [(var, 1. / var.value)]

Let’s check that these work:

a = Var(43.)
b = Var(3.)
c = Var(2.)

def f(a, b, c):
    f = Sin(a * b) + Exp(c - (a / b))
    return Log(f * f) * c

y = f(a, b, c)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
print("The partial derivative of y with respect to c =", gradients[c])
The partial derivative of y with respect to a = 60.85353612046653
The partial derivative of y with respect to b = 872.2331479536114
The partial derivative of y with respect to c = -3.2853671032530305
delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b, c) - f(a, b, c)) / delta
numerical_grad_b = (f(a, b + delta, c) - f(a, b, c)) / delta
numerical_grad_c = (f(a, b, c + delta) - f(a, b, c)) / delta

print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
print("The numerical estimate for c =", numerical_grad_c.value)
The numerical estimate for a = 60.85352186602222
The numerical estimate for b = 872.232160009645
The numerical estimate for c = -3.285367089489455

That’s the end of our minimal autodiff implementation!

Of course there’s various features missing, such as:

  • Vectorisation.
  • Placeholder variables.
  • Nth derivatives.
  • Optimisations.
  • All the other amazing stuff deep learning / autodiff frameworks can do.

The most fruitful addition to our minimal framework would be vectorisation (but not carried out in the manner that follows).

A naive vectorisation

We will look at an incredibly computationally inefficient way to vectorise our autodiff framework. (It is not recommended to use this for anything, because of how slow it is.)

The approach is:

Var
import numpy as np

to_var = np.vectorize(lambda x : Var(x))  # convert NumPy array into array of Var objects
to_vals = np.vectorize(lambda var : var.value)  # get values from array of Var objects

A single linear layer artificial neural network (fitting noise to noise):

import matplotlib.pyplot as plt

def update_weights(weights, gradients, lrate):
    for i in range(weights.shape[0]):
        for j in range(weights.shape[1]):
            weights[i, j].value -= lrate * gradients[weights[i, j]]

np.random.seed(0)

input_size = 50
output_size = 10
lrate = 0.001

x = to_var(np.random.random(input_size))
y_true = to_var(np.random.random(output_size))
weights = to_var(np.random.random((input_size, output_size)))

loss_vals = []
for i in range(100):

    y_pred = np.dot(x, weights)

    loss = np.sum((y_true - y_pred) * (y_true - y_pred))
    loss_vals.append(loss.value)
    
    gradients = get_gradients(loss)

    update_weights(weights, gradients, lrate)

    
plt.plot(loss_vals)
plt.xlabel("Time step"); plt.ylabel("Loss"); plt.title("Single linear layer learning")
plt.show()

From scratch: reverse-mode automatic differentiation (in Python)

Coming soon: a part two, where we look at how to vectorize our minimal framework more efficiency.

References:


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

注意力经济: 如何把大众的注意力变成生意

注意力经济: 如何把大众的注意力变成生意

吴修铭 / 中信出版集团股份有限公司 / 2018-4-1 / 69

编辑推荐 这本书由万维钢作序,并在《得到》日课中多次推荐!中文版未上市之前,中文前沿媒体就在力推这本书!关于注意力争夺战的历史和现在,作者给了权威的梳理和定位! 百年来,在争夺注意力的战场上,媒体、广告、商人、企业和大众成为博弈的主角。商人是如何在注意力争夺战中获利的?媒体是如何在改变报道形式的?广告是如何进化的?以及,营销是如何变得随处可见、无孔不入的呢?这本书讲述了令商人或企业从吸......一起来看看 《注意力经济: 如何把大众的注意力变成生意》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

在线进制转换器
在线进制转换器

各进制数互转换器

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码