From scratch: reverse-mode automatic differentiation (in Python)

栏目: IT技术 · 发布时间: 4年前

内容简介：Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode

Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode autodiff’ is the autodiff method used by most deep learning frameworks, due to its efficiency and accuracy.

Let’s:

Look at how reverse-mode autodiff works.
Create a minimal autodiff framework in Python.

The small autodiff framework will deal with scalars. We can (and will later) use NumPy to vectorise it.

Note on terminology: from now on ‘autodiff’ will refer to ‘reverse-mode autodiff’. ‘Gradient’ is used loosely, but in this context generally means ‘first order partial derivative’.

How does autodiff work?

Let’s start with an example.

a = 4
b = 3
c = a + b  # = 4 + 3 = 7
d = a * c  # = 4 * 7 = 28

Q1: What is the gradient of $d$ with respect to $a$, i.e. $\frac{\partial{d}}{\partial{a}}$? (Go ahead and try this!)

Solving the ‘traditional’ way:

There’s many ways to solve Q1 , but let’s use the product rule, i.e. if $ y = x_1x_2 $ then $y’ = x_1’x_2 + x_1x_2’$.

Phew… and if you wanted to know $\frac{\partial{d}}{\partial{b}}$ you’d have to carry out the process again.

Solving the autodiff way

We’ll now look at the autodiff way to solve Q1 . Here is a figure:

From scratch: reverse-mode automatic differentiation (in Python)

On the left we see the system represented as a graph. Each variable is a node; e.g. $d$ is the topmost node, and $a$ and $b$ are leaf nodes at the bottom.

On the right we see the system from autodiff’s point of view. Let’s call the values on the graph edges local derivatives . By using local derivatives and simple rules, we will be able to compute the derivatives that we want.

Here is the answer to Q1 , calculated the autodiff way. Can you see how it relates to the figure?

We get this answer from the graph by finding the routes from $d$ to $a$ (not going against the dotted arrows), and then applying the following rules:

Multiply the edges of a route.
Add together the different routes.

The first route is straight from $d$ to $a$, which gives us the $\frac{\partial{\bar{d}}}{\partial{a}}$ term. The second route is from $d$ to $c$ to $a$, which gives us the term $\frac{\partial{\bar{d}}}{\partial{c}} * \frac{\partial{\bar{c}}}{\partial{a}}$.

Our autodiff implementation will go down the graph and compute the derivative of $d$ with respect to every sub-node, rather than just computing it for a particular node, as we have just done with $d$. Notice that we could compute the gradient of $d$ with respect to $c$, and $b$, without much more work.

Local derivatives

We saw ‘local derivatives’ on the graph edges above, written in the form: $ \frac{\partial \bar{y} }{\partial x}$.

The bar is to convey that we are doing a simpler kind of differentiation.

In general: to get a local derivative, treat the variables going into a node as not being functions of other variables.

For example, recall that $d = a * c$. Then compare $ \frac{\partial{d}}{\partial{a}} = 2a + b $, to $ \frac{\partial \bar{d} }{\partial a} = c $. The local derivative $\frac{\partial \bar{d} }{\partial a} = c$, is obtained by treating $c$ as a constant before differentiating the expression for $d$.

Local derivatives make our lives easier.

It is often easy to define the local derivatives of simple functions, and adding functions to the autodiff framework is easy if you know the local derivatives. E.g.

Addition: $n = a + b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = 1$ and $\frac{\partial \bar{n} }{\partial b} = 1$.

Multiplication: $n = a * b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = b$ and $\frac{\partial \bar{n} }{\partial b} = a$.

Let’s create the framework

Description of the implementation.

A node contains two pieces of data:

value
grad

The function get_gradients uses the nodes’ grad data to go through the graph recursively, computing the gradients. It uses the rules we saw above:

Multiply the edges of a route (into a node).
Add together the different routes (that lead to a node).

The tuples in stack (in get_gradients ) are similar to the tuples in grad , but they contain the current route value, instead of the local derivative value.

Implementing just enough to solve the example above:

from collections import defaultdict

class Var:
    """A leaf node (a node with no children)."""
    def __init__(self, value):
        self.value = value  # the scalar value of the node.

class Add:
    """The node that results from adding two nodes."""
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]  # child nodes & corresponding 'local derivatives'

class Mul:
    """The node that results from multiplying two nodes."""
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]

def get_gradients(parent_node):
    """Go down the graph, and compute derivative of `parent_node` with respect to each node."""
    gradients = defaultdict(lambda : 0)
    stack = parent_node.grad.copy()  # list of (node, route_value) tuples.
    while stack:
        node, route_value = stack.pop()
        gradients[node] += route_value  # "Add together the different routes."
        if not isinstance(node, Var): 
            # if the node has children, put them onto the stack.
            for child_node, child_route_value in node.grad:
                stack.append((child_node, child_route_value * route_value))  # "Multiply the edges of a route."
    return dict(gradients)

Solving the example above:

a = Var(4)
b = Var(3)
c = Add(a, b) # = 4 + 3 = 7
d = Mul(a, c) # = 4 * 7 = 28

gradients = get_gradients(d)

print('d.value =', d.value)
print("The partial derivative of d with respect to a =", gradients[a])

d.value = 28
The partial derivative of d with respect to a = 11

Success!

We also got gradients for the other nodes:

print('gradients[b] =', gradients[b])
print('gradients[c] =', gradients[c])

gradients[b] = 4
gradients[c] = 4

Let’s take a look at the grad values (the local derivatives):

print('dict(d.grad)[a] =', dict(d.grad)[a])
print('dict(d.grad)[c] =', dict(d.grad)[c])
print('dict(c.grad)[a] =', dict(c.grad)[a])
print('dict(c.grad)[b] =', dict(c.grad)[b])

dict(d.grad)[a] = 7
dict(d.grad)[c] = 4
dict(c.grad)[a] = 1
dict(c.grad)[b] = 1

We saw these in our example above as:

$ \frac{\partial \bar{d} }{\partial a} = c = 7$

$ \frac{\partial \bar{d} }{\partial c} = a = 4$

$ \frac{\partial \bar{c} }{\partial a} = 1$

$ \frac{\partial \bar{c} }{\partial b} = 1$

A few small improvements

Let’s:

Enable the use of operators, such as +, *, - …
Add a few more functions

class Ops:
    """Enables use of +, *, -, etc."""
    def __add__(self, other):
        return Add(self, other)
    
    def __mul__(self, other):
        return Mul(self, other)
    
    def __sub__(self, other):
        return Add(self, Neg(other))
    
    def __truediv__(self, other):
        return Mul(self, Inv(other))

class Var(Ops):
    def __init__(self, value):
        self.value = value
        
class Add(Ops):
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]

class Mul(Ops):
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]
    
class Neg(Ops):
    def __init__(self, var):
        self.value = -1 * var.value
        self.grad = [(var, -1)]

class Inv(Ops):
    def __init__(self, var):
        self.value = 1 / var.value
        self.grad = [(var, -var.value ** -2 )]

Some more examples

We can get the gradients of arbitrary functions made from the functions we’ve added to the framework. E.g.

a = Var(230.3)
b = Var(33.2)

def f(a, b):
    return (a / b - a) * (b / a + a + b) * (a - b)

y = f(a, b)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])

The partial derivative of y with respect to a = -153284.83150602411
The partial derivative of y with respect to b = 3815.0389441500993

We can use numerical estimates to check that we’re getting correct results:

delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b) - f(a, b)) / delta
numerical_grad_b = (f(a, b + delta) - f(a, b)) / delta
print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)

The numerical estimate for a = -153284.89243984222
The numerical estimate for b = 3815.069794654846

It’s easy to add more functions

You just need to be able to define the local derivatives.

import numpy as np

class Sin(Ops):
    def __init__(self, var):
        self.value = np.sin(var.value)
        self.grad = [(var, np.cos(var.value))]

class Exp(Ops):
    def __init__(self, var):
        self.value = np.exp(var.value)
        self.grad = [(var, self.value)]
        
class Log(Ops):
    def __init__(self, var):
        self.value = np.log(var.value)
        self.grad = [(var, 1. / var.value)]

Let’s check that these work:

a = Var(43.)
b = Var(3.)
c = Var(2.)

def f(a, b, c):
    f = Sin(a * b) + Exp(c - (a / b))
    return Log(f * f) * c

y = f(a, b, c)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
print("The partial derivative of y with respect to c =", gradients[c])

The partial derivative of y with respect to a = 60.85353612046653
The partial derivative of y with respect to b = 872.2331479536114
The partial derivative of y with respect to c = -3.2853671032530305

delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b, c) - f(a, b, c)) / delta
numerical_grad_b = (f(a, b + delta, c) - f(a, b, c)) / delta
numerical_grad_c = (f(a, b, c + delta) - f(a, b, c)) / delta

print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
print("The numerical estimate for c =", numerical_grad_c.value)

The numerical estimate for a = 60.85352186602222
The numerical estimate for b = 872.232160009645
The numerical estimate for c = -3.285367089489455

That’s the end of our minimal autodiff implementation!

Of course there’s various features missing, such as:

Vectorisation.
Placeholder variables.
Nth derivatives.
Optimisations.
All the other amazing stuff deep learning / autodiff frameworks can do.

The most fruitful addition to our minimal framework would be vectorisation (but not carried out in the manner that follows).

A naive vectorisation

We will look at an incredibly computationally inefficient way to vectorise our autodiff framework. (It is not recommended to use this for anything, because of how slow it is.)

The approach is:

Var

import numpy as np

to_var = np.vectorize(lambda x : Var(x))  # convert NumPy array into array of Var objects
to_vals = np.vectorize(lambda var : var.value)  # get values from array of Var objects

A single linear layer artificial neural network (fitting noise to noise):

import matplotlib.pyplot as plt

def update_weights(weights, gradients, lrate):
    for i in range(weights.shape[0]):
        for j in range(weights.shape[1]):
            weights[i, j].value -= lrate * gradients[weights[i, j]]

np.random.seed(0)

input_size = 50
output_size = 10
lrate = 0.001

x = to_var(np.random.random(input_size))
y_true = to_var(np.random.random(output_size))
weights = to_var(np.random.random((input_size, output_size)))

loss_vals = []
for i in range(100):

    y_pred = np.dot(x, weights)

    loss = np.sum((y_true - y_pred) * (y_true - y_pred))
    loss_vals.append(loss.value)
    
    gradients = get_gradients(loss)

    update_weights(weights, gradients, lrate)

    
plt.plot(loss_vals)
plt.xlabel("Time step"); plt.ylabel("Loss"); plt.title("Single linear layer learning")
plt.show()

From scratch: reverse-mode automatic differentiation (in Python)

Coming soon: a part two, where we look at how to vectorize our minimal framework more efficiency.

References:

Hands-On Machine Learning with Scikit-Learn and TensorFlow. A Géron, A. 2017
https://en.wikipedia.org/wiki/Automatic_differentiation
https://stackoverflow.com/a/43184871/4572034 (using a stack instead of recursive function)

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

From scratch: reverse-mode automatic differentiation (in Python)

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

码出高效：Java开发手册

杨冠宝、高海慧 / 电子工业出版社 / 2018-10 / 99.00元

《码出高效：Java 开发手册》源于影响了全球250万名开发工程师的《阿里巴巴Java开发手册》，作者静心沉淀，对Java规约的来龙去脉进行了全面而彻底的内容梳理。《码出高效：Java 开发手册》以实战为中心，以新颖的角度全面阐述面向对象理论，逐步深入地探索怎样成为一位优秀开发工程师。比如：如何驾轻就熟地使用各类集合框架；如何得心应手地处理高并发多线程问题；如何顺其自然地写出可读性强、可维护性好的......一起来看看《码出高效：Java开发手册》这本书的介绍吧!

码农工具

From scratch: reverse-mode automatic differentiation (in Python)

How does autodiff work?

Solving the ‘traditional’ way:

Solving the autodiff way

Local derivatives

Local derivatives make our lives easier.

Let’s create the framework

Description of the implementation.

Implementing just enough to solve the example above:

Solving the example above:

Success!

A few small improvements

Some more examples

It’s easy to add more functions

That’s the end of our minimal autodiff implementation!

A naive vectorisation

码出高效：Java开发手册

Base64 编码/解码

正则表达式在线测试

HEX CMYK 转换工具