From scratch: reverse-mode automatic differentiation (in Python)

栏目: IT技术 · 发布时间: 5年前

内容简介:Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode

Automatic differentiation is the foundation upon which deep learning frameworks lie. Deep learning models are typically trained using gradient based techniques, and autodiff makes it easy to get gradients, even from enormous, complex models. ‘Reverse-mode autodiff’ is the autodiff method used by most deep learning frameworks, due to its efficiency and accuracy.

Let’s:

  • Look at how reverse-mode autodiff works.
  • Create a minimal autodiff framework in Python.

The small autodiff framework will deal with scalars. We can (and will later) use NumPy to vectorise it.

Note on terminology: from now on ‘autodiff’ will refer to ‘reverse-mode autodiff’. ‘Gradient’ is used loosely, but in this context generally means ‘first order partial derivative’.

How does autodiff work?

Let’s start with an example.

a = 4
b = 3
c = a + b  # = 4 + 3 = 7
d = a * c  # = 4 * 7 = 28

Q1: What is the gradient of $d$ with respect to $a$, i.e. $\frac{\partial{d}}{\partial{a}}$? (Go ahead and try this!)

Solving the ‘traditional’ way:

There’s many ways to solve Q1 , but let’s use the product rule, i.e. if $ y = x_1x_2 $ then $y’ = x_1’x_2 + x_1x_2’$.

Phew… and if you wanted to know $\frac{\partial{d}}{\partial{b}}$ you’d have to carry out the process again.

Solving the autodiff way

We’ll now look at the autodiff way to solve Q1 . Here is a figure:

From scratch: reverse-mode automatic differentiation (in Python)

On the left we see the system represented as a graph. Each variable is a node; e.g. $d$ is the topmost node, and $a$ and $b$ are leaf nodes at the bottom.

On the right we see the system from autodiff’s point of view. Let’s call the values on the graph edges local derivatives . By using local derivatives and simple rules, we will be able to compute the derivatives that we want.

Here is the answer to Q1 , calculated the autodiff way. Can you see how it relates to the figure?

We get this answer from the graph by finding the routes from $d$ to $a$ (not going against the dotted arrows), and then applying the following rules:

  • Multiply the edges of a route.
  • Add together the different routes.

The first route is straight from $d$ to $a$, which gives us the $\frac{\partial{\bar{d}}}{\partial{a}}$ term. The second route is from $d$ to $c$ to $a$, which gives us the term $\frac{\partial{\bar{d}}}{\partial{c}} * \frac{\partial{\bar{c}}}{\partial{a}}$.

Our autodiff implementation will go down the graph and compute the derivative of $d$ with respect to every sub-node, rather than just computing it for a particular node, as we have just done with $d$. Notice that we could compute the gradient of $d$ with respect to $c$, and $b$, without much more work.

Local derivatives

We saw ‘local derivatives’ on the graph edges above, written in the form: $ \frac{\partial \bar{y} }{\partial x}$.

The bar is to convey that we are doing a simpler kind of differentiation.

In general: to get a local derivative, treat the variables going into a node as not being functions of other variables.

For example, recall that $d = a * c$. Then compare $ \frac{\partial{d}}{\partial{a}} = 2a + b $, to $ \frac{\partial \bar{d} }{\partial a} = c $. The local derivative $\frac{\partial \bar{d} }{\partial a} = c$, is obtained by treating $c$ as a constant before differentiating the expression for $d$.

Local derivatives make our lives easier.

It is often easy to define the local derivatives of simple functions, and adding functions to the autodiff framework is easy if you know the local derivatives. E.g.

Addition: $n = a + b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = 1$ and $\frac{\partial \bar{n} }{\partial b} = 1$.

Multiplication: $n = a * b$. The local derivatives are: $\frac{\partial \bar{n} }{\partial a} = b$ and $\frac{\partial \bar{n} }{\partial b} = a$.

Let’s create the framework

Description of the implementation.

A node contains two pieces of data:

value
grad

The function get_gradients uses the nodes’ grad data to go through the graph recursively, computing the gradients. It uses the rules we saw above:

  • Multiply the edges of a route (into a node).
  • Add together the different routes (that lead to a node).

The tuples in stack (in get_gradients ) are similar to the tuples in grad , but they contain the current route value, instead of the local derivative value.

Implementing just enough to solve the example above:

from collections import defaultdict

class Var:
    """A leaf node (a node with no children)."""
    def __init__(self, value):
        self.value = value  # the scalar value of the node.

class Add:
    """The node that results from adding two nodes."""
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]  # child nodes & corresponding 'local derivatives'

class Mul:
    """The node that results from multiplying two nodes."""
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]

def get_gradients(parent_node):
    """Go down the graph, and compute derivative of `parent_node` with respect to each node."""
    gradients = defaultdict(lambda : 0)
    stack = parent_node.grad.copy()  # list of (node, route_value) tuples.
    while stack:
        node, route_value = stack.pop()
        gradients[node] += route_value  # "Add together the different routes."
        if not isinstance(node, Var): 
            # if the node has children, put them onto the stack.
            for child_node, child_route_value in node.grad:
                stack.append((child_node, child_route_value * route_value))  # "Multiply the edges of a route."
    return dict(gradients)

Solving the example above:

a = Var(4)
b = Var(3)
c = Add(a, b) # = 4 + 3 = 7
d = Mul(a, c) # = 4 * 7 = 28

gradients = get_gradients(d)

print('d.value =', d.value)
print("The partial derivative of d with respect to a =", gradients[a])
d.value = 28
The partial derivative of d with respect to a = 11

Success!

We also got gradients for the other nodes:

print('gradients[b] =', gradients[b])
print('gradients[c] =', gradients[c])
gradients[b] = 4
gradients[c] = 4

Let’s take a look at the grad values (the local derivatives):

print('dict(d.grad)[a] =', dict(d.grad)[a])
print('dict(d.grad)[c] =', dict(d.grad)[c])
print('dict(c.grad)[a] =', dict(c.grad)[a])
print('dict(c.grad)[b] =', dict(c.grad)[b])
dict(d.grad)[a] = 7
dict(d.grad)[c] = 4
dict(c.grad)[a] = 1
dict(c.grad)[b] = 1

We saw these in our example above as:

$ \frac{\partial \bar{d} }{\partial a} = c = 7$

$ \frac{\partial \bar{d} }{\partial c} = a = 4$

$ \frac{\partial \bar{c} }{\partial a} = 1$

$ \frac{\partial \bar{c} }{\partial b} = 1$

A few small improvements

Let’s:

  • Enable the use of operators, such as +, *, - …
  • Add a few more functions
class Ops:
    """Enables use of +, *, -, etc."""
    def __add__(self, other):
        return Add(self, other)
    
    def __mul__(self, other):
        return Mul(self, other)
    
    def __sub__(self, other):
        return Add(self, Neg(other))
    
    def __truediv__(self, other):
        return Mul(self, Inv(other))

class Var(Ops):
    def __init__(self, value):
        self.value = value
        
class Add(Ops):
    def __init__(self, a, b):
        self.value = a.value + b.value
        self.grad = [(a, 1), (b, 1)]

class Mul(Ops):
    def __init__(self, a, b):
        self.value = a.value * b.value
        self.grad = [(a, b.value), (b, a.value)]
    
class Neg(Ops):
    def __init__(self, var):
        self.value = -1 * var.value
        self.grad = [(var, -1)]

class Inv(Ops):
    def __init__(self, var):
        self.value = 1 / var.value
        self.grad = [(var, -var.value ** -2 )]

Some more examples

We can get the gradients of arbitrary functions made from the functions we’ve added to the framework. E.g.

a = Var(230.3)
b = Var(33.2)

def f(a, b):
    return (a / b - a) * (b / a + a + b) * (a - b)

y = f(a, b)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
The partial derivative of y with respect to a = -153284.83150602411
The partial derivative of y with respect to b = 3815.0389441500993

We can use numerical estimates to check that we’re getting correct results:

delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b) - f(a, b)) / delta
numerical_grad_b = (f(a, b + delta) - f(a, b)) / delta
print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
The numerical estimate for a = -153284.89243984222
The numerical estimate for b = 3815.069794654846

It’s easy to add more functions

You just need to be able to define the local derivatives.

import numpy as np

class Sin(Ops):
    def __init__(self, var):
        self.value = np.sin(var.value)
        self.grad = [(var, np.cos(var.value))]

class Exp(Ops):
    def __init__(self, var):
        self.value = np.exp(var.value)
        self.grad = [(var, self.value)]
        
class Log(Ops):
    def __init__(self, var):
        self.value = np.log(var.value)
        self.grad = [(var, 1. / var.value)]

Let’s check that these work:

a = Var(43.)
b = Var(3.)
c = Var(2.)

def f(a, b, c):
    f = Sin(a * b) + Exp(c - (a / b))
    return Log(f * f) * c

y = f(a, b, c)

gradients = get_gradients(y)

print("The partial derivative of y with respect to a =", gradients[a])
print("The partial derivative of y with respect to b =", gradients[b])
print("The partial derivative of y with respect to c =", gradients[c])
The partial derivative of y with respect to a = 60.85353612046653
The partial derivative of y with respect to b = 872.2331479536114
The partial derivative of y with respect to c = -3.2853671032530305
delta = Var(1e-8)
numerical_grad_a = (f(a + delta, b, c) - f(a, b, c)) / delta
numerical_grad_b = (f(a, b + delta, c) - f(a, b, c)) / delta
numerical_grad_c = (f(a, b, c + delta) - f(a, b, c)) / delta

print("The numerical estimate for a =", numerical_grad_a.value)
print("The numerical estimate for b =", numerical_grad_b.value)
print("The numerical estimate for c =", numerical_grad_c.value)
The numerical estimate for a = 60.85352186602222
The numerical estimate for b = 872.232160009645
The numerical estimate for c = -3.285367089489455

That’s the end of our minimal autodiff implementation!

Of course there’s various features missing, such as:

  • Vectorisation.
  • Placeholder variables.
  • Nth derivatives.
  • Optimisations.
  • All the other amazing stuff deep learning / autodiff frameworks can do.

The most fruitful addition to our minimal framework would be vectorisation (but not carried out in the manner that follows).

A naive vectorisation

We will look at an incredibly computationally inefficient way to vectorise our autodiff framework. (It is not recommended to use this for anything, because of how slow it is.)

The approach is:

Var
import numpy as np

to_var = np.vectorize(lambda x : Var(x))  # convert NumPy array into array of Var objects
to_vals = np.vectorize(lambda var : var.value)  # get values from array of Var objects

A single linear layer artificial neural network (fitting noise to noise):

import matplotlib.pyplot as plt

def update_weights(weights, gradients, lrate):
    for i in range(weights.shape[0]):
        for j in range(weights.shape[1]):
            weights[i, j].value -= lrate * gradients[weights[i, j]]

np.random.seed(0)

input_size = 50
output_size = 10
lrate = 0.001

x = to_var(np.random.random(input_size))
y_true = to_var(np.random.random(output_size))
weights = to_var(np.random.random((input_size, output_size)))

loss_vals = []
for i in range(100):

    y_pred = np.dot(x, weights)

    loss = np.sum((y_true - y_pred) * (y_true - y_pred))
    loss_vals.append(loss.value)
    
    gradients = get_gradients(loss)

    update_weights(weights, gradients, lrate)

    
plt.plot(loss_vals)
plt.xlabel("Time step"); plt.ylabel("Loss"); plt.title("Single linear layer learning")
plt.show()

From scratch: reverse-mode automatic differentiation (in Python)

Coming soon: a part two, where we look at how to vectorize our minimal framework more efficiency.

References:


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java和Android开发学习指南(第二版)

Java和Android开发学习指南(第二版)

Budi Kurniawan / 李强 / 人民邮电出版社 / 2016-3 / 69.00元

本书是Java语言学习指南,特别针对使用Java进行Android应用程序开发展开了详细介绍。 全书共50章,分为两大部分。第1部分(第1章到第22章)主要介绍Java语言基础知识及其功能特性。第2部分(第23章到第50章)主要介绍如何有效地构建Android应用程序。 本书适合任何想要学习Java语言的读者阅读,特别适合想要成为Android应用程序开发人员的读者学习参考。一起来看看 《Java和Android开发学习指南(第二版)》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具