Automatic Differentiation via Contour Integration

栏目: IT技术 · 发布时间: 5年前

内容简介：Given the usefulness of partial derivatives for closed-loop control, it is natural to ask how might large branching structures in the brain and other biological systems compute derivatives. After some reflection I realised that an important result in compl

Automatic Differentiation via Contour Integration

Might dendritic trees be used to compute partial derivatives?(image taken from [5])

Introduction:

Given the usefulness of partial derivatives for closed-loop control, it is natural to ask how might large branching structures in the brain and other biological systems compute derivatives. After some reflection I realised that an important result in complex analysis due to Cauchy, the Cauchy Integral Formula, may be used to compute derivatives with a simple forward propagation of signals using a Monte Carlo method.

In this article I introduce Cauchy’s formula, explain how Cauchy’s formula has a natural interpretation as an expected value, and demonstrate its reliability using concrete applications. These include gradient descent(also discovered by Cauchy), computing partial derivatives, and simulating a Spring Pendulum using the Euler-Lagrange equations.

Furthermore, I also address more conceptual issues towards the end of the article in the discussion section, where I provide a natural physical interpretation of complex numbers in terms of waves.

Note:While reading this article, you may find it helpful to experiment with functions in the following Jupyter notebook .

Cauchy’s Integral Formula for derivatives:

Derivation of the formula:

The Cauchy Integral Formula is defined as follows for functions

$f: A \rightarrow \mathbb{C}$ where $f$ is assumed to be differentiable $\forall z \in A$ and $\gamma$ is a simple closed piecewise smooth and positively oriented curve in $A$

\begin{equation} f’(z_0) = \frac{1}{2\pi} \int_{\gamma} \frac{f(z)}{(z-z_0)^2} \end{equation}

An important consequence of this definition is that (1) is applicable to any holomorphic function which includes polynomials with complex coefficients, trigonometric functions and hyperbolic functions. The keyword here is polynomials. Due to Taylor’s theorem, any differentiable real-valued function may be approximated by polynomials so Cauchy’s method is applicable to any function that is differentiable.

In order to apply (1) numerically it is convenient to derive a simpler formulation using the unit disk as a boundary. By letting

$z=z_0+e^{i\theta}$

we obtain:

\begin{equation} \forall z_0 \in A, f’(z_0) = \frac{1}{2\pi} \int_{0}^{2\pi} f(z_0+e^{i\theta}) \cdot e^{-i\theta} d\theta \end{equation}

and if we are mainly interested in functions of a real variable then

$\text{Im}(z_0)=0$ and $A$

is of the form:

\begin{equation} A = \{x + e^{i\theta}: x, \theta \in \mathbb{R}\} \end{equation}

Deterministic computation of derivatives:

Given (2), if

$\Delta \theta= \frac{2\pi}{N}$ denotes the sampling frequency in the discrete case, then $\frac{\Delta \theta}{2\pi}=\frac{1}{N}$

and we may implement this integration procedure in Julia as follows:

function nabla(f, x::Float64, delta::Float64)

	  N = round(Int,2*pi/delta)
	  thetas = vcat(1:N)*delta

	  ## collect arguments and rotations: 
	  rotations = map(theta -> exp(-im*theta),thetas)
	  arguments = x .+ conj.(rotations)  

	  ## calculate expectation: 
	  expectation = 1.0/N*real(sum(map(f,arguments).*rotations))

	  return expectation

	end

If you are familiar with Julia you may notice that, as with any finite summation, the order of the operations is permutation invariant. This may be implemented in a tree-like structure where a large number of local computations occur in parallel. However, given that most biological networks are inherently noisy some scientists may object to the fact that this program is deterministic.

In the next section I explain why this is generally a non-issue. On the contrary, random sampling using intrinsic noise allows cheap, fast and reliable Monte Carlo estimates of an integral.

Cauchy’s formula as a Monte Carlo method:

Interpretation as an Expected Value:

If we define the real-valued function:

\begin{equation} g: \theta \mapsto \text{Re}(f(x+e^{i\theta})\cdot e^{-i\theta}) \end{equation}

By the intermediate-value theorem,

\begin{equation} \forall x \in A \exists \theta^* \in [0,2\pi], g(\theta^*) = \text{Re}(f’(x)) \end{equation}

It follows that we may interpret

$f'$ as an expected value and the factor $\frac{1}{2\pi}$

as the value of a Probability Density Function of a continuous uniform distribution. This may serve as a basis for cheap and stochastic gradient estimates using Monte Carlo methods.

Monte Carlo estimates of the gradient:

It is worth noting that (2) isn’t a high-dimensional integral. Nevertheless, a biological network may be inherently noisy and face severe computation constraints so a Monte Carlo method may be both useful and a lot more biologically plausible.

Using half the number of samples, this may be implemented in Julia simply by adding one line:

function mc_nabla(f, x::Float64, delta::Float64)

	  N = round(Int,2*pi/delta)

	  ## sample with only half the number of points: 
	  sample = rand(1:N,round(Int,N/2)) 
	  thetas = sample*delta

	  ## collect arguments and rotations: 
	  rotations = map(theta -> exp(-im*theta),thetas)
	  arguments = x .+ conj.(rotations)  

	  ## calculate expectation: 
	  expectation = 2.0/N*real(sum(map(f,arguments).*rotations))

	  return expectation

	end

I’d like to add that by sampling randomly using intrinsic noise we are taking into account the epistemological uncertainty of the biological network, that doesn’t generally know how to sample

$x \in A$ optimally for arbitrary functions $f$

. So if we assume bounded computational resources this would generally be a better procedure to follow even in the absence of intrinsic noise.

Practical Applications:

Performing gradient descent:

Given the great importance of gradient descent in machine learning, the reader might wonder whether Cauchy’s definition of the complex derivative may be used to perform gradient descent. The answer is yes:

function gradient_descent(f,x_p::Float64,alpha::Float64)

  ## 100 steps
  for i=1:100

    x_n = x_p - alpha*nabla(f,x_p,delta)
    x_p = x_n

  end

  return x_p

end

…which may be tested on concrete problems such as this one:

## function:
g(x) = (x-1)^2 + (x-2)^4 + (x-3)^6

## initial value: 
x_p = 5.0 

## learning rate: 
alpha = 0.01 

x_min = gradient_descent(g,x_p,alpha)

The reader should find a minimum-value around 2.17 which Wolfram Alpha agrees with. A good sign.

Computing Partial Derivatives:

We may also use Cauchy’s formula to compute partial derivatives of functions of several variables such as:

\begin{equation} q(x,y,z) = x + y^2 + cos(z) \end{equation}

Using the Kronecker delta,

$\delta_{i,j}$ , we may use $q$

to construct three functions of a single variable where the others are held constant. I could explain how to proceed but I think this is better left as an exercise for the reader.

In any case, we are now ready to consider the more interesting challenge of simulating dynamical systems which is very useful for an organism capable of counterfactual reasoning.

Simulating a Spring Pendulum using the Euler-Lagrange equations:

A plot of angle against the angular velocity

Let’s consider a pendulum with mass

$m$ on the end, stiffness $k$ , and whose equilibrium length is $l$ . By doing a drawing, you may discover the auxiliary variable $x$

\begin{equation} x = \Delta l \end{equation}

After a bit more reflection, you should find the following expressions for the kinetic energy

$T$ and the potential energy $V$

\begin{equation} T(\dot{x},\dot{\theta}) = \frac{1}{2}m(\dot{x}^2+(l+x)^2\dot{\theta}^2) \end{equation}

\begin{equation} V(x,\theta) = -mg(l+x)cos(\theta) + \frac{1}{2}kx^2 \end{equation}

We may then define the Lagrangian:

\begin{equation} \mathcal{L}(x,\theta,\dot{x},\dot{\theta}) = T-V= \frac{1}{2}m(\dot{x}^2+(l+x)^2\dot{\theta}^2)+mg(l+x)cos(\theta) - \frac{1}{2}kx^2 \end{equation}

which may be defined in Julia:

## The Lagrangian where we assume the length of the unextended spring is 1.0 and mass is 1.0 kg: 
L(X) = 0.5*(X[3]^2+(1.0+X[1])^2*X[4]^2)+9.8*(1.0+X[1])*cos(X[2])-0.5*X[1]^2

Given initial conditions, we may simulate this pendulum system using the Euler-Lagrange equations:

\begin{equation} \frac{\partial \mathcal{L}}{\partial x} = \frac{d}{dt}\frac{\partial \mathcal{L}}{\partial \dot{x}} \end{equation}

\begin{equation} \frac{\partial \mathcal{L}}{\partial \theta} = \frac{d}{dt}\frac{\partial \mathcal{L}}{\partial \dot{\theta}} \end{equation}

where

$\frac{\partial \mathcal{L}}{\partial x}$ and $\frac{\partial \mathcal{L}}{\partial \theta}$ describe forces acting on the system with respect to $x$ and $\theta$

respectively.

The full simulation in Julia for 100 time-steps using the leapfrog method may then be defined:

## The Lagrangian where we assume the length of the unextended spring is 1.0 and mass is 1.0 kg: 
L(X) = 0.5*(X[3]^2+(1.0+X[1])^2*X[4]^2)+9.8*(1.0+X[1])*cos(X[2])-0.5*X[1]^2

## kronecker delta, which will come in handy:
kd(i,n) = [j==i for j in 1:n]

##  All four variables will be stored in this array:
Z = zeros(100,4)

## The initial conditions: 
Z[1,:] = [1.0,pi/4,0.0,0.0]

## simulate the pendulum system: 
for i=1:99

	## define partial derivatives: 
	dd_theta = x -> L(x .+ Z[i,:].*(ones(4)-kd(2,4)))

	dd_x = x -> L(x .+ Z[i,:].*(ones(4)-kd(1,4)))

	## update angle: 
	Z[i+1,1] = Z[i,1] + Z[i,3]*0.01 + 0.5*(nabla(dd_theta,Z[i,4],delta))*0.0001

	## update position: 
	Z[i+1,2] = Z[i,2] + Z[i,4]*0.01 + 0.5*(nabla(dd_x,Z[i,4],delta))*0.0001

	## update dx and dtheta:
	Z[i+1,3] = Z[i,3] + 0.5*(nabla(dd_x,Z[i,1],delta)+nabla(dd_x,Z[i+1,1],delta))*0.01

	Z[i+1,4] = Z[i,4] + 0.5*(nabla(dd_theta,Z[i,2],delta)+nabla(dd_theta,Z[i+1,2],delta))*0.01

end

Discussion:

At this point the reader might still have questions. I can’t address all questions but today I will try to address the physical interpretation of complex numbers.

It’s worth noting that complex numbers,

$re^{i\theta} \in \mathbb{C}$ have both magnitude $r$ and an angle $\theta$ . Therefore multiplication by complex numbers has a geometric interpretation as re-scaling by $r$ and rotation by $\theta$

. This leads to a natural physical interpretation.

For any

$\Delta \theta \in [0, 2\pi)$ we may define the frequency $f = \frac{\Delta \theta}{2\pi}$ which specifies a sampling rate and may be physically interpreted as the propagation of a wave whose amplitude corresponds to $r$

. So complex numbers may be represented in biological networks through the propagation of waves.

References:

L.D. Landau & E.M. Lifshitz. Mechanics ( Volume 1 of A Course of Theoretical Physics ). Pergamon Press 1969.
W. Rudin. Real and complex analysis. McGraw-Hill. 3rd ed.1986.
Aidan Rocke. AutoDiff(2020).GitHub repository, https://github.com/AidanRocke/AutoDiff
Aidan Rocke. Twitter Thread. https://twitter.com/bayesianbrain/status/1202650626653597698
Duncan E. Donohue,Giorgio A. Ascoli. A Comparative Computer Simulation of Dendritic Morphology. PLOS Biology. June 6, 2008.
Michael London & Michael Häusser. Dendritic Computation. Annu. Rev. Neurosci. 2005. 28:503–32
WARREN S. MCCULLOCH AND WALTER PITTS. A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN NERVOUS ACTIVITY. Bulletin of Mothemnticnl Biology Vol. 52, No. l/2. pp. 99-115. 1990.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

再看电商

黄若 / 电子工业出版社 / 2014-7-1 / CNY 39.00

电商行业在中国经历了十年的高速增长。如果说十年前的网上购物是新鲜潮人的尝试的话，那么今天几亿网购人群的规模，零售市场18,000亿人民币的年交易额，正催生着一个改变人们生活习惯的全新行业。互联网正在从各个维度重新定义生产、品牌、娱乐、传播、消费，电商毫无疑问的在购物领域影响着越来越多人的生活。同时，这个行业连年亏损，顾客服务良莠不齐，也受到广泛关注。作者从地面零售到电子商务，从跨国公司高管到管理民......一起来看看《再看电商》这本书的介绍吧!

码农工具