ForwardDiff.jl
Forward Mode Automatic Differentiation for Julia
view repo
We present ForwardDiff, a Julia package for forwardmode automatic differentiation (AD) featuring performance competitive with lowlevel languages like C++. Unlike recently developed AD tools in other popular highlevel languages such as Python and MATLAB, ForwardDiff takes advantage of justintime (JIT) compilation to transparently recompile ADunaware user code, enabling efficient support for higherorder differentiation and differentiation using custom number types (including complex numbers). For gradient and Jacobian calculations, ForwardDiff provides a variant of vectorforward mode that avoids expensive heap allocation and makes better use of memory bandwidth than traditional vector mode. In our numerical experiments, we demonstrate that for nontrivially large dimensions, ForwardDiff's gradient computations can be faster than a reversemode implementation from the Pythonbased autograd package. We also illustrate how ForwardDiff is used effectively within JuMP, a modeling language for optimization. According to our usage statistics, 41 unique repositories on GitHub depend on ForwardDiff, with users from diverse fields such as astronomy, optimization, finite element analysis, and statistics. This document is an extended abstract that has been accepted for presentation at the AD2016 7th International Conference on Algorithmic Differentiation.
READ FULL TEXT VIEW PDFForward Mode Automatic Differentiation for Julia
Julia package mirror.
We present ForwardDiff, a Julia package for forwardmode automatic differentiation (AD) featuring performance competitive with lowlevel languages like C++. Unlike recently developed AD tools in other popular highlevel languages such as Python and MATLAB [1, 2, 3], ForwardDiff takes advantage of justintime (JIT) compilation [4] to transparently recompile ADunaware user code, enabling efficient support for higherorder differentiation and differentiation using custom number types (including complex numbers). For gradient and Jacobian calculations, ForwardDiff provides a variant of vectorforward mode that avoids expensive heap allocation and makes better use of memory bandwidth than traditional vector mode.
In our numerical experiments, we demonstrate that for nontrivially large dimensions, ForwardDiff’s gradient computations can be faster than a reversemode implementation from the Pythonbased autograd package. We also illustrate how ForwardDiff is used effectively within JuMP [5], a modeling language for optimization. According to our usage statistics, 41 unique repositories on GitHub depend on ForwardDiff, with users from diverse fields such as astronomy, optimization, finite element analysis, and statistics.
ForwardDiff implements a Julia representation of a multidimensional dual number, whose behavior on scalar functions is defined as:
(1) 
where for all indices and . Storing additional components allows for a vector forwardmode implementation of the sort developed by Kahn and Barton [6]. In our formulation, orthogonal components are appended to input vector components to track their individual directional derivatives:
(2) 
Vector forward mode enables the calculation of entire gradients in a single pass of the program defining , but at the cost of additional memory and operations. Specifically, every dual number must allocate an vector of equal size to the input vector, and the number of operations required for derivative propagation scales linearly with the input dimension. In practice, especially in memorymanaged languages like Julia, the cost of rapidly allocating and deallocating large vectors on the heap can lead to slowdowns that practically outweigh the advantage of fewer passes through .
ForwardDiff’s implementation works around this pitfall by stackallocating the vectors, as well as permitting their size to be tunable at runtime relative to the input dimension and performance characteristics of the target function. We call ForwardDiff’s strategy chunk mode, since it allows us to compute the gradient in bigger or smaller chunks of the input vector. The vector length is then the chunk size of the computation. For a chunk size and an input vector of length , it takes passes through to compute . For example, it takes two passes through to evaluate the gradient at a vector of length and chunk size :
(3) 
ForwardDiff implements a multidimensional dual number as the type Dual{N,T}, where the type parameter N denotes the length of the vector and the type parameter T denotes the element type (e.g. Dual{2,Float64} has two Float64 components). This type has two fields: value, which stores the component, and partials, which stores the stackallocated vector. It’s straightforward to overload base Julia methods on the Dual type; here’s an example using sin, cos, and  (univariate negation):
These method definitions are all that is required to support the following features:
order derivative of sin or cos (through nesting Dual types)
derivative of complex sin or cos via types of the form Complex{Dual{N,T}}
derivative of sin or cos over custom types, e.g. Custom{Dual{N,T}} or Dual{N,Custom}
We unfortunately do not have room in this abstract to adequately cover the latter two items; a proper discussion would require a more thorough exposition of Julia’s multiple dispatch and JITcompilation facilities.
Instead, we discuss how instances of the Dual type can be nested to enable the use of vectormode AD for higherorder derivatives. For example, the type Dual{M,Dual{N,T}} can be used to compute M x N order derivatives. As a simple demonstration of the scalar case, we use an instance of the type Dual{1,Dual{1,Float64}} to take the second derivative of sin at the Julia prompt (The notation [d,k] is used to denote the partial nested at level ):
Algebraically, the above example is equivalent to the use of hyperdual numbers described by Fike and Alonso [7]. In fact, a Dual instance with levels of nesting implements a order hyperdual number, with the added advantage of scaling to arbitrary dimensions. For example, an instance of Dual{M,Dual{N,Dual{L,T}}} can be used to take M x N x L thirdorder derivatives in a single pass of the target function.
In this section, we present timing results for gradient calculations of the Rosenbrock (4) and Ackley (5) functions. Recalling (1) and the discussion in Section 2, increasing chunk size reduces the number of evaluations of the univariate functions within . We choose Ackley and Rosenbrock as our target functions in order to provide a contrast between the relative gains of increasing chunk sizes when the target function contains many and few expensive univariate functions, respectively.
(4) 
(5) 
Table 1 shows evaluation times for calculating gradients of our two target functions using ForwardDiff versus a naive equivalent C++ implementation. Various chunk sizes were tested, while the input size was fixed at elements. For the sake of simplicity, ForwardDiff’s Dual{N,T} type was translated into a hardcoded C++ class for each .


Table 1 helps illustrate that the optimal chunk size for a given problem is a result of a tradeoff between memory bandwidth, memory alignment, cache performance, and function evaluation cost. For example, note that ForwardDiff’s performance worsens when going from to , and that the C++ implementation’s performance with both functions worsens when going from to . The former observation is likely due to the large memory bandwidth cost relative to the cost of the arithmetic operations, while the latter observation is likely due to poor alignment of input vector (since each 2dimensional dual number is essentially a struct of three double values  one for the instance value, and two for the partial components).
Function  Input Size  autograd Time (s)  ForwardDiff Time (s)  ForwardDiff Multithreaded Time (s) 

Ackley  10  0.001204  0.000007  0.000007 
Ackley  100  0.008472  0.000058  0.000056 
Ackley  1000  0.081499  0.006351  0.002620 
Ackley  10000  0.835441  0.564828  0.253798 
Ackley  100000  8.361769  56.850198  24.394373 
Rosenbrock  10  0.000866  0.000003  0.000003 
Rosenbrock  100  0.004395  0.000034  0.000041 
Rosenbrock  1000  0.040702  0.003010  0.001582 
Rosenbrock  10000  0.411095  0.302277  0.159703 
Rosenbrock  100000  4.173851  30.365882  14.111776 
Table 2 compares the gradient computation time of the reversemode implementation of the Pythonbased autograd package versus the forwardmode implementation of ForwardDiff for varying input sizes. We also include results obtained using our experimental multithreaded implementation, which show a 2x speedup using 4 threads compared to our singlethreaded implementation.
Both functions have linear complexity in the input dimension ; therefore reverse mode, which requires passes through each function, scales linearly, while our forward mode, which requires passes through each function (with fixed ), scales quadratically. The results in Table 2 agree with this complexity analysis. Nevertheless, there is a huge performance gap between these two implementations such that autograd is slower on these examples when , despite reverse mode being a superior algorithm in principle for computing gradients.
The code used to generate the timings in this section can be found at https://github.com/JuliaDiff/ForwardDiff.jl/tree/jr/benchmarks/benchmark. Julia benchmarks were run using Julia version 0.5.0dev+3200, C++ benchmarks were compiled with clang600.0.57 using O2, and Python benchmarks were run using Python version 2.7.9.
Effective use of ForwardDiff has brought improvements to JuMP [5], a domainspecific language for optimization embedded in Julia where users provide closedform algebraic expressions using a specialized syntax. JuMP, and similar commercial tools like AMPL [8], compute derivatives of user models as input to nonlinear optimization solvers, which is quite different from ForwardDiff’s original target case of differentiating general userdefined code.
JuMP computes sparse Hessians by using the graph coloring approach of [9], which requires computing a small number of Hessianvector products in order to recover the full Hessian. JuMP’s forwardoverreverse mode implementation for Hessianvector products makes use of ForwardDiff’s chunk mode, essentially computing Hessianmatrix products instead of independent Hessianvector products. This use of chunk mode yielded speedups of 30% on benchmarks presented in [5] (under review). The results in [5] include this speedup but are not accompanied by a discussion of the methodology of chunk mode.
On the userfacing side, ForwardDiff has enabled JuMP to be the first AML to our knowledge which performs automatic differentiation of userdefined functions embedded within closedform expressions. We reproduce an example from [5] illustrating a userdefined square root function within a JuMP optimization model:
JuMP computes gradients of squareroot with ForwardDiff which are then integrated within the reversemode computations of JuMP. We do not yet support order derivatives of userdefined functions. While this functionality is immature and leaves room for improvement (we could attempt to tape the userdefined functions and calculate their gradients in reverse mode), it already creates a new and useful way for JuMP users to seamlessly interact with AD when small parts of their model cannot easily be expressed in closed algebraic form.
We are currently investigating several avenues of research that could improve ForwardDiff’s performance and usability. We are in the preliminary phases of implementing SIMD vectorization of derivative propagation. We intend to address perturbation confusion [10]
by intercepting unwanted pertubations at compile time using Julia’s metaprogramming capabilities. Finally, we wish to improve our support for matrix operations such as eigenvalue computations by directly overloading linear algebra functions, a technique which has already seen use in
[2].Gradientbased Hyperparameter Optimization through Reversible Learning.
InProceedings of the 32nd International Conference on Machine Learning
, volume 37 of JMLR Proceedings, pages 2113–2122. JMLR.org, 2015.JuMP: A Modeling Language for Mathematical Optimization.
ArXiv eprints, August 2015.
Comments
There are no comments yet.