Capstan.jl
A Cassette-based automatic differentiation package for the Julia language
view repo
We show how forward-mode automatic differentiation (AD) can be employed within larger reverse-mode computations to dynamically differentiate broadcast operations in a GPU-friendly manner. Our technique fully exploits the broadcast Jacobian's inherent sparsity structure, and unlike a pure reverse-mode approach, this "mixed-mode" approach does not require a backwards pass over the broadcasted operation's subgraph, obviating the need for several reverse-mode-specific programmability restrictions on user-authored broadcast operations. Most notably, this approach allows broadcast fusion in primal code despite the presence of data-dependent control flow. We discuss an experiment in which a Julia implementation of our technique outperformed pure reverse-mode TensorFlow and Julia implementations for differentiating through broadcast operations within an HM-LSTM cell update calculation.
READ FULL TEXT VIEW PDFA Cassette-based automatic differentiation package for the Julia language
In recent years, the prevalence of gradient-based optimization in machine learning (ML) has motivated an upsurge in the development of ML-specific modeling languages that incorporate automatic differentiation (AD) as a fundamental feature. However, contemporary ML research routinely seeks to utilize new modeling and optimization techniques that push these frameworks’ AD capabilities to - and past - their limit. Both practical and exploratory implementations of such techniques demand advanced features such as nested differentiation, differentiation through data-dependent control flow, domain-specific hardware specialization, distributed parallelism, checkpointing, and more
maclaurin2015hyperparameter ; chen2018tvm ; lacey2016fpga ; li2018adhalide ; baydin2017adsurvey ; paszke2017pytorch ; abadi2016tensorflow .In the pursuit of solutions capable of incorporating such features, it has become clear that modeling languages’ expressiveness must necessarily be constrained for the sake of differentiability. Recent endeavors wei2017dlvm ; breuleux2017myia ; frostig2018jax ; elliott2018categories ; wei2018manifesto ; innes2018zygote ; pytorch2018scriptdocs ; shaikhha2018diffprogarray that explore this tradeoff have been strongly guided by well-established methods from programming language (PL) research, provoking the evolution of a new research area known as differentiable programming. This is quite a natural development, as the narrative of traditional AD research has always been richly intertwined with PL and mathematical programming research.
Particularly relevant in this regard is the work of Siskind and Pearlmutter, whose stated vision aligns surprisingly well with the goals of contemporary differentiable programming research: “Our vision: …a unified intermediate language that supports both compiler optimizations and AD transformations for a variety of source and target languages.” siskind2008vlad When viewed through the lens of naumann2000oja , it can be concluded that “optimal” differentiation of such a language cannot be achieved via pure forward- or reverse-mode approaches, but rather demands a mixed-mode approach. Achieving optimality, in this case, is often defined as minimizing the number of multiply-adds required to differentiate a given program via selecting the optimal mode for each subregion, and is known as the Optimal Jacobian Accumulation (OJA) problem. This problem has been shown to be NP-complete naumann2008npcompleteoja .
Despite this general theoretical intractability, mixed-mode AD offers a host of other advantages that can still be leveraged in practice by heuristically exploiting the local structure of the target language’s primitive operations. This paper’s primary concern is the application of this idea to a very common scientific computing primitive: the
broadcast operation vanderwalt2011numpyarray . In §2, we present the broadcast operation and a reverse-mode-interleavable forward-mode method for its differentiation that exploits the special sparsity structure of its total Jacobian. In §3, we discuss an experiment that demonstrates our method’s superiority over pure reverse-mode approaches for the differentiation of a data-dependent HM-LSTM cell update calculation on the GPU. Via this paper, we wish to motivate the development of a new generation of differentiating compilers that do not operate purely in the forward or reverse modes, but rather choose the optimal mode for each target subprogram when such a choice is naively determinable from local structure.Throughout this paper, we will append a period to a function to denote broadcasting that function over its arguments. We define the broadcast of a function as:
Here, the arguments are multidimensional arrays of arbitrary shape ^{1}^{1}1Note that scalars and single-element arrays are equivalent under this definition of broadcast., subject to the constraint that each dimension of any argument must either have the same length as that dimension in other arguments, or must have length . Each is equivalent to the corresponding , but where length- dimensions are “copied” along that dimension to match its maximum length across all , such that all are of equal shape. The function is then mapped elementwise across all , resulting in the outputs , each of which is the same shape as any .^{2}^{2}2Shrewd broadcast implementations index directly into the arguments to perform elementwise invocations (as is done in Eq. (1)) rather than explicitly materialize the arguments.
For example, broadcasting over an matrix , a scalar , and an
-element vector
yields:(1) | ||||
To denote broadcast for binary infix operators, we prepend (instead of append) the period, e.g. .
Assuming that a pair of broadcasted operations have compatible shapes and are relatively side-effect free, the broadcast of their composition generally obeys the following relation:
(2) |
In programs containing broadcast operations, Eq. (2) can be exploited to perform broadcast fusion, a compiler-level optimization that transforms compositions of broadcast calls into a single broadcast call. This optimization imparts a couple of performance benefits. First, by obviating the need to compute and store intermediate results, broadcast fusion reduces memory usage, temporary allocations, and kernel invocations required to complete the computation. Second, broadcast fusion allows the fused broadcast operation to be parallelized without re-synchronization between intermediary broadcast operations darte2000fusion ; kennedy1994fusion .
A common way to formulate forward-mode AD is via the algebra of dual numbers. Dual numbers are similar to complex numbers, but instead of appending the imaginary unit to , the dual number algebra appends the infinitesimal perturbation where to . Via Taylor series expansion, one can show that unary function application on dual numbers is defined as:
(3) |
This formulation is commonly implemented by defining a library of mathematical primitives that act on a Dual type consisting of the pair, e.g. cos(Dual(x, y)) Dual(cos(x), -sin(x) * y). This Dual type can then be used to automatically differentiate any program composed of the defined primitives.
While this formulation is usually straightforward to implement, it is also quite limited - only a single scalar derivative can be calculated per call of the target function. To overcome this limitation, we can use an extended formulation of the dual numbers known as the multidimensional dual numbers, which are defined as:
(4) | |||
(5) |
Multidimensional dual numbers allow for a “vector forward-mode” implementation of gradient calculation, where orthogonal components are appended to orthogonal input components to compute their individual directional derivatives pearlmutter2007lazy ; revels2016forwarddiff :
(6) |
To extract the components as a tuple from a multidimensional dual number, we utilize the tangent extraction function , defined as
(7) |
Note that utilizes the notion of tagged perturbations; only extracts perturbations that are marked with the “tag” , represented here via the bracket syntax . This tagging machinery is necessary (but unfortunately not sufficient) to avoid a class of AD bugs known as perturbation confusion siskind2008nesting ; siskind2005perturbation ; manzyuk2012confusion .
Broadcasted operations generally take the form where is the input arity and is the output arity. While might be broadcasted over millions of input elements, the arities and are generally relatively small (often ). To automatically differentiate such a function in the forward-mode, we can define a Jacobian operator using multidimensional dual numbers:
(8) |
where is the tangent extraction function defined in Eq. (7). Note that we are broadcasting over the output tuple of in order to extract all components of the Jacobian. For example, for , the definition expands to the following:
The following observation is frequently utilized throughout the rest of the paper:
(9) |
where vec
is notation for vectorization (i.e. “flattening” the given tensor to a vector), and
diag is notation for extracting the diagonal of a square matrix. In other words, Eq. (9) directly computes all elementwise partial derivatives of the total Jacobian of . This approach exploits the sparsity structure imposed on the Jacobian by the broadcast operation, avoiding calculation of the zero-valued cross-element partial derivatives (the off-diagonal elements of the matrices) by construction. Note, however, that if entails side-effects that induce cross-element dependence, then the total Jacobian is not fully recovered by this method, since the uncalculated cross-element partial derivatives may be nonzero in this case.Eq. (9) has significant performance and programmability implications when exploited within larger reverse-mode AD computations, since it enables the differentiation of fully-fused broadcast subgraphs without requiring the construction of a backwards pass. Specifically, Eq. (9
) can be employed to easily calculate and cache the intermediate Jacobian of the broadcast subgraph during the forward pass of the overall reverse-mode computation. This Jacobian can then be backpropagated during the reverse pass instead of backpropagating through the broadcast subgraph directly. In this way, Eq. (
9) allows one to treat entire broadcast subgraphs as fused forward-mode primitives, obviating the need for reversible representations of these subgraphs.Definition | Forward (Primal) | Reverse (Adjoint) |
---|---|---|
To illustrate the use of Eq. (9) within a reverse-mode computation, consider the example defined in Table 1. The left column defines the target function , the center column expresses the primal forward pass of the computation, and the right column expresses the adjoint pass used to compute and . In the adjoint pass, the actual calculation of can be accomplished via the usual reverse-mode approach of decomposing into a reversible subgraph built from known primitive operations. The calculation of and , however, can be accomplished via Eq. (9) without requiring the construction of reverse-mode computation subgraph at all:
(10) |
Note that while it is mathematically useful to discuss and as separate computations as we have done here, practical AD implementations leveraging this method can exploit the implicit computation of that occurs as part of computing . Instead of applying the operator immediately as is done in Eq. (8), the primal and dual computation results can be extracted simultaneously. In other words, the step in forward pass code can be replaced with a step that simultaneously calculates , , and by simply invoking with dual number inputs.
Fusing the primal and derivative calculations in this manner avoids redundant computation, but requires that the resulting partial derivatives be cached until they are backpropagated in the reverse pass. This fusion, then, may not be desirable if there is not sufficient memory available to sustain such a cache. Conversely, computing during the reverse pass redundantly computes , but the resulting partial derivatives can be backpropagated immediately, and thus their storage can be freed (or reused) immediately. Ultimately, the choice of whether should be applied in the forward pass or in the reverse pass depends on the memory/compute bandwidth of the overall computation.
The previous sections implemented as a forward-mode differentiation operator, but could have also been implemented via reverse-mode AD without invalidating Eq. (9). Why, then, is forward-mode the better choice for this use case? The answer to this question can be summarized in three points:
1: If , then reverse-mode is algorithmically superior to forward-mode. However, is generally low-arity, and in practice, forward-mode often outperforms reverse-mode for low-arity functions regardless of the
ratio. There are two reasons for this. First, reverse-mode implementations often incur relatively high constant costs that are not amortized in the low-arity regime. Second, forward-mode’s additional chain rule applications can be offset for low-arity functions by leveraging stack allocation schemes that make better use of cache bandwidth and allow for the exploitation of instruction-level parallelism
revels2016forwarddiff .2: If the target function contains data-dependent control flow, reverse-mode implementations must dynamically allocate the data-dependent regions of the computation graph^{3}^{3}3This requirement is not implementation-specific, but rather a hard theoretical limit; capturing intermediate values which depend on run time data will always require run time allocation in the general case, though certain optimizations may alleviate this burden in special cases. This requirement applies even to reverse-mode tools that claim to be “tapeless” by statically generating backwards pass code merrinboer2017tangent ; innes2018zygote , or performing equivalent transformation via language-level constructs such as delimited continuations or closures wang2018shiftreset . As Pearlmutter and Siskind remark, it is “impossible” to “eliminate the tape from reverse-mode AD” because “the tape stores intermediate values computed during the forward phase that are needed during the reverse phase.” pearlmutter2008lambda . For low-arity functions, the overhead of dynamic trace allocation can easily dwarf the cost of the target function’s primal evaluation. For broadcasted operations, this high overhead would be incurred for every elementwise invocation, rendering the reverse-mode approach in this case wholly unsuitable for the GPU where excessive dynamic allocation is infeasible.
3: Following from the previous point, using forward-mode for broadcast differentiation allows data-dependent control flow to occur within broadcasted scalar operations, thus avoiding several disadvantages inherent to vectorized control flow primitives currently employed by reverse-mode frameworks (e.g. TensorFlow’s where abadi2016tensorflow ). The first disadvantage is programmability; vectorized control flow primitives are often more cumbersome to use than their naive scalar counterparts. The second disadvantage is that many vectorized control flow primitives require computing untaken branches. While these primitives do have the benefit of clearly avoiding warp divergence on the GPU, the experiment described in §3 demonstrates that this benefit does not necessarily offset the cost of computing untaken branches on newer GPU architectures - especially if the difference in cost between branches is substantial - since newer architectures support executing different instructions across a warp without forcing serialized execution.
In this section, we describe an experiment performed to compare this paper’s forward-mode broadcast differentiation technique with existing reverse-mode approaches. Our test case for this experiment was a cell update calculation that occurs during the execution of a hierarchical multiscale LSTM (HM-LSTM) chung2016hierarchical , described in §3.1. In §3.2, we describe the three different AD implementations used to calculate gradients for our test case (TensorFlow-based reverse-mode, Julia-based reverse-mode, and Julia-based forward-mode). Finally, in §3.4, we analyze GPU performance measurements obtained from benchmarking these implementations.
Note that all implementation, benchmark, and test code is publicly available in its entirety at paperrepo .
The HM-LSTM cell update calculation is a real-world example of a broadcast operation that is amenable to differentiation via Eq. (9). For a given time step and layer , the update calculation for the cell is:
(11) |
where and are memory gates, is a cell proposal vector, and is a boundary state.
We chose this operation as our experimental test case because it is self-contained, hinges on data-dependent control flow, has a substantial computational cost difference between branches, and is relevant to a machine learning audience.
The benchmarks described in the following sections are primarily concerned with the calculation of , , , and .
The first implementation tested in our experiment was a TensorFlow-based implementation derived from finkelstein2017hmlstm . This implementation makes use of TensorFlow’s vectorized control flow primitive where, which eagerly computes both branches of the conditional statement before returning the branch specified by the given predicate. This primitive sidesteps actual branching and thus avoids two potential pitfalls discussed in §2.2.4: dynamic trace allocation and warp divergence. As discussed in that section, avoiding these two perceived pitfalls comes at the cost of restricting the programming model and limiting opportunities for optimizations such as broadcast fusion.
A visualization of the implementation’s post-optimization intermediate representation (IR) can be found in paperrepo , depicted as a computation graph in the High Level Optimizer (HLO) format. From this graph, it can be seen that TensorFlow’s XLA compiler broke up the entire computation into six separate kernels, each representing a partially fused region of the forward and reverse passes, including broadcasted select operations that were generated from the initial code’s where invocations.
The second implementation tested in our experiment was a reverse-mode implementation in the Julia language bezanson2017julia . This implementation was directly derived from the HLO graph of the TensorFlow implementation described in the previous section. The intent was to exactly mirror TensorFlow’s operations at the abstraction level of its HLO representation in order to better bridge comparisons between the reverse-mode TensorFlow and forward-mode Julia implementations. To accomplish this, the HLO graph operations were manually transcribed as native Julia code, additionally using the CUDAnative package to enable execution on the GPU besard2017cudanative .
The third implementation tested in our experiment was a native Julia implementation of forward-mode broadcast differentiation as described by Eq. (9). Like the reverse-mode Julia implementation, this implementation also employed the CUDAnative package to enable execution on the GPU. In many other ways, however, this forward-mode implementation differs substantially from the reverse-mode implementations.
The principle difference was the manner in which the primal calculation was expressed. While the reverse-mode implementations expressed control flow via vectorized primitives, the forward-mode approach allows the fusion of control flow into the broadcasted kernel without incurring the reverse-mode-specific performance penalties discussed in §2.2.4. Thus, the forward-mode implementation’s primal calculation was derived by using the relation from Eq. (2) to fuse Eq. (11) into a single broadcastable kernel:
(12) | ||||
(13) |
An implementation of the operator (defined by Eq. (8)) was then applied to Eq. (12) to calculate the required gradients. The operator itself was implemented using the multidimensional dual number type provided by the ForwardDiff package, which represents a dual number as a pure Julia struct with two fields; one for the primal scalar, and one for a stack-allocated vector of perturbation coefficients revels2016forwarddiff .
Python code was executed using Python 3.6.3 with TensorFlow 1.5.0 and its XLA JIT compiler (which includes an unreleased version of LLVM close to LLVM 6.0). Julia code was executed using Julia 0.7.0-DEV.5025, built on LLVM 6.0. All required versions of all Julia packages used are publicly available: CUDAnative.jl version 0.8.1, CUDAdrv.jl 0.8.3, LLVM.jl 0.9.8, and ForwardDiff 0.7.5. We used the CUDA toolkit at version 9.1.85, in combination with NVIDIA driver 390.30 and Linux 4.13 from Ubuntu 17.10.
The implementations described in §3.2 were tested on NVIDIA Tesla V100, Tesla P100, and GTX 1080Ti GPUs in combination with 2 hexa-core Intel Xeon E5-2603 v4 CPUs and 64 GiB of DDR4 memory.
We measure the performance of individual implementations using the NVIDIA profiling tools from the CUDA toolkit. We only report kernel timings, excluding, e.g., memory transfers and CUDA API interactions, because the different implementations were designed to behave identically from an API point of view.
For the sake of accurate comparison, our Julia-based benchmarks followed TensorFlow’s configuration where possible, e.g., using page-locked memory allocated asynchronously using the driver API, performing an identical amount of memory transfers, launch kernels identically (using at most 64 threads and a corresponding number of blocks), etc.
Fig. 1 shows the execution times to compute the aforementioned derivatives for each implementation described in §3.2 across three generations of NVIDIA GPUs, with , , , and taking random 32-bit floating point matrix values and and taking -element random 32-bit floating point vector values where . As can be seen in Fig. 1, the forward-mode Julia implementation features a speedup of x, x, and x over the reverse-mode Julia implementation on the Volta, Pascal, and Kepler architectures, respectively. Compared to the reverse-mode TensorFlow implementation, these speedups are x, x and x, respectively.
As mentioned in §2.2.4, a substantial advantage of the forward-mode approach is that it avoids the computation of untaken branches by allowing data-dependent control flow to be fused within the broadcasted operation itself. However, this kind of fine-grained branching has traditionally been considered unfavorable for GPUs, which typically require threads within a so-called “warp” (a group of typically 32 threads) to execute in lockstep. If threads within a warp branch to different instructions, the hardware must execute both branches on all threads within the warp and mask out the results of untaken branches on each thread. This is known as warp divergence, and can decrease performance significantly bialas2015benchmarking .
Fortunately, recent hardware improvements found on NVIDIA’s Volta architecture can drastically mitigate the negative impact of warp divergence in many cases. This architecture enables independent thread scheduling by maintaining a program counter and call stack for every thread separately nvidia2017v100 , thus allowing threads to execute different instructions without requiring serialized execution. The effects of this architectural improvement can be seen in Fig. 2
, which shows the ratio of the forward-mode Julia implementation’s execution time between uniformly distributed random control inputs and warp-uniform control inputs. Executing on a Tesla V100 GPU, the overhead of the thread-divergent implementation in terms of kernel execution time drops from
to on Kepler and Pascal. When looking at total application execution time, this cost is even lower (below ), as kernels also execute faster on more recent hardware.In addition to the main experiment, a different experiment was performed to explore how various indicators of GPU utilization scale as the arity of a forward-mode-differentiated broadcast operation increases.
As previously stated, the ForwardDiff package’s multidimensional dual number implementation utilizes a stack-allocated vector to store perturbation coefficients. Recalling §2.2.1, the length of every input, output, and intermediate dual number’s perturbation vector is equal to the input arity of the target operation. Thus, increasing the input arity of the target operation increases register pressure. On CPUs, increased register usage can quickly result in excessive stack pressure, such that temporary values must be spilled into memory. On GPUs, however, many more registers are available; for example, the Tesla V100 contains 65,536 32-bit registers on each of its 84 Shared Multiprocessors (SM) nvidia2017v100 . This advantage is offset by the large number of threads executing concurrently on the GPU, since each thread reserves a number of registers for exclusive use. The balance between active thread count and register usage is captured by the occupancy metric, precisely defined as the ratio of active warps on an SM to the maximum number of active warps supported by the SM. With increased register usage, fewer warps can be allocated on each SM, and occupancy drops.
To assess the impact of a broadcasted operation’s input arity on the performance of its forward-mode differentiation by dual numbers, we designed an artificial benchmark and measured its achieved occupancy and effective hardware utilization: the computation of where
and each is a random matrix with 32-bit floating point elements. This benchmark provides a balanced workload for which it is easy to increase the arity and measure the subsequent effect on hardware utilization. Fig. 3 shows how occupancy drops steadily from 5 arguments on, at which point the amount of registers exceeds 32 and insufficient warps can be launched to satisfy the maximum number of concurrent warps per SM. Hardware utilization is initially limited by the low complexity of the kernel. It does not drop as strongly as the occupancy, since higher arity also increases the workload of the kernel, but at 18 arguments both compute and bandwidth utilization drop below 60% and the kernel can be considered latency-bound due to low occupancy.
In this paper, we presented a reverse-mode-interleavable forward-mode method for the differentiation of broadcast that outperforms pure reverse-mode methods on the GPU and simultaneously obviates the need for reverse-mode-specific programmability restrictions on user-authored broadcasted operations. This mixed-mode technique is, in fact, already well-utilized in the Julia ecosystem. It was first introduced in 2016 by the ReverseDiff package (developed by this paper’s first author) revels2016reversediff , whose original implementation of the method has since propagated to the Flux and Zygote packages innes2018flux ; innes2018zygote . This method was also recently employed for the differentiation of Julia broadcast operations on TPUs fischer2018tpu .
In the future, higher-order mixed-mode AD is likely to present interesting new challenges in the vein of perturbation/sensitivity confusion. For example, consider the forward-mode differentiation of the broadcast of a function that closes over variables naively tracked by a surrounding reverse-mode implementation. More research is needed to identify these potentially problematic scenarios and explore their ramifications.
Additional work has been planned to implement first-class mixed-mode AD for Julia within the upcoming Capstan package revels2018capstan , which will build on recently developed tools enabling third-party packages to extend Julia’s compiler with new, context-specific behaviors by dynamically injecting code transformation passes into Julia’s just-in-time (JIT) compilation cycle revels2018cassette .
Jarrett Revels was supported by MIT’s NEC Corporation Fund for Research in Computers and Communications, MIT’s Skoltech Next Generation Program and a Lawrence Livermore National Lab Subcontract. Tim Besard was supported by the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), and by the Research Foundation – Flanders (FWO). Valentin Churavy was supported by NSF DMS-1312831, Darpa XDATA, and an ARAMCO MITEI grant. The authors would also like to thank James Bradbury, Peter Ahrens, Mike Innes, Deniz Yuret, Ekin Akyürek and Simon Danisch for multiple helpful conversations and contributions to Julia’s AD and GPU ecosystems.
Tvm: end-to-end compilation stack for deep learning.
In SysML Conference, 2018.Gradient-based hyperparameter optimization through reversible learning.
In International Conference on Machine Learning, pages 2113–2122, 2015.Automatic differentiation in pytorch.
In NIPS-W, 2017.Dlvm: A modern compiler framework for neural network dsls.
Urbana, 51:61801, 2017.
Comments
There are no comments yet.