Divide-and-Conquer Checkpointing for Arbitrary Programs with No User Annotation

08/22/2017 ∙ by Jeffrey Mark Siskind, et al. ∙ 0

Classical reverse-mode automatic differentiation (AD) imposes only a small constant-factor overhead in operation count over the original computation, but has storage requirements that grow, in the worst case, in proportion to the time consumed by the original computation. This storage blowup can be ameliorated by checkpointing, a process that reorders application of classical reverse-mode AD over an execution interval to tradeoff space time. Application of checkpointing in a divide-and-conquer fashion to strategically chosen nested execution intervals can break classical reverse-mode AD into stages which can reduce the worst-case growth in storage from linear to sublinear. Doing this has been fully automated only for computations of particularly simple form, with checkpoints spanning execution intervals resulting from a limited set of program constructs. Here we show how the technique can be automated for arbitrary computations. The essential innovation is to apply the technique at the level of the language implementation itself, thus allowing checkpoints to span any execution interval.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reverse-mode automatic differentiation (AD) traverses the run-time dataflow graph of a calculation in reverse order, in a so-called reverse sweep

, so as to calculate a Jacobian-transpose-vector product of the Jacobian of the given original (or

primal) calculation [speelpenning80]. Although the number of arithmetic operations involved in this process is only a constant factor greater than that of the primal calculation, some values involved in the primal dataflow graph must be saved for use in the reverse sweep, thus imposing considerable storage overhead. This is accomplished by replacing the primal computation with a forward sweep that performs the primal computation while saving the requisite values on a data structure known as the tape. A technique called checkpointing [volin1985aco] reorders portions of the forward and reverse sweeps to reduce the maximal length of the requisite tape. Doing so, however, requires (re)computation of portions of the primal and saving the requisite program state to support such as snapshots. Overall space savings result when the space saved by reducing the maximal length of the requisite tape exceeds the space cost of storing the snapshots. Such space saving incurs a time cost in (re)computation of portions of the primal. Different checkpointing strategies lead to a space-time tradeoff.

(a) (b) (c) (d)

Figure 1: Checkpointing in reverse-mode AD. See text for description.

We introduce some terminology that will be useful in describing checkpointing. An execution point is a point in time during the execution of a program. A program point is a location in the program code. Since program fragments might be invoked zero or more times during the execution of a program, each execution point corresponds to exactly one program point but each program point may correspond to zero or more execution points. An execution interval is a time interval spanning two execution points. A program interval is a fragment of code spanning two program points. Program intervals are usually constrained so that they nest, i.e., they do not cross one boundary of a syntactic program construct without crossing the other. Each program interval may correspond to zero or more execution intervals, those execution intervals whose endpoints result from the same invocation of the program interval. Each execution interval corresponds to at most one program interval. An execution interval might not correspond to a program interval because the endpoints might not result from the same invocation of any program interval.

(a) (b) (c) (d) (e) (f)

Figure 2: Divide-and-conquer checkpointing in reverse-mode AD. See text for description.

Figs. 1 and 2 illustrate the process of performing reverse-mode AD with and without checkpointing. Control flows from top to bottom, and along the direction of the arrow within each row. The symbols , , and denote execution points in the primal,  being the start of the computation whose derivative is desired,  being the end of that computation, and each  being an intermediate execution point in that computation. Reverse mode involves various sweeps, whose execution intervals are represented as horizontal green, red, and blue lines. Green lines denote (re)computation of the primal without taping. Red lines denote computation of the primal with taping, i.e., the forward sweep of reverse mode. Blue lines denote computation of the Jacobian-transpose-vector product, i.e., the reverse sweep of reverse mode. The vertical black lines denote collections of execution points across the various sweeps that correspond to execution points in the primal, each particular execution point being the intersection of a horizontal line and a vertical line. In portions of Figs. 1 and 2 other than Fig. 1(a) we refer to execution points for other sweeps besides the primal in a given collection with the symbols , , and when the intent is clear. The vertical violet, gold, pink, and brown lines denote execution intervals for the lifetimes of various saved values. Violet lines denote the lifetime of a value saved on the tape during the forward sweep and used during the reverse sweep. The value is saved at the execution point at the top of the violet line and used once at the execution point at the bottom of that line. Gold and pink lines denote the lifetime of a snapshot.111The distinction between gold and pink lines, the meaning of brown lines, and the meaning of the black tick marks on the left of the gold and pink lines will be explained in Section LABEL:sec:intuition. The snapshot is saved at the execution point at the top of each gold or pink line and used at various other execution points during its lifetime. Green lines emanating from a gold or pink line indicate restarting a portion of the primal computation from a saved snapshot.

Fig. 1(a) depicts the primal computation, , which takes  time steps, with  being a portion of the program state at execution point  and  being a portion of the program state at execution point  computed from . This is performed without taping (green). Fig. 1(b) depicts classical reverse mode without checkpointing. An uninterrupted forward sweep (red) is performed for the entire length of the primal, then an uninterrupted reverse sweep (blue) is performed for the entire length. Since the tape values are consumed in reverse order from which they are saved, the requisite tape length is . Fig. 1(c) depicts a checkpoint introduced for the execution interval . This interrupts the forward sweep and delays a portion of that sweep until the reverse sweep. Execution proceeds by a forward sweep (red) that tapes during the execution interval , a primal sweep (green) without taping during the execution interval , a taping forward sweep (red) during the execution interval , a reverse sweep (blue) during the execution interval , a taping forward sweep (red) during the execution interval , a reverse sweep (blue) during the execution interval , and then a reverse sweep (blue) during the execution interval . The forward sweep for the execution interval is delayed until after the reverse sweep for the execution interval . As a result of this reordering, the tapes required for those sweeps are not simultaneously live. Thus the requisite tape length is the maximum of the two tape lengths, not their sum. This savings comes at a cost. To allow such out-of-order execution, a snapshot (gold) must be saved at  and the portion of the primal during the execution interval must be computed twice, first without taping (green) then with (red).

A checkpoint can be introduced into a portion of the forward sweep that has been delayed, as shown in Fig. 1(d). An additional checkpoint can be introduced for the execution interval . This will delay a portion of the already delayed forward sweep even further. As a result, the portions of the tape needed for the three execution intervals , , and are not simultaneously live, thus further reducing the requisite tape length, but requiring more (re)computation of the primal (green). The execution intervals for multiple checkpoints must either be disjoint or must nest; the execution interval of one checkpoint cannot cross one endpoint of the execution interval of another checkpoint without crossing the other endpoint.

Execution intervals for checkpoints can be specified in a variety of ways.

program interval

Execution intervals of specified program intervals constitute checkpoints.

subroutine call site

Execution intervals of specified subroutine call sites constitute checkpoints.

subroutine body

Execution intervals of specified subroutine bodies constitute checkpoints [volin1985aco].

Nominally, these have the same power; with any one, one could achieve the effect of the other two. Specifying a subroutine body could be accomplished by specifying all call sites to that subroutine. Specifying some call sites but not others could be accomplished by having two variants of the subroutine, one whose body is specified and one whose is not, and calling the appropriate one at each call site. Specifying a program interval could be accomplished by extracting that interval as a subroutine.

Tapenade [hascoet2004tug] allows the user to specify program intervals for checkpoints with the c$ad checkpoint-start and c$ad checkpoint-end pragmas. Tapenade, by default, checkpoints all subroutine calls [dauvergne2006tdf]. This default can be overridden for named subroutines with the -nocheckpoint command-line option and for both named subroutines and specific call sites with the c$ad nocheckpoint pragma.

2

Figure 3:

Recursive application of checkpointing in a divide-and-conquer fashion, i.e., “treeverse,” can divide the forward and reverse sweeps into stages run sequentially [griewank1992alg]. The key idea is that only one stage is live at a time, thus requiring a shorter tape. However, the state of the primal computation at various intermediate execution points needs to be saved as snapshots, in order to (re)run the requisite portion of the primal to allow the forward and reverse sweeps for each stage to run in turn. This process is illustrated in Fig. 2. Consider a root execution interval of the derivative calculation. Without checkpointing, the forward and reverse sweeps span the entire root execution interval, as shown in Fig. 2(a). One can divide the root execution interval into two subintervals and at the split point and checkpoint the first subinterval . This divides the forward (red) and reverse (blue) sweeps into two stages. These two stages are not simultaneously live. If the two subintervals are the same length, this halves the storage needed for the tape at the expense of running the primal computation for twice, first without taping (green), then with taping (red). This requires a single snapshot (gold) at . This process can be viewed as constructing a binary checkpoint tree (Fig. 3) whose nodes are labeled with execution intervals, the intervals of the children of a node are adjacent, the interval of a node is the disjoint union of the intervals of its children, and left children are checkpointed.

(a) 2 2 2
(b) 4
Figure 4:

One can construct a left-branching binary checkpoint tree over the same root execution interval with the split points , , and  (Fig. 4a). This can also be viewed as constructing an n-ary checkpoint tree where all children but the rightmost are checkpointed (Fig. 4b). This leads to nested checkpoints for the execution intervals , , and as shown in Fig. 2(c). Since the starting execution point  is the same for these intervals, a single snapshot (gold) with longer lifetime suffices. These checkpoints divide the forward (red) and reverse (blue) sweeps into four stages. This allows the storage needed for the tape to be reduced arbitrarily (i.e., the red and blue segments can be made arbitrarily short), by rerunning successively shorter prefixes of the primal computation (green), without taping, running only short segments (red) with taping. This requires an increase in time for (re)computation of the primal (green).

2 2 2

Figure 5:

Alternatively, one can construct a right-branching binary checkpoint tree over the same root execution interval with the same split points , , and  (Fig. 5). This also divides the forward (red) and reverse (blue) sweeps into four stages. With this, the requisite tape length (the maximal length of the red and blue segments) can be reduced arbitrarily while running the primal (green) just once, by saving more snapshots (gold and pink), as shown in Fig. 2(d), This requires an increase in space for storage of the live snapshots (gold and pink).

Thus we see that divide-and-conquer checkpointing can make the requisite tape arbitrarily small with either left- or right-branching binary checkpoint trees. This involves a space-time tradeoff. The left-branching binary checkpoint trees require a single snapshot but an increase in time for (re)computation of the primal (green). The right-branching binary checkpoint trees require an increase in space for storage of the live snapshots (gold and pink) but (re)run the primal only once.

2 2 2

Figure 6:

One can also construct a complete binary checkpoint tree over the same root execution interval with the same split points , , and  (Fig. 6). This constitutes application of the approach from Fig. 2(b) in a divide-and-conquer fashion as shown in Fig. 2(e). This also divides the forward (red) and reverse (blue) sweeps into four stages. One can continue this divide-and-conquer process further, with more split points, more snapshots, and more but shorter stages, as shown in Fig. 2(f). This leads to an increase in space for storage of the live snapshots (gold and pink) and an increase in time for (re)computation of the primal (green). Variations of this technique can tradeoff between different improvements in space and/or time complexity, leading to overhead in a variety of sublinear asymptotic complexity classes in one or both. In order to apply this technique, we must be able to construct a checkpoint tree of the desired shape with appropriate split points. This in turn requires the ability to interrupt the primal computation at appropriate execution points, save the interrupted execution state as a capsule, and restart the computation from the capsules, sometimes repeatedly.222The correspondence between capsules and snapshots will be discussed in Section LABEL:sec:intuition.

(a) 2 2 2
(b) 2 3
(c) 2 2 2 2 2 2 2
(d) 2        2        3                                               4
Figure 7:

Any given divide-and-conquer decomposition of the same root execution interval with the same split points can be viewed as either a binary checkpoint tree or an n-ary checkpoint tree. Thus Fig. 2(e) can be viewed as either Fig. 7(a) or Fig. 7(b). Similarly, Fig. 2(f) can be viewed as either Fig. 7(c) or Fig. 7(d). Thus we distinguish between two algorithms to perform divide-and-conquer checkpointing.

binary

An algorithm that constructs a binary checkpoint tree.

treeverse

The algorithm from [griewank1992alg, Figs. 2 and 3] that constructs an n-ary checkpoint tree.

There is, however, a simple correspondence between associated binary and n-ary checkpoint trees. The n-ary checkpoint tree is derived from the binary checkpoint tree by coalescing each maximal sequence of left branches into a single node. Thus we will see, in Section LABEL:sec:binomial, that these two algorithms exhibit the same properties.

Note that (divide-and-conquer) checkpointing does not incur any space or time overhead in the forward or reverse sweeps themselves (i.e., the number of violet lines and the total length of red and blue lines). Any space overhead results from the snapshots (gold and pink) and any time overhead results from (re)computation of the primal (green).

Several design choices arise in the application of divide-and-conquer checkpointing in addition to the choice of binary vs. n-ary checkpoint trees.

  • What root execution interval(s) should be subject to divide-and-conquer checkpointing?

  • Which execution points are candidate split points? The divide-and-conquer process of constructing the checkpoint tree will select actual split points from these candidates.

  • What is the shape or depth of the checkpoint tree, i.e., what is the termination criterion for the divide-and-conquer process?

Since the leaf nodes of the checkpoint tree correspond to stages, the termination criterion and the number of evaluation steps in the stage at each leaf node (the length of a pair of red and blue lines) are mutually constrained. The number of live snapshots at a leaf (how many gold and pink lines are crossed by a horizontal line drawn leftward from that stage, the pair of red and blue lines, to the root) depends on the depth of the leaf and its position in the checkpoint tree. Different checkpoint trees, with different shapes resulting from different termination criteria and split points, can lead to a different maximal number of live snapshots, resulting in different storage requirements. The amount of (re)computation of the primal (the total length of the green lines) can also depend on the shape of the checkpoint tree, thus different checkpoint trees, with different shapes resulting from different termination criteria and split points, can lead to different compute-time requirements. Thus different strategies for specifying the termination criterion and the split points can influence the space-time tradeoff.

We make a distinction between several different approaches to selecting root execution intervals subject to divide-and-conquer checkpointing.

loop

Execution intervals resulting from invocations of specified DO loops are subject to divide-and-conquer checkpointing.

entire derivative calculation

The execution interval for an entire specified derivative calculation is subject to divide-and-conquer checkpointing.

We further make a distinction between several different approaches to selecting candidate split points.

iteration boundary

Iteration boundaries of the DO loop specified as the root execution interval are taken as candidate split points.

arbitrary

Any execution point inside the root execution interval can be taken as a candidate split point.

We further make a distinction between several different approaches to specifying the termination criterion and deciding which candidate split points to select as actual split points.

bisection

Split points are selected so as to divide the computation dominated by a node in half as one progresses successively from right to left among children [griewank1992alg, equation (12)]. One can employ a variety of termination criteria, including that from [griewank1992alg, p. 46]. If the termination criterion is such that the total number of leaves is a power of two, one obtains a complete binary checkpoint tree. A termination criterion that bounds the number of evaluation steps in a leaf limits the size of the tape and achieves logarithmic overhead in both asymptotic space and time complexity compared with the primal.

binomial

Split points are selected using the criterion from [griewank1992alg, equation (16)]. The termination criterion from [griewank1992alg, p. 46] is usually adopted to achieve the desired properties discussed in [griewank1992alg]. Different termination criteria can be selected to control space-time tradeoffs.

fixed space overhead

One can bound the size of the tape and the number of snapshots to obtain sublinear but superlogarithmic overhead in asymptotic time complexity compared with the primal.

fixed time overhead

One can bound the size of the tape and the (re)computation of the primal to obtain sublinear but superlogarithmic overhead in asymptotic space complexity compared with the primal.

logarithmic space and time overhead

One can bound the size of the tape and obtain logarithmic overhead in both asymptotic space and time complexity compared with the primal. The constant factor is less than that of bisection checkpointing.

We elaborate on the strategies for selecting actual split points from candidate split points and the associated termination criteria in Section LABEL:sec:binomial.

Divide-and-conquer checkpointing has only been provided to date in AD systems in special cases. For example, Tapenade allows the user to select invocations of a specified DO loop as the root execution interval for divide-and-conquer checkpointing with the c$ad binomial-ckp pragma, taking iteration boundaries of that loop as candidate split points. Tapenade employs binomial selection of split points and a fixed space overhead termination criterion. Note, however, that Tapenade only guarantees this fixed space overhead property for DO loop bodies that take constant time. Similarly adol-c [griewank1996apf] contains a nested taping mechanism for time-integration processes [kowarz2006ocf] that also performs divide-and-conquer checkpointing. This only applies to code formulated as a time-integration process.

Here, we present a framework for applying divide-and-conquer checkpointing to arbitrary code with no special annotation or refactoring required. An entire specified derivative calculation is taken as the root execution interval, rather than invocations of a specified DO loop. Arbitrary execution points are taken as candidate split points, rather than iteration boundaries. As discussed below in Section LABEL:sec:binomial, both binary and n-ary (treeverse) checkpoint trees are supported. Furthermore, as discussed below in Section LABEL:sec:binomial, both bisection and binomial checkpointing are supported. Additionally, all of the above termination criteria are supported: fixed space overhead, fixed time overhead, and logarithmic space and time overhead. Any combination of the above checkpoint-tree generation algorithms, split-point selection methods, and termination criteria are supported. In order to apply this framework, we must be able to interrupt the primal computation at appropriate execution points, save the interrupted execution state as a capsule, and restart the computation from the capsules, sometimes repeatedly. This is accomplished by building divide-and-conquer checkpointing on top of a general-purpose mechanism for interrupting and resuming computation. This mechanism is similar to engines [haynes1984engines] and is orthogonal to AD. We present several implementations of our framework which we call checkpointVLAD. In Section LABEL:sec:example, we compare the space and time usage of our framework with that of Tapenade on an example.

Note that one cannot generally achieve the space and time guarantees of divide-and-conquer checkpointing with program-interval, subroutine-call-site, or subroutine-body checkpointing unless the call tree has the same shape as the requisite checkpoint tree. Furthermore, one cannot generally achieve the space and time guarantees of divide-and-conquer checkpointing for DO loops by specifying the loop body as a program-interval checkpoint, as that would lead to a right-branching checkpoint tree and behavior analogous to Fig. 2(d). Moreover, if one allows split points at arbitrary execution points, the resulting checkpoint execution intervals may not correspond to program intervals.

Some form of divide-and-conquer checkpointing is necessary. One may wish to take the gradient of a long-running computation, even if it has low asymptotic time complexity. The length of the tape required by reverse mode without divide-and-conquer checkpointing increases with increasing run time. Modern computers can execute several billion floating point operations per second, even without GPUs and multiple cores, which only exacerbate the problem. If each such operation required storage of a single eight-byte double precision number, modern terabyte RAM sizes would fill up after a few seconds of computation. Thus without some form of divide-and-conquer checkpointing, it would not be possible to efficiently take the gradient of a computation that takes more than a few seconds.

Machine learning methods in general, and deep learning methods in particular, require taking gradients of long-running high-dimension computations, particularly when training deep neural networks in general or recurrent neural networks over long time series. Thus variants of divide-and-conquer checkpointing have been rediscovered and deployed by the machine learning community in this context [chen-etal-2016a, gruslys-etal-2016a]. These implementations are far from automatic, and depend on compile-time analysis of the static primal flow graphs.

The general strategy of divide-and-conquer checkpointing, the n-ary treeverse algorithm, the bisection and binomial strategies for selecting split points, and the termination criteria that provide fixed space overhead, fixed time overhead, and logarithmic space and time overhead were all presented in [griewank1992alg]. Furthermore, Tapenade has implemented divide-and-conquer checkpointing with the n-ary treeverse algorithm, the binomial strategy for selecting split points, and the termination criterion that provides fixed space overhead, but only for root execution intervals corresponding to invocations of specified DO loops that meet certain criteria with split points restricted to iteration boundaries of those loops. To our knowledge, the binary checkpoint-tree algorithm presented here and the framework for allowing it to achieve all of the same guarantees as the n-ary treeverse algorithm is new. However, our central novel contribution here is providing a framework for supporting either the binary checkpoint-tree algorithm or the n-ary treeverse algorithm, either bisection or binomial split point selection, and any of the termination criteria of fixed space overhead, fixed time overhead, or logarithmic space and time overhead in a way that supports taking the entire derivative calculation as the root execution interval and taking arbitrary execution points as candidate split points, by integrating the framework into the language implementation.

Some earlier work [heller1998checkpointing, stovboun2000tool, kang2003implementation] prophetically presaged the work here. This work seems to have received far less exposure and attention than deserved. Perhaps because the ideas therein were so advanced and intricate that it was difficult to communicate those ideas clearly. Moreover, the authors report difficulties in getting their implementations to be fully functional. Our work here formulates the requisite ideas and mechanisms carefully and precisely, using methods from the programming-language community, like formulation of divide-and-conquer checkpointing of a function as divide-and-conquer application of reverse mode to two functions whose composition is the original function, formulation of the requisite decomposition as a precise and abstract interruption and resumption interface, formulation of semantics precisely through specification of evaluators, use of CPS evaluators to specify an implementation of the interruption and resumption interface, and systematic derivation of a compiler from that evaluator via CPS conversion, to allow complete, correct, comprehensible, and fully general implementation.

2 The Limitations of Divide-and-Conquer Checkpointing with Split Points at Fixed Syntactic Program Points like Loop Iteration Boundaries

Consider the example in Fig. LABEL:fig:example-fortran. This example, , while contrived, is a simple caricature of a situation that arises commonly in practice: modeling a physical system with an adaptive grid. An initial state vector is repeatedly transformed by a state update process and, upon termination, an aggregate property  of the final state is computed by a function . We wish to compute the gradient of that property  relative to the initial state 

. Here, the state update process first rotates the value pairs at adjacent odd-even coordinates of the state 

by an angle  and then rotates those at adjacent even-odd coordinates. The rotation  is taken to be proportional to the magnitude of . The adaptive grid manifests in two nested update loops. The outer loop has duration 

, specified as an input hyperparameter. The duration 

of the inner loop varies wildly as some function of another input hyperparameter  and the outer loop index , perhaps , that is small on most iterations of the outer loop but on a few iterations. If the split points were limited to iteration boundaries of the outer loop, as would be common in existing implementations, the increase in space or time requirements would grow larger than sublinearly. The issue is that for the desired sublinear growth properties to hold, it must be possible to select arbitrary execution points as split points. In other words, the granularity of the divide-and-conquer decomposition must be primitive atomic computations, not loop iterations. The distribution of run time across the program is not modularly reflected in the static syntactic structure of the source code, in this case the loop structure. Often the user is unaware of, or even unconcerned with, the micro-level structure of atomic computations, and does not wish to break the modularity of the source code to expose it. Yet the user may still wish to reap the sublinear space or time overhead benefits of divide-and-conquer checkpointing. Moreover, the relative duration of different paths through a program may vary from loop iteration to loop iteration in a fashion that is data dependent, as shown by the above example, and not even statically determinable. We will now proceed to discuss an implementation strategy for divide-and-conquer checkpointing that does not constrain split points to loop iteration boundaries or other syntactic program constructs and does not constrain checkpoints to program intervals or other syntactic program constructs. Instead, it can take any arbitrary execution point as a split point and introduce checkpoints at any resulting execution interval.