Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

10/07/2019 ∙ by Paras Jain, et al. ∙ berkeley college 11

Modern neural networks are increasingly bottlenecked by the limited capacity of on-device GPU memory. Prior work explores dropping activations as a strategy to scale to larger neural networks under memory constraints. However, these heuristics assume uniform per-layer costs and are limited to simple architectures with linear graphs, limiting their usability. In this paper, we formalize the problem of trading-off DNN training time and memory requirements as the tensor rematerialization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a system that solves for optimal schedules in reasonable times (under an hour) using off-the-shelf MILP solvers, then uses these schedules to accelerate millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware through the use of accelerator-specific, profile-based cost models. In addition to reducing training cost, Checkmate enables real-world networks to be trained with up to 5.1× larger input sizes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is rapidly pushing the limits of memory capacity on neural network accelerators as researchers train neural networks on high-resolution images [dong_superres, kim_superres, tai_superres], 3D point-clouds [yangHDNETExploitingHD, chenMultiview3DObject2017]

, and long Natural Language Processing (NLP) sequence data 

[devlin_bert:_2018, vaswani_attention_2017, child_generating_2019]. In these applications, GPU memory usage is dominated by the intermediate activation tensors (see Figure 1

) needed for backpropagation.

The limited availability of high bandwidth on-device memory creates a memory wall that stifles exploration of novel architectures. Authors of state-of-the-art models cite memory as a limiting factor in image classification [krizhevsky_imagenet_2012, gomez_reversible_2017, he_deep_2016], semantic segmentation [chen_deeplab:_2016, pohlen2017FRRN], and NLP [child_generating_2019, liu_roberta:_2019, dai_transformer-xl:_2019].

Given there is insufficient RAM to cache all activation tensors for backpropagation, some select tensors can be discarded during forward evaluation. When a discarded tensor is necessary as a dependency for gradient calculation, the tensor can be rematerialized. As illustrated in Figure 2, rematerializing a node allows a large DNN to fit within memory at the expense of additional computation.

chen_training_2016 and griewank_algorithm_2000

present heuristics for rematerialization, referring to the problem as checkpointing. However, their approaches cannot be generally applied to nonlinear DNN structures such as residual connections. Furthermore, they force a very strong assumption that the graph has uniform compute across all nodes. Prior work also assumes that gradients may never be rematerialized. These assumptions limit the efficiency and generality of prior approaches.

Our work formalizes tensor rematerialization as a constrained optimization problem. Using off-the-shelf numerical solvers, we are able to discover optimal rematerialization strategies for arbitrary deep neural networks in TensorFlow with non-uniform computation and memory costs. We demonstrate that optimal rematerialization allows larger batch sizes and substantially reduced memory usage with minimal computational overhead across a range of image classification and semantic segmentation architectures. As a consequence, our approach allows researchers to easily explore larger models, at larger batch sizes, on more complex signals with minimal computation overhead.

In particular, the contributions of this work include:

  • a formalization of the rematerialization problem as a mixed integer linear program with a substantially more flexible search space than prior work, in Section 

    4.5.

  • an algorithm to translate a feasible solution into a concrete execution plan and a static training graph.

  • an implementation of optimal tensor rematerialization in Tensorflow at runtime.

  • Checkmate, a system that enables training models with up to 5.1 larger input sizes and with up to 2.6 larger batch sizes than prior art at minimal overhead.

Figure 1: Memory consumed by activations far outweigh parameters for popular model architectures. Moreover, advances in GPU DRAM capacity are quickly utilized by researchers; the dashed line notes the memory limit of the GPU used to train each model.
Figure 2: Tensor rematerialization enables training larger models. A large DNN is trained by evaluating forward pass operations (blue) then computing gradients (red). Current frameworks retain each forward activation for backpropogation, leading to memory exhaustion (OOM). Instead, we deliberately deallocate at and rematerialize it at ; the working set now fits within RAM.

2 Optimal Rematerialization

Tensor rematerialization enables training neural networks with activation memory requirements larger than the capacity of current hardware.

In Figure 2, we examine a three-layer neural network that exceeds device memory (OOMs) with the current practice of caching every dependency. We represent the corresponding gradient nodes in red. Our proposed policy discards the activations for node at , thereby freeing memory. However, ’s gradient operation depends on the value of . At , is rematerialized in time for .

Prior art assumes networks are linear graphs, or path graphs, where node solely depends on and edge set . For example, chen_training_2016 propose a strategy where the graph is divided into segments each with nodes. During the forward pass, the strategy retains the activation of the endpoint of each segment. During backpropagation, each segment is recomputed from the activation of the endpoint of the previous segment. This results in an rematerialization overhead. As Chen assumes networks are linear graphs, node cannot be rematerialized independently.

Linear graph assumptions dramatically limit applicability of prior work to popular DNN architectures. For example, the popular ResNet50 [he_deep_2016] requires treating each residual block as a single node which leads to inefficient solutions. For other networks with larger skip connection (e.g., U-Net [ronneberger_u-net:_2015]), the vast majority of the graph is incompatible.

Prior work also makes similar assumptions about the cost model for DNN graphs, namely that all nodes require an equivalent number of operations to evaluate. In the VGG19 [simonyan_very_2014] architecture, the largest layer is seven orders of magnitude more expensive than the smallest layer. This leads to suboptimal solutions.

Our work makes few assumptions on neural network graphs. We explore a solution space that allows for (a) arbitrary graphs with several inputs and outputs for each node, (b) variable memory costs across layers and (c) variable computation costs for each layer (such as FLOPs or profiled runtimes). We constrain solutions to simply be correct (a node’s dependencies must be materialized before it can be evaluated) and within the RAM budget (at any point during execution, resident tensors must fit into RAM).

To find solutions to this generalized problem, we find solutions that minimize the amount of time it takes to perform a single training iteration, subject to the correctness and memory constraints outlined above. To efficiently solve this optimization problem, we project schedules into space and time. Modeling schedules this way allows us to cast the objective as a linear expression. This problem can then be solved using off-the-shelf mixed integer linear program solvers such as GLPK [noauthor_glpk_nodate] or COIN-OR Branch-and-Cut [forrest_coin-or_2019]. An optimal solution to the linear program will be optimal for arbitrary graphs with variable memory and variable compute profiles, factors that are neglected by prior work.

3 Related Work

We categorize the large body of related work into activation compression, rematerialization and checkpointing, reversible architectures, and distributed computation.

Activation compression In some DNN applications, it is possible to process compressed representations with minimal accuracy loss. gueguen_faster_2018

develop a DNN that classifies ImageNet images given discrete cosine transform codes from partially decoded JPEG images, accelerating network evaluation, and possibly requiring less memory.

jain_gist:_2018 observe that feature maps, or activations, dominate memory usage during training, and halve memory usage on average by reducing the precision of activations. Still, compression only reduces memory usage by a constant factor and reduces accuracy at higher compression rates. In contrast, deleting and recomputing values maintains the same accuracy at a range of memory budgets. Also, as our solver operates on arbitrary computation graphs, compression operations may be included.

Checkpointing and rematerialization Prior work has addressed checkpointing and rematerialization in linear graphs. In work from the differential equation community that inspired later work in reverse-mode automatic differentiation, griewank_algorithm_2000 develop a procedure for checkpointing in idealized linear computation graphs with memory usage, where denotes the number of computation nodes. Griewank proves that their proposed logarithmic checkpointing strategy is optimal with respect to linear graphs with unit cost and memory per node. chen_training_2016 propose a heuristic for rematerialization in similarly idealized unit cost, unit memory usage linear graphs with memory usage at an computational overhead, intended for DNN training. However, the approach is far from optimal in practice as DNN layers vary significantly in memory usage and computational cost [sze2017efficient]. chen_training_2016

also develop a greedy algorithm that checkpoints layers of a network in roughly memory equal segments, with a hyperparameter

for the size of such segments. Still, neither procedure is cost aware nor deallocates checkpoints when possible. gruslys_memory-efficient_2016

develop a dynamic programming algorithm for checkpoint selection in unrolled recurrent neural network training, exploiting their linear forward graphs. To extend checkpointing to branching networks,

feng_cutting_2018 provide a dynamic program to select checkpoints that partition a nonlinear computation graph, but ignore layer costs and memory usage. doi:10.1080/10556788.2018.1459621 develop a divide-and-conquer strategy in programs. beaumont:hal-02131552 give a dynamic program for checkpoint selection in join networks where multiple linear graphs merge at the loss node.

Intermediate value recomputation is also common in register allocation. Compiler backends lower an intermediate representation of code to an architecture-specific executable binary. During lowering, an abstract static single assignment (SSA) graph of values and operations [rosen_global_1988, cytron_efficiently_1991] is concretized by mapping values to a finite number of registers. If insufficient registers are available for an SSA form computation graph, values are spilled to main memory by storing and later loading the value. Register allocation has been formulated as graph coloring problem [chaitin_register_1981], integer program [goodwin_optimal_1996, lozano_combinatorial_2018], and network flow [koes_global_2006].

As an optimization, register allocators can rematerialize, or recompute, constants and values with register-resident dependencies if the cost of doing so is less than the cost of a spill [chaitin_register_1981, briggs_rematerialization_1992, punjani_register_2004]. While similar to our setup, register rematerialization is limited to exceptional values that can be recomputed in a single instruction with dependencies already in registers. For example, memory offset computations can be cheaply recomputed, and loads of constants can be statically resolved. In contrast, Checkmate can recompute entire subgraphs of the program’s data-flow.

During the evaluation of a single kernel, GPUs spill per-thread registers to a thread-local region of global memory (i.e. local memory) [micikevicius_local_2011, nvidia_nvidia_2017]. NN training executes DAGs of kernels and stores intermediate values in shared global memory. This produces a high range of value sizes, from 4 byte floats to gigabyte tensors, whereas CPU and GPU registers range from 1 to 64 bytes. Our problem of interkernel memory scheduling thus differs in scale from the classical problem of register allocation within a kernel or program. Rematerialization is more appropriate than copying values out of core as the cost of spilling values from global GPU memory to main memory (RAM) is substantial [micikevicius_local_2011, jain_gist:_2018], though possible [meng_training_2017].

Reversible Networks gomez_reversible_2017 propose a reversible (approximately invertible) residual DNN architecture, where intermediate temporary values can be recomputed from values derived later in the standard forward computation. Reversibility allows forward pass activations to be recomputed during the backward pass rather than stored, similar to gradient checkpointing. bulo_-place_2018

replace only ReLU and batch normalization layers with invertible variants, reconstructing their inputs during the backward pass, reducing memory usage up to 50%. However, this approach has a limit to memory savings, and does not support a range of budgets. Reversibility is not yet widely used to save memory, but is a promising complementary approach.

Distributed computation An orthogonal approach to address the limited memory problem is distributed-memory computations and gradient accumulation. However, model parallelism requires access to additional expensive compute accelerators, fast networks, and non-trivial partitioning of model state to balance communication and computation [gholami2018integrated, jiaDataModelParallelism, mccandlishEmpiricalModelLargeBatch2018]. Gradient accumulation enables larger batch sizes by computing the gradients in sub-batches across a mini-batch. However, gradient accumulation often degrades performance as batch normalization performs poorly on small minibatch sizes [wuGroupNormalization2018, ioffe_batch_2015].

4 Optimal Rematerialization with Integer Linear Programming

In this section, we develop an optimal solver that schedules computation, memory allocation and garbage collection during the evaluation of general data-flow graphs including those used in neural network training. Our proposed scheduler minimizes computation or execution time while guaranteeing that the schedule will not exceed device memory limitations. The rematerialization problem is formulated as a mixed integer linear program (MILP) that can be solved with standard commercial or open-source solvers.

4.1 Problem definition

A computation or data-flow graph has nodes that represent operations yielding values (e.g. tensors). These operations depend on the results of other operations, with dependencies specified by edges

. Operations are neural network layers such as convolutions, fully connected layers, and activation functions, and values include activations and gradients stored in memory. Operations may also depend on parameters, or weights, which are stored in memory.

In this work, we impose a topological order over the nodes, such that operation may only depend on the results of operations

. This topological ordering specifies an execution order for the graph, and is given by user code in eager-execution frameworks such as PyTorch 

[paszke_automatic_2017]. For convenience, we refer to nodes by their index in the ordering. Separating ordering and allocation is common in compiler design, and both GCC [olesen_register_2011] and LLVM [lattner_llvm:_2002] have separate instruction scheduling and register allocation passes.

We wish to find a feasible schedule specifying the order of memory allocations, evaluations, and garbage collections such that the total computational cost is minimized, subject to a global memory budget in bytes. This schedule will be optimal with respect to the topological order and a cost model for the computational expense of operations.

Figure 3: Dependencies of can only be garbage collected after it is evaluated. measures the memory used after evaluating and before deallocating its dependencies. and may be deallocated during garbage collection, but may not due to a forward edge.

4.2 Partitioning the schedule

Any schedule that evaluates all nodes in the computation graph can be partitioned into frontier-advancing stages, where in stage , operation is evaluated for the first time. Each operation in the graph can be evaluated once per stage if needed to advance the frontier. The partitioned schedule is represented by binary decision variables and that indicate whether a node is to be computed and whether it is to be checkpointed at each point in evaluation.

Let

be a binary variable, where

indicates that operation should evaluated in stage . This computation has cost in FLOPs or latency, and the result of the operation consumes memory in bytes.

Further, let indicate that the result of operation should be retained in memory at stage until stage , such that the result is available for use during stage . This generalizes checkpointing [griewank_algorithm_2000, chen_training_2016, gruslys_memory-efficient_2016, siskind_divide-and-conquer_2018, feng_cutting_2018], as values can be retained and deallocated many times in our schedules.

The decision variables are instantiated as binary lower triangular matrices. Coupled with an aggressive memory deallocation policy that frees memory as soon as possible (4.4), and are sufficient to express the evaluation schedule.

4.3 Scheduling with ample memory

First, consider neural network evaluation on a processor with ample memory. Even without a memory constraint, our solver must ensure that checkpointed and computed operations have dependencies resident in memory. Minimizing the total cost of computation across stages with dependency constraints yields objective (4.3): |l|[1] R,S∑_t=1^n ∑_i=1^t C_i R_t,i R_t, j≤R_t, i + S_t, i∀t  ∀(i, j) ∈E S_t, i≤R_t-1, i + S_t-1, i  ∀t  ∀i R_t, t=1∀t R_t, i, S_t, i ∈{0, 1}∀t  ∀i

Constraints encode boolean logical formulas for feasibility via arithmetic operations.

Dependencies must be resident Constraint (4.3) ensures that an operation is computed in stage only if all dependencies are resident in memory. Dependencies can either be recomputed or retained from the previous stage. That is, if operation depends on operation . Similarly, Constraint (4.3) encodes : retaining a value requires it to either be computed or already be checkpointed.

Frontier advancement

Constraint (4.3) partitions the schedule into stages and imposes a topological ordering over operations, requiring exactly one new operation to be evaluated per stage. For connected computation graphs with a single leaf node, we could replace (4.3) with . The ILP solver would then need to choose an evaluation order in conjunction with the rematerialization and checkpointing pattern. However, imposing an execution order substantially accelerates solving in practice. Analogously, production compilers such as LLVM [lattner_llvm:_2002, olesen_register_2011] and GCC separately allocate registers and schedule instructions.

The infinite memory ILP has binary decision variables and constraints.

4.4 Constraining memory utilization

For a schedule specified by and to be feasible, the memory used at all points of evaluation must be less than . To constrain memory usage, we introduce memory accounting variables into the ILP. Let denote the memory used while computing node in stage . is defined recursively in terms of auxiliary binary variables for , which specify whether node may be deallocated in stage after evaluating node .

We assume that (1) network inputs and parameters are always resident in memory and (2) enough space is allocated for gradients of the loss with respect to parameters.111We reserve space for gradients as many optimizers such as SGD with momentum maintain gradient statistics. Parameter gradients are typically small, the same size as the parameters themselves. Additionally, at the beginning of a stage, all checkpointed values are resident in memory. Hence, we initialize the recurrence as

(1)

Suppose bytes of memory are utilized after possibly evaluating operation . Before evaluating operation , and dependencies of may be deallocated if no longer used. Then, an output tensor for the result of operation is allocated, consuming memory . This yields recurrence (2), depicted in Figure 3:

(2)

expresses the amount of memory that can be freed by deallocating and its dependencies. Treating and as booleans, where is the indicator function,

(3)
(4)

Logically, bytes are deallocated if dependency is not checkpointed for the next stage nor used for later computation within the stage. That is, if and only if can be deallocated in stage after evaluating . Predicating on in (4) ensures values are onlyfreed once. To express Free in our ILP, (4) must be defined arithmetically with linear constraints. Applying De Morgan’s law for union and intersection interchange,

(5)

where is introduced simply for notational convenience. Relation (5) is implemented with linear cast-to-boolean constraints, where is the maximum value can assume,

(6a)
(6b)
(6c)

4.5 Complete Integer Linear Program formulation

The complete memory constrained MILP follows in (4.5), with variables and constraints. |l|[0] R,S,U,Free_∑_t=1^n ∑_i=1^t C_i R_t,i      (4.3) (4.3), (4.3), (4.3), (4.3) (1), (2), (6a), (6b), (6c) U_t,k ≤M_budget

4.6 Constraints implied by optimality

Problem 4.5 can be simplified by removing constraints implied by optimality of a solution. In , all values with are allocated space, even if they are unused. If such a value is unused, the checkpoint is spurious and the solver can set to reduce memory usage if needed.

Further, only if operation is spuriously evaluated with no uses of the result. Hence, the solver can set to reduce cost. When constructing the MILP, we eliminate variables , assumed to be 0, by modifying (3) to only sum over . Note that the eliminated variables can be computed inexpensively from and after solving.

4.7 Generating an execution plan

Given a feasible solution to (4.5), , we generate a concrete execution plan that evaluates the computation graph with bounded memory usage. This execution plan, or schedule, is constructed via a row major scan of the solution matrices, detailed in Algorithm 1.

A concrete execution plan is a program consisting of statements , where . Statement %r = allocate v defines a virtual register for the result of the operation corresponding to , used to track memory usage during execution. Such a register must be allocated for before an instance of statement compute v, %r in the plan, which invokes the operation and generates an output value which is tracked by the register %r. Finally, statement deallocate %r deletes the virtual register, marks the output value for garbage collection, and updates the tracked memory usage.

The execution plan generated by Algorithm 1 is further optimized by moving deallocations earlier in the plan if possible. For example, spurious checkpoints that are unused in a stage can be deallocated at the start of the stage rather than during the stage. Note that this code motion is unnecessary for feasibility as the solver guarantees that the unoptimized schedule will not exceed the desired memory budget.


Method Description General graphs Cost aware Memory aware
Checkpoint all (Ideal) No rematerialization. Default in deep learning frameworks.
Griewank et al. griewank_algorithm_2000 revolve procedure
Chen et al. chen_training_2016 checkpointing heuristic
Chen et al. greedy chen_training_2016, with search over parameter
AP Chen et al. on articulation points + optimal R solve
AP greedy Chen et al. greedy on articulation points + optimal R solve
Linearized Chen et al. on topological sort + optimal R solve
Linearized greedy Chen et al. greedy on topological sort + optimal R solve
Optimal MILP MILP as formulated in Section 4
Table 1: Rematerialization baselines and our extensions to make them applicable to non-linear architectures

4.8 Generating static computation graph

For implementation, the concrete execution plan can either be interpreted, or encoded as a static computation graph. In this work, we generate a static graph from the plan, which is executed by a numerical machine learning framework. See Section 5.2 for implementation details.

4.9 Cost model

To estimate the runtime of a training iteration under a rematerialization plan, we apply an additive cost model (

4.3), incurring cost when node is evaluated. Costs are determined prior to MILP construction by profiling network layers on target hardware with random inputs across a range of batch sizes and input shapes, and exclude static graph construction and input generation time.

As neural network operations consist of dense numerical kernels such as matrix multiplication, these runtimes are low variance and largely independent of the specific input data

[jia_exploring_2018, sivathanu_astra:_2019]. However, forward pass time per batch item decreases with increasing batch size due to improved data parallelism [canziani_analysis_2016], so it is important to compute costs with appropriate input dimensions.

We statically determine the memory consumption of each value in the data-flow graph as input sizes are known. Values are multi-dimensional tensors stored at 4 byte floating point precision. The consumption is used to construct memory constraints (1-2).

  Input: graph , feasible
  Output: execution plan
  Initialize , .
  for  to  do
     for  to  do
        if  then
           // Materialize
           emit % = allocate vk
           emit compute vk, %
           
           
        end if
        // Free and dependencies
        for  do
           if  then
              emit deallocate %
           end if
        end for
     end for
  end for
Algorithm 1 Generate execution plan

4.10 Extension: Recomputing nodes within a stage

In Section 4.2, we allowed each operation in a computation graph to be evaluated in multiple stages, though a given operation could only be evaluated once per stage. Results are cached if needed later within a stage. This allows certain nodes to be evaluated up to times, a substantially more flexible search space than prior work. However, the schedule could be partitioned into any number of stages to allow more evaluations of a given node, which may be useful for highly connected graphs. For example, one frontier-advancing stage in our formulation could be split into stages, allowing up to re-evaluations. In Section 5, we evaluate without splitting stages, so .

Figure 4: Computational overhead versus memory budget for (a) VGG16 image classification NN [simonyan_very_2014], (b) MobileNet image classification NN, and (c) the U-Net semantic segmentation NN [ronneberger_u-net:_2015]. Overhead is measured with respect to the best possible strategy without a memory restriction based on a profile-based cost model of a single NVIDIA Tesla V100 GPU. For U-Net (c), at the V100 memory budget of 16 GB, we achieve a speedup over the best baseline—linearized greedy—and a speedup over the next best—linearized . Takeaway: our model- and hardware-aware solver produces in-budget solutions with the lowest overhead on linear networks (a-b), and dramatically lowers memory consumption and overhead when residual connections are used (c).

5 Evaluation

Figure 5: Overview of the Checkmate system.

In this section, we demonstrate that optimal rematerialization with Checkmate significantly decreases DNN training memory usage at minimal overhead (Figure 4). Furthermore, by allowing at most a single extra forward pass, we can increase state-of-the-art network batch sizes up to over frameworks which store all intermediate values, and up to over the best rematerialization baseline (Figure 6). We compare our proposed solver against eight baseline rematerialization heuristics on representative image classification and high resolution semantic segmentation models. As prior work is largely limited to simplified computation graphs, we propose novel extensions where necessary for comparison.

5.1 Rematerialization baselines and generalizations

Table 1 summarizes baseline rematerialization strategies. The nominal evaluation strategy stores all features generated during the forward pass for use during the backward pass – this is the default in frameworks such as Tensorflow. Hence, every layer is computed once. We refer to this baseline as Checkpoint all, an ideal approach given ample memory.

On the linear architectures, such as VGG16 and MobileNet (v1), we directly apply prior work from griewank_algorithm_2000 and chen_training_2016, baselines referred to as Griewank and Walther , Chen et al. and Chen et al. greedy. To build a tradeoff curve for computation versus memory budget, we search over the segment size hyperparameter in Chen’s greedy strategy. However, these baselines cannot be used for modern architectures with simple residual connections. For a fair comparison, we extend Chen’s and greedy algorithms to apply to general computation graphs with residual connections or branching structure (e.g. ResNet50 and U-Net).

chen_training_2016 suggests manually annotating good checkpointing candidates in a computation graph. For the first extensions, denoted by AP and AP greedy, we automatically identify articulation points, or cut vertices, vertices that disconnect the forward pass DAG, and use these as candidates. The heuristics then select a subset of these candidates, and we work backwards from the checkpoints to identify which nodes require recomputation.

Still, some networks have few articulation points, including U-Net. We also extend Chen’s heuristics by treating the original graph as a linear network, with nodes connected in topological order, again backing out the minimal recomputations from the selected checkpoints. These extensions are referred to as Linearized and Linearized greedy.

Sections 5.1.1 and 5.1.2 provide more details on our generalizations. Note that all proposed generalizations exactly reproduce the original heuristics on linear networks.

Figure 6: Maximum batch size possible on a single NVIDIA Tesla V100 GPU when using different generalized rematerialization strategies with at most a single extra forward pass. We enable increasing batch size by up to over the current practice of caching all activations (on U-Net), and up to over the best checkpointing scheme (on MobileNet).

5.1.1 Ap and AP greedy

We identify Articulation Points (AP) in the undirected form of the forward pass data-flow graph as candidates for checkpointing. Articulation points are vertices that increase the number of connected components (e.g. disconnect) the graph if removed, and can be identified in time via a modified DFS traversal [holder_graph_2008]. An articulation point is a good candidate for checkpointing as subsequent vertices in the topological order have no dependencies on vertices before in the order. DNN computation graphs are connected, so each intermediate tensor can be reconstructed from a single articulation point earlier in the topological order, or the input if there is no such AP. APs include the input and output nodes of residual blocks in ResNet, but not vertices inside blocks. We apply Chen’s heuristics to checkpoint a subset of these candidates, then solve for the optimal recomputation plan to restore correctness. Solving for ensures that the dependencies of a node are in memory when it is computed.

We could find by solving the optimization (4.5) with additional constraints on that encode the heuristically selected checkpoints. However, as is given, the optimization is solvable in via a graph traversal per row of that fills in entries when a needed value is not in memory.

5.1.2 Linearized and Linearized greedy

The data-flow graph can be treated as a linear graph with edges connecting consecutive vertices in a topological order:

While does not properly encode data dependencies, it is a linear graph that baselines can analyze. To extend a baseline, we apply it to , generate checkpoint matrix from the resulting checkpoint set, and solve for the optimal with respect to as in the AP baselines.

5.2 Evaluation Setup

The feasible set of our optimal MILP formulation includes all possible schedules produced by the baselines. This allows us to leverage the same execution planning, static graph generation, and testing infrastructure across the rematerialization strategies. Together, these components form the Checkmate system, illustrated in Figure 5.

Our framework is implemented in Tensorflow 2.0 [abadi_tensorflow:_2016]

, accepting user-defined models expressed via the high-level Keras interface. We extract the forward and backward computation graph, then construct optimization problem (

4.5) with the Gurobi mathematical programming library as a mixed integer linear program. To accelerate problem construction, decision variables and are expressed as lower triangular matrices, as are the memory accounting variables . Free is represented via a binary matrix. Solutions are generated with a user-configurable time limit of seconds, though the large majority of problems solve within minutes. Problems with exceptionally large batch sizes or heavily constrained memory budgets may reach this time limit if the solver cannot prove the optimization to be infeasible. Finally, solutions are translated into concrete execution plans and are used to construct a new static training graph.

5.3 Results

Figure 4 compares remateralization strategies on VGG-16, MobileNet, and U-Net. The y-axis shows the computational overhead of checkpointing in terms of time as compared to baseline. The time is computed by profiling each individual layer of the network. The x-axis shows the total memory budget required to run each model with the specified batch size, computed for single precision training. Except for the heuristics, each rematerialization algorithm has a knob to trade-off the amount of recomputation and memory usage (a smaller memory budget leads to higher overhead).

Takeaways: For all three DNNs, our optimal Checkmate formulation produces clearly faster execution plans as compared to algorithms proposed by chen_training_2016 and griewank_algorithm_2000 – over faster than the next best on U-Net at the NVIDIA V100 memory budget. Our framework allows training a U-Net at a batch size of 32 images per GPU with less than 10% higher overhead. This would require 23 GB of memory without rematerialization, or with the original baselines without our extensions. In fact, we can increase the batch size even further to 57.

It is interesting to compute the largest batch size that we can use when training these models on a single GPU. The maximum batch size enabled by different rematerialization strategies is shown in Figure 6. The y-axis shows the theoretical maximum batch size we could feasibly train with bounded compute cost. This is calculated by enforcing that the total cost must be less than the cost of performing just one additional forward pass. That is, in Figure 6 the cost is at most an additional forward pass higher, if the specified batch size would have fit in GPU memory. We reformulate Problem (4.5) to maximize a batch size variable subject to modified memory constraints that use in place of and subject to an additional cost constraint (7).

(7)

The modified integer program has quadratic constraints, and is difficult to solve. We set a time limit of one day for the experiment, but Gurobi may be unable to reach optimality within that limit. Figure 6 then provides a lower bound on the maximum batch size that Checkmate can achieve.

For fair comparison on the non-linear graphs used in U-Net and ResNet, we use the AP and AP greedy generalizations of Chen’s algorithms described in Section 5.1.1. For U-Net, the baselines perform slightly worse than the checkpoint all strategy due to the constrained search space. Let , as in (1) and let be the memory a baseline strategy uses at batch size . The maximum baseline batch size is estimated with (8), where the minimization is taken with respect to hyperparameters, if any.

(8)

Costs are measured in FLOPs, determined statically. U-Net, FCN8 and SegNet semantic segmentation networks use a resolution of , and classification networks ResNet50, VGG19 and MobileNet use resolution .

Takeaways: We can increase the batch size of U-Net to at a high resolution, an unprecedented result. For many tasks such as semantic segmentation, where U-Net is commonly used, it is not possible to use batch sizes greater than , depending on resolution. This is sub-optimal for batch normalization layers, and being able to increase the batch size by ( vs for a representative resolution) is quite significant. Orthogonal approaches to achieve this include model parallelism and distributed memory batch normalization which can be significantly more difficult to implement and have high communication costs. Furthermore, for MobileNet, Checkmate allows a batch size of which is higher than the best baseline solution, Chen’s greedy heuristic, and common practice, checkpointing all activations.

6 Conclusions

One of the main challenges when training large neural networks is the limited capacity of high-bandwidth memory on accelerators such as GPUs and TPUs. This has created a memory wall that limits the size of the models that can be trained. Critically, the bottleneck for state-of-the-art model development is now memory rather than data and compute availability, and we expect this trend to worsen in the near future. To address this challenge, we proposed a novel checkpointing and rematerialization algorithm which allows large models to be trained with limited available memory. Our method does not make the strong assumptions required in prior work. In particular, our proposed approach supports general non-linear computation graphs such as residual networks and captures the impact of non-uniform memory usage and computation cost throughout the graph with a hardware-aware, profile-guided cost model.

We presented a MILP formulation for the problem, implemented the Checkmate system for optimal checkpointing and rematerialization in Tensorflow, and tested the proposed system on a range of neural network models including VGG16, VGG19, ResNet50, MobileNet, U-Net, FCN, and SegNet. Furthermore, we showed that Checkmate enables practitioners to train a high resolution U-Net on a single V100 GPU with an unprecedented batch size of , as well as batch size of for MobileNet. The former is larger than the maximum batch size with prior art.

7 Acknowledgments

In addition to NSF CISE Expeditions Award CCF-1730628, this research is supported by gifts from Alibaba, Amazon Web Services, Ant Financial, CapitalOne, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, NVIDIA, Scotiabank, Splunk and VMware. This research is also supported by the NSF Graduate Research Fellowship.

References