In recent years, deep learning has seen significant growth, driven by several methodologies which enable the training of deep neural networks (DNNs) in a scalable way and by development of more powerful hardwares. It is observed that increased capacity of DNN effectively has improved the performance. For example, AmoebaNet-B[real2019regularized] scaled with GPipe [huang2019gpipe]
has 557 million parameters and has achieved top-1 accuracy 84.4% which was state-of-the-arts result at the time, and GPT-2[radford2019language] is a Transformer-based [vaswani2017attention] language model which has 1.5 billion parameters (see Figure 1 of [huang2019gpipe] for the effect of model scaling). However, training such a massive model is very resource intensive. One can mitigate this issue by reducing the size of the model without losing the performance by pruning the model [han2015learning, alvarez2016learning], designing more efficient architectures [howard2017mobilenets, tan2019efficientnet], architecture search under resource constraints [cai2018proxylessnas], and many more.
We may wonder a rather direct approach is possible: can we train a massive model fast enough, given a large pool of devices? One obstacle is that common optimization techniques to train a neural network are sequential in nature. Those algorithms repeatedly compute the gradient of the loss with respect to the given mini-batch at a time and update the model parameters using the gradient. With abundant computational resource, data parallelism [krizhevsky2012imagenet]
is commonly used to speed up the overall optimization procedure by dividing the mini-batch into micro-batches and delegating per micro-batch computation to available devices. With careful hyperparameter tuning, this effectively reduce the training time up to a certain size of mini-batch which may depend on model, optimization algorithm, and data[goyal2017accurate, shallue2018measuring]. One drawback of data-parallel training is that devices hold their own version of network for executing the subdivided task, and network parameters must be synchronized after each parameter update. This may induce heavy communication load when there are lots of parameters to synchronize.
Note that data parallelism is not applicable when the model is so big that it is impossible to compute gradient even when a single data point is fed into the network. Model parallelism [dean2012large] is a method for training such a massive model, which partitions the model into several pieces and places them on different devices. Each device only computes a small part of the model, and updates only the parameters in that part. However, model parallelism suffers from its underutilization behavior. Since most neural networks consist of sequence of layers, the device holding the later part of the model must wait until computation in devices holding earlier parts of the model.
Another possible solution is to use gradient checkpointing [chen2016training] which saves memory by only storing the subset of activation maps and re-computing the discarded activation maps when necessary. Obviously, this requires certain part of the model be computed twice and overall training time would be increased.
It is benefitting to combine different types of parallelization strategies [krizhevsky2014one, pmlr-v80-jia18a, shazeer2018mesh, huo2018decoupled, harlap2018pipedream, huang2019gpipe, guan2019xpipe], and recent lines of research questions how to find an optimal strategy [jia2018beyond, mirhoseini2017device, mirhoseini2018a, zhou2019gdp]. Among them, pipeline parallelism a way to accelerate neural network training by combining model parallelism with data pipelining, either in synchronous way as in GPipe [huang2019gpipe] or in asynchronous way as in [huo2018decoupled], PipeDream [harlap2018pipedream], and XPipe [guan2019xpipe]. We remark that gradient checkpointing (also called re-materialization) is further combined in GPipe to allow training even bigger models.
In this paper, we design and implement , a ready-to-use library for GPipe in PyTorch [paszke2017automatic]. In particular, we develop a set of design components for optimized pipeline-parallel computations in PyTorch’s define-by-run and eager execution environment. We show that each component is necessary to fully benefit from pipeline parallelism in such environment, and demonstrate the efficiency of by conducting the speed and memory benchmarks on AmoebaNet-D [real2019regularized] and U-Net [RonnebergerFB15] when trained with the library.
The rest of the paper is organized as follows. In section 2, we discuss how the forward and backward passes can be decomposed into subtasks (under certain assumptions), describe the device placement strategy of micro-batch pipeline parallelism, and demonstrate what the desired order of execution per device is. In section 3, we discuss complications for achieving the optimal timeline of pipeline parallelism in PyTorch and explain how resolves them. Additionally, we relax the assumption that the model is sequentially composed, and provide a way for expressing models with long skip connections so that pipeline parallelism still applies without giving up the efficiency. Then, we demonstrate that the optimization components suggested in the paper are essential for the performance, and evaluate the performance of the proposed library in section 4.
2 Pipeline Parallelism
Suppose that we have a neural network which is represented as a composition of sequence of subnetworks. Let us denote the subnetworks by with parameters and let the full network be
parameterized by . For clarity, we call the th partition of and assume that the parameters of partitions are mutually disjoint.
When training the network, gradient-based methods such as stochastic gradient descent requires computing the outcomeof the network given a mini-batch of training data and the corresponding loss, and the gradient of the loss with respect to the network parameter . Those two stages are called forward and backward pass, respectively.
Since is sequentially composed, in forward pass can be computed by letting and sequentially applying the partitions as for . Furthermore, if consists of smaller batches called micro-batches, computing dissolves into tasks where and
for and , assuming that
does not involve any intra-batch computation. One prominent exception for this is batch normalization[pmlr-v37-ioffe15]111Applying pipeline parallelism to a network with batch normalization is feasible while the computation is not identical anymore. Indeed, this discrepancy also exists in data-parallel training scheme and it may results in degradation of the result.. The loss is obtained by aggregating
and evaluating the loss function on them.
In a similar fashion, backward pass is decomposed into tasks where is the gradient of the loss with respect to and
for and . Here
is a function which does backward propagation (also known as vector-Jacobian product) through the partition, and is defined likewise. As a result, we get the gradient of the loss with respect to by summing over ’s.
Note that there are data dependencies between tasks. For example, requires which is only available after , hence must be completed before starting and the same applies for and . Figure 3 shows the full dependency graph in the case of and .
Given the set of tasks and and a pool of devices which can work in parallel, different parallelization strategies have their own rule to assign tasks to devices. Each device computes one or more assigned tasks as soon as the dependencies are resolved. In the setting above, all dependencies are among the tasks with the same micro-batch index . Hence, one can effectively parallelize the tasks by assigning tasks with different micro-batch indices to different devices — which is data parallelism.
2.1 Dependency Graph of GPipe
Pipeline parallelism’s strategy is to assign tasks with respect to the partition index so that th partition entirely lies in the th device. In addition to this, it is enforced that must be completed before executing and must be completed before executing .
In addition to the micro-batch pipelining, GPipe [huang2019gpipe] further reduces the memory requirement by utilizing gradient checkpointing for each . Since th device executes one at a time, only the activation maps obtained from are needed to complete . By recomputing the forward pass right before executing , memory consumption is reduced by a factor of . Moreover, the re-computation can take place while the device is waiting for being done. This is summarized in Figure 3, where dashed arrows denotes the execution order between independent tasks induced by the micro-batch order, and denotes the re-computation of .
We remark that re-computations for the last micro-batch, i.e., for are unnecessary. This is because that on th device the last task in the forward pass is , so discarding intermediate activations of it in forward pass and re-computing them in the beginning of backward pass has no effect of reducing memory, only slowing down the pipeline. For this reason, is omitted from the graph.
2.2 Device-wise Execution Order
To summarize, in pipeline parallelism (with checkpointing) each device is assigned with a set of tasks with the prescribed order. Each device will execute the given tasks one-by-one as soon as cross-device dependencies are met. However, there is a missing component in this picture — data tranfer between the devices. For illustration, the full execution order that device must follow is shown in Figure 3. Here data transfer operations are explicitly denoted as ‘receive’ and ‘send’ for emphasis.
3 : A PyTorch Library for GPipe
is a PyTorch library for micro-batch pipeline parallelism with checkpointing, as known as GPipe. The library provides a simple way to apply GPipe to a generic sequential module written in PyTorch. The usage of resembles that of the data parallel module of PyTorch — just wrap your model with the wrapper.
Users must specify the number of micro-batches and how consecutive layers form partitions. Here we remark that even though we simplified our assumption to that the model is a sequence of partitions, it is strictly required in that the model is a sequence of layers to give flexibility for users how to split the model. will assume that each layer is a non-divisible, black-box, and referentially transparent222This is required especially for checkpointing: referential transparency ensures that recomputation is identical to the computation done in the forward pass. algorithm.
For convenience, the library provides the submodule torchgpipe.balance which computes a partition whose pairwise resource discrepancy is small, where resource consumption is computed by profiling. Specifically, we used the algorithm from [barany2015block].
As is built on PyTorch equipped with CUDA backend, we will often assume that devices are NVIDIA GPU throughout this section. Nevertheless, the underlying principle of the library applies in general for implementing pipeline parallelism any eager execution environments.
3.1 Complications in PyTorch
Our primary concern is efficiency. As we discussed in subsection 2.2, in order for pipeline parallelism to work as desired, the tasks must be assigned to each device in the correct order. There are several complications to achieve this in PyTorch.
First of all, kernels are issued to each device on-the-fly due to PyTorch’s define-by-run style and its eager execution behavior (as opposed to in construct-and-run type frameworks). Hence, one must design the host code carefully so not only that device-bound tasks are issued in the correct order within each device, but also that execution of the tasks on devices (asynchronous to CPU) are not delayed due to the Python interpreter failing to request it ahead of the time. This kind of delay may happen when some of the tasks are CPU-intensive or involve a lot of cheap kernel calls. As a solution, introduces deterministic clock-cycle which gives the total ordering of the tasks.
Secondly, the computation graph for backward pass is constructed dynamically during the forward pass in PyTorch. In other words, “it avoids ever materializing a “forward graph”, recording only what is necessary to differentiate the computation.” [paszke2017automatic] Since PyTorch does not record the forward computation graph nor maintain a gradient tape, the automatic differentiation (autograd) engine of PyTorch does back-propagation solely with respect to the graph. It implies that autograd engine may not run exactly in the reverse order of execution as in the forward pass, unless enforced by the structure of the graph. To deal with this, we develop a pair of primitive functions called ‘fork’ and ‘join’ to create explicit dependencies on the fly in the backward computation graph.
Thirdly, communication between several devices can cause two-way synchronization, if not carefully managed. This may cause under-utilization since sender may wait to synchronize with the receiver even when there is no explicit dependency between the copy and next task in queue, or vice versa. avoids this issue by using non-default CUDA streams so that copies would never block computations unless the computation must wait for the data.
Lastly, attempts to relax the restriction of micro-batch pipeline parallelism that model must be sequential. Although any neural network can be written in a sequential form in principle, this requires knowing the entire computation graph ahead of the time which is not the case in PyTorch. In particular, if there is a tensor which skips from a layer in deviceto another layer in device , the tensor will be copied to all devices in between since cannot know it ahead. To circumvent this issue, we design an interface to signify which intermediate tensors are skipped and which layers use them.
3.2 Optimization Components
In the remainder of this section, it is explained how the components of are designed and why each of them is essential for performance.
3.2.1 Forward Dependency: Deterministic Clock-cycle
As we discussed in subsection 3.1, the total ordering of tasks is determined by the host code in the forward pass. Each device implicitly understands the dependency between tasks by the order they are assigned by CPU. Ideally, if tasks could be assigned to devices with no cost, CPU may assign tasks to devices in any order as long as the ordering within device is correct. However, this assumption is not realistic enough, as launching kernels on a GPU is not free for CPU, memory transfer between GPUs may require synchronization, or a task is CPU-intensive. For this reason, we minimize the delay coming from CPU by sorting all tasks by the distance to .
We call this deterministic clock-cycle (algorithm 1). In the algorithm, CPU executes the clock cycles starting from the counter to . In th clock cycle, all copy kernels for data needed to execute tasks where are first issued, and then the computation kernels for executing the tasks are registered to corresponding devices (which can be safely multithreaded since tasks in the same clock cycle are independent).
3.2.2 Backward Dependency: Fork and Join
Suppose now that we run a forward pass according to the deterministic clock-cycle. The resulting computation graph for backward will look rather like 3 than 3, even when the forward tasks on device were executed in order. From such a graph, autograd engine of PyTorch would never know that must be executed before , and this messes up the timeline of the backward pass. For this reason, virtual dependencies (dashed arrows in Figure 3) must be explicitly drawn during the forward pass.
We design a pair of primitive functions called Fork and Join to express such dependency. Basically, Fork is the autograd function mapping a tensor to the pair where is an empty tensor333In principle, the tensor which indicates the virtual dependency can be arbitrary. We chose to use the empty tensor for this, however, to remove any unnecessary computation caused by the tensor such as gradient accumulation in PyTorch., and Join is the autograd function mapping a pair to the tensor . Now, dependency of upon (which translates to the dependency of upon in the backward computation graph) can be expressed as
See Figure 4 for illustration.
3.2.3 Concurrent Copy and Computation: Streams
PyTorch issues every device-bound kernels to the default stream, unless it is specified otherwise. Stream is a device-bound sequence of kernels that is executed in order. Kernels in the same stream are guaranteed to be executed in the prescribed order, but kernels in different streams can be interleaved, and even can overlap when possible. In particular, nearly all CUDA devices with compute capability 1.1 and higher support concurrent copy and execution: data transfer between devices can always overlap with kernel execution (see section 22.214.171.124 of [nvidia2007cuda]).
registers every copy kernel to non-default streams while keeping computation kernels on the default stream. This allows the device processing in concurrent with sending to the device and/or receiving from the device . Moreover, each device uses different streams for each micro-batch. Since there is no true dependency between different micro-batches, this use of streams is safe and this allows copies to occur as fast as possible. See Figure 5 for illustration.
3.2.4 Autograd Functions with Shared Memory
So far in this section, we did not discuss how to schedule re-computation tasks when gradient checkpointing is in use. It must be scheduled in prior to the back-propagation task upon completion of . This must be encoded in the computation graph as well for autograd engine. Indeed, PyTorch supports such functionality via an in-house autograd function for checkpointing.
Checkpoint in PyTorch is implemented by defining an autograd function which computes as usual function in the forward pass without storing intermediate activation maps but the inputs. In the backward pass, this function constructs a local computation graph for backward by recomputing the function using the stored inputs, and computes gradients by back-propagating through the local graph. However, this tightly binds and together. Ultimately, we would like to insert the instruction for waiting the result of to be copied from device to device in between and , to allow that and the copy happens concurrently.
For such a fine-grained order control, implements checkpointing with two separate autograd functions Checkpoint and Recompute. At the execution time of the task , a pair of Checkpoint and Recompute which have a shared memory is generated. This shared memory is used in the backward pass for transferring the local computation graph made by executing Recompute to Checkpoint for back-propagation. By arranging the functions so that , synchronization for receiving , and are executed in the order during the backward pass, it is ensured that re-computation and copy can happen concurrently.
3.3 Dealing with Non-sequential Models
In section 2, we assumed that the model is composed of partitions in sequence. In principle, any neural network can be represented in this form by sorting all nodes in the forward computation graph of in topological ordering. Hence, pipeline parallelism is applicable to any model.
However, consider a symptomatic case that all the partitions except the first and the last one are parallel, i.e.,
where and for . In a sequential form, this is equivalent to such that
for , and . In this case, it is quite inefficient to use pipeline parallelism in its native form since at the boundary of device and , the tuple must be copied instead of a single tensor which is the only required data to compute th partition.
provides a submodule which allows users to indicate skipping tensors from which layer to which layer: torchgpipe.skip. With the decorator @skippable, user-defined layer can stash a tensor for later or pop a stashed one via yield operator in Python without returning it. This in particular does not change the input and output signature of a layer. Hence, minimal effort is needed for adding skip connection to a preexisting sequential model.
3.3.1 Hiding Skip Tensors in the Graph: Portals
Adding skip connections into the dependency graph (Figure 3) is fairly straightforward. Indeed, no additional dependency would be introduced no matter how many skip connections are added, hence only the copy kernels for skip connections need extra care. In , this is taken care by portals consisting of three autograd functions PortalBlue, PortalOrange, and PortalCopy sharing memory, like Checkpoint and Recompute in subsubsection 3.2.4. Each does the job of saving the skip tensor, loading the tensor, and moving the saved tensor to the skipped device, respectively (and vice versa in the backward pass). This mechanism is illustrated in Figure 6.
Every experiment was conducted with NVIDIA Tesla P40 GPUs with CUDA 10.1.243, each having 22 GiB of memory. For reproducibility, codes for all benchmarks provided in this section is made available in the repository444Further details available at this link..
4.1 Effects of Optimization Components
We conducted an experiment to show that every component of is necessary to achieve the maximal efficiency. Starting from the baseline which only has deterministic clock-cycle but no others, each component (backward dependency via Fork and Join, non-default streams for copy kernels, and portals for skip connections) is added incrementally. We report the throughput, GPU utilization, and memory usage under each setting to measure how each component contributed to the performance of . We find that addition of each component gives a speed-up, and with all components runs nearly twice as fast as the baseline. Results can be found in Table 1.
We used U-Net for the experiment. Details of the architecture can be found in subsubsection 4.2.2 and we set to be as in the speed benchmark. In settings without portals, the model is implemented as a fully sequential version where skip connections are encoded as inputs and outputs of layers that they pass through, as described in the symptomatic example of subsection 3.3. For the setting with all components, it is implemented with torchgpipe.skip while the architecture is identical.
We also visualized per GPU timelines to help understanding each component’s role, illustrated in Figure 7. Explanation for each picture is summarized as follows.
|Optimization components||Throughput||Speed up||Utilization||Memory usage|
By deterministic clock-cycle, all kernels are issued in the correct order during forward pass. It is illustrated by the left part of the timeline. However, without explicit dependency encoded in the computation graph, the autograd engine processes the micro-batches in an uncontrollable order so the timeline is messed up.
With backward dependency, kernels are now issued in the correct, deterministic order in backward pass.
By using non-default copy streams, copies and computations are now concurrent as illustrated by overlapping blue and red bars.
Portals remove unnecessary copies caused by transferring the skipping tensor to all devices in between. This is illustrated by that the length of red bars are reduced compared to (c).
4.2 Performance Benchmarks
To demonstrate the efficiency of , we report performance benchmarks similar to that conducted by GPipe [huang2019gpipe].
4.2.1 AmoebaNet-D Speed Benchmark
We measured the throughput of AmoebaNet-D with various number of devices. For this, we measured the throughput of the model when is applied, with partitions and micro-batches. Here throughput means the number of samples processed per second.
The experiment is conducted for each pair where and . When , we used checkpointing to all micro-batches555 does not use checkpointing on the last micro-batch by default, as explained in section 2. This means that no checkpointing is applied when . to make a fair comparison of loss due to checkpointing with [huang2019gpipe]. The model we used is our implementation of a sequential version of AmoebaNet-D in PyTorch666 We tried to make it as close as possible to the model in the official repository of TensorFlow (
We tried to make it as close as possible to the model in the official repository of TensorFlow (link)..
The model is trained by plain SGD for 10 epochs and reported the average throughput over the epochs except the first one. To exclude the overhead caused by data loading, we used a synthesized dataset which consists of 10,000 images whose dimension is. For each setting, the batch size and the number of micro-batches are chosen to maximize the throughput. Relative speed-up is calculated against the baseline case and reported in Table 4. We included the speed-up of GPipe for comparison.
The relative speed-up of shows similar trend to that of GPipe. We remark that differences in performance reported in Table 4 might be due to many unknown factors such as balance of the partitions, discrepancy between the implementation, difference in devices, and so on.
4.2.2 U-Net Memory Benchmark
To evaluate the effectiveness of for models with long skip connections, we used U-Net [RonnebergerFB15] for 2-dimensional segmentation. The version of U-Net we used has five down-sampling layers and five up-sampling layers, and two hyper-parameters and determining the size of the model. Here stands for the number of convolution blocks in between down-sampling layers, and stands for the number of output channels of the first convolution. Channels are doubled after each down-sampling layers (or halved after each up-sampling layers, respectively). Our implementation of U-Net is rather symmetric than the original model proposed in [RonnebergerFB15] for effective balancing.
We conducted an experiment to measure the ability of for training a bigger model. For 1, 2, 4 and 8 GPUs, we found maximum to occupy each number of devices. In all settings, the input size is set to , the output size to , and the batch size to 32. The total memory usage for training each model is reported in Table 4. Here parameters consumes 8 bytes each for itself and its gradients.
4.2.3 U-Net Speed Benchmark
We also measured the throughput of U-Net with various number of devices. Naive-1 denotes the baseline without pipeline parallelism nor checkpointing, and Pipeline-1, -2, -4, -8 denotes that the model is trained with with the corresponding number of partitions. The hyper-parameters determining the size of U-Net is set to in this experiment. The batch size, the number of micro-batches (), and the balance to partitions are chosen to maximize the throughput. For each setting, throughput is measured as in subsubsection 4.2.1 except that the image size was in this experiment. Result is summarized in Table 4.
|U-Net||(, )||Parameters||Memory usage|
|Naive-1||(6, 72)||362.2M||20.3 GiB|
|Pipeline-1||(11, 128)||2.21B||20.5 GiB|
|Pipeline-2||(24, 128)||4.99B||43.4 GiB|
|Pipeline-4||(24, 160)||7.80B||79.1 GiB|
|Pipeline-8||(48, 160)||15.82B||154.1 GiB|
|U-Net||Throughput||Speed up||Batch size|
In this paper, we introduced , a ready-to-use library in PyTorch for micro-batch pipeline parallelism with checkpointing proposed by GPipe [huang2019gpipe]. This library is designed and implemented in PyTorch’s define-by-run and eager execution environment. Ablation study and performance benchmarks presented in section 4 demonstrate that all components of are essential to endeavor the desired advantanges of pipeline parallelism with checkpointing in eager execution environment. We believe that general principles we established in the paper apply to any other frameworks with eager execution environment.
We tried to avoid going too deep into technical details involved in . Our code is available at https://github.com/kakaobrain/torchgpipe for those who are interested in further details, and those who want to apply pipeline parallelism to their model in PyTorch.