GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

by   Yanping Huang, et al.

GPipe is a scalable pipeline parallelism library that enables learning of giant deep neural networks. It partitions network layers across accelerators and pipelines execution to achieve high hardware utilization. It leverages recomputation to minimize activation memory usage. For example, using partitions over 8 accelerators, it is able to train networks that are 25x larger, demonstrating its scalability. It also guarantees that the computed gradients remain consistent regardless of the number of partitions. It achieves an almost linear speed up without any changes in the model parameters: when using 4x more accelerators, training the same model is up to 3.5x faster. We train a 557 million parameters AmoebaNet model on ImageNet and achieve a new state-of-the-art 84.3 use this learned model as an initialization for training 7 different popular image classification datasets and obtain results that exceed the best published ones on 5 of them, including pushing the CIFAR-10 accuracy to 99 accuracy to 91.3


Memory-Efficient Pipeline-Parallel DNN Training

Many state-of-the-art results in domains such as NLP and computer vision...

PipeMare: Asynchronous Pipeline Parallel DNN Training

Recently there has been a flurry of interest around using pipeline paral...

Pipelined Backpropagation at Scale: Training Large Models without Batches

Parallelism is crucial for accelerating the training of deep neural netw...

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

The use of FPGAs for efficient graph processing has attracted significan...

Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators

This paper describes various design considerations for deep neural netwo...

Automatic Graph Partitioning for Very Large-scale Deep Learning

This work proposes RaNNC (Rapid Neural Network Connector) as middleware ...

Incremental Training of Deep Convolutional Neural Networks

We propose an incremental training method that partitions the original n...

1 Introduction

Figure 1: Strong correlation between top-1 accuracy on ImageNet 2012 validation dataset and model size for representative state-of-the-art image classification models in recent years [49, 50, 23, 54, 24, 57, 45]. Red dot shows top-1 accuracy for a giant AmoebaNet model trained by GPipe.

Deep neural networks have advanced many machine learning tasks, including speech recognition 

[11], visual recognition [57, 45], and language processing [17]. Their successes have been largely due to the model’s capacity to learn complex features from vast amounts of data. Increasing the size of models has been shown to dramatically improve task performance. One of the most challenging and popular machine learning tasks is to solve the ImageNet visual recognition challenge [16]

, where researchers compete to create the most accurate model that classifies given images in the dataset. The winner of 2014 ImageNet challenge was GoogleNet 

[49], which achieved top-1 accuracy with million parameters. The winner of 2017 ImageNet challenge went to Squeeze-and-Excitation Networks [24], which achieved top-1 accuracy with million parameters. This corresponds to more than a times increase in the number of parameters in the best visual recognition models, as shown in Figure 1. However, memory available on accelerators such as GPUs has only increased from 12 GB in 2014 (Nvidia K40) to 32 GB in 2018 (Nvidia V100). Hence, training even bigger neural networks can be challenging when faced with the accelerator memory limits.

There are increasing needs for scaling up deep neural networks. Modern machine learning datasets are growing faster than ever in terms of dataset size and quality. Image classification datasets such as OpenImages, JFT [47], and hashtagged Instagram [35]

contain hundreds of millions of high definition images. Higher image resolutions provide greater details of the object but consume more memory. This leads to a contention between memory allocated to model parameters and network activations - reinforcing a need for breaking the accelerator memory limit. The larger volume of training data helps reduce over-fitting and facilitates deep neural networks to grow bigger. Meanwhile, the emergence of deep learning super computers such as Nvidia DGX and Google TPU enables efficient parallelism by providing fast interconnections between accelerators. The memory restrictions have limited the scales of deep neural networks and confine researchers to smaller scale problems with fewer parameters. For example, while the average ImageNet resolution is

, it has been shown that increasing input image size can lead to higher accuracy [23]. However, most current models are engineered to only use input image size or to fit within accelerator memory limits. Our work focuses on removing this limiting factor of scaling up deep neural networks.

To overcome the memory limitation, we propose to use pipeline parallelism to scale up deep neural network training. We design and implement GPipe, a distributed machine learning library that uses synchronous mini-batch gradient descent for training. GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches

. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption. Gradients are consistently accumulated across micro-batches, so that the number of partitions does not affect the model quality. Therefore, GPipe allows researchers to easily deploy more accelerators to train larger models, and also to scale the performance without tuning hyperparameters.

GPipe maximizes memory allocation for model parameters. In experiments, we show that GPipe can support models up to times larger using accelerators without reducing the batch size. The implementation of GPipe is very efficient: with times more accelerators we can achieve a times speedup for training giant neural networks. GPipe can be combined with data parallelism [51] to scale training in a complementary way.

Finally, we demonstrate the empirical power of giant neural networks on image classification tasks. We increase the number of parameters for a AmoebaNet model to millions and train it with input image size of on ImageNet ILSVRC 2012 dataset. Our scaled-up AmoebaNet model attains top-1 / top-5 validation accuracy. To the best of our knowledge, it outperforms all other models trained from scratch on ImageNet dataset 111Mahajan et al.’s model [35] achieved top-1 accuracy but it was pretrained on non public external data (Instagram images with hashtags).. Furthermore, we use this learned giant model as an initialization for training seven datasets that span a wide range of tasks from general recognition to fine-grained classification. We find that giant models perform well on those datasets, obtaining results that are competitive to state-of-the-art models. For example, we push the CIFAR-10 accuracy to and CIFAR-100 accuracy to .

In summary, this paper introduces GPipe, a scalable model parallelism library for training giant deep learning models with the following key contributions:

  • [noitemsep,topsep=-6pt,leftmargin=*]

  • It supports models up to times using accelerators due to recomputation and model parallelism.

  • It achieves up to times speedup with four times more accelerators using pipelining in our experiments.

  • It guarantees consistent training regardless of the number of partitions due to synchronous gradient descent.

  • It advances the performance of visual recognition tasks on multiple datasets, including pushing ImageNet top-5 accuracy to , CIFAR-10 accuracy to , and CIFAR-100 accuracy to .

2 Related Work

Deep neural networks typically consist of a sequence of layers. During training, a neural network first uses the current model parameters to compute predictions from input mini-batches in the forward pass. Then, the gradients are computed by backpropagating prediction errors (Figure LABEL:fig_forward_backward). Computing gradients in each intermediate layer requires both gradients from upper layers and the cached output activations from the forward pass. Thus, activation memory requirements typically grow in proportion to the number of layers, leaving less space for storing model parameters.

Various efforts have been studied to allow accelerators to train bigger models. They come with different trade-offs between memory, performance, and model quality. One common method is to recompute the forward pass activations during backpropagation [21, 8], which significantly reduces memory required to cache activations. However, this method is still limited by the size of a single accelerator memory. Another approach is to swap memory between accelerators and the host [15]. However, this approach often slows down training because of the limited communication bandwidths between the host and accelerators.

Standard parallelism techniques including data parallelism and model parallelism provide orthogonal ways to use more accelerators for training. Data parallelism [51] effectively scales up the global mini-batch size. It lets each machine compute the gradient on a mini-batch of training examples. Each machine either synchronously or asynchronously updates the model parameters at the end of each training step [4, 34]. Data parallelism is widely used due to its simplicity and effectiveness. Because the batch size is proportional to the number of machines and different batch sizes often require different hyperparameters, scaling deep net training purely by data parallelism has became more challenging.

Model parallelism is a complementary technique to data parallelism. A naive strategy is to divide the computation into partitions and assign different partitions to different accelerators [30, 33]. This approach is straightforward when networks consist mainly of parallel branches. However, many deep learning models stack layers sequentially, presenting a challenge to parallelize computation efficiently. A naive partition strategy may result in only one accelerator active during computation, significantly underutilizing accelerator compute capacity (Figure LABEL:fig_naive_partition).

Pipelining is a common parallel algorithm [32] that integrates model and data parallelism. Petrowski et al. explored accelerating training neural networks via pipelining on early parallel machines [43]. Chen et al. used pipeline computation to approximate expensive backpropagation [9]. Wu et al[53]

parallelized computation of stacked recurrent neural networks on GPUs in the pipelining way. Recently, PipeDream 

[22] introduced a pipelining approach to reduce communication overhead for synchronized training using parameter servers [34]. However, it suffered from inconsistency and staleness issues in the backward pass, which could lead to unstable and poor model quality. PipeDream maintained multiple versions of model parameters on the accelerator to address the consistency issue. These constraints could prevent PipeDream from scaling up to bigger models. Similarly, DualPipe [6] optimized pipeline performance by assuming that there exists a robust way to predict future model parameters for back-propagation. Unlike these approaches, GPipe does not have any inconsistency or staleness issue. It integrates recomputation with pipeline parallelism to maximize memory and compute utilization. It offers effective and efficient synchronous training of large scale deep neural networks.

3 Methods

Figure 2: (a) An example neural network with sequential layers is partitioned across four accelerators. is the composite forward computation function of -th partition. is the corresponding backpropagation function. depends on both from upper layer and the intermediate activations of . (b) The naive model parallelism strategy leads to severe under-utilization due to the sequential nature of the network. Only one accelerator is active at a time. (c) Pipeline parallelism divides the input mini-batch into smaller micro-batches. It enables different accelerators to work on different micro-batches at the same time. and refer to forward and backward computation of the -th micro-batch on -th partition. Gradients are applied synchronously at the end of each mini-batch.

This section describes main design features of GPipe. This library is implemented using the TensorFlow 

[1] framework. The core algorithm can be implemented using other frameworks [27, 7, 40] as well. It will be open sourced in the coming months.

3.1 Interface

The caller of the GPipe interface specifies a sequential list of layers. Each layer specifies its model parameters , its stateless forward computation function

, and an optional cost estimation function

that estimates the static computation cost of -th layer given shapes of all inputs to the layer. Neighboring layers can be combined into a composite layer. For example, the composite layer may be composed of consecutive layers from the -th layer to the -th layer. In this case, ’s model parameters would be the union of , , …, and its forward function would be . The corresponding back-propagation function is derived from using TensorFlow’s automatic symbolic differentiation mechanism. Its cost estimator is constructed based on .

3.2 Algorithm

After users defined their network layers in terms of model parameter , cost estimation function and forward computation function , GPipe partitions the network into composite layers and places -th composite layer onto -th accelerator, where

is the number of partitions users specified. Communication primitives are automatically inserted by GPipe at the partition boundaries to allow data exchanging between neighboring partitions. The partitioning algorithm is heuristic-based. It simply minimizes the variance of each composite layer’s estimated cost. We expect that better partitioning algorithms can potentially improve the performance of GPipe.

During training, GPipe first divides a mini-batch of size into micro-batches at the first layer. Each micro-batch contains

examples. For instance, an image input tensor with shape [

, , , ] is reshaped into [, , , , ]. During the forward pass (Figure LABEL:fig_pipeline_partition), the -th accelerator starts to compute as soon as it finishes the -th micro-batch and receives inputs from . At the same time, the -th accelerator can start to compute . Each accelerator repeats this process times to finish the forward pass of the whole mini-batch. There are still up to idle time per accelerator, which we refer to as bubble overhead as depicted in Figure LABEL:fig_pipeline_partition. This bubble time is and amortized by the number of micro-batches . The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss.

During the backward pass, gradients for each micro-batch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each mini-batch. Therefore, GPipe maintains the same synchronous nature of gradient descent, independent of the number of partitions. This is important because deep learning training is sensitive to hyperparameters such as learning rate schedules and dropout probabilities. Such guarantee frees researchers from the time consuming process of re-tuning hyperparameters.

If batch normalization 

[25] is used in the network, the sufficient statistics of inputs during training are computed over each micro-batch, and over replicas if necessary [41]. We also track the moving average of the sufficient statistics using the entire mini-batch for use during evaluation.

3.3 Optimization 

The computation of the backward pass at layer requires both the upper layer gradients and cached activations . Therefore, the total cached activations need space without optimization, where is the mini-batch size and is the number of layers in the network. In order to reduce activation memory requirements, GPipe recomputes the forward passes. Each accelerator only stores output activations at the partition boundaries, rather than activations of all intermediate layers within the partition. During the backward pass, the -th accelerator recomputes the composite forward function and requires only the cache activations at the partition boundaries. As a result, the size of peak activation memory reduces to where is the micro batch size and is the number of layers in one partition.

As depicted in Figure LABEL:fig_pipeline_partition, the aggregation of the loss during the forward pass introduces a bubble of idleness between the forward and backward passes. The bubble is amortized over the number of micro-steps . In our experiments, we found that the bubble overhead was quite small. This is partly because recomputation during the backward pass can be scheduled earlier without waiting for gradients from earlier layers. Figure LABEL:fig_pipeline_partition assumes partitions are evenly balanced. However, memory requirements and computation flops at different layers are often quite imbalanced. For example, the number of convolution filters doubles every time there is a reduction in spatial dimensions of the activation tensors for many modern image models, such as ResNet, Inception, NasNets, and AmoebaNets. The activation memory footprint per layer decreases linearly at later layers while the number model parameter per layer increases quadratically. Therefore, imperfect partitioning algorithms will lead to load imbalance when partitioning those layers. Better partitioning algorithms can potentially improve the performance over our heuristic approach.

4 Results

This section provides detailed analysis of scalability and performance of GPipe. We evaluated ResNet and AmoebaNet in the experiments. ResNet is a representative neural network for image classification. AmoebaNet was the previous state-of-the-art image model on ImageNet. Both networks allowed us to increase the model size by changing the number of layers or the number of filters. We ran the experiments on TPU-v2s, each of which has accelerator cores and 64 GB memory (8 GB per accelerator).

4.1 Memory 

Naive-1 Pipeline-1 Pipeline-2 Pipeline-4 Pipeline-8
AmoebaNet-D (L, F) (6, 208) (6, 416) (6, 544) (12, 544) (24, 512)
# of Model Parameters 82M 318M 542M 1.05B 1.8B
Total Peak Model Parameter Memory 1.05GB 3.8GB 6.45GB 12.53GB 24.62GB
Total Peak Activation Memory 6.26GB 3.46GB 8.11GB 15.21GB 26.24GB
Table 1: Maximum model size of AmoebaNet supported by GPipe under different scenarios. Naive-1 refers to the sequential version without GPipe. Pipeline- means partitions with GPipe using accelerators. and control the number of layers and the number of filters of AmoebaNet, respectively. We recorded maximum model size by increasing and until we reached the limits of accelerator memory in each scenario. Input image size was and the batch size was . GPipe divided the mini-batch into micro-batches. It supported up to billion parameters with accelerators. Total peak model parameter memory and activation memory across all accelerators are also shown.

GPipe uses recomputation and pipeline parallelism for better memory utilization. We expect that both methods can enable bigger models, which we verified experimentally in this section. To do this, we fixed the input image size at and the mini-batch size at . We studied the effect of each method on the maximum AmoebaNet model size that would fit with accelerators, . An AmoebaNet model consists of a sequence of two repeated layer modules called normal cell and reduction cell. Normal cell reserves input activation size. Reduction cell reduces the spatial dimension of activation but increases the activation filter size. The capacity of an AmoebaNet is configured by two hyperparameters, and . defines the number of normal cells stacked between reduction cells and specifies the number of filters in the first normal cell. We increased and until we reached the limits of accelerator memory. We then compared training a model with and without GPipe on a single accelerator to understand the benefits that GPipe introduces. We also partitioned AmoebaNet across different number of accelerators to study the payoff of pipeline parallelism. We reported the maximum model size, total peak activation memory, and total peak model parameters memory across accelerators under different scenarios in Table 1.

First, we found that GPipe enabled times bigger models on a single accelerator. Without recomputation, a single accelerator can train up to million model parameters due to memory limits. Recomputation and mini-batch splitting reduced activation memory from GB to GB, enabling 318 million parameters on a single accelerator. For each model parameter, GPipe consumed 12 bytes, i.e., the parameter itself, its moving average and momentum each consumes one single precision float.

Second, we saw that with pipeline parallelism the maximum model size was proportional to the number of partitions, as expected. GPipe was capable of enabling AmoebaNet with 1.8 billion parameters across accelerators, a times increase compared to that on a single accelerator. In total, GPipe supported models that are times bigger using

accelerators in this experiment. The maximum model size was not a perfect linear function of the number of partitions because of the non-uniform distribution of model parameters over layers in AmoebaNet. This made it challenging to distribute layers evenly across multiple accelerators. With improvements from the partitioning algorithms, GPipe would be capable of allocating even larger models.

Figure 3: (a): Performance of AmoebaNet-D (4, 512) under different scenarios. This model could not fit into one accelerator. It achieved times speedup comparing to the baseline case naive-2: naive model parallelism with partitions. (b): Performance of ResNet-101 under different scenarios. Pipeline- means partitions with GPipe using accelerators. The baseline naive-1 refers to the sequential version without GPipe. The image size for both models was fixed at . Note that ResNet-101 is a small model that won’t be beneficial from any model parallelism. But it allowed us to analyze system performance and identify overheads easily.

4.2 Performance 

Figure 4: Time step breakdowns from ResNet-101 runs with 2 (a) and 4 (b) partitions, respectively. We analyzed the trace files and measured the times spent on different categories. Setup overhead measured the time to divide and reshape the inputs for pipelining. Bubble overhead measured the idle time between forward and backward passes. Load imbalance measured the waiting time for the next mirco-batch due to imbalanced partition. Recompute accounted for the recomputation time during back pass. Weight update measured the time for applying gradients.

In this section, we evaluated various factors that trade-off GPipe performance for better memory utilization. For example, recomputation of forward passes reduces activation memory but inevitably introduces computation overhead. Pipeline parallelism partitions networks across accelerators, but it can have overheads such as imbalanced workload and bubbles of idleness. It also requires setup time to divide and reshape the inputs. In our experiments, we measured the effects of pipeline parallelism and recomputation on the model throughput of ResNet-101 and AmoebaNet-D (4, 512). We fixed the image size at . We adjusted the mini-batch size to maximize the throughput. To isolate the effects of pipeline parallelism, we used accelerators to train a model with partitions. Since training AmoebaNet-D (4, 512) requires at least two accelerators, we reported the speedup with respect to no pipelining case with two partitions in Figure LABEL:fig_amoebanet_speedup. We reported speedup of ResNet-101 with respect to the sequential case without recomputation in Figure LABEL:fig_resnet_speedup. To assess the overhead cost, we carefully studied the trace files from ResNet-101 runs to identify key factors that affect performance. We also examined how the effects of these factors change with the number of partitions in Figure LABEL:fig_step_time_2 and LABEL:fig_step_time_4.

We observed that the benefits of pipeline parallelism outweigh the performance overhead introduced. We saw an almost linear speed up in training AmoebaNet-D (4, 512). Compared to the naive approach with two partitions, distributing AmoebaNet-D (4, 512) across four times more accelerators achieved times speedup. ResNet-101 is a relatively smaller model that doesn’t need model parallelism for training. But it allowed us to analyze system performance easily. The relative throughput of ResNet-101 using GPipe with one partition is . Recomputation thus introduced about overhead. As ResNet-101 was distributed across more accelerators, performance increased. It achieved about times speedup with accelerators. In both examples, GPipe provided a way to increase throughput using more accelerators, complementary to the traditional data parallelism approach.

To study opportunities for future performance improvements, we identified key factors that would affect GPipe performance. We measured the times spent on different activities listed in Figure LABEL:fig_step_time_2. We showed the distributions of these times for ResNet-101 with 2 and 4 partitions in Figure LABEL:fig_step_time_2 and LABEL:fig_step_time_4, respectively. We found that recomputation time was the main contributor to GPipe overhead, taking up to of the total step time. Another source of overhead was load imbalance. With two partitions, it was only , but with four partitions, it rose up to overhead. It is increasingly difficult for load balancing with more partitions in the network. Thus finding a good partitioning algorithm can help reduce this overhead in general. The theoretical bubble overhead is where K is the number of partitions and T is the number of micro-batches in each mini-batch. The observed bubble overhead was slightly lower than the theoretical value partly because recomputation was scheduled early to overlap with the bubble. Weight update time for gradients aggregation at the end of pipeline was also small thanks to high-speed interconnections between the accelerators.

Model Image Size # Parameters Top-1 Accuracy () top-5 Accuracy ()
Incep-ResNet V2[48] 55.8M 80.4 95.3
ResNeXt-101 [54] 83.6M 80.9 95.6
PolyNet[56] 92.0M 81.3 95.8
Dual-Path-Net-131[10] 79.5M 81.5 95.8
SENet [24] 146M 82.7 96.2
AmoebaNet-C (6, 228)[13] 155.3M 83.5 96.5
AmoebaNet-B (6, 512) 557M
Table 2: Single-model, single-crop classification accuracy for AmoebaNet-B (6, 512) compared to other published state-of-the-art models on ImageNet ILSVRC 2012 validation dataset. Data in the this table suggested that better model quality might be obtained by higher model capacity (# of parameters) and more computation (larger input image size). Note that Mahajan et al.’s model [35] achieved top-1 accuracy but it was pretrained on non-public external data (Instagram images with hashtags). *Hu et al. reported classification error using an ensemble of models with multiple crops on ImageNet 2017 test dataset, which is different from the ImageNet 2012 validation dataset in this table.

4.3 Model quality

4.3.1 Consistent Training

GPipe performs synchronous training over the micro-batches. In this section, we verified the hypothesis that the end-to-end convergence accuracy using GPipe is the same within statistical errors, regardless of the number of partitions. We trained AmoebaNet-D (2, 128) several times for epochs and measured the final validation accuracy on ImageNet. We chose AmoebaNet-D (2, 128) since it was the winning image model by training cost in the DAWNBench competition [12]. We adopted the same hyperparameters and training procedure reported in DAWNBench.222 As a baseline, we trained AmoebaNet-D (2, 128)

times using the official open source implementation and computed the mean and standard deviation of the final accuracy. Using the same hyperpameters and training procedures, we trained the same network using GPipe with

and partitions. We found that the resulting accuracy fell within two standard deviations from the mean, as expected.

4.3.2 Scaling up Giant Models

We verified the hypothesis that scaling up existing neural networks can achieve even better model quality. As a proof of concept, we trained an AmoebaNet-B (6, 512) with million model parameters and input image size of on the ImageNet ILSVRC-2012 dataset. We followed the same hyperparameters and input pre-processings as described in [45]

to train AmoebaNet-B (6, 512). We employed the RMSProp optimizer with a decay of

and , regularization , label smoothing coefficiency and an auxiliary head with weight . We applied the same drop-path schedule to intermediate layers as in NasNet [57], and dropout to the final layer with probability . We used a learning rate schedule that decays every epochs at a rate of with an initial learning rate of times the batch size. The network was divided into 4 partitions, and we performed training using both model and data parallelism. We adopted mixed precision training [37] where activations are represented in half precision. Unlike other mixed precision training strategies, we didn’t scale the loss values thanks to the wide dynamic range of bfloat16 on TPUs. We used ImageNet ILSVRC-2012 dataset for training and reported the validation accuracy in table 2. This giant model reached top-1 / top-5 validation accuracy with single-crop.

4.3.3 Transfer Learning

Large neural networks are not only applicable to datasets like ImageNet, but also relevant for other datasets through transfer learning

[44, 19, 46]. One successful approach to transfer learning is to use ImageNet pre-trained models as initialization for training on a target dataset. In this section, we will evaluate the transfer learning performance for the best giant model found in Section 4.3.2 that achieved top-1 accuracy on ImageNet.

We ran transfer learning experiments on the following datasets: CIFAR-10, CIFAR-100 [31], Birdsnap [2], Stanford Cars [29], FGVC Aircraft [36], Oxford-IIIT Pets [39], and Food-101 [3]. This spanned a range of tasks from general object recognition to fine-grained classification.

We trained a AmoebaNet-B (6, 512) model for each of these datasets. We changed the number of output units in the last softmax classification layer to the number of classes in the target dataset. This softmax layer was initialized randomly, while all other layers were initialized with the best parameters trained on ImageNet. We selected the learning rate and

weight regularization parameters for each dataset on a hold-out subset of training dataset. For other hyperparameters we used the same ones as in ImageNet training. We adopted image pre-processing procedure that is widely used for training CIFAR datasets. In all our transfer learning experiments, input images to the network during training were resized to , horizontally flipped randomly and augmented using cutout [18]. We trained the models for

steps using stochastic gradient descent with momentum. Each mini-batch contained 256 examples. We reported the averaged single-crop accuracy on test sets across 5 fine-tuning runs for each dataset.

Dataset # Training Examples # Test Examples # Classes Our Model Accuracy () Previously Reported Result ()
CIFAR-10 50,000 10,000 10 [13]
CIFAR-100 50,000 10,000 100 [13]
Stanford Cars 8,144 8,041 196 [13]
Oxford-IIIT Pets 3,680 3,369 37 [42]
Food-101 75,750 25,250 101 [14]
FGVC Aircraft 6,667 3,333 100 [55]
Birdsnap 47,386 2,443 500 [52]
Table 3: Transfer learning results using AmoebaNet-B (6, 512) initialized with the best ImageNet model, using an input image size of and single crop at test time. Our results were averaged across 5 fine-tuning runs. Baseline results for CIFAR-10 CIFAR-100, and Stanford Cars from Cubuk et al[13] were directly trained from scratch. For all other datasets, the baselines used some form of ImageNet pre-training. *For some datasets, Ngiam et al[38] achieved better results by pre-training with data from a large private dataset (JFT-300M).

We found that our giant models performed well on the target datasets, obtaining results that were competitive to state-of-the-art models in Table 3. For example, they reduced CIFAR-10 error rate to and CIFAR-100 error rate to . These results corroborated Kornblith et al[28] findings that ImageNet performance correlated well with transfer learning performance.

5 Discussion

Our work validates the hypothesis that bigger models and more computation would lead to higher model quality. This hypothesis is also supported by past advances in visual recognition tasks shown in Figure 1 and the recent progresses in other fields such as BigGAN [5] and BERT[17]. Those results suggest that accuracy improvements of machine learning tasks may be obtained by further increases in the scale of neural networks beyond the limits of accelerator memory. Moreover, the availability of bigger datasets such as JFT-300M [47] and hashtagged Instragram [35] also reduces risks of over-fitting and encourages giant networks with higher capacity.

GPipe supports models up to -billion parameters with accelerators in our experiments, inviting future research on searching efficient network architectures with billions of parameters. As a proof of concept, we only scaled up the capacity of AmoebaNet to -million parameters by doubling the number of filters in our experiments. It doesn’t mean that it’s the most effective way to grow the model size. There might exist better ways for model augmentation like increasing the number of layers or employing more branches of transformations.

GPipe allows us to revisit some of choices in network architecture design that might be made due to limited accelerator memory. For example, one of design choices of existing image classification models is to aggressively reduce the spatial dimensions of inputs at the first few layers. Employing convolution or pooling layers with non-unity stride values at the beginning greatly reduces the activation memory requirement. Some lower level input features might be omitted because of the aggressive early reductions. We verified this hypothesis by running a control experiment that compared aggressive reduction with delayed reduction. We reduced the stride value of the first convolution layer and increased the stride value at the last convolution layer on AmoebaNet-D (2, 256). As a result, the activation memory footprint increased four times but the model size stayed the same. This change improved the ImageNet top-1 accuracy of the network from

to .

GPipe can scale training by employing even more accelerators without changes in the hyperparameters. Therefore, it can be combined with data parallelism to scale neural network training using even more accelerators in a complementary way. Pure data parallelism with stochastic gradient descent runs into inferior model generalization issues when the size of the global mini-batch is extremely large. Significant re-tuning and optimization is required to train on ImageNet without loss of accuracy when the global mini-batch size is greater than  [20, 26].

GPipe enables pipeline parallelism for any neural networks that consist of sequence of layers. It can be further applied to more deep learning tasks such as object detection, image segmentation, and natural language processing. The training efficiency of GPipe can be further improved by better graph partition algorithms.

6 Conclusion

In this work, we introduce GPipe, a scalable model parallelism library that addresses the memory bottleneck for giant neural networks. It allows researchers to explore deeper and more complex deep learning models. For example, GPipe supports models up to times larger with accelerators, demonstrating its scalability. Moreover, it can achieve a times speedup with times more accelerators without tuning. In all cases, it converges to the same accuracy as the sequential version without any changes to the model hyperparameters. Furthermore, we demonstrate the power of our framework by training a giant AmoebaNet model that achieves top-1 / top-5 ImageNet validation accuracy, CIFAR-10 accuracy, and CIFAR-100 accuracy.


We wish to thank Esteban Real, Alok Aggarwal, Xiaodan Song, Naveen Kumar, Mark Heffernan, Rajat Monga, Megan Kacholia, Samy Bengio, and Jeff Dean for their support and valuable input; Patrick Nguyen, Xiaoqiang Zheng, Yonghui Wu, Noam Shazeer, Barret Zoph, Ekin Cubuk, Tianqi Chen, and Vijay Vasudevan for helpful discussions and inspirations; and the larger Google Brain team.