1 Introduction
Deep neural networks have advanced many machine learning tasks, including speech recognition
[11], visual recognition [57, 45], and language processing [17]. Their successes have been largely due to the model’s capacity to learn complex features from vast amounts of data. Increasing the size of models has been shown to dramatically improve task performance. One of the most challenging and popular machine learning tasks is to solve the ImageNet visual recognition challenge [16], where researchers compete to create the most accurate model that classifies given images in the dataset. The winner of 2014 ImageNet challenge was GoogleNet
[49], which achieved top1 accuracy with million parameters. The winner of 2017 ImageNet challenge went to SqueezeandExcitation Networks [24], which achieved top1 accuracy with million parameters. This corresponds to more than a times increase in the number of parameters in the best visual recognition models, as shown in Figure 1. However, memory available on accelerators such as GPUs has only increased from 12 GB in 2014 (Nvidia K40) to 32 GB in 2018 (Nvidia V100). Hence, training even bigger neural networks can be challenging when faced with the accelerator memory limits.There are increasing needs for scaling up deep neural networks. Modern machine learning datasets are growing faster than ever in terms of dataset size and quality. Image classification datasets such as OpenImages, JFT [47], and hashtagged Instagram [35]
contain hundreds of millions of high definition images. Higher image resolutions provide greater details of the object but consume more memory. This leads to a contention between memory allocated to model parameters and network activations  reinforcing a need for breaking the accelerator memory limit. The larger volume of training data helps reduce overfitting and facilitates deep neural networks to grow bigger. Meanwhile, the emergence of deep learning super computers such as Nvidia DGX and Google TPU enables efficient parallelism by providing fast interconnections between accelerators. The memory restrictions have limited the scales of deep neural networks and confine researchers to smaller scale problems with fewer parameters. For example, while the average ImageNet resolution is
, it has been shown that increasing input image size can lead to higher accuracy [23]. However, most current models are engineered to only use input image size or to fit within accelerator memory limits. Our work focuses on removing this limiting factor of scaling up deep neural networks.To overcome the memory limitation, we propose to use pipeline parallelism to scale up deep neural network training. We design and implement GPipe, a distributed machine learning library that uses synchronous minibatch gradient descent for training. GPipe partitions a model across different accelerators and automatically splits a minibatch of training examples into smaller microbatches
. By pipelining the execution across microbatches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption. Gradients are consistently accumulated across microbatches, so that the number of partitions does not affect the model quality. Therefore, GPipe allows researchers to easily deploy more accelerators to train larger models, and also to scale the performance without tuning hyperparameters.
GPipe maximizes memory allocation for model parameters. In experiments, we show that GPipe can support models up to times larger using accelerators without reducing the batch size. The implementation of GPipe is very efficient: with times more accelerators we can achieve a times speedup for training giant neural networks. GPipe can be combined with data parallelism [51] to scale training in a complementary way.
Finally, we demonstrate the empirical power of giant neural networks on image classification tasks. We increase the number of parameters for a AmoebaNet model to millions and train it with input image size of on ImageNet ILSVRC 2012 dataset. Our scaledup AmoebaNet model attains top1 / top5 validation accuracy. To the best of our knowledge, it outperforms all other models trained from scratch on ImageNet dataset ^{1}^{1}1Mahajan et al.’s model [35] achieved top1 accuracy but it was pretrained on non public external data (Instagram images with hashtags).. Furthermore, we use this learned giant model as an initialization for training seven datasets that span a wide range of tasks from general recognition to finegrained classification. We find that giant models perform well on those datasets, obtaining results that are competitive to stateoftheart models. For example, we push the CIFAR10 accuracy to and CIFAR100 accuracy to .
In summary, this paper introduces GPipe, a scalable model parallelism library for training giant deep learning models with the following key contributions:

[noitemsep,topsep=6pt,leftmargin=*]

It supports models up to times using accelerators due to recomputation and model parallelism.

It achieves up to times speedup with four times more accelerators using pipelining in our experiments.

It guarantees consistent training regardless of the number of partitions due to synchronous gradient descent.

It advances the performance of visual recognition tasks on multiple datasets, including pushing ImageNet top5 accuracy to , CIFAR10 accuracy to , and CIFAR100 accuracy to .
2 Related Work
Deep neural networks typically consist of a sequence of layers. During training, a neural network first uses the current model parameters to compute predictions from input minibatches in the forward pass. Then, the gradients are computed by backpropagating prediction errors (Figure LABEL:fig_forward_backward). Computing gradients in each intermediate layer requires both gradients from upper layers and the cached output activations from the forward pass. Thus, activation memory requirements typically grow in proportion to the number of layers, leaving less space for storing model parameters.
Various efforts have been studied to allow accelerators to train bigger models. They come with different tradeoffs between memory, performance, and model quality. One common method is to recompute the forward pass activations during backpropagation [21, 8], which significantly reduces memory required to cache activations. However, this method is still limited by the size of a single accelerator memory. Another approach is to swap memory between accelerators and the host [15]. However, this approach often slows down training because of the limited communication bandwidths between the host and accelerators.
Standard parallelism techniques including data parallelism and model parallelism provide orthogonal ways to use more accelerators for training. Data parallelism [51] effectively scales up the global minibatch size. It lets each machine compute the gradient on a minibatch of training examples. Each machine either synchronously or asynchronously updates the model parameters at the end of each training step [4, 34]. Data parallelism is widely used due to its simplicity and effectiveness. Because the batch size is proportional to the number of machines and different batch sizes often require different hyperparameters, scaling deep net training purely by data parallelism has became more challenging.
Model parallelism is a complementary technique to data parallelism. A naive strategy is to divide the computation into partitions and assign different partitions to different accelerators [30, 33]. This approach is straightforward when networks consist mainly of parallel branches. However, many deep learning models stack layers sequentially, presenting a challenge to parallelize computation efficiently. A naive partition strategy may result in only one accelerator active during computation, significantly underutilizing accelerator compute capacity (Figure LABEL:fig_naive_partition).
Pipelining is a common parallel algorithm [32] that integrates model and data parallelism. Petrowski et al. explored accelerating training neural networks via pipelining on early parallel machines [43]. Chen et al. used pipeline computation to approximate expensive backpropagation [9]. Wu et al. [53]
parallelized computation of stacked recurrent neural networks on GPUs in the pipelining way. Recently, PipeDream
[22] introduced a pipelining approach to reduce communication overhead for synchronized training using parameter servers [34]. However, it suffered from inconsistency and staleness issues in the backward pass, which could lead to unstable and poor model quality. PipeDream maintained multiple versions of model parameters on the accelerator to address the consistency issue. These constraints could prevent PipeDream from scaling up to bigger models. Similarly, DualPipe [6] optimized pipeline performance by assuming that there exists a robust way to predict future model parameters for backpropagation. Unlike these approaches, GPipe does not have any inconsistency or staleness issue. It integrates recomputation with pipeline parallelism to maximize memory and compute utilization. It offers effective and efficient synchronous training of large scale deep neural networks.3 Methods
This section describes main design features of GPipe. This library is implemented using the TensorFlow
[1] framework. The core algorithm can be implemented using other frameworks [27, 7, 40] as well. It will be open sourced in the coming months.3.1 Interface
The caller of the GPipe interface specifies a sequential list of layers. Each layer specifies its model parameters , its stateless forward computation function
, and an optional cost estimation function
that estimates the static computation cost of th layer given shapes of all inputs to the layer. Neighboring layers can be combined into a composite layer. For example, the composite layer may be composed of consecutive layers from the th layer to the th layer. In this case, ’s model parameters would be the union of , , …, and its forward function would be . The corresponding backpropagation function is derived from using TensorFlow’s automatic symbolic differentiation mechanism. Its cost estimator is constructed based on .3.2 Algorithm
After users defined their network layers in terms of model parameter , cost estimation function and forward computation function , GPipe partitions the network into composite layers and places th composite layer onto th accelerator, where
is the number of partitions users specified. Communication primitives are automatically inserted by GPipe at the partition boundaries to allow data exchanging between neighboring partitions. The partitioning algorithm is heuristicbased. It simply minimizes the variance of each composite layer’s estimated cost. We expect that better partitioning algorithms can potentially improve the performance of GPipe.
During training, GPipe first divides a minibatch of size into microbatches at the first layer. Each microbatch contains
examples. For instance, an image input tensor with shape [
, , , ] is reshaped into [, , , , ]. During the forward pass (Figure LABEL:fig_pipeline_partition), the th accelerator starts to compute as soon as it finishes the th microbatch and receives inputs from . At the same time, the th accelerator can start to compute . Each accelerator repeats this process times to finish the forward pass of the whole minibatch. There are still up to idle time per accelerator, which we refer to as bubble overhead as depicted in Figure LABEL:fig_pipeline_partition. This bubble time is and amortized by the number of microbatches . The last accelerator is also responsible for concatenating the outputs across microsteps and computing the final loss.During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch. Therefore, GPipe maintains the same synchronous nature of gradient descent, independent of the number of partitions. This is important because deep learning training is sensitive to hyperparameters such as learning rate schedules and dropout probabilities. Such guarantee frees researchers from the time consuming process of retuning hyperparameters.
3.3 Optimization
The computation of the backward pass at layer requires both the upper layer gradients and cached activations . Therefore, the total cached activations need space without optimization, where is the minibatch size and is the number of layers in the network. In order to reduce activation memory requirements, GPipe recomputes the forward passes. Each accelerator only stores output activations at the partition boundaries, rather than activations of all intermediate layers within the partition. During the backward pass, the th accelerator recomputes the composite forward function and requires only the cache activations at the partition boundaries. As a result, the size of peak activation memory reduces to where is the micro batch size and is the number of layers in one partition.
As depicted in Figure LABEL:fig_pipeline_partition, the aggregation of the loss during the forward pass introduces a bubble of idleness between the forward and backward passes. The bubble is amortized over the number of microsteps . In our experiments, we found that the bubble overhead was quite small. This is partly because recomputation during the backward pass can be scheduled earlier without waiting for gradients from earlier layers. Figure LABEL:fig_pipeline_partition assumes partitions are evenly balanced. However, memory requirements and computation flops at different layers are often quite imbalanced. For example, the number of convolution filters doubles every time there is a reduction in spatial dimensions of the activation tensors for many modern image models, such as ResNet, Inception, NasNets, and AmoebaNets. The activation memory footprint per layer decreases linearly at later layers while the number model parameter per layer increases quadratically. Therefore, imperfect partitioning algorithms will lead to load imbalance when partitioning those layers. Better partitioning algorithms can potentially improve the performance over our heuristic approach.
4 Results
This section provides detailed analysis of scalability and performance of GPipe. We evaluated ResNet and AmoebaNet in the experiments. ResNet is a representative neural network for image classification. AmoebaNet was the previous stateoftheart image model on ImageNet. Both networks allowed us to increase the model size by changing the number of layers or the number of filters. We ran the experiments on TPUv2s, each of which has accelerator cores and 64 GB memory (8 GB per accelerator).
4.1 Memory
Naive1  Pipeline1  Pipeline2  Pipeline4  Pipeline8  

AmoebaNetD (L, F)  (6, 208)  (6, 416)  (6, 544)  (12, 544)  (24, 512) 
# of Model Parameters  82M  318M  542M  1.05B  1.8B 
Total Peak Model Parameter Memory  1.05GB  3.8GB  6.45GB  12.53GB  24.62GB 
Total Peak Activation Memory  6.26GB  3.46GB  8.11GB  15.21GB  26.24GB 
GPipe uses recomputation and pipeline parallelism for better memory utilization. We expect that both methods can enable bigger models, which we verified experimentally in this section. To do this, we fixed the input image size at and the minibatch size at . We studied the effect of each method on the maximum AmoebaNet model size that would fit with accelerators, . An AmoebaNet model consists of a sequence of two repeated layer modules called normal cell and reduction cell. Normal cell reserves input activation size. Reduction cell reduces the spatial dimension of activation but increases the activation filter size. The capacity of an AmoebaNet is configured by two hyperparameters, and . defines the number of normal cells stacked between reduction cells and specifies the number of filters in the first normal cell. We increased and until we reached the limits of accelerator memory. We then compared training a model with and without GPipe on a single accelerator to understand the benefits that GPipe introduces. We also partitioned AmoebaNet across different number of accelerators to study the payoff of pipeline parallelism. We reported the maximum model size, total peak activation memory, and total peak model parameters memory across accelerators under different scenarios in Table 1.
First, we found that GPipe enabled times bigger models on a single accelerator. Without recomputation, a single accelerator can train up to million model parameters due to memory limits. Recomputation and minibatch splitting reduced activation memory from GB to GB, enabling 318 million parameters on a single accelerator. For each model parameter, GPipe consumed 12 bytes, i.e., the parameter itself, its moving average and momentum each consumes one single precision float.
Second, we saw that with pipeline parallelism the maximum model size was proportional to the number of partitions, as expected. GPipe was capable of enabling AmoebaNet with 1.8 billion parameters across accelerators, a times increase compared to that on a single accelerator. In total, GPipe supported models that are times bigger using
accelerators in this experiment. The maximum model size was not a perfect linear function of the number of partitions because of the nonuniform distribution of model parameters over layers in AmoebaNet. This made it challenging to distribute layers evenly across multiple accelerators. With improvements from the partitioning algorithms, GPipe would be capable of allocating even larger models.
4.2 Performance
In this section, we evaluated various factors that tradeoff GPipe performance for better memory utilization. For example, recomputation of forward passes reduces activation memory but inevitably introduces computation overhead. Pipeline parallelism partitions networks across accelerators, but it can have overheads such as imbalanced workload and bubbles of idleness. It also requires setup time to divide and reshape the inputs. In our experiments, we measured the effects of pipeline parallelism and recomputation on the model throughput of ResNet101 and AmoebaNetD (4, 512). We fixed the image size at . We adjusted the minibatch size to maximize the throughput. To isolate the effects of pipeline parallelism, we used accelerators to train a model with partitions. Since training AmoebaNetD (4, 512) requires at least two accelerators, we reported the speedup with respect to no pipelining case with two partitions in Figure LABEL:fig_amoebanet_speedup. We reported speedup of ResNet101 with respect to the sequential case without recomputation in Figure LABEL:fig_resnet_speedup. To assess the overhead cost, we carefully studied the trace files from ResNet101 runs to identify key factors that affect performance. We also examined how the effects of these factors change with the number of partitions in Figure LABEL:fig_step_time_2 and LABEL:fig_step_time_4.
We observed that the benefits of pipeline parallelism outweigh the performance overhead introduced. We saw an almost linear speed up in training AmoebaNetD (4, 512). Compared to the naive approach with two partitions, distributing AmoebaNetD (4, 512) across four times more accelerators achieved times speedup. ResNet101 is a relatively smaller model that doesn’t need model parallelism for training. But it allowed us to analyze system performance easily. The relative throughput of ResNet101 using GPipe with one partition is . Recomputation thus introduced about overhead. As ResNet101 was distributed across more accelerators, performance increased. It achieved about times speedup with accelerators. In both examples, GPipe provided a way to increase throughput using more accelerators, complementary to the traditional data parallelism approach.
To study opportunities for future performance improvements, we identified key factors that would affect GPipe performance. We measured the times spent on different activities listed in Figure LABEL:fig_step_time_2. We showed the distributions of these times for ResNet101 with 2 and 4 partitions in Figure LABEL:fig_step_time_2 and LABEL:fig_step_time_4, respectively. We found that recomputation time was the main contributor to GPipe overhead, taking up to of the total step time. Another source of overhead was load imbalance. With two partitions, it was only , but with four partitions, it rose up to overhead. It is increasingly difficult for load balancing with more partitions in the network. Thus finding a good partitioning algorithm can help reduce this overhead in general. The theoretical bubble overhead is where K is the number of partitions and T is the number of microbatches in each minibatch. The observed bubble overhead was slightly lower than the theoretical value partly because recomputation was scheduled early to overlap with the bubble. Weight update time for gradients aggregation at the end of pipeline was also small thanks to highspeed interconnections between the accelerators.
Model  Image Size  # Parameters  Top1 Accuracy ()  top5 Accuracy () 

IncepResNet V2[48]  55.8M  80.4  95.3  
ResNeXt101 [54]  83.6M  80.9  95.6  
PolyNet[56]  92.0M  81.3  95.8  
DualPathNet131[10]  79.5M  81.5  95.8  
SENet [24]  146M  82.7  96.2  
AmoebaNetC (6, 228)[13]  155.3M  83.5  96.5  
AmoebaNetB (6, 512)  557M 
4.3 Model quality
4.3.1 Consistent Training
GPipe performs synchronous training over the microbatches. In this section, we verified the hypothesis that the endtoend convergence accuracy using GPipe is the same within statistical errors, regardless of the number of partitions. We trained AmoebaNetD (2, 128) several times for epochs and measured the final validation accuracy on ImageNet. We chose AmoebaNetD (2, 128) since it was the winning image model by training cost in the DAWNBench competition [12]. We adopted the same hyperparameters and training procedure reported in DAWNBench.^{2}^{2}2https://github.com/stanfordfuturedata/dawnbenchentries/blob/master/ImageNet/train/google_amoeba_net_d_tpu_tensorflow18.json As a baseline, we trained AmoebaNetD (2, 128)
times using the official open source implementation and computed the mean and standard deviation of the final accuracy. Using the same hyperpameters and training procedures, we trained the same network using GPipe with
and partitions. We found that the resulting accuracy fell within two standard deviations from the mean, as expected.4.3.2 Scaling up Giant Models
We verified the hypothesis that scaling up existing neural networks can achieve even better model quality. As a proof of concept, we trained an AmoebaNetB (6, 512) with million model parameters and input image size of on the ImageNet ILSVRC2012 dataset. We followed the same hyperparameters and input preprocessings as described in [45]
to train AmoebaNetB (6, 512). We employed the RMSProp optimizer with a decay of
and , regularization , label smoothing coefficiency and an auxiliary head with weight . We applied the same droppath schedule to intermediate layers as in NasNet [57], and dropout to the final layer with probability . We used a learning rate schedule that decays every epochs at a rate of with an initial learning rate of times the batch size. The network was divided into 4 partitions, and we performed training using both model and data parallelism. We adopted mixed precision training [37] where activations are represented in half precision. Unlike other mixed precision training strategies, we didn’t scale the loss values thanks to the wide dynamic range of bfloat16 on TPUs. We used ImageNet ILSVRC2012 dataset for training and reported the validation accuracy in table 2. This giant model reached top1 / top5 validation accuracy with singlecrop.4.3.3 Transfer Learning
Large neural networks are not only applicable to datasets like ImageNet, but also relevant for other datasets through transfer learning
[44, 19, 46]. One successful approach to transfer learning is to use ImageNet pretrained models as initialization for training on a target dataset. In this section, we will evaluate the transfer learning performance for the best giant model found in Section 4.3.2 that achieved top1 accuracy on ImageNet.We ran transfer learning experiments on the following datasets: CIFAR10, CIFAR100 [31], Birdsnap [2], Stanford Cars [29], FGVC Aircraft [36], OxfordIIIT Pets [39], and Food101 [3]. This spanned a range of tasks from general object recognition to finegrained classification.
We trained a AmoebaNetB (6, 512) model for each of these datasets. We changed the number of output units in the last softmax classification layer to the number of classes in the target dataset. This softmax layer was initialized randomly, while all other layers were initialized with the best parameters trained on ImageNet. We selected the learning rate and
weight regularization parameters for each dataset on a holdout subset of training dataset. For other hyperparameters we used the same ones as in ImageNet training. We adopted image preprocessing procedure that is widely used for training CIFAR datasets. In all our transfer learning experiments, input images to the network during training were resized to , horizontally flipped randomly and augmented using cutout [18]. We trained the models forsteps using stochastic gradient descent with momentum. Each minibatch contained 256 examples. We reported the averaged singlecrop accuracy on test sets across 5 finetuning runs for each dataset.
Dataset  # Training Examples  # Test Examples  # Classes  Our Model Accuracy ()  Previously Reported Result () 

CIFAR10  50,000  10,000  10  [13]  
CIFAR100  50,000  10,000  100  [13]  
Stanford Cars  8,144  8,041  196  [13]  
OxfordIIIT Pets  3,680  3,369  37  [42]  
Food101  75,750  25,250  101  [14]  
FGVC Aircraft  6,667  3,333  100  [55]  
Birdsnap  47,386  2,443  500  [52] 
We found that our giant models performed well on the target datasets, obtaining results that were competitive to stateoftheart models in Table 3. For example, they reduced CIFAR10 error rate to and CIFAR100 error rate to . These results corroborated Kornblith et al. [28] findings that ImageNet performance correlated well with transfer learning performance.
5 Discussion
Our work validates the hypothesis that bigger models and more computation would lead to higher model quality. This hypothesis is also supported by past advances in visual recognition tasks shown in Figure 1 and the recent progresses in other fields such as BigGAN [5] and BERT[17]. Those results suggest that accuracy improvements of machine learning tasks may be obtained by further increases in the scale of neural networks beyond the limits of accelerator memory. Moreover, the availability of bigger datasets such as JFT300M [47] and hashtagged Instragram [35] also reduces risks of overfitting and encourages giant networks with higher capacity.
GPipe supports models up to billion parameters with accelerators in our experiments, inviting future research on searching efficient network architectures with billions of parameters. As a proof of concept, we only scaled up the capacity of AmoebaNet to million parameters by doubling the number of filters in our experiments. It doesn’t mean that it’s the most effective way to grow the model size. There might exist better ways for model augmentation like increasing the number of layers or employing more branches of transformations.
GPipe allows us to revisit some of choices in network architecture design that might be made due to limited accelerator memory. For example, one of design choices of existing image classification models is to aggressively reduce the spatial dimensions of inputs at the first few layers. Employing convolution or pooling layers with nonunity stride values at the beginning greatly reduces the activation memory requirement. Some lower level input features might be omitted because of the aggressive early reductions. We verified this hypothesis by running a control experiment that compared aggressive reduction with delayed reduction. We reduced the stride value of the first convolution layer and increased the stride value at the last convolution layer on AmoebaNetD (2, 256). As a result, the activation memory footprint increased four times but the model size stayed the same. This change improved the ImageNet top1 accuracy of the network from
to .GPipe can scale training by employing even more accelerators without changes in the hyperparameters. Therefore, it can be combined with data parallelism to scale neural network training using even more accelerators in a complementary way. Pure data parallelism with stochastic gradient descent runs into inferior model generalization issues when the size of the global minibatch is extremely large. Significant retuning and optimization is required to train on ImageNet without loss of accuracy when the global minibatch size is greater than [20, 26].
GPipe enables pipeline parallelism for any neural networks that consist of sequence of layers. It can be further applied to more deep learning tasks such as object detection, image segmentation, and natural language processing. The training efficiency of GPipe can be further improved by better graph partition algorithms.
6 Conclusion
In this work, we introduce GPipe, a scalable model parallelism library that addresses the memory bottleneck for giant neural networks. It allows researchers to explore deeper and more complex deep learning models. For example, GPipe supports models up to times larger with accelerators, demonstrating its scalability. Moreover, it can achieve a times speedup with times more accelerators without tuning. In all cases, it converges to the same accuracy as the sequential version without any changes to the model hyperparameters. Furthermore, we demonstrate the power of our framework by training a giant AmoebaNet model that achieves top1 / top5 ImageNet validation accuracy, CIFAR10 accuracy, and CIFAR100 accuracy.
Acknowledgments
We wish to thank Esteban Real, Alok Aggarwal, Xiaodan Song, Naveen Kumar, Mark Heffernan, Rajat Monga, Megan Kacholia, Samy Bengio, and Jeff Dean for their support and valuable input; Patrick Nguyen, Xiaoqiang Zheng, Yonghui Wu, Noam Shazeer, Barret Zoph, Ekin Cubuk, Tianqi Chen, and Vijay Vasudevan for helpful discussions and inspirations; and the larger Google Brain team.
References
 [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.

[2]
T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur.
Birdsnap: Largescale finegrained visual categorization of birds.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 2019–2026, 2014. 
[3]
L. Bossard, M. Guillaumin, and L. J. V. Gool.
Food101  mining discriminative components with random forests.
In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV 2014, volume 8694 of Lecture Notes in Computer Science, pages 446–461. Springer, 2014.  [4] L. Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 [5] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
 [6] C.C. Chen, C.L. Yang, and H.Y. Cheng. Efficient and robust parallel dnn training through model parallelism on multigpu platform. arXiv preprint arXiv:1809.02839, 2018.
 [7] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 [8] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
 [9] X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide. Pipelined backpropagation for contextdependent deep neural networks. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.
 [10] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information Processing Systems (NIPS), pages 4467–4475, 2017.
 [11] C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani. Stateoftheart speech recognition with sequencetosequence models. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778, 2018.
 [12] C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re, and M. Zaharia. Analysis of dawnbench, a timetoaccuracy machine learning performance benchmark. arXiv preprint arXiv:1806.01427, 2018.
 [13] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
 [14] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale finegrained categorization and domainspecific transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [15] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223–1231. Curran Associates, Inc., 2012.
 [16] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
 [17] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [18] T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 [19] R. Girshick. Fast rcnn. In International Conference on Computer Vision, pages 1440–1448, 2015.
 [20] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [21] A. Griewank and A. Walther. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
 [22] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
 [23] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [24] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. CVPR, 2018.
 [25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
 [26] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, et al. Highly scalable deep learning training system with mixedprecision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.
 [27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 [28] S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models transfer better? CoRR, abs/1805.08974, 2018.
 [29] J. Krause, J. Deng, M. Stark, and L. FeiFei. Collecting a largescale dataset of finegrained cars. In Second Workshop on FineGrained Visual Categorization (FGVC2), 2013.
 [30] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
 [31] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
 [32] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings Redwood City, 1994.
 [33] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In Advances in neural information processing systems, pages 2834–2842, 2014.
 [34] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
 [35] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. ECCV, 2018.
 [36] S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi. Finegrained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
 [37] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al. Mixed precision training. ICLR, 2018.
 [38] J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. Le, and R. Pang. Domain adaptive transfer learning. 2018.
 [39] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012.

[40]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer.
Automatic differentiation in pytorch.
2017.  [41] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large minibatch object detector. CVPR, 7, 2017.

[42]
Y. Peng, X. He, and J. Zhao.
Objectpart attention model for finegrained image classification.
IEEE Transactions on Image Processing, 27(3):1487–1500, 2018.  [43] A. Petrowski, G. Dreyfus, and C. Girault. Performance analysis of a pipelined backpropagation parallel algorithm. IEEE Transactions on Neural Networks, 4(6):970–981, Nov 1993.
 [44] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features offtheshelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 512–519, 2014.
 [45] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 [46] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017.
 [47] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 843–852. IEEE, 2017.

[48]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, 2017.  [49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [50] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
 [51] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990.
 [52] X.S. Wei, C.W. Xie, J. Wu, and C. Shen. Maskcnn: Localizing parts and selecting descriptors for finegrained bird species categorization. Pattern Recognition, 76:704–714, 2018.

[53]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun,
Y. Cao, Q. Gao, K. Macherey, et al.
Google’s neural machine translation system: Bridging the gap between human and machine translation.
Transactions of the Association for Computational Linguistics,, 2017.  [54] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [55] F. Yu, D. Wang, and T. Darrell. Deep layer aggregation. In IEEE Conference on Computer Vision and Pattern Recognition 2018, 2018.
 [56] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In CVPR, pages 3900–3908. IEEE, 2017.
 [57] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. CVPR, 2018.