Deep neural networks have advanced many machine learning tasks, including speech recognition, visual recognition [57, 45], and language processing . Their successes have been largely due to the model’s capacity to learn complex features from vast amounts of data. Increasing the size of models has been shown to dramatically improve task performance. One of the most challenging and popular machine learning tasks is to solve the ImageNet visual recognition challenge 
, where researchers compete to create the most accurate model that classifies given images in the dataset. The winner of 2014 ImageNet challenge was GoogleNet, which achieved top-1 accuracy with million parameters. The winner of 2017 ImageNet challenge went to Squeeze-and-Excitation Networks , which achieved top-1 accuracy with million parameters. This corresponds to more than a times increase in the number of parameters in the best visual recognition models, as shown in Figure 1. However, memory available on accelerators such as GPUs has only increased from 12 GB in 2014 (Nvidia K40) to 32 GB in 2018 (Nvidia V100). Hence, training even bigger neural networks can be challenging when faced with the accelerator memory limits.
There are increasing needs for scaling up deep neural networks. Modern machine learning datasets are growing faster than ever in terms of dataset size and quality. Image classification datasets such as OpenImages, JFT , and hashtagged Instagram 
contain hundreds of millions of high definition images. Higher image resolutions provide greater details of the object but consume more memory. This leads to a contention between memory allocated to model parameters and network activations - reinforcing a need for breaking the accelerator memory limit. The larger volume of training data helps reduce over-fitting and facilitates deep neural networks to grow bigger. Meanwhile, the emergence of deep learning super computers such as Nvidia DGX and Google TPU enables efficient parallelism by providing fast interconnections between accelerators. The memory restrictions have limited the scales of deep neural networks and confine researchers to smaller scale problems with fewer parameters. For example, while the average ImageNet resolution is, it has been shown that increasing input image size can lead to higher accuracy . However, most current models are engineered to only use input image size or to fit within accelerator memory limits. Our work focuses on removing this limiting factor of scaling up deep neural networks.
To overcome the memory limitation, we propose to use pipeline parallelism to scale up deep neural network training. We design and implement GPipe, a distributed machine learning library that uses synchronous mini-batch gradient descent for training. GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches
. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption. Gradients are consistently accumulated across micro-batches, so that the number of partitions does not affect the model quality. Therefore, GPipe allows researchers to easily deploy more accelerators to train larger models, and also to scale the performance without tuning hyperparameters.
GPipe maximizes memory allocation for model parameters. In experiments, we show that GPipe can support models up to times larger using accelerators without reducing the batch size. The implementation of GPipe is very efficient: with times more accelerators we can achieve a times speedup for training giant neural networks. GPipe can be combined with data parallelism  to scale training in a complementary way.
Finally, we demonstrate the empirical power of giant neural networks on image classification tasks. We increase the number of parameters for a AmoebaNet model to millions and train it with input image size of on ImageNet ILSVRC 2012 dataset. Our scaled-up AmoebaNet model attains top-1 / top-5 validation accuracy. To the best of our knowledge, it outperforms all other models trained from scratch on ImageNet dataset 111Mahajan et al.’s model  achieved top-1 accuracy but it was pretrained on non public external data (Instagram images with hashtags).. Furthermore, we use this learned giant model as an initialization for training seven datasets that span a wide range of tasks from general recognition to fine-grained classification. We find that giant models perform well on those datasets, obtaining results that are competitive to state-of-the-art models. For example, we push the CIFAR-10 accuracy to and CIFAR-100 accuracy to .
In summary, this paper introduces GPipe, a scalable model parallelism library for training giant deep learning models with the following key contributions:
It supports models up to times using accelerators due to recomputation and model parallelism.
It achieves up to times speedup with four times more accelerators using pipelining in our experiments.
It guarantees consistent training regardless of the number of partitions due to synchronous gradient descent.
It advances the performance of visual recognition tasks on multiple datasets, including pushing ImageNet top-5 accuracy to , CIFAR-10 accuracy to , and CIFAR-100 accuracy to .
2 Related Work
Deep neural networks typically consist of a sequence of layers. During training, a neural network first uses the current model parameters to compute predictions from input mini-batches in the forward pass. Then, the gradients are computed by backpropagating prediction errors (Figure LABEL:fig_forward_backward). Computing gradients in each intermediate layer requires both gradients from upper layers and the cached output activations from the forward pass. Thus, activation memory requirements typically grow in proportion to the number of layers, leaving less space for storing model parameters.
Various efforts have been studied to allow accelerators to train bigger models. They come with different trade-offs between memory, performance, and model quality. One common method is to recompute the forward pass activations during backpropagation [21, 8], which significantly reduces memory required to cache activations. However, this method is still limited by the size of a single accelerator memory. Another approach is to swap memory between accelerators and the host . However, this approach often slows down training because of the limited communication bandwidths between the host and accelerators.
Standard parallelism techniques including data parallelism and model parallelism provide orthogonal ways to use more accelerators for training. Data parallelism  effectively scales up the global mini-batch size. It lets each machine compute the gradient on a mini-batch of training examples. Each machine either synchronously or asynchronously updates the model parameters at the end of each training step [4, 34]. Data parallelism is widely used due to its simplicity and effectiveness. Because the batch size is proportional to the number of machines and different batch sizes often require different hyperparameters, scaling deep net training purely by data parallelism has became more challenging.
Model parallelism is a complementary technique to data parallelism. A naive strategy is to divide the computation into partitions and assign different partitions to different accelerators [30, 33]. This approach is straightforward when networks consist mainly of parallel branches. However, many deep learning models stack layers sequentially, presenting a challenge to parallelize computation efficiently. A naive partition strategy may result in only one accelerator active during computation, significantly underutilizing accelerator compute capacity (Figure LABEL:fig_naive_partition).
Pipelining is a common parallel algorithm  that integrates model and data parallelism. Petrowski et al. explored accelerating training neural networks via pipelining on early parallel machines . Chen et al. used pipeline computation to approximate expensive backpropagation . Wu et al. 
parallelized computation of stacked recurrent neural networks on GPUs in the pipelining way. Recently, PipeDream introduced a pipelining approach to reduce communication overhead for synchronized training using parameter servers . However, it suffered from inconsistency and staleness issues in the backward pass, which could lead to unstable and poor model quality. PipeDream maintained multiple versions of model parameters on the accelerator to address the consistency issue. These constraints could prevent PipeDream from scaling up to bigger models. Similarly, DualPipe  optimized pipeline performance by assuming that there exists a robust way to predict future model parameters for back-propagation. Unlike these approaches, GPipe does not have any inconsistency or staleness issue. It integrates recomputation with pipeline parallelism to maximize memory and compute utilization. It offers effective and efficient synchronous training of large scale deep neural networks.
This section describes main design features of GPipe. This library is implemented using the TensorFlow framework. The core algorithm can be implemented using other frameworks [27, 7, 40] as well. It will be open sourced in the coming months.
The caller of the GPipe interface specifies a sequential list of layers. Each layer specifies its model parameters , its stateless forward computation function
, and an optional cost estimation functionthat estimates the static computation cost of -th layer given shapes of all inputs to the layer. Neighboring layers can be combined into a composite layer. For example, the composite layer may be composed of consecutive layers from the -th layer to the -th layer. In this case, ’s model parameters would be the union of , , …, and its forward function would be . The corresponding back-propagation function is derived from using TensorFlow’s automatic symbolic differentiation mechanism. Its cost estimator is constructed based on .
After users defined their network layers in terms of model parameter , cost estimation function and forward computation function , GPipe partitions the network into composite layers and places -th composite layer onto -th accelerator, where
is the number of partitions users specified. Communication primitives are automatically inserted by GPipe at the partition boundaries to allow data exchanging between neighboring partitions. The partitioning algorithm is heuristic-based. It simply minimizes the variance of each composite layer’s estimated cost. We expect that better partitioning algorithms can potentially improve the performance of GPipe.
During training, GPipe first divides a mini-batch of size into micro-batches at the first layer. Each micro-batch contains
examples. For instance, an image input tensor with shape [, , , ] is reshaped into [, , , , ]. During the forward pass (Figure LABEL:fig_pipeline_partition), the -th accelerator starts to compute as soon as it finishes the -th micro-batch and receives inputs from . At the same time, the -th accelerator can start to compute . Each accelerator repeats this process times to finish the forward pass of the whole mini-batch. There are still up to idle time per accelerator, which we refer to as bubble overhead as depicted in Figure LABEL:fig_pipeline_partition. This bubble time is and amortized by the number of micro-batches . The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss.
During the backward pass, gradients for each micro-batch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each mini-batch. Therefore, GPipe maintains the same synchronous nature of gradient descent, independent of the number of partitions. This is important because deep learning training is sensitive to hyperparameters such as learning rate schedules and dropout probabilities. Such guarantee frees researchers from the time consuming process of re-tuning hyperparameters.
The computation of the backward pass at layer requires both the upper layer gradients and cached activations . Therefore, the total cached activations need space without optimization, where is the mini-batch size and is the number of layers in the network. In order to reduce activation memory requirements, GPipe recomputes the forward passes. Each accelerator only stores output activations at the partition boundaries, rather than activations of all intermediate layers within the partition. During the backward pass, the -th accelerator recomputes the composite forward function and requires only the cache activations at the partition boundaries. As a result, the size of peak activation memory reduces to where is the micro batch size and is the number of layers in one partition.
As depicted in Figure LABEL:fig_pipeline_partition, the aggregation of the loss during the forward pass introduces a bubble of idleness between the forward and backward passes. The bubble is amortized over the number of micro-steps . In our experiments, we found that the bubble overhead was quite small. This is partly because recomputation during the backward pass can be scheduled earlier without waiting for gradients from earlier layers. Figure LABEL:fig_pipeline_partition assumes partitions are evenly balanced. However, memory requirements and computation flops at different layers are often quite imbalanced. For example, the number of convolution filters doubles every time there is a reduction in spatial dimensions of the activation tensors for many modern image models, such as ResNet, Inception, NasNets, and AmoebaNets. The activation memory footprint per layer decreases linearly at later layers while the number model parameter per layer increases quadratically. Therefore, imperfect partitioning algorithms will lead to load imbalance when partitioning those layers. Better partitioning algorithms can potentially improve the performance over our heuristic approach.
This section provides detailed analysis of scalability and performance of GPipe. We evaluated ResNet and AmoebaNet in the experiments. ResNet is a representative neural network for image classification. AmoebaNet was the previous state-of-the-art image model on ImageNet. Both networks allowed us to increase the model size by changing the number of layers or the number of filters. We ran the experiments on TPU-v2s, each of which has accelerator cores and 64 GB memory (8 GB per accelerator).
|AmoebaNet-D (L, F)||(6, 208)||(6, 416)||(6, 544)||(12, 544)||(24, 512)|
|# of Model Parameters||82M||318M||542M||1.05B||1.8B|
|Total Peak Model Parameter Memory||1.05GB||3.8GB||6.45GB||12.53GB||24.62GB|
|Total Peak Activation Memory||6.26GB||3.46GB||8.11GB||15.21GB||26.24GB|
GPipe uses recomputation and pipeline parallelism for better memory utilization. We expect that both methods can enable bigger models, which we verified experimentally in this section. To do this, we fixed the input image size at and the mini-batch size at . We studied the effect of each method on the maximum AmoebaNet model size that would fit with accelerators, . An AmoebaNet model consists of a sequence of two repeated layer modules called normal cell and reduction cell. Normal cell reserves input activation size. Reduction cell reduces the spatial dimension of activation but increases the activation filter size. The capacity of an AmoebaNet is configured by two hyperparameters, and . defines the number of normal cells stacked between reduction cells and specifies the number of filters in the first normal cell. We increased and until we reached the limits of accelerator memory. We then compared training a model with and without GPipe on a single accelerator to understand the benefits that GPipe introduces. We also partitioned AmoebaNet across different number of accelerators to study the payoff of pipeline parallelism. We reported the maximum model size, total peak activation memory, and total peak model parameters memory across accelerators under different scenarios in Table 1.
First, we found that GPipe enabled times bigger models on a single accelerator. Without recomputation, a single accelerator can train up to million model parameters due to memory limits. Recomputation and mini-batch splitting reduced activation memory from GB to GB, enabling 318 million parameters on a single accelerator. For each model parameter, GPipe consumed 12 bytes, i.e., the parameter itself, its moving average and momentum each consumes one single precision float.
Second, we saw that with pipeline parallelism the maximum model size was proportional to the number of partitions, as expected. GPipe was capable of enabling AmoebaNet with 1.8 billion parameters across accelerators, a times increase compared to that on a single accelerator. In total, GPipe supported models that are times bigger using
accelerators in this experiment. The maximum model size was not a perfect linear function of the number of partitions because of the non-uniform distribution of model parameters over layers in AmoebaNet. This made it challenging to distribute layers evenly across multiple accelerators. With improvements from the partitioning algorithms, GPipe would be capable of allocating even larger models.
In this section, we evaluated various factors that trade-off GPipe performance for better memory utilization. For example, recomputation of forward passes reduces activation memory but inevitably introduces computation overhead. Pipeline parallelism partitions networks across accelerators, but it can have overheads such as imbalanced workload and bubbles of idleness. It also requires setup time to divide and reshape the inputs. In our experiments, we measured the effects of pipeline parallelism and recomputation on the model throughput of ResNet-101 and AmoebaNet-D (4, 512). We fixed the image size at . We adjusted the mini-batch size to maximize the throughput. To isolate the effects of pipeline parallelism, we used accelerators to train a model with partitions. Since training AmoebaNet-D (4, 512) requires at least two accelerators, we reported the speedup with respect to no pipelining case with two partitions in Figure LABEL:fig_amoebanet_speedup. We reported speedup of ResNet-101 with respect to the sequential case without recomputation in Figure LABEL:fig_resnet_speedup. To assess the overhead cost, we carefully studied the trace files from ResNet-101 runs to identify key factors that affect performance. We also examined how the effects of these factors change with the number of partitions in Figure LABEL:fig_step_time_2 and LABEL:fig_step_time_4.
We observed that the benefits of pipeline parallelism outweigh the performance overhead introduced. We saw an almost linear speed up in training AmoebaNet-D (4, 512). Compared to the naive approach with two partitions, distributing AmoebaNet-D (4, 512) across four times more accelerators achieved times speedup. ResNet-101 is a relatively smaller model that doesn’t need model parallelism for training. But it allowed us to analyze system performance easily. The relative throughput of ResNet-101 using GPipe with one partition is . Recomputation thus introduced about overhead. As ResNet-101 was distributed across more accelerators, performance increased. It achieved about times speedup with accelerators. In both examples, GPipe provided a way to increase throughput using more accelerators, complementary to the traditional data parallelism approach.
To study opportunities for future performance improvements, we identified key factors that would affect GPipe performance. We measured the times spent on different activities listed in Figure LABEL:fig_step_time_2. We showed the distributions of these times for ResNet-101 with 2 and 4 partitions in Figure LABEL:fig_step_time_2 and LABEL:fig_step_time_4, respectively. We found that recomputation time was the main contributor to GPipe overhead, taking up to of the total step time. Another source of overhead was load imbalance. With two partitions, it was only , but with four partitions, it rose up to overhead. It is increasingly difficult for load balancing with more partitions in the network. Thus finding a good partitioning algorithm can help reduce this overhead in general. The theoretical bubble overhead is where K is the number of partitions and T is the number of micro-batches in each mini-batch. The observed bubble overhead was slightly lower than the theoretical value partly because recomputation was scheduled early to overlap with the bubble. Weight update time for gradients aggregation at the end of pipeline was also small thanks to high-speed interconnections between the accelerators.
|Model||Image Size||# Parameters||Top-1 Accuracy ()||top-5 Accuracy ()|
|AmoebaNet-C (6, 228)||155.3M||83.5||96.5|
|AmoebaNet-B (6, 512)||557M|
4.3 Model quality
4.3.1 Consistent Training
GPipe performs synchronous training over the micro-batches. In this section, we verified the hypothesis that the end-to-end convergence accuracy using GPipe is the same within statistical errors, regardless of the number of partitions. We trained AmoebaNet-D (2, 128) several times for epochs and measured the final validation accuracy on ImageNet. We chose AmoebaNet-D (2, 128) since it was the winning image model by training cost in the DAWNBench competition . We adopted the same hyperparameters and training procedure reported in DAWNBench.222https://github.com/stanford-futuredata/dawn-bench-entries/blob/master/ImageNet/train/google_amoeba_net_d_tpu_tensorflow18.json As a baseline, we trained AmoebaNet-D (2, 128)
times using the official open source implementation and computed the mean and standard deviation of the final accuracy. Using the same hyperpameters and training procedures, we trained the same network using GPipe withand partitions. We found that the resulting accuracy fell within two standard deviations from the mean, as expected.
4.3.2 Scaling up Giant Models
We verified the hypothesis that scaling up existing neural networks can achieve even better model quality. As a proof of concept, we trained an AmoebaNet-B (6, 512) with million model parameters and input image size of on the ImageNet ILSVRC-2012 dataset. We followed the same hyperparameters and input pre-processings as described in 
to train AmoebaNet-B (6, 512). We employed the RMSProp optimizer with a decay ofand , regularization , label smoothing coefficiency and an auxiliary head with weight . We applied the same drop-path schedule to intermediate layers as in NasNet , and dropout to the final layer with probability . We used a learning rate schedule that decays every epochs at a rate of with an initial learning rate of times the batch size. The network was divided into 4 partitions, and we performed training using both model and data parallelism. We adopted mixed precision training  where activations are represented in half precision. Unlike other mixed precision training strategies, we didn’t scale the loss values thanks to the wide dynamic range of bfloat16 on TPUs. We used ImageNet ILSVRC-2012 dataset for training and reported the validation accuracy in table 2. This giant model reached top-1 / top-5 validation accuracy with single-crop.
4.3.3 Transfer Learning
Large neural networks are not only applicable to datasets like ImageNet, but also relevant for other datasets through transfer learning[44, 19, 46]. One successful approach to transfer learning is to use ImageNet pre-trained models as initialization for training on a target dataset. In this section, we will evaluate the transfer learning performance for the best giant model found in Section 4.3.2 that achieved top-1 accuracy on ImageNet.
We ran transfer learning experiments on the following datasets: CIFAR-10, CIFAR-100 , Birdsnap , Stanford Cars , FGVC Aircraft , Oxford-IIIT Pets , and Food-101 . This spanned a range of tasks from general object recognition to fine-grained classification.
We trained a AmoebaNet-B (6, 512) model for each of these datasets. We changed the number of output units in the last softmax classification layer to the number of classes in the target dataset. This softmax layer was initialized randomly, while all other layers were initialized with the best parameters trained on ImageNet. We selected the learning rate andweight regularization parameters for each dataset on a hold-out subset of training dataset. For other hyperparameters we used the same ones as in ImageNet training. We adopted image pre-processing procedure that is widely used for training CIFAR datasets. In all our transfer learning experiments, input images to the network during training were resized to , horizontally flipped randomly and augmented using cutout . We trained the models for
steps using stochastic gradient descent with momentum. Each mini-batch contained 256 examples. We reported the averaged single-crop accuracy on test sets across 5 fine-tuning runs for each dataset.
|Dataset||# Training Examples||# Test Examples||# Classes||Our Model Accuracy ()||Previously Reported Result ()|
We found that our giant models performed well on the target datasets, obtaining results that were competitive to state-of-the-art models in Table 3. For example, they reduced CIFAR-10 error rate to and CIFAR-100 error rate to . These results corroborated Kornblith et al.  findings that ImageNet performance correlated well with transfer learning performance.
Our work validates the hypothesis that bigger models and more computation would lead to higher model quality. This hypothesis is also supported by past advances in visual recognition tasks shown in Figure 1 and the recent progresses in other fields such as BigGAN  and BERT. Those results suggest that accuracy improvements of machine learning tasks may be obtained by further increases in the scale of neural networks beyond the limits of accelerator memory. Moreover, the availability of bigger datasets such as JFT-300M  and hashtagged Instragram  also reduces risks of over-fitting and encourages giant networks with higher capacity.
GPipe supports models up to -billion parameters with accelerators in our experiments, inviting future research on searching efficient network architectures with billions of parameters. As a proof of concept, we only scaled up the capacity of AmoebaNet to -million parameters by doubling the number of filters in our experiments. It doesn’t mean that it’s the most effective way to grow the model size. There might exist better ways for model augmentation like increasing the number of layers or employing more branches of transformations.
GPipe allows us to revisit some of choices in network architecture design that might be made due to limited accelerator memory. For example, one of design choices of existing image classification models is to aggressively reduce the spatial dimensions of inputs at the first few layers. Employing convolution or pooling layers with non-unity stride values at the beginning greatly reduces the activation memory requirement. Some lower level input features might be omitted because of the aggressive early reductions. We verified this hypothesis by running a control experiment that compared aggressive reduction with delayed reduction. We reduced the stride value of the first convolution layer and increased the stride value at the last convolution layer on AmoebaNet-D (2, 256). As a result, the activation memory footprint increased four times but the model size stayed the same. This change improved the ImageNet top-1 accuracy of the network fromto .
GPipe can scale training by employing even more accelerators without changes in the hyperparameters. Therefore, it can be combined with data parallelism to scale neural network training using even more accelerators in a complementary way. Pure data parallelism with stochastic gradient descent runs into inferior model generalization issues when the size of the global mini-batch is extremely large. Significant re-tuning and optimization is required to train on ImageNet without loss of accuracy when the global mini-batch size is greater than [20, 26].
GPipe enables pipeline parallelism for any neural networks that consist of sequence of layers. It can be further applied to more deep learning tasks such as object detection, image segmentation, and natural language processing. The training efficiency of GPipe can be further improved by better graph partition algorithms.
In this work, we introduce GPipe, a scalable model parallelism library that addresses the memory bottleneck for giant neural networks. It allows researchers to explore deeper and more complex deep learning models. For example, GPipe supports models up to times larger with accelerators, demonstrating its scalability. Moreover, it can achieve a times speedup with times more accelerators without tuning. In all cases, it converges to the same accuracy as the sequential version without any changes to the model hyperparameters. Furthermore, we demonstrate the power of our framework by training a giant AmoebaNet model that achieves top-1 / top-5 ImageNet validation accuracy, CIFAR-10 accuracy, and CIFAR-100 accuracy.
We wish to thank Esteban Real, Alok Aggarwal, Xiaodan Song, Naveen Kumar, Mark Heffernan, Rajat Monga, Megan Kacholia, Samy Bengio, and Jeff Dean for their support and valuable input; Patrick Nguyen, Xiaoqiang Zheng, Yonghui Wu, Noam Shazeer, Barret Zoph, Ekin Cubuk, Tianqi Chen, and Vijay Vasudevan for helpful discussions and inspirations; and the larger Google Brain team.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In , pages 2019–2026, 2014.
L. Bossard, M. Guillaumin, and L. J. V. Gool.
Food-101 - mining discriminative components with random forests.In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV 2014, volume 8694 of Lecture Notes in Computer Science, pages 446–461. Springer, 2014.
-  L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  C.-C. Chen, C.-L. Yang, and H.-Y. Cheng. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839, 2018.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
-  T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
-  X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide. Pipelined back-propagation for context-dependent deep neural networks. In Thirteenth Annual Conference of the International Speech Communication Association, 2012.
-  Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information Processing Systems (NIPS), pages 4467–4475, 2017.
-  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani. State-of-the-art speech recognition with sequence-to-sequence models. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778, 2018.
-  C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re, and M. Zaharia. Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark. arXiv preprint arXiv:1806.01427, 2018.
-  E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
-  Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale fine-grained categorization and domain-specific transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1223–1231. Curran Associates, Inc., 2012.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  R. Girshick. Fast r-cnn. In International Conference on Computer Vision, pages 1440–1448, 2015.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  A. Griewank and A. Walther. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
-  A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. CVPR, 2018.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
-  X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, et al. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models transfer better? CoRR, abs/1805.08974, 2018.
-  J. Krause, J. Deng, M. Stark, and L. Fei-Fei. Collecting a large-scale dataset of fine-grained cars. In Second Workshop on Fine-Grained Visual Categorization (FGVC2), 2013.
-  A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
-  V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms, volume 400. Benjamin/Cummings Redwood City, 1994.
-  S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In Advances in neural information processing systems, pages 2834–2842, 2014.
-  M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
-  D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. ECCV, 2018.
-  S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
-  P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al. Mixed precision training. ICLR, 2018.
-  J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. Le, and R. Pang. Domain adaptive transfer learning. 2018.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer.
Automatic differentiation in pytorch.2017.
-  C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. CVPR, 7, 2017.
Y. Peng, X. He, and J. Zhao.
Object-part attention model for fine-grained image classification.IEEE Transactions on Image Processing, 27(3):1487–1500, 2018.
-  A. Petrowski, G. Dreyfus, and C. Girault. Performance analysis of a pipelined backpropagation parallel algorithm. IEEE Transactions on Neural Networks, 4(6):970–981, Nov 1993.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 512–519, 2014.
-  E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
-  E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017.
-  C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 843–852. IEEE, 2017.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
-  L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990.
-  X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun,
Y. Cao, Q. Gao, K. Macherey, et al.
Google’s neural machine translation system: Bridging the gap between human and machine translation.Transactions of the Association for Computational Linguistics,, 2017.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
-  F. Yu, D. Wang, and T. Darrell. Deep layer aggregation. In IEEE Conference on Computer Vision and Pattern Recognition 2018, 2018.
-  X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In CVPR, pages 3900–3908. IEEE, 2017.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. CVPR, 2018.