XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

10/24/2019 ∙ by Lei Guan, et al. ∙ 0

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to make use of multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU utilization and achieve high throughput, it splits a mini-batch into a set of micro-batches and allows the overlapping of the pipelines of multiple micro-batches, including those belonging to different mini-batches. Most importantly, the weight prediction strategy adopted by XPipe enables it to effectively address the weight inconsistency and staleness issues incurred by the asynchronous pipeline parallelism. As a result, XPipe incorporates the advantages of both synchronous and asynchronous pipeline parallelism approaches. It can achieve high throughput while obtaining very comparable (even slightly better) model quality as the synchronous counterpart. Experimental results show that XPipe outperforms other existing synchronous and asynchronous model parallelism approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) have been recognized as one of the most effective tools for many machine learning tasks including image and video analysis, language translation, and speech recognition 

[26, 13, 17, 11]. Training a DNN model, however, often takes hours, days, and even weeks. The long training time of DNN models is mainly because the training always involves a huge amount of data and massively many model parameters (also known as weights) [31, 24].

In the past few years, there have been needs to quickly scale up DNNs. One comes from the fact that the image datasets to which we are applying DNNs have a larger number of images and come in higher resolutions, for example, the JFT [7] and OpenImages [15] datasets. The need also arises as a DNN is used to simultaneously recognize more classes of subjects or objects [28], which requires much more layers and weights. Such increases inevitably create a higher demand for the memory of the training devices and the training throughput. Sometimes, breaking a model into pieces and training them using multiple GPUs may be the only choice for training a neural network with a huge amount of parameters.

Data parallelism is currently the most commonly used approach to utilize multiple GPU devices to speed up DNN training. For data parallelism, each GPU holds a full copy of the DNN weights and is assigned a subset of training data. Weights update happens only when the gradients on all GPUs are aggregated. Another orthogonal approach is model parallelism [1, 10, 9], where the DNN structure is divided into subsets of layers and each GPU only keeps a part of the DNN model. The naive model parallelism strategy is to divide the DNN into a set of stages (each including one or more consecutive layers) and assigns each stage to a GPU [18]

. Each GPU only computes and transmits the activation to the next GPU in the forward direction, unless it owns the last layer, and computes and transmits gradients to the previous GPU in the backward direction unless it keeps the first layer. The inter-GPU communication overhead in model parallelism can be much smaller than that in data parallelism. However, the naive approach always works serially. In each feedforward-backpropagation round, after a GPU completes its forward step, it waits until all its subsequent GPUs finish their forward and backward steps before it starts the backward step. This leads to that each GPU actives sequentially, one at a pipeline unit, causing serious under-utilization of the GPUs.

To this end, we propose XPipe, an efficient asynchronous pipeline model parallelism method. This work is motivated by the state-of-the-art synchronous pipeline approach GPipe [8] as well as asynchronous pipeline approaches PipeDream [5] and SpecTrain[2], which will be detailedly reviewed in the next section. XPipe inherits the pipeline structure of PipeDream and SpecTrain but uses a micro-batch as the basic processing unit and adopts a more efficient strategy to address the weight inconsistency and staleness issues incurred by the asynchronous pipeline parallelism. Besides, adopting fine-grained micro-batch also makes XPipe easily scale up the mini-batches. On the other hand, despite both XPipe and GPipe introduce micro-batches into the pipeline training, XPipe allows the cross-training of these micro-batches from different mini-batches, giving rise to better GPU utilization and higher throughput than GPipe. In summary, XPipe incorporates the advantages of both synchronous and asynchronous pipeline model parallelism approaches. It provides high throughput, scales up to large batch size easily, and almost incurs no accuracy drop.

We evaluated XPipe using three popular Convolutional Neural Network (CNN) models on two different image datasets. The experimental results are detailedly reported, which demonstrate the effectiveness of our proposal. In comparison to PipeDream and SpecTrain, XPipe effectively alleviates the accuracy drop and achieves very comparable (even slightly better) model quality as GPipe. At the same time, XPipe can obtain consistently higher throughput than GPipe in regardless of the number of mini-batch partitions. For example, for training Inception-V3 on Tiny ImageNet, XPipe provides an average of 20.0% (up to 31.9%) and 88.1% (up to 150.8%) throughput improvement over GPipe on 2-GPU and 4-GPU systems, respectively.

Ii Related Work

Pipelining has been widely applied to accelerate neural network training [21, 12, 3, 22]

. Pipeline model parallelism has been recently proposed to efficiently speed DNN training. According to the way the weights updated, existing pipeline model parallelism approaches can be roughly classified into two categories: synchronous pipeline model parallelism and asynchronous pipeline model parallelism.

Synchronous pipeline model parallelism. The state-of-the-art synchronous pipeline model parallelism approach is GPipe [8], which is proposed to address the low GPU-utilization problem of the naive model parallelism strategy and overcome the memory limitation for scaling up DNNs. The noteworthy feature of GPipe is that it first splits a mini-batch into a set of smaller micro-batches. Therefore the training has a finer data unit; each mini-batch is trained equally through the training of a set of micro-batches. Introducing micro-batches into the pipeline training makes GPipe pretty good at scaling up the mini-batches. More importantly, GPipe trains each set of micro-batches in a pipelined manner, which, to some extent, allows the concurrent training of multiple GPUs. In this mean, GPU utilization is significantly improved compared to the naive model parallelism strategy. Meanwhile, GPipe belongs to synchronous-parallel approach and thus can train DNNs without degrading their model quality. However, since the micro-batches from the same mini-batch flows through all the GPUs sequentially, GPipe is unable to always keep each GPU concurrently being busy training the model and thus still suffers from load imbalance problem.

Asynchronous pipeline model parallelism. Asynchronous model-parallel (AMP) training [4] was proposed to overcome the low device-utilization problem in the naive model parallelism as well. AMP training allows asynchronous (thus faster) weights update as long as enough gradients are accumulated. However, AMP faces serious weight inconsistency and staleness issues due to the cross-training of multiple mini-batches. Besides that, Harlap et al. [5] proposed another asynchronous pipeline parallel approach called PipeDream. Similar to AMP training, PipeDream introduces multiple workers’ concurrent processing by simultaneously training multiple mini-batches in the pipeline. To address the weight inconsistency issue incurred by the cross-training of multiple mini-batches, PipeDream keeps a copy of the weights for each mini-batch active in the pipeline. However, keeping the weights wastes GPU memory especially for DNNs with massively amount of model parameters. On the other hand, PipeDream suffers from staleness problem because it uses different versions of weights in the whole feedforward-backpropagation round [2]. The staleness issue slows down the convergence and degrades the model quality as well. To simultaneously alleviate the inconsistency and staleness issues in the asynchronous pipeline model parallelism, Chen et al. [2] proposed SpecTrain. It adopts the same pipeline structure as PipeDream, enables the cross-training of multiple mini-batches and thus achieves high GPU utilization. Instead of storing the weights for each active mini-batch in the pipeline, SpecTrain addresses the weight inconsistency and staleness issues through weight prediction. Based on the observation that the smoothed gradients used in Momentum SGD [23] reflect the trend of weight updates, in both the forward and backward passes, SpecTrain uses the smoothed gradient times the weights version difference to predict the future weights. However, as shown in the experiments later, SpecTrain is still unable to solve the inconsistency and staleness issues and often incurs accuracy drop.

Iii Method

(a) XPipe ()
(b) XPipe ()
Fig. 1: Illustration of XPipe workflow on the 4-GPU system. Top: XPipe with ; bottom: XPipe with .

Iii-a Workflow

In XPipe, each mini-batch of size is split into smaller micro-batches. Thus a micro-batch of size becomes the basic data processing unit throughout the pipeline training. Figures 1(a) and 1(b) illustrate the workflow of XPipe on the 4-GPU system with and , respectively. The number inside each box refer to the forward or backward pass of the -th micro-batch. The white boxes denote the forward passes; grey boxes indicate the backward passes; orange boxes refer to the backward passes of the -th micro-batches, at the end of which weights are updated. The grey dashed lines with arrows in Figure 1(a) depict the round trip of processing the third mini-batch (i.e., micro-batches 5 and 6); the grey dashed lines with arrows in Figure 1(b) depict the round trip of processing the second mini-batch (i.e., micro-batches 5, 6, 7 and 8). In the workflow of XPipe, each mini-batch is trained equally through the training of micro-batches. For example, in Figure 1(a), micro-batches 1 and 2 correspond to mini-batch 1, and so on. Similarly, for XPipe with (as shown in Figure 1(b)), micro-batches 1 to 4 correspond to mini-batch 1, and so on. The red arrowed lines in Figures 1(a) and 1(b) depict the weight prediction, which will be detailedly described later.

One noteworthy feature of XPipe is that the micro-batches corresponding to the same mini-batch should share the same weights in their forward and backward passes. Weights update does not instantly happen when a micro-batch completes its backward pass. Instead, when performing the backward pass, the gradients are consistently accumulated and applied to update model parameters only when the -th micro-batch completes its backward pass (as shown by the orange boxes in Figure 1).

Beyond that, as depicted in Figures 1(a) and 1(b), XPipe intersects the execution order of micro-batches belonging to different mini-batches. In this way, all GPUs can continuously and concurrently train their submodels after the steady phase starts, giving rise to high GPU utilization. Unfortunately, the cross-training of micro-batches results in weight inconsistency and staleness issues. For example, in Figure 1(a), GPU 0 uses the initial weights to perform the forward pass of the fifth micro-batch. However, when GPU 0 is ready to run the backward pass of it, the weights on GPU 0 have been updated two times, i.e., after the backward passes of micro-batches 2 and 4. Moreover, as shown in Figure 1(a), throughout the training round, the third mini-batch takes different versions of weights to perform forward and backward passes on each GPU. This staleness issue further slows down the convergence and hurts the model quality.

Iii-B Weight Prediction

In this section, we propose an efficient weight prediction strategy to simultaneously address the weight inconsistency and staleness issues arising in the asynchronous pipeline training. Instead of using the smoothed gradients, XPipe performs weight prediction based on Adam [14]

updates, where a running average of the first and second moment of the gradients are used.

For micro-batches corresponding to a mini-batch, we refer to the micro-batch with the minimum index as a bellwether. Each mini-batch is allocated with a bellwether being in charge of doing weight prediction. For instance, the bellwether of the third mini-batch in Figure 1(a) is micro-batch 5 and the bellwether of the second mini-batch in Figure 1(b) is micro-batch 5 as well. The noteworthy feature of bellwether is that it always comes in first to perform both forward and backward passes among the micro-batches.

We use the weights version difference to measure the amount of weight updates happened between the current pipeline unit and the pipeline unit at which the -th micro-batch on GPU 0 completes its train round trip. The version difference should always be calculated first when the bellwether is ready to perform weight prediction.

For forward pass, the bellwether calculates the version difference via

(1)

where refers to the amount of GPUs and is the index of each GPU.

At the backward pass, the version difference turns to

(2)

For both forward and backward passes, the bellwether of the -th mini-batch uses following formula to predict the corresponding future weights:

(3)

where is the learning rate and with

(4)

In (4), refers to the gradients of stochastic objective corresponding to the -th mini-batch;

is the biased first-moment estimate;

is the biased second raw moment estimate; is the bias-corrected first-moment estimate; is the bias-corrected second raw moment estimates; refers to elementwise square with ; , and are constant values.

Figure 1 illustrates the main idea of weight prediction by the bellwether. The red arrowed lines stand for the weight prediction performed by the bellwether. All of them start from the pipeline unit where the bellwethers start their forward passes and point to the pipeline unit at which their corresponding mini-batches on GPU 0 finish the whole train round. In Figures 1(a) and 1(b), denotes the predicted weights corresponding to the -th mini-batch. On each GPU, when micro-batches (i.e., a mini-batch) are ready to perform the forward pass or the backward pass in sequence, the bellwether will first calculate the version difference ; then weight prediction is performed using the current weights and version difference to generate the future weights through (3). Following that, the other micro-batches will directly apply to perform their forward or backward passes.

In the following, we illustrate the weight prediction procedure of XPipe using the pipeline training with on the 4-GPU system. As shown in Figure 1(b), on GPU 0, when the second mini-batch (i.e., micro-batches 5, 6, 7 and 8) is ready to perform the forward pass, micro-batch 5 will first use formula (1) to calculate the version different and then apply formula (3) to calculate the future weights for the 2nd mini-batch (i.e., ). After that, micro-batches 6, 7 and 8 directly make use of to perform their forward passes. To avoid repeatedly doing weight predictions, the generated by the bellwether (micro-batch 5 here) should be temporarily cached and then directly used by the other micro-batches within the same mini-batch.

Likewise, at the backward pass, a bellwether again takes charge of predicting future weights. As shown in Figure 1(b), when micro-batch 5 is ready to perform the backward pass, it will first use formula (2) to calculate the version difference and then apply (3) to predict the future weights . As with the forward pass prediction, the predicted weights are cached and then reused by the subsequent micro-batches for their backward passes, to avoid repetitive weight predictions.

Iv Experimental Results

Iv-a Implementation Details

We implemented XPipe using Torch 

[20] with version 1.2.0. The code of XPipe will be released on Github. In the implementation of XPipe, each GPU is allocated with one process. All the processes are in charge of managing the local memory, data transfer between the CPU and GPUs, gradient calculation, weights update, as well as communicating with other processes. Torch provides the distributed package (i.e., torch.distributed) for message passing among multiple processes. In XPipe, each process uses the MPI communication backend for inter-GPU communication. Non-block communication primitives (e.g., isend and irecv) are used to overlap inter-GPU communication and GPU computation.

Iv-B Model Partition

The premise of pipeline model parallelism is to partition DNN layers across multiple GPUs. A few prior works are concentrating on efficient partitioning [19, 5, 8]. Designing an efficient partitioning algorithm is not the focus of this paper. In the experiments, we just partition all the DNN layers across GPUs with roughly equal number of layers to balance their training time, while making the former GPUs have a slightly more number of layers to achieve time/memory balance across GPUs.

Iv-C Experiment Setup

We conducted all the experiments on a 4-GPU system, which is equipped with 4 GeForce RTX2080X Nvidia GPUs. The host CPU is an Intel i9-9940X (@3.30 GHz).

Three popular CNN models are chosen as the benchmark networks in our experiments: VGG-16 [25], ResNet-101 [6] and Inception-V3 [27]. Two image datasets are used in the experiments. The first dataset is CIFAR-10 [16] which includes 60000 3232 images in total, 50000 images for training and 10000 images for validation. The second dataset is Tiny ImageNet [30]

, which is categorized into 200 classes each having 500 training images and 50 validation images. For CIFAR-10, standard data augmentation schemes, including flip, padding and random crop, are used in both of these two datasets. To be concrete, the images are normalized using mean = [0.4914, 0.4822, 0.4465] and std = [0.2023, 0.1994, 0.2010]. For Tiny ImageNet, each

image is first scaled up to . Following that, the images are loaded into a range of [0, 1] and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225].

In the experiments, we compared XPipe with following state-of-the-art pipeline model parallelism approaches: PipeDream (with weight stashing) [5], SpecTrain [2] and GPipe [8]. The following three measures were taken to ensure fairness. First, as with XPipe, we implemented PipeDream, SpecTrain and GPipe using the Torch framework. Second, before the pipeline training starts, all the evaluated methods adopted the same model partitioning approach to split the model across GPUs. Third, each evaluated approach took advantage of the same strategy (i.e., automatically recomputes the forward pass during the backward pass [8]) for better memory utilization. In all the experiments, for XPipe, we empirically set , and . Meanwhile, the elements of both and were initialized using times randomly generated values ranging from 0 to 1.

Iv-D Results and Discussions

Comparison with PipeDream and SpecTrain In this section, we compared XPipe with two recent-proposed asynchronous pipeline model parallelism approaches PipeDream and SpecTrain. Since GPipe without mini-batch partitioning automatically reduces to the naive pipeline approach, we trained GPipe with to simulate the behavior of the naive approach and saw the learning results of it as the baseline. We also trained XPipe with to isolate the effect of model partition. We selected VGG-16 111https://github.com/kuangliu/pytorch-cifar and Inception-V3 222https://github.com/weiaicunzai/pytorch-cifar100

as the benchmark network and used 4 GPUs to train them on CIFAR-10 for 90 epochs. The learning rate was initialized as 1e-2 and divided by 10 every 30 epochs. We trained the model using the Momentum SGD with the momentum factor

was set to 0.9 and weight decay was 5e-4. The batch-size for all the evaluated methods was 128.

Approach Min. Val. Max. Val.
Loss Top-1 Accuracy
baseline 0.289 92.10% ()
PipeDream 0.289 91.93% (-0.17%)
SpecTrain 0.293 91.56% (-0.54%)
XPipe 0.269 92.18% (+0.08%)
baseline 0.283 93.26% ()
PipeDream 0.287 92.90% (-0.36%)
SpecTrain 0.296 92.78% (-0.48%)
XPipe 0.257 93.21% (-0.05%)
TABLE I: Results on CIFAR-10. Top: results for VGG-16; bottom: results for Inception-V3. The values inside the parentheses denote the relative variation of validation top-1 accuracy compared to the results produced by the baseline. The best results are highlighted in boldface.

Figure 2 depicts the learning curves and Table I summarizes the obtained minimal validation loss and maximum validation top-1 accuracy. XPipe converges very fast and its learning curves match well with that of the baseline. Besides, the experimental results show that XPipe can obtain the least validation loss value and very comparable top-1 accuracy as the baseline. On average, XPipe achieves 0.015% top-1 validation accuracy improvement over the baseline. In contrast, PipeDream and SpecTrain incurs an average of 0.265% and 0.51% top-1 validation accuracy drop respectively. Note that XPipe with makes use of the same version differences as SpecTrain to do the weight prediction in both the forward and backward passes. The experiment results verify that the Adam-based weight prediction provides a more effective solution for weight prediction.

(a) VGG-16
(b) Inception-V3
(c) VGG-16
(d) Inception-V3
Fig. 2: Experimental results for training VGG-16 and Inception-V3 on CIFAR-10. Top: validation loss vs epochs; bottom: validation top-1 accuracy vs epochs.
(a)
(b)
(c)
Fig. 3: Validation top-1 accuracy versus epochs for training Inception-V3 on Tiny ImageNet.
(a)
(b)
(c)
Fig. 4: Validation top-1 accuracy versus epochs for training ResNet-101 on Tiny ImageNet.

Comparison with GPipe In this section, we compared XPipe with GPipe, a state-of-the-art synchronous pipeline model parallelism approach. We selected Inception-V3 and ResNet-101 as the benchmark network333https://github.com/pytorch/vision/tree/master/torchvision/models and used 4 GPUs to train them on Tiny ImageNet for 70 epochs. We compared XPipe and GPipe by running them with , and . For all the conducted experiments, we let both XPipe and GPipe use the same hyper-parameters. The batch-size for both XPipe and GPipe was fixed with 100. The learning rate was initialized as 1e-2 and divided by 10 at the 40th and 60th epoch. We trained the models using the Momentum SGD with the momentum factor was set to 0.9 and weight decay was 5e-4.

Figures 3 and 4 depict the top-1 validation accuracy versus epochs. The obtained minimum validation loss and maximum validation top-1 accuracy are summarized in Table II. XPipe converges very fast and its learning curves on both Inception-V3 and ResNet-101 match well (even converges faster) with that of GPipe, independence of the setting of . This results again verify the learning-effectiveness of XPipe. Table II shows that XPipe almost always achieves smaller loss value and higher validation top-1 accuracy than GPipe. On average, XPipe is able to obtain 0.26% and 0.67% top-1 validation accuracy improvement over GPipe on Inception-V3 and ResNet-101 respectively.

Partition Method Min. Val. Max. Val.
Loss Top-1 Acc.
GPipe 1.543 62.66% ()
XPipe 1.546 62.62%(-0.04%)
GPipe 1.549 63.24% ()
XPipe 1.542 63.54%(+0.30%)
GPipe 1.600 63.28%()
XPipe 1.596 63.72%(+0.44%)
GPipe 1.508 63.24% ()
XPipe 1.438 64.04%(+0.80%)
GPipe 1.495 64.60%
XPipe 1.459 65.06%(+0.46%)
GPipe 1.560 64.08%
XPipe 1.540 64.82%(+0.74%)
TABLE II: Results on Tiny ImageNet. Top: results for Inception-V3; bottom: results for ResNet-101. The values inside the parentheses denote the relative variation of validation top-1 accuracy compared to the results produced by GPipe. The best results are highlighted in boldface.
(a) VGG-16
(b) Inception-V3
Fig. 5: Throughputs of PipeDream, SpecTrain and XPipe for 2-GPU and 4-GPU systems.
(a) 2-GPU system
(b) 4-GPU system
(c) 2-GPU system
(d) 4-GPU system
Fig. 6: Throughputs of GPipe and XPipe for training Inception-V3 on 2-GPU and 4-GPU systems. Figures 6(a) and 6(b) depict the throughputs for Inception-V3; Figures 6(c) and 6(d) show the throughputs for ResNet-101.

Throughput Study In this section, we compared the throughput of XPipe with that of PipeDream, SpecTrain and GPipe using 2 and 4 GPUs, respectively. Here throughput is defined as the amount of training samples per second. For PipeDream, SpecTrain and XPipe, the throughput measurement refers to their per-second training samples during the steady phase. We divided the comparison into two groups. For the first group, we compared the throughput of XPipe with that of PipeDream and SpecTrain. We selected VGG-16 and Inception-V3 as the benchmark network and trained it on the CIFAR-10 dataset for one epoch. The batch-size for all evaluated approaches was 128. For XPipe, we always let to isolate the affect of mini-batch partition. For the second group, we compared the throughput of XPipe with that of GPipe. We selected Inception-V3 and ResNet-101 as the benchmark network and trained them on the Tiny ImageNet for one epoch. We compared the throughput of XPipe and GPipe with , and , respectively. The batch-size for both GPipe and XPipe on 2-GPU and 4-GPU systems was and , respectively.

Figure 5 illustrates the results for the first group; Figure 6 illustrates the results for the second group. It is worth noting that the experiments were conducted to compare the throughput with other state-of-the-art pipeline approaches. All the evaluated pipeline approaches can consistently obtain higher throughput if a better model partition method is applied. We can reach to following conclusions based on the observation of the throughput results. First, the throughput results also show that the throughput of XPipe is slightly inferior to that of PipeDream and SpecTrain despite all of them adopt the same pipeline structure. This is because that XPipe takes advantage of more computation-intensive weights prediction strategy to guarantee effective learning. Second, as with GPipe, XPipe enables the same larger mini-batch size for training. It is very reasonable because both GPipe and XPipe use the fine-grained micro-batch as the basic data processing unit in the pipeline training. Third, the throughput of GPipe is very sensitive to the choice of . This is because the pipeline structure of GPipe varies with the selection of and different selection of gives rise to different proportions of ’bubble’ or idle time. In contrast, the pipeline structure is stable which is independent of . Therefore, XPipe can consistently achieve very high throughput. For Inception-V3, XPipe provides an average of 20.0% (up to 31.9%) and 88.1% (up to 150.8%) throughput improvement over GPipe on 2-GPU and 4-GPU systems, respectively. For ResNet-101, XPipe provides an average of 10.8% (up to 21.2%) and 84.6% (up to 142.7%) throughput improvement over GPipe on 2-GPU and 4-GPU systems, respectively.

Robustness Study

In this section, we study the robustness of XPipe by using another two popular optimization methods for pipeline training: RMSProp 

[29] and Adam [14]. We also trained GPipe with to simulate the behavior of the naive model parallel approach and regarded the results of it as the baseline. We selected VGG-16 as the benchmark network and trained it on CIFAR-10 for 50 epochs using 4 GPUs. The learning rate was fixed as 1e-4. The batch-size for all approaches was 128. For RMSProp, the value of momentum was set to 0.9; for Adam, the exponential decay rate for the first and second momentum estimates were set to 0.9 and 0.999, respectively.

Figure 7 shows the robustness study results. The experimental results demonstrate the effectiveness of XPipe, regardless of the optimization method used. For using both either RMSProp or Adam as the optimization method, the learning curves of XPipe converge quickly and match well with that of the baseline. This demonstrates that the Adam-based weight prediction strategy is very robust and can guarantee the effective learning, independent of the selection of optimizer method.

(a) RMSProp
(b) Adam
Fig. 7: Validation top-1 accuracy versus epochs for using RMSProp and Adam optimizers.

V Conclusions

In this work, we propose an efficient asynchronous pipeline model parallelism method called XPipe. XPipe interweaves the pipeline training of micro-batches belonging to different mini-batches, so as to ensure that each GPU concurrently and continuously trains the DNN model, and thereby provide high throughput. Moreover, the effective weight prediction scheme makes XPipe well address the weight inconsistency and staleness issues in the asynchronous pipeline training. Overall, XPipe provides high throughput, scales up mini-batch size easily, and achieves very comparable accuracy (even slightly better) as the state-of-the-art synchronous counterpart.

Acknowledgment

The work is partially supported by the China Scholarship Council (CSC), and Major State Research Development Program, China (2016YFB0201305). Lei Guan thanks Zhihui Yang, Tao Sun, and Bao Wang for stimulating discussions.

References

  • [1] T. Ben-Nun and T. Hoefler (2018)

    Demystifying parallel and distributed deep learning: an in-depth concurrency analysis

    .
    arXiv preprint arXiv:1802.09941. Cited by: §I.
  • [2] C. Chen, C. Yang, and H. Cheng (2018) Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839. Cited by: §I, §II, §IV-C.
  • [3] X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide (2012) Pipelined back-propagation for context-dependent deep neural networks. In Thirteenth Annual Conference of the International Speech Communication Association, Cited by: §II.
  • [4] A. L. Gaunt, M. A. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster (2017) Ampnet: asynchronous model-parallel training for dynamic neural networks. arXiv preprint arXiv:1705.09786. Cited by: §II.
  • [5] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons (2018) Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377. Cited by: §I, §II, §IV-B, §IV-C.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §IV-C.
  • [7] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §I.
  • [8] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen (2018) GPipe: efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965. Cited by: §I, §II, §IV-B, §IV-C.
  • [9] Z. Huo, B. Gu, and H. Huang (2018) Training neural networks using features replay. In Advances in Neural Information Processing Systems, pp. 6659–6668. Cited by: §I.
  • [10] Z. Huo, B. Gu, Q. Yang, and H. Huang (2018) Decoupled parallel backpropagation with convergence guarantee. arXiv preprint arXiv:1804.10574. Cited by: §I.
  • [11] N. Kalchbrenner, E. Grefenstette, and P. Blunsom (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. Cited by: §I.
  • [12] M. Kamruzzaman, S. Swanson, and D. M. Tullsen (2013) Load-balanced pipeline parallelism. In SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. Cited by: §II.
  • [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §I.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-B, §IV-D.
  • [15] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, et al. (2017) Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2, pp. 3. Cited by: §I.
  • [16] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §IV-C.
  • [17] H. Lee, P. Pham, Y. Largman, and A. Y. Ng (2009)

    Unsupervised feature learning for audio classification using convolutional deep belief networks

    .
    In Advances in neural information processing systems, pp. 1096–1104. Cited by: §I.
  • [18] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing (2014) On model parallelization and scheduling strategies for distributed machine learning. In Advances in neural information processing systems, pp. 2834–2842. Cited by: §I.
  • [19] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean (2017)

    Device placement optimization with reinforcement learning

    .
    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2430–2439. Cited by: §IV-B.
  • [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    Cited by: §IV-A.
  • [21] A. Petrowski, G. Dreyfus, and C. Girault (1993) Performance analysis of a pipelined backpropagation parallel algorithm. IEEE Transactions on Neural Networks 4 (6), pp. 970–981. Cited by: §II.
  • [22] R. Pittman, H. Guan, X. Shen, S. Lim, and R. M. Patton (2018) Exploring flexible communications for streamlining dnn ensemble training pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 64. Cited by: §II.
  • [23] N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §II.
  • [24] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: §I.
  • [25] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §IV-C.
  • [26] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In

    Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: §I.
  • [27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016-06) Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IV-C.
  • [28] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §I.
  • [29] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §IV-D.
  • [30] L. Yao and J. Miller (2015) Tiny imagenet classification with convolutional neural networks. CS 231N 2 (5), pp. 8. Cited by: §IV-C.
  • [31] Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer (2018) Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, pp. 1. Cited by: §I.