In recent years the compute requirements for training state of the art deep neural networks have rapidly increased (Amodei and Hernandez, 2018). To keep training times manageable, practitioners use increasingly parallel training setups where the workload is divided across multiple workers. This is most commonly done using data parallelism in the form of mini-batch training. This has been used to efficiently utilize individual accelerators as well as scaling training to large clusters of devices, usually in the form of synchronized distributed SGD (Chen et al., 2016).
Scaling training by increasing the batch size has drawbacks. Beyond a certain point, larger batch sizes do not reduce the number of steps required to train a model and thus cannot reduce training time further (Shallue et al., 2019). Shallue et al. (2019)
empirically show that this point varies depending on the network and dataset. Further, for a given compute budget, the range of acceptable hyperparameters (learning rate, momentum) can shrink with increased batch sizes. Finally, they show that tuning is necessary for good performance; heuristically scaling the hyperparameters for large batch sizes does not always result in efficient training. The cost of the tuning required for efficient large batch size training could also be prohibitive. These downsides have sparked interest in alternative forms of parallelism.
Pipeline parallelism is one such alternative where the model is divided sequentially into segments we call pipeline stages. Each worker is assigned to one stage and inputs proceed sequentially through the stages, similar to an assembly line (Figure 1). This form of parallelism has the advantage that each worker only performs a subset of the computation which can allow them to specialize.
An example of this can be seen in GPipe (Huang et al., 2018) where workers specialize by holding a subset of the model parameters. This allows GPipe to train larger models than can fit on a single worker. Li and Pedram (2017) discuss other hardware advantages of such training, for example energy efficiency. Pipeline parallel training commonly uses a form of mini-batch SGD which sequentially feeds samples from a batch through the pipeline (filling the pipeline) and waits for the resulting gradients before updating the parameters (draining the pipeline) and processing the next batch.
Filling and draining the pipeline for each update can significantly lower hardware utilization when batch sizes are small compared to the number of pipeline stages (Figure 2). Pipelined backpropagation (PB) is a technique that avoids this overhead by updating the weights without draining the pipeline (Pétrowski et al., 1993). This can result in an inconsistency between the weights used for the forward and backwards passes for a given sample. Even with Weight stashing (Harlap et al., 2018), which saves the weights used on the forward pass for use on the backwards pass, the weights used to calculate the gradient may have been updated before the resulting gradient is applied, in which case the gradient is said to be stale. For these reasons PB may not match SGD training.
Recent works have explored training networks through a combination of data parallelism and pipelined backpropagation. SpecTrain (Chen et al., 2018) uses a form of weight prediction to mitigate both stale gradients and inconsistent weights. PipeMare (Yang et al., 2019) applies discrepancy correction (a form of backward weight prediction) to mitigate for inconsistent weights and learning rate rescheduling (a new form of learning rate warmup) to help with stale gradients. Zhuang et al. (2019) propose Gradient Shrinking, which exponentially scales the gradients for each stage depending on the delay.
Unlike prior work, we eliminate batch parallelism by having each stage process a single sample at a time. We attain parallelism using fine-grained pipelined parallelism where each pipeline stage only consists of a single layer. This enables highly specialized workers which can have significant hardware advantages. Our contributions are as follows:
We explore the use of fine-grained pipelined backpropagation without batch parallelism. We show that this could be a viable method for accelerating training with an update size of one.
We propose two methods, Spike Compensation and a new variant of weight prediction, Linear Weight Prediction, to mitigate the issues of pipelined backpropagation: inconsistent weights and stale gradients.
We analyze our methods and show how they can counteract the effects of stale gradients. We provide mathematically motivated settings for our methods removing the need for hyperparameter tuning. We also show that the methods restore the benefits of momentum for ill-conditioned problems with delay.
2 Pipelined Backpropagation
Pipeline parallelism is an interesting alternative or supplement to standard data parallelism111Pipeline and data parallelism differences are analyzed in Appendix A. To perform SGD training using pipeline parallelism, the same weights must be used on the forward and backwards passes. To satisfy this the pipeline needs to be empty before updating the weights. While the pipeline is filling or draining some workers sit idle which lowers utilization. The fill and drain overhead is illustrated in Figure 2.
We assume our pipeline has pipeline stages and that each stage performs a single forward and a single backward transformation at each time step. Each sample is processed in time steps. Performing a mini-batch SGD update with samples takes roughly steps222This is assuming the workers are unable to speed up processing when they only perform one of the transformations, otherwise it may be about .. The work performed only corresponds to fully utilized steps so the overall utilization is upper bounded by:
Unless this represents a significant overhead.
Pipelined backpropagation (Pétrowski et al., 1993) avoids the fill and drain overhead by relaxing the constraint that the same weights must be used for the forward and backwards passes. In PB the pipeline is not drained before an update is applied, instead the parameters are updated as soon as gradients have been obtained. This keeps all workers utilized after the pipeline is filled for the first time (Figure 2). We assume an update size () of one333Alternatively could be set to match some reference batch size for which known hyperparameters exist. We do not explore this.. We compare the weight updates of PB and SGD. We write SGD as:
where is the set of all model weights, is the sample at time , is the learning rate, and
is the loss function. For PB we defineto be the weights for pipeline stage as seen by the sample, , as it propagates backwards through the network. is defined as the concatenation (denoted by ) of for all stages444This corresponds to the weights on the blue line in Figure 2:
The weight update for can then be written as:
where approximates the gradient and is the network state used for the forward pass of the network555This corresponds to the weights on the red line in Figure 2. For pipelined backpropagation with :
Inconsistent Weights Different weight are used during the forward and backwards pass, . The resulting sample gradient is not the true sample gradient. The inconsistency is greater for earlier stages in the pipeline. If weight stashing (Harlap et al., 2018) is used to mitigate weight inconsistency the resulting update is:
Weight stashing requires the overhead of storing parameter versions along with the activations.
Stale Gradients In PB each gradient is obtained using weights from various time steps. When the gradient is obtained the weights have been updated. This results in stale gradients (aka. delayed gradients), an issue that also occurs in asynchronous SGD training (Lian et al., 2015; Avron et al., 2015). The gradient staleness varies by stage, earlier stages suffer from a greater degree of staleness. The length of the grey lines in Figure 2 is proportional to the age of the weights, which is also a measure of the gradient delay for each stage. The depth of the pipeline determines the maximum delay. Weight stashing does not address gradient delay because in equation 6 is a delayed version of .
We introduce two compensation methods for pipelined backpropagation: weight prediction and spike compensation. We formulate them for SGD with momentum666Both methods require momentum. They can be adapted for other momentum based optimizers. (SGDM) which we write as:
where are weights (parameters) at time , is the velocity (sometimes called momentum), is the momentum coefficient, and is the learning rate. We use
to represent a gradient estimate for time. The estimate can correspond to a delayed gradient, and is potentially calculated with inconsistent weights.
We describe and analyze our methods for a constant delay, , without modeling the pipeline or inconsistency. When we use the methods for PB we apply them to each stage separately, with the corresponding delay set to the number of steps between the forward and backwards passes for that stage. To simplify notation we drop the superscript representing the stage index for , , and . We represent a delayed gradient with . We write the gradient as a function of the weights alone, in SGD the gradient may also depend on inputs, labels or other data.
3.1 Small batch size training
We define the per-worker batch size to be the number of samples that each pipeline stage processes at a time and the update-size to be the number of samples that contribute to the gradient in each update. We set both of these to one in our experiments. Larger values can potentially be used but this is outside the scope of this work.
Since the optimal learning rate and momentum depend on the update size , we scale the values used by the SGDM reference according to Chiley et al. (2019). This correspond to scaling the expected update size linearly with the batch size and scaling the momentum such that the decay per sample is the same. This allows for a fair comparison of techniques even though different update sizes are used (Appendix H.4). The scaling rules are:
where , and are the reference learning rate, momentum coefficient and batch size and , and are the new values (we use ).
3.2 Spike Compensation
We introduce spike compensation (SC) to mitigate the effects of delayed gradients in pipelined backpropagation. The method uses a modified weight update which increases the contribution of the latest gradient relative to the velocity. For a delay of this can generally be written as:
where and are functions of the delay777We could absorb either or into but use this form to keep consistent with other methods.. We refer to this form as generalized spike compensation (GSC). To reason about sensible choices for and we can look at the contribution of each gradient over time in the no-delay case vs the delay case (see Figure 3). When a gradient is obtained with some delay , this gradient would already have contributed to weight updates in the no-delay case. The total contribution of the gradient so far would have been:
This inspires our default choice of and for spike compensation which we will refer to as SCD:
For this choice, the missing weight update is applied immediately and the contribution of the gradient at later time steps will match that of the no-delay case. The total contribution of each gradient to the weights over the course of training is unchanged, this only changes how the gradients are applied over time. The modified weight update can equivalently be seen as approximating the velocity in the no-delay case with . This uses the latest gradient to estimate the gradient terms in the velocity that have not been observed yet due to the delay. Note that for a delay of zero, SCD reduces to standard SGD with momentum.
3.3 Linear Weight Prediction
Both the weight inconsistency and gradient delay arise from the fact that we can not access the (future) weights used on the backwards pass when we compute the forward pass. The goal of weight prediction is to estimate the backwards weights on the forward pass. The weights we want to estimate are:
where is the delay (number of update steps between the forward and backwards passes). The future velocities are unknown but can be estimated by assuming a constant gradient over the prediction horizon, i.e. the number of iterations over which the prediction is made. This gives:
which results in predicted weights:
We have several good choices including setting it to zero or estimating it based on recent gradients. In this work we focus on weight prediction where the direction of the velocity does not change, i.e. is collinear with . We refer to this as linear weight prediction (LWP). The estimate for the weights at time and delay can then be written in terms of past weights and velocities as:
Where is a hyperparameter we call the horizon of the weight prediction. For SGDM without modifications, we can equivalently write the estimate in terms of the previous weights alone:
When combined with spike compensation (and potentially with other optimizers) the predictions given by equations 18 and 19 differ. When this is the case we refer to the two types as LWP (velocity form) and LWP (weight difference form), respectively. We can write the update step as:
In the rest of this paper we use LWPD to denote LWP with our default choice of . This is equivalent to choosing in equation 17 which would result in a constant velocity. This form is closely related to the weight prediction used in SpecTrain (Chen et al., 2018) which extends the prediction horizon and also predicts weights on the backwards pass (see Appendix C).
3.4 Combined Mitigation
Spike compensation and weight prediction can be combined resulting in the following update step:
where, as before, is the horizon of the weight prediction and and are the coefficients for the spike compensation. When combined with spike compensation we have:
In the combination can be interpreted as using spike compensation to estimate the velocity used in the weight prediction888This weight prediction also corresponds to a different choice of in equation 17 using the most recent gradient estimate..
3.5 Analysis for a Convex Quadratic
In this section we analyze the optimization of a convex quadratic loss with gradient delay. We find that our methods:
Improve convergence for large condition numbers
Allow higher learning rates for large momentum values
Restore the benefits of momentum for poorly conditioned losses
where correspond to the parameters being optimized and
are the eigenvalues of the quadratic. As shown in e.g.Goh (2017), any positive definite quadratic can be written in this form through a coordinate transformation. Since is diagonal, each coordinate of the gradient is independent of other coordinates. This allows us to analyze the convergence for each coordinate separately. For simplicity we assume that the gradient is deterministic. A similar analysis would hold for the expected values of if each gradient sample was assumed to be noisy but unbiased.
In Appendix D we derive the state transition equations for SGDM with delay and our methods. Since the gradient here is linear, and the coordinates are independent, inserting it into the transition equations results in a linear recurrence relation for each coordinate. For component , with associated eigenvalue , the characteristic polynomial for the recurrence relation of each method is:
where GDM stands for gradient descent with momentum, GSC is general spike compensation, LWP is linear weight prediction, parameterizes the polynomials and other symbols have the same meaning as in Section 3.4. Note that since the gradient is linear, GSC and LWP are equivalent for a certain choice of , and as shown in Appendix D. Even though this is the case, the characteristic polynomial of the combination cannot be obtained from either method.
Linear recurrence relations have a well known solution in terms of the roots of the corresponding characteristic equation. The resulting sequence for component , corresponding to the characteristic polynomial with roots , can be written as:
where is a polynomial. The order of the polynomial is one less than the multiplicity of the corresponding root . The coefficients of the polynomials are determined by the initial conditions.
For our analysis we assume that all components start with some error and look at the rate of convergence in the limit . A component converges to the optimal value of if . In the limit, the slowest term of equation 32 will dominate so the error for this component, will be:
The overall rate of convergence is determined by the slowest component. The slowest component can depend on the roots of high order polynomials, which are difficult to determine analytically, so we turn to computational analysis. For a given delay, we can compute the roots of the characteristic polynomials 28-31, including , as a function of the normalized rate and the momentum . Figure 4 shows heatmaps of for each method for a delay of one and our default values of , and . Note that the region of stability is significantly reduced by the delay, especially for large momentum values. Our compensation methods counteract this, allowing larger learning rates to be used for high momentum values. SCD in particular strictly increases the region of stability, the other methods slightly decrease it for small momentum coefficients.
Figure 4 also allow us to reason about more than a single component at a time. Let’s assume that we have multiple components, a condition number and a dense spectrum of eigenvalues between and . The same learning rate and momentum are used for all components. The overall convergence rate is determined by the component with the largest . This corresponds to the largest value in a horizontal line segment between and on the root heatmaps. With a log scale the line segment has a constant length determined by .
Figure 5 shows the convergence speed as a function of for the different methods. We measure the half-life where is obtained by finding the lowest max magnitude over all intervals of sufficient length. The methods improve the rate of convergence compared to the delayed baseline. The combination performs the best which also holds for larger delays as is shown in Figure 6.
As mentioned earlier, GSC and LWP can be equivalent for a convex quadratic. The fact that LWPD slightly outperforms SCD indicates that our selection of is better than the selection of and as given in equation 14 in this case. Figure 7 shows the effect of different values of . It shows that values close to are optimal but do not outperform the combination LWP+SCD. This seems to indicate that “overcompensating” for the delays, by predicting weights further out in LWP or equivalently by using larger spikes in SC, seems to produce better optimization trajectories. The resulting root heatmaps resemble the ones for the no-delay Nesterov baseline (see LWP+SCD in Figure 4, LWP with looks similar). Note that adding Nesterov to the delay is not sufficient to get this effect. In Appendix E we show the effect of extended horizons for both the convex quadratic and a neural network.
Figure 7 also reveals that without mitigation ( is equal to GDM with delay), the optimal momentum is zero. In the no-delay case the optimal momentum is given by (Zhang and Mitliagkas, 2017) which increases with the condition number. Our compensation methods restore the benefits of momentum for high condition numbers. Overall the combined mitigation performs the best. Extended horizons for LWP or the equivalent coefficients for GSC also outperform our default choice in this case but are unable to match the combination LWP+SCD.
To efficiently run small batch, fine-grained pipelined backpropagation on a GPU, we developed a framework described in Appendix G.1. The majority of experiments are done with the pre-activation residual networks proposed by He et al. (2016)
. To enable training at a batch size of one we replace batch normalization(Ioffe and Szegedy, 2015) with group normalization (Wu and He, 2018)999In ResNets, batch normalization slightly outperforms group normalization so the results are not directly comparable to the baselines found in (He et al., 2016, 2016).. Hyperparameters are adopted from He et al. (2016) and scaled for batch size one training (Section 3.1). We combine each convolution layer and its associated normalization and non-linearity into a single pipeline stage. In our implementation the sum nodes between residual blocks also become pipeline stages. For our mitigation methods we use the default hyperparameters for LWPD and SCD without further tuning. The results can potentially be improved with a hyperparameter search. Other experiment details as well as run to run variability can be seen in Appendix H.
Pipelined backpropogation without mitigation suffers from a loss of accuracy compared to the SGDM baseline (Figures 8 and 9, Table 1). The size of degradation depends on the depth of the pipeline (Table 1). This is expected given longer pipelines produce larger delays.
Mitigating for the delay improves the performance of PB training. For relatively shallow networks, PB training has minimal degradation. All mitigation methods tested fully recover the SGDM baseline accuracy (Figure 8). For CIFAR10 ResNet20 training, PB+LPW+SCD produces the best accuracy101010In our training setup LPW+SCD outperformed LPW+SCD (Appendix H.5)..
When training deeper networks, such as ImageNet ResNet50 with 78 pipeline stages (Figure 9), PB training incurs an accuracy loss of 0.6%111111Wu and He (2018) report an accuracy of 75.9%. They do this by extending and modifying the learning rate schedule we used which we adopted from (He et al., 2016).. LPWD and SCD are not able to fully recover the baseline accuracy but LPW+SCD produces competitive results. PB training of CIFAR10 ResNet110 leads to an accuracy which is 1.0% worse than SGD training. Although LPW+SCD recovers most of the accuracy loss it does not fully close the gap. Weight stashing does not help with PB training in our setting (Table 2 in Appendix B). While SpecTrain works well in training CIFAR networks, it still exhibits a accuracy degradation on ImageNet Training (Appendix C.1).
Even without any hyperparameter tuning, PB+LPW+SCD mostly produces results which are competitive to SGD training for both CIFAR and ImageNet. Where LPW+SCD is not sufficient, hyperparameter tuning, a learning rate warmup, or additional delay mitigation methods can potentially help recover full accuracy.
Pipelined backpropagation training works well for shallow networks but does not perform as well as SGD for deeper networks without mitigation. The pipeline geometry determines the number of steps between the forward and backward passes which cause gradient delay and weight inconsistency. In Appendix B we explore the effects of weight inconsistency and find that it is insignificant in our setting. In cases where weight inconsistency is an issue, weight stashing or similar techniques (e.g. discrepancy correction from Yang et al. 2019) can be applied.
The effect of the delays depends on the total change in the local loss surface over the course of the delay. For a small change, a delayed gradient is roughly equal to the true non-delayed gradient and is therefore unlikely to have an adverse impact. The change in the model parameters, which also causes the weight inconsistency, is indicative of the change in the local loss surface. The effects of the delay may therefore depend on the learning rate, phase of training, etc. Since the model parameters usually change most rapidly at the start of training, a learning rate warmup may help stabilize PB training. Such methods can be combined with our mitigation strategies to improve performance.
Using a small per-worker batch size decreases the length of the delays when measured in number of samples. If the learning rate is adjusted to keep the contribution of each sample the same, this reduces the total change in model parameters over the course of the delay and thus the adverse effects of the delay. The use of small batch size training therefore helps mitigate the delays of fine-grained PB.
Small batch size training prevents the use of batch normalization (BN) so we opted to use group normalization (GN). In additional exploratory experiments (not shown) we observed that BN seems to significantly decrease the effects of delayed gradients compared to GN. The use of other small batch size alternatives to BN such as Online Normalization (Chiley et al., 2019), Weight Standardization (Qiao et al., 2019) or Filter Response Normalization (Singh and Krishnan, 2019) may boost delay tolerance. Optimizers such as ADAM may also increase delay tolerance.
We introduced spike compensation and linear weight prediction to mitigate the effects of delays in PB. These methods require momentum to be effective. When we scale the hyperparameters for small batch size training, we keep the half-life of the momentum the same when measured in the number of samples. This results in a very high momentum coefficient which we find works well and boosts the performance of our mitigation methods (Appendix F). Other works claim that momentum is not necessary for small batch size training (Smith and Le, 2018). However momentum may still make it easier to mitigate the effects of the delays and enable the use of existing hyper-parameter settings. In Section 3.5 we show our methods restore some of the traditional advantages of momentum in the delayed setting.
We find “overcompensating” for the delays can result in better optimization trajectories (Section 3.5, Appendix E). One way to do this is to combine spike compensation and weight prediction. We show this combination enables training moderately deep neural networks such as ResNet50 for ImageNet without a loss of accuracy. Overcompensating for large delays, like those in ResNet110, can adversely impact performance. In such cases using one method can work better than the combination (Appendix E).
With mitigation, PB is a promising alternative to batch parallel training. It overcomes the fill and drain overhead of traditional pipeline parallel SGD training. This could enable the design of highly efficient pipeline parallel hardware accelerators that benefit from specialized workers.
We are grateful to Vithu Thangarasa and Ron Estrin for their feedback on the manuscript. We thank Min Xu for his help with the dataloader used in our GProp experiments and Chuan-Yung Tsai for insightful discussions.
- AI and compute. External Links: Cited by: §1.
- Revisiting asynchronous linear solvers: provable convergence rate through randomization. Journal of the ACM (JACM) 62 (6), pp. 51. Cited by: §2.
- Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. ArXiv abs/1809.02839. Cited by: Appendix B, Figure 11, Appendix C, §1, §3.3.
- Revisiting distributed synchronous SGD. CoRR abs/1604.00981. External Links: Cited by: §1.
cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: §G.1.
- Online normalization for training neural networks. In Advances in Neural Information Processing Systems 32, pp. 8431–8441. Cited by: 4th item, §H.4, §3.1, §5.
- MetaInit: Initializing learning by learning to initialize. In Advances in Neural Information Processing Systems, pp. 12624–12636. Cited by: 4th item.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: 4th item.
- Pytorch-vgg-cifar10. External Links: Cited by: §H.1.
- At stability’s edge: how to adjust hyperparameters to preserve minima selection in asynchronous training of neural networks?. arXiv preprint arXiv:1909.12340. Cited by: Appendix D.
- Why momentum really works. Distill. External Links: Cited by: Appendix D, §3.5.
- Taming momentum in a distributed asynchronous environment. arXiv preprint arXiv:1907.11612. Cited by: Appendix C.
- PipeDream: fast and efficient pipeline parallel dnn training. ArXiv abs/1806.03377. Cited by: 3rd item, Appendix B, Appendix C, §1, §2.
- Deep residual learning for image recognition. In , Cited by: §H.3, §H.4, §4, footnote 11, footnote 9.
- Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §H.3, §4, footnote 9.
- GPipe: efficient training of giant neural networks using pipeline parallelism. ArXiv abs/1811.06965. Cited by: §1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. Cited by: 4th item, §4.
- Learning multiple layers of features from tiny images. Citeseer. Cited by: 4th item.
- CATERPILLAR: coarse grain reconfigurable architecture for accelerating the training of deep neural networks. CoRR abs/1706.00517. External Links: Cited by: §1.
- Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2737–2745. Cited by: §2.
- Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics 15 (3), pp. 715–732. Cited by: Appendix D, §3.5.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: Appendix B, §G.2.
- Performance analysis of a pipelined backpropagation parallel algorithm. IEEE transactions on neural networks 4 6, pp. 970–81. Cited by: §1, §2.
- Weight standardization. CoRR abs/1903.10520. Cited by: 4th item, §5.
- Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research 20 (112), pp. 1–49. Cited by: §1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §H.1.
- Filter response normalization layer: eliminating batch dependence in the training of deep neural networks. arXiv preprint arXiv:1911.09737. Cited by: 4th item, §5.
A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, Cited by: §5.
- Group normalization. In The European Conference on Computer Vision (ECCV), Cited by: 4th item, §H.3, §4, footnote 11.
- PipeMare: asynchronous pipeline parallel DNN training. arXiv preprint arXiv:1910.05124. Cited by: Appendix C, §1, §5.
- Fixup initialization: Residual learning without normalization. In 7th International Conference on Learning Representations, ICLR 2019, pp. 1–16. External Links: Cited by: 4th item.
- Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471. Cited by: §3.5.
- Fully decoupled neural network learning using delayed gradients. ArXiv abs/1906.09108. Cited by: §1.
Appendix A Batch Parallel vs Pipeline Parallel Computation
Pipeline parallelism differs from batch parallelism in several ways:
The training memory requirements differ. In both cases we assume an layer network trained with workers. During neural network training, the activations of many layers must be stored for the gradient calculation. For batch parallelism the activation memory required is . To compute the backwards pass, each worker has to store activations for roughly every layer. In the pipeline parallel setting, each worker is responsible for storing the activations of approximately layers. The first worker must store its activations for steps. The second worker needs to keep activations for steps and so on. The total activation memory comes out to be approximately the same, , however the per worker memory requirements can be very different. Pipeline parallelism generally requires less memory for storing model parameters potentially requiring only a single copy of each parameter. Unless special methods are used, batch parallelism may need to keep copies of the model.
The communication pattern is different. In pipeline parallelism each worker sends activations and the corresponding gradients to their neighbors. In distributed mini-batch training every worker must send the gradients for all model parameters and receive updated values after every batch. The bandwidth requirements in each case depend on the exact model used, the batch size, as well as other factors.
Both pipeline parallel training and synchronized distributed batch parallel training can suffer from worker balancing bottlenecks. When using pipeline parallelism, care must be taken to balance the throughput of all workers since the overall speed is determined by the slowest worker. This load balancing issue could be handled in software (Harlap et al., 2018) without requiring users to manually specify the model division. In synchronized distributed SGD care must be taken to balance the throughput and master node communication of all workers since the overall speed is determined by the slowest worker.
Batch normalization (Ioffe and Szegedy, 2015) requires batch parallelism. In our work we are interested in replacing batch parallelism with fine-grained pipeline parallelism. We therefore operate at a per-worker batch size of one which does not work well with Batch Normalization. Newer normalization techniques such as Group Normalization (Wu and He, 2018), Weight Standardization (Qiao et al., 2019), Filter Response Normalization (Singh and Krishnan, 2019) and Online Normalization (Chiley et al., 2019) are alternative normalization techniques which work well and can be used with small batch sizes. Alternatively initialization methods can be used to enable training without normalization (Zhang et al., 2019; Dauphin and Schoenholz, 2019).
Appendix B Inconsistent Weights vs Stale Gradients
In pipelined backpropagations gradients are delayed and computed with inconsistent weights. This can lead to accuracy degradation and instability. In this section we investigate the relative importance of the effects. We do this by comparing training with delayed gradients using either inconsistent or consistent weights. In Appendix G.2 we describe how we can simulate this in PyTorch (Paszke et al., 2019) without using pipelined backpropagation.
Figure 10 shows the effects of delay on the final accuracy of CIFAR10 ResNet20 training with or without inconsistent weights. As can be seen, even modest delays affect the final accuracy of training. Weight inconsistency does not cause an additional loss of accuracy for small delays but causes a rapid loss of accuracy beyond a certain delay. This transition point where weight inconsistency starts to affect training will depend on the dataset and architecture. Harlap et al. (2018) and Chen et al. (2018) make opposing claims about the effect of weight inconsistency. Harlap et al. (2018) introduce weight stashing to fix weight inconsistency and claim its use is necessary for convergence. Chen et al. (2018) show that weight stashing has no effect on training in their experiments so it should not be used to avoid memory overhead. Our results suggest that the effects of weight inconsistency depend on the magnitude of delays reconciling the two claims.
We also investigate the effect of weight inconsistency in our fine-grained pipelined backpropagation setup. Table 2 compares PB training with and without weight stashing. The results suggest that weight stashing is not beneficial in our setup so we do not use it in other experiments. This indicates that weight inconsistency is likely not an issue and the accuracy losses of PB primarily stem from the gradient delay. As mentioned in the discussion section, the small batch sizes we use combined with the hyperparameter scaling may reduce the effects of the delay. For larger batch sizes weight inconsistency may be a bigger issue.
Appendix C Forms of Weight Prediction
The goal of weight prediction is to estimate future weights to combat gradient delay and weight inconsistency. Linear Weight Prediction (LWP) gives a general form for predicting the network state steps into the future by using the velocity. In Pipelined Backpropagation the delay varies for different stages. By default (LPWD) we set equal to the delay for every stage (see red arrows in Figure 11). Other works have proposed related forms of weight prediction.
LWP is closely related to the weight prediction proposed in SpecTrain (Chen et al., 2018). SpecTrain extends the prediction horizon such that all stages predict to the same time step. This form of time synchronization is first described by Harlap et al. (2018) as Vertical Sync. The forward prediction horizon is depicted in green in Figure 11. With the extended prediction horizon, SpecTrain must also predicts weights on the backwards pass to address inconsistency. The prediction horizon for the backward pass weights is depicted in blue in Figure 11. This can be seen as using a stage dependent extended prediction horizon (Appendix E).
Discrepancy correction (Yang et al., 2019) can be seen as a form of weight prediction. Whereas LWP and SpecTrain predict weights into the future to mitigate for gradient delay and weight inconsistency, PipeMare estimates the weights used on the forward pass during the backward pass. This can only deal with weight inconsistency, but potentially provides a more accurate prediction. Discrepancy correction uses a separate exponential tracker for their prediction. LWP uses the optimizer velocity directly. In Appendix B we show that weight inconsistency is not a significant issue in our setting so we primarily focus on mitigating the effects of gradient delay.
DANA (Hakimi et al., 2019) is another variant of weight prediction that has been used in the ASGD setting but is not directly applicable to Pipelined Backpropagation.
c.1 SpecTrain Experimental Results
Table 3 compares the final validation accuracy of CIFAR10 training using SpecTrain and our methods. Although SpecTrain does very well in these settings, it is not able to recover SGDM reference accuracy on ImageNet training unlike the combined method LPW+SCD.
Appendix D State Transition Equations
In order to analyze and compare our methods, we view the optimization as a dynamical system in terms of its state transition equation. A similar approach is used in (O’donoghue and Candes, 2015; Goh, 2017; Giladi et al., 2019). We assume that is the underlying loss function we are trying to minimize where are the weights at time . For neural networks, could be the mean training loss, the expected loss over all training samples. We assume that for a given sample or time step, the gradient with respect to the weights is where
is a random variable. The expectation of(over all samples) is assumed to be zero.
We are interested in comparing the dynamics of delayed SGDM, weight prediction, spike compensation and the combined mitigation. These can all be seen as special cases of the combined mitigation given in Section 3.4 for the appropriate choice of , and . The velocity form of the combined mitigation, LWP+SC, results in a complicated state transition equation which can not be easily analyzed without further simplifications. The velocity form can be approximated with the weight difference form, LWP+SC. This form is simple to analyze so we use it for the rest of the analysis.
We analyze the systems in expectation and do not try to estimate the variance. Letand be the expected weights and velocity at time . We can then write the expected state update for the combined mitigation at time in terms of previous expected values as:
where , are the coefficients for general spike compensation and is the expected gradient arriving at time . This gradient is calculated using weight prediction with horizon from weights delayed by time steps:
We can isolate from equation 35:
Shifting the time index we obtain an expression for which we can insert into equation 34:
By inserting appropriate values for , and we can obtain the state transition equations for general spike compensation (GSC, ), linear weight prediction (LWP, ) and SGDM with delay ():
We note that unlike state transition equation of SGDM the equations for LWP and GSC both contain . This means that the mitigation methods generally do not correspond to a simple change in the hyperparameter values of SGDM. Similarly, the combination of GSC and LWP has an additional term and thus does not simply correspond to a different setting of , or for either method.
The equations for LWP and GSC contain the same weight terms which could indicate that they operate in similar ways. If the gradient is well approximated as a linear function on the line segment:
In this case GSC and LWP are equivalent for the same learning rate and momentum if:
This is equivalent to assuming zero future gradient over the prediction horizon in equation 17 instead of a constant velocity. GSC is equivalent to LWP with horizon for the same learning rate if the approximation in 43 holds and:
This shows that LWP and GSC are closely related. Both methods compensate for a delay but at different points in time. Weight prediction changes how the gradient is computed, spike compensation changes how it is applied. Each method has its advantages. Spike compensation has minimal overhead and doesn’t require an estimate of the delay ahead of time. Weight prediction might introduce memory overhead by adding a new copy of the weights (depending on the implementation and hardware), but may help reduce weight inconsistency. The combination of the two methods can be useful in cases where we want to overcompensate for the delay. A similar effect can be achieved with either method by changing the horizon but their combination offers increased weight consistency without requiring an additional weight prediction on the backwards pass.
Appendix E Extended Weight Prediction Horizons
In Section 3.5 we discuss how overcompensating for delays can help improve convergence speed. One way to do this is to predict weights more than (the delay) steps into the future with linear weight prediction. Figure 12 shows the effect of scaling the weight prediction horizon on the convergence rate when optimizing a convex quadratic. We see that horizon lengths of around seem to give the best results.
We repeated this experiment for ResNet20 (with group normalization) trained on CIFAR10. We used a delay for all layers with consistent weights and a batch size of 32 for a total delay of 128 samples (which is in the range of many of our CP experiments). The learning rate and momentum were scaled according to equation 9 using the default reference values referenced in the experiments section. The results can be seen in Figure 13. We can see that the training loss curve looks somewhat similar to the convergence speed for the convex quadratic, with the lowest loss obtained for . The validation accuracy also peaks for .
We also test this hypothesis in the Pipelined Backpropagation setting. We explore the use of weight prediction with a horizon which is double that of the delay (LWP). We also experiment with overcompensating for the delay by doubling the effect of Spike Compensation (SC which replaces D with 2D in equation 14). We observe that overcompensating can improve the final accuracy in most cases (Table 4). We note that in these networks weight inconsistency does not seem to be an issue (see Appendix B). In cases where weight inconsistency is an issue, doubling the prediction horizon can reduce training stability. The same may apply to networks with large delays. One such example may be training ResNet110 on CIFAR10 (Table 4) where standard weight prediction outperforms methods which overcompensate for delay.
Appendix F Effects of Momentum Scaling
Throughout this work we heuristically scale the momentum and learning rate for small batch size training according to equation 9. This enables us to use pipelined backpropagation without hyperparameter tuning for existing networks which is important for the practicality of PB training. These rules increase the momentum significantly compared to other heuristics which might keep it constant or lower it. In Section 3.5 we show that momentum loses some of its benefits with delays. However our compensation methods, Spike Compensation and Linear Weight Prediction, likely benefit from high momentum. In this section we look at the effects of different momentum values, while keeping the total contribution from each gradient the same. We do this by selecting a specific value of in equation 9 (ignoring the first expression) and then scaling the learning rate according to the second expression.
The experiments involve training ResNet20 (with group normalization) on CIFAR10. We use a batch size of 8 and a delay of 12 for all layers for a total delay of 96 samples (which is in the range of many of our CP experiments). Figure 13(a) shows this when consistent weights are used. We can see that for the baseline with no delay a wide range of momentum values can be used, including no momentum, but very large values cause accuracy loss. With delay, small values of momentum are better and the accuracy falls off relatively quickly for larger values. With our compensation methods the best accuracy is obtained for large momentum values. Spike compensation has no effect for low (zero) momentum values and therefore matches the delayed baseline for small momentum values. Weight prediction for small momentum values tries to predict future weights based on recent gradients without sufficient smoothing and performs worse than the baseline. The combined mitigation exceeds the best results for the no-delay baseline for a range of large momentum values.
Figure 13(b) shows the same experiment performed with inconsistent weights (using the most recent weights on the backwards pass instead of the delayed weights used on the forward pass). Most of the observations from the previous experiment hold in this case as well. The most notable difference is the poor performance of all methods when low momentum is used. This suggests that small momentum values adversely affect weight consistency. These runs do not use a tuned learning rate or a learning rate warmup which could likely help stabilize lower momentum values. Using our formulation of momentum causes a warmup in the step size while the velocity is building up. This effect could contribute to larger momentum values performing better. Another factor may be the exponential smoothing of weight updates with momentum. Without this, a couple of relatively large gradients could cause a large weight inconsistency for some time steps, potentially destabilize training.
Appendix G Computational Setup
g.1 Simulating Pipelined Backpropagation on GPUs
One of the goals of this work is to explore PB training of modern deep networks such as ResNet50. In particular we are interested in simulating fully pipeline parallel training with a maximal number of pipeline stages and no batch parallelism. For ResNet50 this results in about 150 stages if we naively make every convolution, normalization, and non-linearity into a stage. Combining convolution, normalization, and relu into one stage still makes ResNet50 a 50 stage network. Most modern deep learning frameworks are not well suited for such experimentation. To enable efficient simulation of fully pipeline parallel training, we built a mini-framework, GProp. GProp is implemented in C++ using cuDNN(Chetlur et al., 2014) kernels and Thrust.
Overall fine-grained pipeline parallelism is not very efficient on GPUs. While speed is a consideration, the goal is not to be competitive with data parallel training on GPUs. We only aim to simulate pipeline parallel training and evaluate its potential as an alternative to data parallel training. As discussed before, other compute architectures could reap significant benefits from pipeline parallelism. In this section we discuss some of the implementation details and some of the potential limitations of GPUs at batch size one training.
Compute Utilization At small batch sizes, the amount of computation per kernel might be insufficient to utilize all compute resources. With pipeline parallelism a large number of kernels can run in parallel. Launching multiple kernels in parallel can significantly increase compute utilization.
Kernel Launch Rate The compute throughput of the GPU is equal to the rate at which kernels are launched multiplied by the work done by each kernel. As the work per kernel is decreased, the kernel launch rate must be increased to maintain compute throughput. Among other factors, the work depends on the batch size. As described previously, launching kernels in parallel can mitigate for decreased in work due to batch size one training. For smaller networks the work done per kernel is generally less, therefore the kernel launch rate must be higher for good utilization. GPUs have a kernel launch rate limit which can become the training bottleneck. This is an issue for smaller networks such as ResNet20 for CIFAR10.
Bandwidth Limitations Without significant weight reuse, GPU’s become memory bandwidth limited. For convolutional layers the weights are reused over the spatial and batch dimensions. Weight reuse increases as the spatial dimensions of the inputs increase. This makes bandwidth less of an issue for ImageNet (i1k) scale networks when compared to CIFAR10 scale networks.
There are a few other challenges to small batch sized training. At small batch sizes optimizer overheads become significant. Each optimizer step requires loading the entire model, consuming significant memory bandwidth. At large batch sizes this is amortized over the batch size. For a batch size of one the optimizer steps consume a large fraction of the total memory bandwidth. Similarly, the time required for any new memory allocations cannot be amortized over the batch size.
In GProp the network is split into structures we call stages that act as pipeline stages. Each stage manages all resources needed to compute the forward and backward passes for the corresponding part of the network (Figure 15). In our experiments we sometimes group several components together into a single stage. One example of this is grouping convolution, normalization, and ReLU into a stage.
GProp uses CUDA streams to run the stages in parallel. GProp also supports splitting the network over multiple GPUs and uses a different thread to launch the stages on each GPU. We found that using multiple threads to launch kernels on a single GPU did not raise the kernel launch rate limit. We suspect this is potentially due to some sort of locking mechanism in cuDNN.
Validation the GProp framework using CIFAR10 VGG11. Showing mean (shading is standard deviation) of ten runs.
g.2 Simulating Delayed Gradients
Weight inconsistency and delayed gradients are potential issues in pipelined backpropagation. To better understand the issues we simulated weight inconsistency and delayed gradients in a PyTorch (Paszke et al., 2019) environment using a modified optimizer. The modified optimizer has a buffer of old parameter values. To apply a delay , the model is loaded with parameters from time steps ago, a forward and backward pass is performed. The resulting gradients are then used to update a master copy of the weights. Weight inconsistency is simulated by loaded the model with parameters from time steps ago, doing the forward pass then loading the model with the master weights before doing the backwards pass. While this was not an exact model of PB, this setup allows for the simulation of PB’s issues and fast iterate of potential methods to overcome the issues. This technique can also be used to simulate PB by having different delays for different layers based on the depth of the layer. This simulation method does not allow simultaneously launching multiple kernels and is therefore not efficient for small batch sizes. Our simulations are done using a constant delay across layers. This upper bounds the effect of weight inconsistency and delayed gradients. This setup can also be used to simulated ASGD training by making a random variable which models the distribution of GPU communications with the master node in ASGD.
Appendix H Experiment Details
h.1 VGG Experiments
h.2 GProp validation
To validate our framework implementation, we compare batch parallel SGD, and fill & drain SGD training. We trained each setting, as well as the same network in PyTorch, 10 times to validate similar behavior. Figure 16
shows the optimization of the different SGD training modes for the first 20 epochs. Numerical precision, network initialization, and data loading / augmentation randomness makes a numerical comparison for distinct runs impractical. Instead we show the mean and standard deviation of 10 runs. The different SGD modes in GProp are consistent and also match PyTorch’s SGD convergence.
He et al. (2016) modified the original ResNet formulation given by He et al. (2016) by introducing the ResNet pre-activation block. We adopt the hyperparameters and data preprocessing from He et al. (2016). Our experiments are done at batch size one where Batch Normalization is not effective. We replace Batch Normalization with Group Normalization. For ImageNet ResNet50 training, we used an initial group size of two as outlined in the Group Normalization paper. Wu and He (2018) do not tune Group Normalization for CIFAR10 training. We use the same initial group size of two for our CIFAR10 experiments.
h.4 Hyperparameter comparison
As mentioned in Section 3.1 we use the hyperparameters published in (He et al., 2016) and scale them using the rules described by Chiley et al. (2019). Figures 16(a) and 16(b) shows that the hyperparameters produced using these scaling rule result in training curves similar to the reference when training VGG11 on the CIFAR10 dataset.
h.5 Lpw vs LPW
Table 6 shows the results of using the two variants of LWP. When combined with SC, LPW outperforms LPW. When the weight form is used the most recent gradient has a large effect on the velocity estimate used for the weight prediction. For small batch sizes this estimate might be noisy decreasing the effectiveness of LWP. A similar effect can be observed for LWP in general (Appendix F) when very small momentum values are used which also leads to noisy predictions.