, and natural language processing (NLP)devlin2018bert vaswaniattention . As models and datasets have grown in complexity, training times have increased significantly tan2019efficientnet devlin2018bert . To tackle this challenge, data-parallelism approaches are widely used to accelerate the training of DNN models dean2012large . In order to scale data-parallelism techniques to more workers while preserving the computational efficiency in each worker, it is important to increase the overall batch size proportionally with the number of workers. However, increasing the batch size often leads to a significant loss in test accuracy–remedied by a number of recent ideas including increasing the learning rate during the training process as well as a learning rate warm-up procedure goyal2017accurate you2017scaling you2019large . Using these techniques, large batch size training has been successfully applied to state-of-the-art distributed systems jia2018highly ott2018scaling . However, increasing evidence seems to suggest that there is a maximum mini-batch size beyond which the number of iterations required to converge increases ma2017power . Furthermore, driven by recent advances in low-precision arithmetic gupta2015deep wang2018training sun2019hybrid
, there has been a renaissance in the computational capability of deep learning training hardware resulting in accelerator throughputs exceeding 100s of TeraOps/sfleischer2018scalable nvidia dean20201 ibmchip . This dramatic increase in throughput can cause an imbalance between computation and communication, resulting in large scale training platforms that are severely communication constrained.
To mitigate these communication bottlenecks in DNN training, several gradient compression techniques have been proposed seide20141 strom2015scalable chen2018adacomp lin2017deep . Most of these techniques exploit error feedback or ‘local memory’ (preserving gradient residues from compression) to demonstrate significant compression rates and good convergence properties. However, current error-feedback gradient compression techniques cannot be directly applied to large-scale distributed training. There are two primary challenges. (a) Gradient build-up: As addressed in ivkin2019comm yu2018gradiveq vogels2019powersgd hongkong2019gtopk , compressed data can be gathered, but not reduced. This results in a dramatically decreased compression rate as the number of workers increases. (b) Large batch size with scaled learning rate: As shown in stich2018sparsified , for a convex problem, the noise term in the error-feedback gradient increases as the cube of the learning rate (). alistarh2018convergence also shows that the increased learning rate could add large noise for error-feedback gradient compression in non-convex and distributed settings. Thus, scaled learning rates needed for large batch-sized training can significantly increase gradient noise and cause performance degradation (or even divergence), particularly for complex models and datasets.
In this paper, we propose a new gradient compression algorithm, ScaleCom, that provides solutions to both of these challenges. ScaleCom provides significant compression rates (65-400X) while enabling convergence in large-scale distributed training (64 workers). To the best of our knowledge, this is the first compression algorithm that has been extensively evaluated in large datasets and batch sizes and shown to be fully compatible with conventional all-reduce schemes, as shown in Table 1.
|Compressor||scalability||overhead (FLOPs/element)||compr. rate||convergence||empirical exp.||LB|
|Top strom2015scalable dryden2016communication||(sort)||>100X||not guaranteed||broadly tested||no|
|AdaCompchen2018adacomp||(quasi-sort)||40-200X||not guaranteed||broadly tested||no|
|DGClin2017deep||(sample based-sort)||270-600X||not guaranteed||broadly tested||no|
|PowerSGDvogels2019powersgd||low-rank approximation||40-128X||not guaranteed||small datasets||yes|
|gTop-hongkong2019gtopk||local top- merge||>100X||not guaranteed||up to 6% degrad.||no|
|ScaleCom (ours)||constant||(chunk-wise sort)||65-400X||guaranteed||broadly tested||yes|
is mode size. unless explicit assumption is made. H(.) is hash function computation and r is rows of sketch table. include a wide range of applications with large datasets. large batch size training/scaled learning rate.
1.1 Challenges and Related Works
Error-feedback gradient compression and all-reduce: Error-feedback gradient compression was first introduced by seide20141 and later widely applied to various application domains strom2015scalable chen2018adacomp lin2017deep dryden2016communication . Error-feedback gradient (also referred to as "residues" chen2018adacomp or local memory) is the difference between a worker’s computed gradient and it’s compressed gradient. When compressed gradients from multiple workers are sent to a centralized parameter server (for reduction), they cause a "gradient build-up" problem. Specifically, as shown in Figure 1(a), since different workers pick different gradients during compression, the overall compression ratio for the accumulated gradients decreases linearly with the number of workers , i.e., . This effect is especially dramatic in large-scale distributed systems as shown in Figure 1(b). Recently, there has been a body of work focused on the gradient build-up issue. yu2018gradiveq emphasizes the importance of commutability in gradient compression to enable efficient aggregation in ring all-reduce. vkj2019powerSGD proposed low-rank methods for error-feedback gradient compression that reduces the complexity to . ivkin2019comm used the reduction property of sketch tables to achieve 40X compression rates. tang2019doublesqueeze did double compression to achieve linear speedup. hongkong2019gtopk merged each worker’s top elements to approximate the all-reduce of global top elements. In spite of all these efforts, none of these techniques have been shown to comprehensively work on large models, datasets and high number of learners, with the desired constant complexity.
Large batch size training: Furthermore, many of these compression techniques have not shown to work well in large batch size training scenarios where communication bottlenecks limit system performance and scalability. chen2018adacomp and lin2017deep scaled mini-batch sizes by 8X and achieved baseline accuracies for CIFAR10 models. Similarly, vkj2019powerSGD linearly scaled the learning rate and batch size by 16X and reduced communication time by 54% in ResNet18 (CIFAR10). Overall, most recent studies have primarily focused on small datasets, and it remains unclear if gradient compression techniques work well on large models and datasets. As shown in Figure 1(c), we observe that a naive error-feedback gradient compression strom2015scalable scheme can cause significant accuracy degradation in large batch size training scenarios (Transformer in WMT14 En-De).
provided convergence analyses for error-feedback gradient compression in both convex and non-convex optimization contexts and show convergence similar to traditional stochastic gradient descent (SGD). The results suggest that the essence of network convergence is the contraction property of compressors, defined as the “energy” preserved in the compressed gradients relative to the full gradients as shown in Eqn.(4) ofstich2018sparsified . The results show that both random- and top- compression could achieve similar convergence properties as SGD. Later on shi2019undtopk reported the advantages of the top- compressor. Recent analyses karimireddy2019error also proved that error feedback can enable biased gradient compressors to reach the target test accuracy with high compression rates. In theory, compressors are quite flexible (biased or unbiased).
In this paper, we introduce a new gradient compression technique, ScaleCom, that resolves the two important issues central to scalability: (i) enable compression to work effectively with all-reduce and (ii) applicable to large batch size training for large datasets. In comparison to existing compression methods, our primary contributions include:
We explore local memory (error feedback) similarity across workers and use this property to design a commutative compressor, which we call cyclic local top- (CLT-). The CLT- operator solves the gather (gradient build-up) issue and is compatible with all-reduce operations.
To apply gradient compression in large batch size training, we propose a novel low-pass filter during local memory updates. This filter cleans out disruptive noise and enhances local memory similarity. Thus, our filter scales the CLT- compressor to much larger-scale distributed training.
We present theoretical analysis to show that ScaleCom can guarantee the same convergence rate as SGD and enjoys linear speedup with the number of workers. ScaleCom mitigates gradient noise induced by scaled learning rates and keeps communication cost constant with the number of workers. Moreover, we have also observed that ScaleCom has similar convergence properties as the ideal (but impractical) true top- compression.
Experimentally, we have verified that ScaleCom shows no degradation across a wide range of applications (datasets) including vision (ImageNet), language (WMT), and speech (SWB300), in both standard (8 workers) and large batch size (64 workers) training.
2 Gradient Sparsification in All-Reduce
A commutative compressor between gradient averaging and sparsification following definition (1) is desired for communication-efficient distributed training. There are two advantages for commutative compressors: (i) theoretically, with this setting, error-feedback gradient compression has convergence guarantees alistarh2018convergence , and (ii) this resolves the ‘gradient build-up’ issue and keeps communication cost constant with the number of workers yu2018gradiveq .
Besides commutativeness, recent studies lin2017deep stich2018sparsified alistarh2018convergence shi2019undtopk suggest that the top- compressor has good contraction properties and test accuracies from both theoretical and empirical perspectives. Thus, an optimized compressor should have both (i) commutative property and (ii) top- contraction property. To satisfy these, we designed our compressor based on the following two observations:
(i) Memory similarity: Although local memory (gradient residue) is never exchanged amongst workers, it is correlated in the sense that local gradients are computed from samples drawn from the same training set. Figure 2(a) shows the pairwise cosine distance (worker 0 and 1)111 The cosine distance of two real-valued vectors
The cosine distance of two real-valued vectors, is defined as . of local memory in the first 90 iterations of ResNet18 (CIFAR10) with conventional local top-k compressor (top-0.1% is used)strom2015scalable . The cosine distance decreases fast over the iterations, i.e., local memory similarity is improved quickly and stays correlated over much of the training process. (Appendix-A shows different statistical metrics.) Finally, we observe that this phenomenon is agnostic to increasing worker number when learning rate and per-worker batch size stays the same as shown in Figure 2(a).
(ii) True vs local top-: The local memory similarity amongst workers offers a critical insight: the local worker’s top- indices may be used to approximate the true top- indices. In Figure 2(b), area under blue curve represents all-reduced error-feedback gradient magnitudes222Sum of local memory and new computed gradients, among which, the area to the right of grey line corresponds to its top (i.e. true top-).333Top- is used here. The true top- area overlaps more than 70% with the red histogram representing local top- of worker 0, suggesting that true top- and local top- have sufficiently overlapping indices and similar contraction properties.
Cyclic Local Top- (Clt-) Compressor:
Based on the similarity between local memories, we propose a novel efficient commutative compressor for all-reduce distributed training, cyclic local top- (CLT-). It works as follows: In each iteration, we sequentially select a leading worker in a cyclical order. The leading worker sorts its error-feedback gradient and obtains its local top- indices. All other workers follow the leading worker’s top- index selection for compressing their own local error-feedback gradients. Formally, CLT- compressor is described as follows.
Let denote the index set corresponding to the indices of the largest entries (in magnitude) of vector . To be more specific, the set is defined by
Suppose that there are vectors Then, we have local top- sets, i.e., . For a vector , the proposed CLT- compressor with worker as the leader, denoted by , is defined entry-wise as
Remark 1. Note that when , CLT is reduced to the classical top- operator on . When , CLT sets ’s entries whose indices belong to as 0.
Remark 2. It is easy to verify that (3) satisfies the commutative property in (1). Moreover, Figure 2(b) suggests the histogram of error-feedback gradient of CLT highly overlaps with that of top- . Thus, proposed CLT- compressor features efficient implementation in all-reduce, has desirable commutative properties and shares a similar contraction property with true top-.
Remark 3. We note that the proposed compressor can naturally be extended to ring all-reduce settings.
Low-Pass Filtering in Memory Accumulation:
Large batch size training schemes usually require to significantly scale up learning rate. As shown in Figure 2(c), when learning rate is increased from 0.01 to 1 (100X), cosine distance becomes much larger (orange line), suggesting drastically reduced local memory similarity, which may degrade the performance of the CLT- compressor. Besides, scaled learning rate causes rapid model changes and incurs larger gradient noise, which makes it more difficult to compress gradients in large batch size settings. To address these challenges, we propose to apply low-pass filtering dsp to local memory accumulation. This low-pass filtering is one kind of weighted error feedback techniques gtech2020compress tencent2018compress , but it focuses on large batch size training and aims to mitigate noise from the incoming residual gradients. Our filter passes the signals of computed gradients with smoother changes and attenuates the gradient noise caused by rapid model changes, which (i) mitigates undesirable noise caused by scaled learning rate, and (ii) improves local memory similarity among workers. Formally our method is described as follow. Assuming workers, the distributed training problem is
where denotes the objective function at the th worker, is the optimization variable (weights of the neural net), represents the sample at node , stands for the data distribution. This work focuses on fully-synchronized distributed training so data distributions at different nodes are identical. Let denote the mini-batch at the
th worker; gradient estimate is written aswhere
denotes the gradient of loss functionw.r.t. the th sample at node , and is the batch size of the sampled data at the th worker. Here we use as gradient residues (local memory) in the th worker and as the compressed gradient after scaling by step size . These parameters are computed locally, where will be sent to update shared weight . Then, the low-pass filter on memory can be written as
where is the discounting factor (), and is the number of iterations. Empirically, we verify that the use of low-pass filters can improve the similarity among local memory for CLT- in the case of scaled learning rate as shown in green and red lines in Figure 2 (c). Figure 2 (d) shows that when the learning rate is significantly increased (100X), with the use of the low-pass filter, our CLT- compressor can still maintain sufficient area overlap in the histograms with true top- compressors, providing a necessary and desirable contraction property for robust and scalable training. One thing should be noted that intuitively, this filtering method has a connection to momentum SGD: momentum SGD can be viewed as a form of filtering (moving average) on current and past gradients, which smooths out noisy gradients to update weight more accurately. Analogously, we perform filtering on the residual gradients to improve signal integrity in local memory.
3 Scalable Sparsified Gradient Compression (ScaleCom)
In this section, we will describe the details of our algorithm, ScaleCom, and its convergence properties. In ScaleCom, each worker first applies the CLT- compressor as shown in (3). Sparsified data is directly added (reduced) across workers (integrated with all-reduce) avoiding ‘gradient build-up’. After all-reduce, each worker applies a low-pass filter in local gradient accumulation, improves workers’ memory similarity and smooths out abrupt noise induced by scaled learning rates. For simplicity, we used the parameter server protocol to explain our algorithm, but it can naturally be extended to all-reduce ring implementations. The whole process is summarized in Algorithm 1. 444 denotes the index of iterations. In the rests of this section, we provide formal convergence properties for ScaleCom. 555Check Appendix-C for proof and Appendix-D for details in convergence analysis and theory exposition.
We establish the contraction property of the CLT- compressor based on the Hamming distance. The Hamming distance measures the overlap of the two index sets. Suppose is a set of indices of a vector
. Define a binarized vectoras the following: , if , otherwise, . Suppose and are two sets of indices. The Hamming distance between the two sets given a vector and an auxiliary variable are defined as:
Suppose is a vector and its top- index set is . is sparsified by another index set . If , we have the following contraction property for this compressor : , where
and is the contraction coefficient of top- sparsification
We can see that depending on , the contraction coefficient . Specialized to the proposed CLT- compressor, for each iteration an index set is generated from a local worker in a cyclic fashion. Let which is the averaged error-feedback gradients among all workers. We assume which indicates there exists a minimal overlap between the local top- indices from worker and global true top- given by . Therefore,
It follows that .
Before showing the theoretical results, we make the following assumptions.
A. 1 We suppose that the size of gradient is upper bounded, i.e., , and the objective function is gradient Lipschitz continuous with constant and it is lower bounded, i.e., .
A. 2 We assume that gradient estimate is unbiased, i.e.,
, and has bounded variance, i.e.,. By leveraging the contraction property of CLT-, we can have following convergence rate guarantees.
Under assumptions A.1-A.2, suppose the sequence is generated by CLT-. Then, when learning rate and discounting factor are chosen as
where denotes the total number of iterations and , we have
Remark 4. Theorem 1 showcases the linear speedup that can be achieved by CLT-, meaning that the optimiality gap (i.e., size of gradient) is decreased as the number of workers increases. Next, we will give the analysis to show how the number of workers and corresponding correlation between the workers jointly affect the convergence in terms of , especially for the case when is large.
Let denote and . Assume that gradients at different workers are positively correlated, (i.e., exists a positive constant such that ), and , then if we have such that , where .
Remark 5. It can be seen that if and , then , implying that the contraction constant is decreased w.r.t. . If , we will have , showing that in this case ScaleCom is able to find the first-order stationary point for any .
Given a pre-defined , the contraction coefficient of CLT- given in (7) depends on the top- contraction coefficient and the Hamming distance . The top- contraction property has been widely investigated in literature. Theoretically, the upper bound of top- contraction is , which is the same as random- when the components of gradient are uniform. Practically, is observed to be a much smaller value shi2019undtopk .
On the other hand, the Hamming distance measures the overlap between two top- index sets. Figure 3 shows the normalized Hamming distance over iterations and various number of workers. The smaller the , the closer the to . It demonstrates that empirically the overlap between local top- indices from one worker and the global true top- indices after all-reduce is reasonable ( is in the range of 0.6-0.8), which indicates a good contraction property of the CLT- compressor in practice. This will further affect the discounting factor in low-pass filtering as shown in Theorem 1.
Large datasets and small batch size:
Large dataset/small batch size introduces more noise in gradients decreasing statistical similarity between workers and is thus harder to deal with. In the analysis above, we’ve assumed that the minimum overlap of Hamming distance between workers to guarantee contraction < 1, which is a mild assumption in practice. Figure 3 shows that when the per-worker batch size is 32, the Hamming distance is still above 0.32 - which is consistent with our pilot experiments, where we tried a minibatch per-worker of 8 with 128 workers on CIFAR10 without any noticeable degradation. This indicates ScaleCom’s applicability in challenging training conditions (large datasets/small mini-batch size).
4 Experimental Results
We apply ScaleCom to three major applications: vision (ImageNet, CIFAR10), language (WMT14 En-De), and speech (SWB300). Experiments are run on IBM POWER System AC922 systems using implementations in PyTorch.666See Appendix-E for experimental details. We adopt gpusorting to accelerate sorting, which divides the whole buffer into chunks and parallelizes sorting in each chunk. As suggested in chen2018adacomp lin2017deep , we use 1-5 warm-up epochs (<10 total training epochs) for compression.777No compression is applied during the warm-up period. A conservative engineering guidance is proposed for compression rate settings in each layer based upon the ratio FLOPsgradient: 25X for ratio in the range ; 50X for , and 400X for (0, 128]. It should be noted that this guidance is based on the per-worker mini-batch size, 32 for vision and speech and 4.5k for language. As per-worker mini-batch size changes, the compression rate is adjusted accordingly. In addition, to demonstrate the robustness of ScaleCom, a much more aggressive compression rate for Transformer-based language model is tested in both standard and large batch size.
Standard Batch Size: In these experiments, we adopt hyper-parameter settings from he2016deep cui2017embedding vaswaniattention (including learning rates and momentum) to achieve excellent baseline accuracy (listed in Table 2). The same hyper-parameters are used in ScaleCom experiments, in which we set =1 in the low-pass filter, as there is no need to filter the gradients in standard batch size experiments. The experimental results are summarized in Table 2, and convergence curves are shown in Figure 4. With compression rates of 65-400X, ScaleCom achieves accuracies very close to the baseline for all workloads.
|Model (Dataset) Accuracy or [other metrics]||#GPU||BSZ||Comp. Rate||Baseline||Comp.|
|Transformer-base (WMT14 En-De) [BLEU]||8||36K||47X (65X)||27.64||27.27 (27.24)|
|4-bidirectional-LSTM Speech (SWB300) [WER]||4||128||400X||10.4||10.1|
More aggressive compression is applied without significant degradation.
Large Batch Size Scaling: To evaluate the scalability of our methods, we follow goyal2017accurate ott2018scaling zhang2020improving to achieve state-of-the-art baseline accuracy with large-scale distributed settings (listed in Table 3). Compression experiments use the same hyper-parameters as baselines. From Section 2.2, as we scale up the mini-batch size and learning rates in large-scale distributed training, the gradient noise increases and local memory similarity becomes weaker among workers, which could damage network performance. As shown in the gray lines of Figure 5, when the low-pass filter is not applied (=1), although small dataset (CIFAR10) still shows good accuracy, large datasets (ImageNet, WMT14, and SWB300) start to show degradation. Once the proposed low-pass filter is applied (=0.1), ScaleCom achieves almost identical test accuracies when compared to the non-compressed baseline on every large network studied as shown in Table 3 and Figure 5 888We observed that is robust to different networks’ convergence in the range of 0.1-0.3.
|Model (Dataset) Accuracy or [other metrics]||#GPU||BSZ||Comp. Rate||Baseline||Comp.|
|Transformer-base (WMT14 En-De) [BLEU]||64||288K||47X (115X)||27.79||28.03 (27.59)|
|4-bidirectional-LSTM Speech (SWB300) [WER]||12||1536||100X||9.9||10.0|
More aggressive compression is applied without significant degradation.
5 End-to-end System Performance
In this section, we quantify the improvement in end-to-end training time achieved by ScaleCom. We considered a distributed training system comprised of multiple accelerator chips connected to a parameter server. Each accelerator chip consists of multiple cores with private scratchpad memory. The systematic performance analysis framework presented in venkataramani2019memory is used to estimate performance. Given a system configuration (compute throughput, memory capacity, interconnect topology and bandwidth), the framework analytically explores possible ways to map DNN computations on to the accelerator system and provide performance estimations.999Appendix-F provides further details on end-to-end system performance.
We present the performance impact of ScaleCom by varying 3 key factors: (i) peak compute capability per worker (100 and 300 TFLOPs) (ii) the size of mini-batch per worker (8 and 32), and (iii) the number of workers (8, 32 and 128). When the mini-batch per worker is increased, the gradient/weight communication becomes less frequent, limiting the scope of end-to-end performance benefits from ScaleCom. This is evident from Figure 6a, where the communication time (as a fraction of total time) decreases from 56% to 20%, when the mini-batch per worker is increased from 8 to 32. Consequently, with 100 TFLOPs peak compute per worker, ScaleCom achieves total training speedup of 2 to 1.23 even with 100 compression ratio. Fraction of communication time grows with increase in peak TFLOPs (100 to 300), resulting in speedup of 4.1 to 1.75.
The key trait of ScaleCom is its performance scalability to larger number of workers independent of minibatch per worker. This is shown in Figure 6b, where the communication cost of prior top-k approaches increase linearly with number of workers, whereas that of ScaleCom remains constant. With Scalecom, the gradient/weight communication is < 3% of total training time even with large number of workers (128) and small mini-batch per worker (8), leaving the training throughput limited only by the computation inefficiency.
Cost of index communication and synchronization: To enable all workers to select the same gradients, ScaleCom incurs an additional overhead for communicating the top-k indices. As the index vector has the same degree of compression as the gradient vector, it occupies only 0.5% of baseline communication time. Also, the cost remains constant () independent of the number of workers. ScaleCom also incurs an additional synchronization during the index communication. Similar to fully synchronous SGD the slowest worker determines when the gradient communication can begin. Once this point is reached by all workers, the additional synchronization costs little extra time.
Gradient compression is a promising technique to resolve communication bottlenecks, but has not been widely adopted in today’s training systems. The two primary reasons for this include lack of demonstrations on large batch sizes (and datasets) and the incompatibility of compression techniques with all-reduce schemes. In this paper, we propose a new compression algorithm, ScaleCom, that resolves both of these issues. We theoretically analyze ScaleCom needs as well demonstrate scalability, robustness and excellent compression rates (65-400X) using experiments on a spectrum of models, datasets and batch-sizes - laying the foundation for its introduction in large scale systems.
The amount of compute for DNNs training doubles every 3 to 4 months openai ; this is faster than Moore’s law that doubles the number of transistors every 2 years. The latest language model GPT3 brown2020language takes 175 billion parameters to achieve state of the art performance on several NLP tasks such as common sense reasoning and word prediction. Training, designing, and optimizing these gigantic models require tremendous time (cost) and computation power. Our research results on compression in large-scale distributed training have two broad benefits:
(i) Reducing time and cost to train DNN models: We believe that communication times will bottleneck training times of distributed systems and this will become even more severe with recent significant improvements in the computational capability of deep learning training hardware. To address this bottleneck, in the past few years, compression techniques have been eagerly researched and implemented in some practical training systems parthasarathi2019realizing . Our research results on scalability of gradient compression aim to push this to larger scale distributed training systems, which is needed for the training of expensive and powerful gigantic models. We believe that the scalable compression solution can accelerate machine learning research and save the cost for company and research institutes to develop state-of-art DNNs in real applications and complicated datasets.
(ii) Energy consumption for environment concerns: Training DNNs especially for big models consumes tremendous energy and starts to cause concerns in CO emission. As indicated in strubell2019energy , Transformer training with neural architecture search could cause CO emission as much as 5 cars’ lifetime. Today most DNNs training runs in distributed systems and energy is mainly consumed in data communication: 32-bit I/O communication took 3-4 orders of more energy (pJ) than 32-bit float ADD computation nvlink . Thus, efficient communication is crucial to reduce energy consumption and mitigate concerns in carbon footprint of DNNs training, especially for large-scale distributed training of gigantic DNNs. Our research cuts communication data size by 65-400X and scale this method to larger scale distribution, which will reduce energy consumption and mitigate environment concerns in gigantic DNNs training. This helps to fight climate change and global warming.
Meanwhile, we would like to point out, although our compression scheme guarantees theoretical convergence and shows no accuracy loss compared to baseline training over the tested models and applications, there could still be concerns about the impact of lossy gradient compression on neural network convergence performance. Especially when gradient compression is applied directly without fine tuning hyper-parameters, training could still be subject to instability, and thus it is recommended to examine the compression scheme over a wider range of models and applications. Our conservative compression selection rules (described in section 4) help mitigate this concern, however task-specific robustness studies are recommended for special applications.
The authors would like to thank Jintao Zhang and Ashish Ranjan for helpful technical discussions, Kim-Khanh Tran, Anthony Giordano, I-Hsin Chung, Ming-Hung Chen, Kaoutar El maghraoui, and Jeffrey Burns for the computing infrastructure, and Leland Chang, Arvind Kumar, Yulong Li, Shubham Jain, Sunil Shukla, Ankur Agrawal, Marcel Schaal, Mauricio Serrano, Wei Wang and the team for the chip platform targeted in this work. This research is realized by generous collaborations across IBM Research. Funding of this work is fully supported by IBM Research.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- (2) M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
- (3) X. Cui, V. Goel, and G. Saon, “Embedding-based speaker adaptive training of deep neural networks,” arXiv preprint arXiv:1710.06937, 2017.
- (4) J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- (5) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
- (6) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, et al., “Large scale distributed deep networks,” in Proc. of Advances in Neural Information Processing Systems, pp. 1223–1231, 2012.
- (7) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- (8) Y. You, I. Gitman, and B. Ginsburg, “Scaling SGD batch size to 32k for imagenet training,” arXiv preprint arXiv:1708.03888, vol. 6, 2017.
- (9) Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” in Proc. of International Conference on Learning Representations, 2019.
- (10) X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, et al., “Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes,” arXiv preprint arXiv:1807.11205, 2018.
M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine translation,” inProceedings of the Third Conference on Machine Translation: Research Papers, pp. 1–9, 2018.
- (12) S. Ma, R. Bassily, and M. Belkin, “The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning,” arXiv preprint arXiv:1712.06559, 2017.
- (13) S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in Proc. of International Conference on Machine Learning, pp. 1737–1746, 2015.
- (14) N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, “Training deep neural networks with 8-bit floating point numbers,” in Proc. of Advances in Neural Information Processing Systems, pp. 7675–7684, 2018.
- (15) X. Sun, J. Choi, C.-Y. Chen, N. Wang, S. Venkataramani, V. V. Srinivasan, X. Cui, W. Zhang, and K. Gopalakrishnan, “Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks,” in Proc. of Advances in Neural Information Processing Systems, pp. 4901–4910, 2019.
- (16) B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A. Agrawal, T. Babinsky, et al., “A scalable multi-teraops deep learning processor core for ai trainina and inference,” in 2018 IEEE Symposium on VLSI Circuits, pp. 35–36, IEEE, 2018.
- (17) R. Krashinsky, O. Giroux, S. Jones, N. Stam, and S. Ramaswamy, “Nvidia ampere architecture in-depth,” NVIDIA blog: https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/, 2020.
- (18) J. Dean, “1.1 the deep learning revolution and its implications for computer architecture and chip design,” in 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 8–14, IEEE, 2020.
- (19) J. Oh, S. Lee, M. K. Kang, M. Ziegler, J. Silberman, and A. e. Agrawal, “A 3.0 tflops 0.62v scalable processor core for high compute utilization ai training and inference,” in Proc. of Symposia on VLSI Technology and Circuits, 2020.
- (20) F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in Proc. of Annual Conference of the International Speech Communication Association, 2014.
- (21) N. Strom, “Scalable distributed dnn training using commodity gpu cloud computing,” in Proc. of Annual Conference of the International Speech Communication Association, 2015.
C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan,
“Adacomp: Adaptive residual gradient compression for data-parallel
distributed training,” in
Proc. of AAAI Conference on Artificial Intelligence, 2018.
- (23) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
- (24) N. Ivkin, D. Rothchild, E. Ullah, I. Stoica, R. Arora, et al., “Communication-efficient distributed SGD with sketching,” in Proceedings of Neural Information Processing Systems (NeurIPS), pp. 13144–13154, 2019.
- (25) M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and S. Avestimehr, “Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed cnn training,” in Proc. of Advances in Neural Information Processing Systems, pp. 5123–5133, 2018.
- (26) T. Vogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low-rank gradient compression for distributed optimization,” in Proc. of Advances in Neural Information Processing Systems, pp. 14236–14245, 2019.
- (27) S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks,” in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 2238–2247, IEEE, 2019.
- (28) S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Proc. of Advances in Neural Information Processing Systems, pp. 4447–4458, 2018.
- (29) D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” in Proc. of Advances in Neural Information Processing Systems, pp. 5973–5983, 2018.
- (30) N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen, “Communication quantization for data-parallel training of deep neural networks,” in 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1–8, IEEE, 2016.
- (31) T. Vogels, S. P. Karimireddy, and M. Jaggi, “PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization,” in Proc. of Advances in Neural Information Processing Systems, 2019.
- (32) H. Tang, X. Lian, C. Yu, T. Zhang, and J. Liu, “Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression,” arXiv preprint arXiv:1905.05957, 2019.
- (33) S. Shi, X. Chu, K. C. Cheung, and S. See, “Understanding Top-k Sparsification in Distributed Deep Learning,” 2019.
- (34) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Error feedback fixes signsgd and other gradient compression schemes,” arXiv preprint arXiv:1901.09847, 2019.
- (35) S. Venkataramani, V. Srinivasan, J. Choi, P. Heidelberger, L. Chang, and K. Gopalakrishnan, “Memory and interconnect optimizations for peta-scale deep learning systems,” in 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 225–234, IEEE, 2019.
- (36) A. V. Oppenheim and R. W. Schafer, “Discrete-time signal processing, 3rd edition,” 2009.
- (37) A. Abdi and F. Fekri, “Quantized compressive sampling of stochastic gradients for efficient communication in distributed deep learning,” in Proc. of AAAI Conference on Artificial Intelligence, 2020.
- (38) J. Wu, W. Huang, H. Junzhou, and T. Zhang, “Error compensated quantized sgd and its applications to large-scale distributed optimization,” in 2018 International Conference on Machine Learning (ICML), pp. PMLR 80:5325–5333, 2018.
- (39) P. Kipfer, “Chapter 46. improved gpu sorting,” 2005.
- (40) W. Zhang, X. Cui, A. Kayi, M. Liu, U. Finkler, B. Kingsbury, G. Saon, Y. Mroueh, A. Buyuktosunoglu, P. Das, et al., “Improving efficiency in large-scale decentralized distributed training,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3022–3026, IEEE, 2020.
- (41) D. Amodei, D. Hernandez, G. Sastry, J. Clark, G. Brockman, and I. Sutskever, “Ai and compute,” OpenAI blog: https://openai.com/blog/ai-and-compute/, 2018.
- (42) T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
- (43) S. H. K. Parthasarathi, N. Sivakrishnan, P. Ladkat, and N. Strom, “Realizing petabyte scale acoustic modeling,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 422–432, 2019.
- (44) E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in nlp,” arXiv preprint arXiv:1906.02243, 2019.
- (45) A. Ishii, D. Foley, E. Anderson, B. Dally, G. Dearth, L. Dennison, M. Hummel, and J. Schafer, “Nvswitch and dgx-2 nvlink-switching chip and scale-up compute server,” 2018.
- (46) W. Zhang, X. Cui, U. Finkler, G. Saon, A. Kayi, A. Buyuktosunoglu, B. Kingsbury, D. Kung, and M. Picheny, “A highly efficient distributed deep learning system for automatic speech recognition,” arXiv preprint arXiv:1907.05701, 2019.
Appendix A Observations in Local Memory Similarity
We observed local memory’s similarity through Q–Q (quantile-quantile) plots as shown in Figure A1(a)-(c). In Figure A1(a), the linearity of the points in Q-Q plot suggests that the worker 1’s local memory (accumulated gradient) magnitudes have very similar statistical distributions as worker 2. The red line is the linear regression fitting for the blue dots; overall itsscore is 0.99; indicates their local memory magnitude distributions (worker 1 and 2) are very similar. This is consistent to our observations in pairwise cosine distance shown in Figure 2(a). The memory accumulation (accumulate gradients over iterations) reduces gradient variation and enhances the similarity across workers. On the other hand, when plotting the computed gradients (right after backward computation) in Q-Q plot, we do not observe the excellent similarity between worker 1 and 2 as shown in Figure A1(b). In this case, the linear regression fitting score is 0.89. In Figure A1(c), we compare worker 1’s error-feedback gradients (local memory + computed gradients) magnitudes with global all-reduced error-feedback gradient magnitudes. From the plot, we can observe that their distribution quantiles are highly correlated ( linear regression fitting score is 0.99). The Spearman’s rank correlation coefficient between worker 1 and all-reduced error-feedback gradient magnitudes is 0.657 (p-value=0). This indicates that we can possibly use local worker’s top- to approximate true top- selections.
Appendix B Preliminaries
Before showing the convergence proofs, we give the following table to highlight the notations and definitions of the variables used in the proofs.
Recall the optimization problem is
and the gradient estimate is as the following
|total number of workers|
|discounting factor of low-pass filter|
|(2)||index set of top entries|
|n/a||mini-batch set at worker|
|or||(A2)||gradient estimate by|
|memory at the th node|
|index of iteration|
|total number of iterations|
|A.1||upper bound of gradients’ size|
|A.1||gradient Lipschitz constant|
|A.1||global minimum of|
We also use several standard inequalities as follows:
1, Young’s inequality with parameter is
where are vectors.
One variant of Young’s inequality is
2, The quadrilateral identity is
Appendix C Proof of Lemma 1 on Contraction Property
Suppose is an -dimensional vector and is its top- index set. is sparsified by the compressor ) via another index set and .
Since , there is an overlap of indices between and . Taking the expectation over the all possible combinations and permutation, we have
which completes the proof. ∎
Appendix D Convergence Performance Analysis of CLT-
Under assumptions A.1-A.2. Suppose the sequences is generated by CLT-. Then, when , we have
where , and and are some constants.
d.1 Proof of Lemma 3
Define a “virtual” sequence as the following
For simplicity of notation, we denote as . When , it is easy to check that
so we have
From A. 2, we have
Then, we are able to quantify as follows:
where in we apply Lemma 1.
Squaring both sides of (A23), we have
where in we use Young’s inequality with parameter .
Taking expectation on both sides of (A25), we have
Since the size of the gradients is bounded by , we have
It is obvious that when
then sequence will be upper bounded by a constant. Therefore, we request
i.e., such that . Then, we can choose small enough, i.e.,
so that when .
Next, in order to get , we need
which is equivalent to
and , we have . Note that here is always greater than . The reasons are as follows: first, we know that
Adding on both sides, we will get
Dividing on both sides, we can arrive at
which is the desired result.
Then, iterate (A29) gives