ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

04/21/2021 ∙ by Chia-Yu Chen, et al. ∙ ibm 1

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing methods do not scale well to large scale distributed systems (due to gradient build-up) and/or fail to evaluate model fidelity (test accuracy) on large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleCom), that leverages similarity in the gradient distribution amongst learners to provide significantly improved scalability. Using theoretical analysis, we show that ScaleCom provides favorable convergence guarantees and is compatible with gradient all-reduce techniques. Furthermore, we experimentally demonstrate that ScaleCom has small overheads, directly reduces gradient traffic and provides high compression rates (65-400X) and excellent scalability (up to 64 learners and 8-12X larger batch sizes over standard training) across a wide range of applications (image, language, and speech) without significant accuracy loss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past decade, DNNs have surpassed traditional Machine Learning models on a wide range of applications including computer vision

he2016deep tan2019efficientnet , speech cui2017embedding

, and natural language processing (NLP)

devlin2018bert vaswaniattention . As models and datasets have grown in complexity, training times have increased significantly tan2019efficientnet devlin2018bert . To tackle this challenge, data-parallelism approaches are widely used to accelerate the training of DNN models dean2012large . In order to scale data-parallelism techniques to more workers while preserving the computational efficiency in each worker, it is important to increase the overall batch size proportionally with the number of workers. However, increasing the batch size often leads to a significant loss in test accuracy–remedied by a number of recent ideas including increasing the learning rate during the training process as well as a learning rate warm-up procedure goyal2017accurate you2017scaling you2019large . Using these techniques, large batch size training has been successfully applied to state-of-the-art distributed systems jia2018highly ott2018scaling . However, increasing evidence seems to suggest that there is a maximum mini-batch size beyond which the number of iterations required to converge increases ma2017power . Furthermore, driven by recent advances in low-precision arithmetic gupta2015deep wang2018training sun2019hybrid

, there has been a renaissance in the computational capability of deep learning training hardware resulting in accelerator throughputs exceeding 100s of TeraOps/s

fleischer2018scalable nvidia dean20201 ibmchip . This dramatic increase in throughput can cause an imbalance between computation and communication, resulting in large scale training platforms that are severely communication constrained.

To mitigate these communication bottlenecks in DNN training, several gradient compression techniques have been proposed seide20141 strom2015scalable chen2018adacomp lin2017deep . Most of these techniques exploit error feedback or ‘local memory’ (preserving gradient residues from compression) to demonstrate significant compression rates and good convergence properties. However, current error-feedback gradient compression techniques cannot be directly applied to large-scale distributed training. There are two primary challenges. (a) Gradient build-up: As addressed in ivkin2019comm yu2018gradiveq vogels2019powersgd hongkong2019gtopk , compressed data can be gathered, but not reduced. This results in a dramatically decreased compression rate as the number of workers increases. (b) Large batch size with scaled learning rate: As shown in stich2018sparsified , for a convex problem, the noise term in the error-feedback gradient increases as the cube of the learning rate (). alistarh2018convergence also shows that the increased learning rate could add large noise for error-feedback gradient compression in non-convex and distributed settings. Thus, scaled learning rates needed for large batch-sized training can significantly increase gradient noise and cause performance degradation (or even divergence), particularly for complex models and datasets.

In this paper, we propose a new gradient compression algorithm, ScaleCom, that provides solutions to both of these challenges. ScaleCom provides significant compression rates (65-400X) while enabling convergence in large-scale distributed training (64 workers). To the best of our knowledge, this is the first compression algorithm that has been extensively evaluated in large datasets and batch sizes and shown to be fully compatible with conventional all-reduce schemes, as shown in Table 1.

Compressor scalability overhead (FLOPs/element) compr. rate convergence empirical exp. LB
Top strom2015scalable dryden2016communication (sort) >100X not guaranteed broadly tested no
AdaCompchen2018adacomp (quasi-sort) 40-200X not guaranteed broadly tested no
DGClin2017deep (sample based-sort) 270-600X not guaranteed broadly tested no
PowerSGDvogels2019powersgd low-rank approximation 40-128X not guaranteed small datasets yes
gTop-hongkong2019gtopk local top- merge >100X not guaranteed up to 6% degrad. no
SketchSGDivkin2019comm constant (sketch table) 40X guaranteed transformer no
ScaleCom (ours) constant (chunk-wise sort) 65-400X guaranteed broadly tested yes

 

is mode size. unless explicit assumption is made. H(.) is hash function computation and r is rows of sketch table. include a wide range of applications with large datasets. large batch size training/scaled learning rate.

Table 1: Comparing different compressors for error-feedback SGD

1.1 Challenges and Related Works

Error-feedback gradient compression and all-reduce: Error-feedback gradient compression was first introduced by seide20141 and later widely applied to various application domains strom2015scalable chen2018adacomp lin2017deep dryden2016communication . Error-feedback gradient (also referred to as "residues" chen2018adacomp or local memory) is the difference between a worker’s computed gradient and it’s compressed gradient. When compressed gradients from multiple workers are sent to a centralized parameter server (for reduction), they cause a "gradient build-up" problem. Specifically, as shown in Figure 1(a), since different workers pick different gradients during compression, the overall compression ratio for the accumulated gradients decreases linearly with the number of workers , i.e., . This effect is especially dramatic in large-scale distributed systems as shown in Figure 1(b). Recently, there has been a body of work focused on the gradient build-up issue. yu2018gradiveq emphasizes the importance of commutability in gradient compression to enable efficient aggregation in ring all-reduce. vkj2019powerSGD proposed low-rank methods for error-feedback gradient compression that reduces the complexity to . ivkin2019comm used the reduction property of sketch tables to achieve 40X compression rates. tang2019doublesqueeze did double compression to achieve linear speedup. hongkong2019gtopk merged each worker’s top elements to approximate the all-reduce of global top elements. In spite of all these efforts, none of these techniques have been shown to comprehensively work on large models, datasets and high number of learners, with the desired constant complexity.

Large batch size training: Furthermore, many of these compression techniques have not shown to work well in large batch size training scenarios where communication bottlenecks limit system performance and scalability. chen2018adacomp and lin2017deep scaled mini-batch sizes by 8X and achieved baseline accuracies for CIFAR10 models. Similarly, vkj2019powerSGD linearly scaled the learning rate and batch size by 16X and reduced communication time by 54% in ResNet18 (CIFAR10). Overall, most recent studies have primarily focused on small datasets, and it remains unclear if gradient compression techniques work well on large models and datasets. As shown in Figure 1(c), we observe that a naive error-feedback gradient compression strom2015scalable scheme can cause significant accuracy degradation in large batch size training scenarios (Transformer in WMT14 En-De).

Convergence analyses of error-feedback gradient compression: In addition to empirical results, stich2018sparsified and alistarh2018convergence

provided convergence analyses for error-feedback gradient compression in both convex and non-convex optimization contexts and show convergence similar to traditional stochastic gradient descent (SGD). The results suggest that the essence of network convergence is the contraction property of compressors, defined as the “energy” preserved in the compressed gradients relative to the full gradients as shown in Eqn.(4) of

stich2018sparsified . The results show that both random- and top- compression could achieve similar convergence properties as SGD. Later on shi2019undtopk reported the advantages of the top- compressor. Recent analyses karimireddy2019error also proved that error feedback can enable biased gradient compressors to reach the target test accuracy with high compression rates. In theory, compressors are quite flexible (biased or unbiased).

Figure 1:

Challenges for gradient compression in large batch size training:(a) Illustration of ’gradient build-up’ issue for compressed gradients. Compressed gradients cannot be reduced directly; instead they are gathered. Gather operation does not scale well to worker number (red). (b) Communication bottlenecks due to gradient build-up; as worker number increases, communication from parameter server to workers becomes a server bottleneck. In this experiment, ResNet50 (ImageNet), bandwidth=32GBps, and compression rate 112X are used. Performance model is based on

venkataramani2019memory . (c) In large batch size training, standard local top- gradient compression strom2015scalable could cause model divergence: Transformer in WMT14 En-De for 288k batch size with 64 workers.

1.2 Contributions

In this paper, we introduce a new gradient compression technique, ScaleCom, that resolves the two important issues central to scalability: (i) enable compression to work effectively with all-reduce and (ii) applicable to large batch size training for large datasets. In comparison to existing compression methods, our primary contributions include:

  1. [itemsep=0pt,topsep=0pt,leftmargin=*]

  2. We explore local memory (error feedback) similarity across workers and use this property to design a commutative compressor, which we call cyclic local top- (CLT-). The CLT- operator solves the gather (gradient build-up) issue and is compatible with all-reduce operations.

  3. To apply gradient compression in large batch size training, we propose a novel low-pass filter during local memory updates. This filter cleans out disruptive noise and enhances local memory similarity. Thus, our filter scales the CLT- compressor to much larger-scale distributed training.

  4. We present theoretical analysis to show that ScaleCom can guarantee the same convergence rate as SGD and enjoys linear speedup with the number of workers. ScaleCom mitigates gradient noise induced by scaled learning rates and keeps communication cost constant with the number of workers. Moreover, we have also observed that ScaleCom has similar convergence properties as the ideal (but impractical) true top- compression.

  5. Experimentally, we have verified that ScaleCom shows no degradation across a wide range of applications (datasets) including vision (ImageNet), language (WMT), and speech (SWB300), in both standard (8 workers) and large batch size (64 workers) training.

2 Gradient Sparsification in All-Reduce

A commutative compressor between gradient averaging and sparsification following definition (1) is desired for communication-efficient distributed training. There are two advantages for commutative compressors: (i) theoretically, with this setting, error-feedback gradient compression has convergence guarantees alistarh2018convergence , and (ii) this resolves the ‘gradient build-up’ issue and keeps communication cost constant with the number of workers yu2018gradiveq .

(1)

Besides commutativeness, recent studies lin2017deep stich2018sparsified alistarh2018convergence shi2019undtopk suggest that the top- compressor has good contraction properties and test accuracies from both theoretical and empirical perspectives. Thus, an optimized compressor should have both (i) commutative property and (ii) top- contraction property. To satisfy these, we designed our compressor based on the following two observations:

(i) Memory similarity: Although local memory (gradient residue) is never exchanged amongst workers, it is correlated in the sense that local gradients are computed from samples drawn from the same training set. Figure 2(a) shows the pairwise cosine distance (worker 0 and 1)111

The cosine distance of two real-valued vectors

, is defined as . of local memory in the first 90 iterations of ResNet18 (CIFAR10) with conventional local top-k compressor (top-0.1% is used)strom2015scalable . The cosine distance decreases fast over the iterations, i.e., local memory similarity is improved quickly and stays correlated over much of the training process. (Appendix-A shows different statistical metrics.) Finally, we observe that this phenomenon is agnostic to increasing worker number when learning rate and per-worker batch size stays the same as shown in Figure 2(a).

(ii) True vs local top-: The local memory similarity amongst workers offers a critical insight: the local worker’s top- indices may be used to approximate the true top- indices. In Figure 2(b), area under blue curve represents all-reduced error-feedback gradient magnitudes222Sum of local memory and new computed gradients, among which, the area to the right of grey line corresponds to its top (i.e. true top-).333Top- is used here. The true top- area overlaps more than 70% with the red histogram representing local top- of worker 0, suggesting that true top- and local top- have sufficiently overlapping indices and similar contraction properties.

Cyclic Local Top- (Clt-) Compressor:

Based on the similarity between local memories, we propose a novel efficient commutative compressor for all-reduce distributed training, cyclic local top- (CLT-). It works as follows: In each iteration, we sequentially select a leading worker in a cyclical order. The leading worker sorts its error-feedback gradient and obtains its local top- indices. All other workers follow the leading worker’s top- index selection for compressing their own local error-feedback gradients. Formally, CLT- compressor is described as follows.

Figure 2:

Similarity analysis of error-feedback gradient compression on ResNet18 (CIFAR10): (a) Cosine distance between workers’ memories over iterations; (b) Histogram (in log scale) of element-wise residual gradient magnitude at iteration 90 in epoch 0. (c) Cosine distance between workers’ memories with varying learning rate and low pass filter’s

in CLT-. (d) Histogram (in log scale) of element-wise residual gradient magnitude at iteration 90 in epoch 0 with scaled learning rate (=1) and low-pass filter (=0.1).

Let denote the index set corresponding to the indices of the largest entries (in magnitude) of vector . To be more specific, the set is defined by

(2)

Suppose that there are vectors Then, we have local top- sets, i.e., . For a vector , the proposed CLT- compressor with worker as the leader, denoted by , is defined entry-wise as

(3)

Remark 1. Note that when , CLT is reduced to the classical top- operator on . When , CLT sets ’s entries whose indices belong to as 0.

Remark 2. It is easy to verify that (3) satisfies the commutative property in (1). Moreover, Figure 2(b) suggests the histogram of error-feedback gradient of CLT highly overlaps with that of top- . Thus, proposed CLT- compressor features efficient implementation in all-reduce, has desirable commutative properties and shares a similar contraction property with true top-.

Remark 3. We note that the proposed compressor can naturally be extended to ring all-reduce settings.

Low-Pass Filtering in Memory Accumulation:

Large batch size training schemes usually require to significantly scale up learning rate. As shown in Figure 2(c), when learning rate is increased from 0.01 to 1 (100X), cosine distance becomes much larger (orange line), suggesting drastically reduced local memory similarity, which may degrade the performance of the CLT- compressor. Besides, scaled learning rate causes rapid model changes and incurs larger gradient noise, which makes it more difficult to compress gradients in large batch size settings. To address these challenges, we propose to apply low-pass filtering dsp to local memory accumulation. This low-pass filtering is one kind of weighted error feedback techniques gtech2020compress tencent2018compress , but it focuses on large batch size training and aims to mitigate noise from the incoming residual gradients. Our filter passes the signals of computed gradients with smoother changes and attenuates the gradient noise caused by rapid model changes, which (i) mitigates undesirable noise caused by scaled learning rate, and (ii) improves local memory similarity among workers. Formally our method is described as follow. Assuming workers, the distributed training problem is

(4)

where denotes the objective function at the th worker, is the optimization variable (weights of the neural net), represents the sample at node , stands for the data distribution. This work focuses on fully-synchronized distributed training so data distributions at different nodes are identical. Let denote the mini-batch at the

th worker; gradient estimate is written as

where

denotes the gradient of loss function

w.r.t. the th sample at node , and is the batch size of the sampled data at the th worker. Here we use as gradient residues (local memory) in the th worker and as the compressed gradient after scaling by step size . These parameters are computed locally, where will be sent to update shared weight . Then, the low-pass filter on memory can be written as

(5)

where is the discounting factor (), and is the number of iterations. Empirically, we verify that the use of low-pass filters can improve the similarity among local memory for CLT- in the case of scaled learning rate as shown in green and red lines in Figure 2 (c). Figure 2 (d) shows that when the learning rate is significantly increased (100X), with the use of the low-pass filter, our CLT- compressor can still maintain sufficient area overlap in the histograms with true top- compressors, providing a necessary and desirable contraction property for robust and scalable training. One thing should be noted that intuitively, this filtering method has a connection to momentum SGD: momentum SGD can be viewed as a form of filtering (moving average) on current and past gradients, which smooths out noisy gradients to update weight more accurately. Analogously, we perform filtering on the residual gradients to improve signal integrity in local memory.

1:Input: initialize shared variable and
2:for  do
3:     for  in parallel do
4:          Select set up mini-batch
5:          Compute a stochastic gradient each worker computes gradients
6:           CLT- compression (3)
7:           low-pass filtering (5)
8:     end for
9:     Upload to the server comm. from workers to parameter
10:      gradient reduction
11:     Download to the each worker comm. from parameter-server to workers
12:     
13:end for
Algorithm 1 ScaleCom: Scalable Sparsified Gradient Compression

3 Scalable Sparsified Gradient Compression (ScaleCom)

In this section, we will describe the details of our algorithm, ScaleCom, and its convergence properties. In ScaleCom, each worker first applies the CLT- compressor as shown in (3). Sparsified data is directly added (reduced) across workers (integrated with all-reduce) avoiding ‘gradient build-up’. After all-reduce, each worker applies a low-pass filter in local gradient accumulation, improves workers’ memory similarity and smooths out abrupt noise induced by scaled learning rates. For simplicity, we used the parameter server protocol to explain our algorithm, but it can naturally be extended to all-reduce ring implementations. The whole process is summarized in Algorithm 1. 444 denotes the index of iterations. In the rests of this section, we provide formal convergence properties for ScaleCom. 555Check Appendix-C for proof and Appendix-D for details in convergence analysis and theory exposition.

Contraction Property:

We establish the contraction property of the CLT- compressor based on the Hamming distance. The Hamming distance measures the overlap of the two index sets. Suppose is a set of indices of a vector

. Define a binarized vector

as the following: , if , otherwise, . Suppose and are two sets of indices. The Hamming distance between the two sets given a vector and an auxiliary variable are defined as:

(6)
Lemma 1.

Suppose is a vector and its top- index set is . is sparsified by another index set . If , we have the following contraction property for this compressor : , where

(7)

and is the contraction coefficient of top- sparsification

We can see that depending on , the contraction coefficient . Specialized to the proposed CLT- compressor, for each iteration an index set is generated from a local worker in a cyclic fashion. Let which is the averaged error-feedback gradients among all workers. We assume which indicates there exists a minimal overlap between the local top- indices from worker and global true top- given by . Therefore,

(8)

It follows that .

Convergence Analysis:

Before showing the theoretical results, we make the following assumptions.

A. 1 We suppose that the size of gradient is upper bounded, i.e., , and the objective function is gradient Lipschitz continuous with constant and it is lower bounded, i.e., .

A. 2 We assume that gradient estimate is unbiased, i.e.,

, and has bounded variance, i.e.,

. By leveraging the contraction property of CLT-, we can have following convergence rate guarantees.

Theorem 1.

Under assumptions A.1-A.2, suppose the sequence is generated by CLT-. Then, when learning rate and discounting factor are chosen as

(9)

where denotes the total number of iterations and , we have

(10)

Remark 4. Theorem 1 showcases the linear speedup that can be achieved by CLT-, meaning that the optimiality gap (i.e., size of gradient) is decreased as the number of workers increases. Next, we will give the analysis to show how the number of workers and corresponding correlation between the workers jointly affect the convergence in terms of , especially for the case when is large.

Lemma 2.

Let denote and . Assume that gradients at different workers are positively correlated, (i.e., exists a positive constant such that ), and , then if we have such that , where .

Remark 5. It can be seen that if and , then , implying that the contraction constant is decreased w.r.t. . If , we will have , showing that in this case ScaleCom is able to find the first-order stationary point for any .

Discussion:

Given a pre-defined , the contraction coefficient of CLT- given in (7) depends on the top- contraction coefficient and the Hamming distance . The top- contraction property has been widely investigated in literature. Theoretically, the upper bound of top- contraction is , which is the same as random- when the components of gradient are uniform. Practically, is observed to be a much smaller value shi2019undtopk .

Figure 3: Normalized hamming distance between true top- and CLT-, which is observed to be between 0.6-0.8. This is measured using ResNet18 on CIFAR10 with learning rate 0.1 and compression rate=400X at epoch 0. Per-worker batch size is 32.

On the other hand, the Hamming distance measures the overlap between two top- index sets. Figure 3 shows the normalized Hamming distance over iterations and various number of workers. The smaller the , the closer the to . It demonstrates that empirically the overlap between local top- indices from one worker and the global true top- indices after all-reduce is reasonable ( is in the range of 0.6-0.8), which indicates a good contraction property of the CLT- compressor in practice. This will further affect the discounting factor in low-pass filtering as shown in Theorem 1.

Large datasets and small batch size: Large dataset/small batch size introduces more noise in gradients decreasing statistical similarity between workers and is thus harder to deal with. In the analysis above, we’ve assumed that the minimum overlap of Hamming distance between workers to guarantee contraction < 1, which is a mild assumption in practice. Figure 3 shows that when the per-worker batch size is 32, the Hamming distance is still above 0.32 - which is consistent with our pilot experiments, where we tried a minibatch per-worker of 8 with 128 workers on CIFAR10 without any noticeable degradation. This indicates ScaleCom’s applicability in challenging training conditions (large datasets/small mini-batch size).

4 Experimental Results

We apply ScaleCom to three major applications: vision (ImageNet, CIFAR10), language (WMT14 En-De), and speech (SWB300). Experiments are run on IBM POWER System AC922 systems using implementations in PyTorch.

666See Appendix-E for experimental details. We adopt gpusorting to accelerate sorting, which divides the whole buffer into chunks and parallelizes sorting in each chunk. As suggested in chen2018adacomp lin2017deep , we use 1-5 warm-up epochs (<10 total training epochs) for compression.777No compression is applied during the warm-up period. A conservative engineering guidance is proposed for compression rate settings in each layer based upon the ratio FLOPsgradient: 25X for ratio in the range ; 50X for , and 400X for (0, 128]. It should be noted that this guidance is based on the per-worker mini-batch size, 32 for vision and speech and 4.5k for language. As per-worker mini-batch size changes, the compression rate is adjusted accordingly. In addition, to demonstrate the robustness of ScaleCom, a much more aggressive compression rate for Transformer-based language model is tested in both standard and large batch size.

Standard Batch Size: In these experiments, we adopt hyper-parameter settings from he2016deep cui2017embedding vaswaniattention (including learning rates and momentum) to achieve excellent baseline accuracy (listed in Table 2). The same hyper-parameters are used in ScaleCom experiments, in which we set =1 in the low-pass filter, as there is no need to filter the gradients in standard batch size experiments. The experimental results are summarized in Table 2, and convergence curves are shown in Figure 4. With compression rates of 65-400X, ScaleCom achieves accuracies very close to the baseline for all workloads.

Model (Dataset) Accuracy or [other metrics] #GPU BSZ Comp. Rate Baseline Comp.
ResNet34 (CIFAR10) 4 128 92X 93.78 93.98
ResNet18 (ImageNet) 8 256 112X 70.482 70.172
ResNet50 (ImageNet) 8 256 96X 76.442 75.988
MobileNetV2 (ImageNet) 8 256 155X 71.644 71.524
Transformer-base (WMT14 En-De) [BLEU] 8 36K 47X (65X) 27.64 27.27 (27.24)
4-bidirectional-LSTM Speech (SWB300) [WER] 4 128 400X 10.4 10.1

 

More aggressive compression is applied without significant degradation.

Table 2: Baseline vs. compression standard batch size training on image, language and speech models
Figure 4: Standard batch size training curves with ScaleCom on (a) ResNet18 for ImageNet dataset (b) MobileNetV2 with width-multiplier 1.0 on ImageNet (c) Transformer-base machine translation (ScaleCom corresponds to 65X in Table 2) (d) LSTM-based speech model for the SWB300 dataset. Convergence and accuracy are preserved across various models and datasets. Final training results are summarized Table 2.

Large Batch Size Scaling: To evaluate the scalability of our methods, we follow goyal2017accurate ott2018scaling zhang2020improving to achieve state-of-the-art baseline accuracy with large-scale distributed settings (listed in Table 3). Compression experiments use the same hyper-parameters as baselines. From Section 2.2, as we scale up the mini-batch size and learning rates in large-scale distributed training, the gradient noise increases and local memory similarity becomes weaker among workers, which could damage network performance. As shown in the gray lines of Figure 5, when the low-pass filter is not applied (=1), although small dataset (CIFAR10) still shows good accuracy, large datasets (ImageNet, WMT14, and SWB300) start to show degradation. Once the proposed low-pass filter is applied (=0.1), ScaleCom achieves almost identical test accuracies when compared to the non-compressed baseline on every large network studied as shown in Table 3 and Figure 5 888We observed that is robust to different networks’ convergence in the range of 0.1-0.3.

Model (Dataset) Accuracy or [other metrics] #GPU BSZ Comp. Rate Baseline Comp.
ResNet34 (CIFAR10) 32 1024 92X 93.75 93.36
ResNet18 (ImageNet) 64 2048 112X 70.285 69.879
ResNet50 (ImageNet) 64 2048 96X 76.473 75.895
MobileNetV2 (ImageNet) 64 2048 155X 71.487 71.014
Transformer-base (WMT14 En-De) [BLEU] 64 288K 47X (115X) 27.79 28.03 (27.59)
4-bidirectional-LSTM Speech (SWB300) [WER] 12 1536 100X 9.9 10.0

 

More aggressive compression is applied without significant degradation.

Table 3: Baseline vs. compression large batch size training on image, language, and speech models
Figure 5: Large batch size training curves with ScaleCom on (a) ResNet18 for ImageNet dataset (b) MobileNetV2 with width-multiplier 1.0 on ImageNet (c) Transformer-base machine translation (ScaleCom corresponds to 115X in Table 3) (d) LSTM-based speech model for SWB300 dataset.

5 End-to-end System Performance

In this section, we quantify the improvement in end-to-end training time achieved by ScaleCom. We considered a distributed training system comprised of multiple accelerator chips connected to a parameter server. Each accelerator chip consists of multiple cores with private scratchpad memory. The systematic performance analysis framework presented in venkataramani2019memory is used to estimate performance. Given a system configuration (compute throughput, memory capacity, interconnect topology and bandwidth), the framework analytically explores possible ways to map DNN computations on to the accelerator system and provide performance estimations.999Appendix-F provides further details on end-to-end system performance.

We present the performance impact of ScaleCom by varying 3 key factors: (i) peak compute capability per worker (100 and 300 TFLOPs) (ii) the size of mini-batch per worker (8 and 32), and (iii) the number of workers (8, 32 and 128). When the mini-batch per worker is increased, the gradient/weight communication becomes less frequent, limiting the scope of end-to-end performance benefits from ScaleCom. This is evident from Figure 6a, where the communication time (as a fraction of total time) decreases from 56% to 20%, when the mini-batch per worker is increased from 8 to 32. Consequently, with 100 TFLOPs peak compute per worker, ScaleCom achieves total training speedup of 2 to 1.23 even with 100 compression ratio. Fraction of communication time grows with increase in peak TFLOPs (100 to 300), resulting in speedup of 4.1 to 1.75.

The key trait of ScaleCom is its performance scalability to larger number of workers independent of minibatch per worker. This is shown in Figure 6b, where the communication cost of prior top-k approaches increase linearly with number of workers, whereas that of ScaleCom remains constant. With Scalecom, the gradient/weight communication is < 3% of total training time even with large number of workers (128) and small mini-batch per worker (8), leaving the training throughput limited only by the computation inefficiency.

Figure 6: Stacked bar chart for Resnet50 (ImageNet dataset): (a) different per worker mini-batch sizes and (b) different worker numbers.

Cost of index communication and synchronization: To enable all workers to select the same gradients, ScaleCom incurs an additional overhead for communicating the top-k indices. As the index vector has the same degree of compression as the gradient vector, it occupies only 0.5% of baseline communication time. Also, the cost remains constant () independent of the number of workers. ScaleCom also incurs an additional synchronization during the index communication. Similar to fully synchronous SGD the slowest worker determines when the gradient communication can begin. Once this point is reached by all workers, the additional synchronization costs little extra time.

6 Conclusion

Gradient compression is a promising technique to resolve communication bottlenecks, but has not been widely adopted in today’s training systems. The two primary reasons for this include lack of demonstrations on large batch sizes (and datasets) and the incompatibility of compression techniques with all-reduce schemes. In this paper, we propose a new compression algorithm, ScaleCom, that resolves both of these issues. We theoretically analyze ScaleCom needs as well demonstrate scalability, robustness and excellent compression rates (65-400X) using experiments on a spectrum of models, datasets and batch-sizes - laying the foundation for its introduction in large scale systems.

Broader Impact

The amount of compute for DNNs training doubles every 3 to 4 months openai ; this is faster than Moore’s law that doubles the number of transistors every 2 years. The latest language model GPT3 brown2020language takes 175 billion parameters to achieve state of the art performance on several NLP tasks such as common sense reasoning and word prediction. Training, designing, and optimizing these gigantic models require tremendous time (cost) and computation power. Our research results on compression in large-scale distributed training have two broad benefits:

(i) Reducing time and cost to train DNN models: We believe that communication times will bottleneck training times of distributed systems and this will become even more severe with recent significant improvements in the computational capability of deep learning training hardware. To address this bottleneck, in the past few years, compression techniques have been eagerly researched and implemented in some practical training systems parthasarathi2019realizing . Our research results on scalability of gradient compression aim to push this to larger scale distributed training systems, which is needed for the training of expensive and powerful gigantic models. We believe that the scalable compression solution can accelerate machine learning research and save the cost for company and research institutes to develop state-of-art DNNs in real applications and complicated datasets.

(ii) Energy consumption for environment concerns: Training DNNs especially for big models consumes tremendous energy and starts to cause concerns in CO emission. As indicated in strubell2019energy , Transformer training with neural architecture search could cause CO emission as much as 5 cars’ lifetime. Today most DNNs training runs in distributed systems and energy is mainly consumed in data communication: 32-bit I/O communication took 3-4 orders of more energy (pJ) than 32-bit float ADD computation nvlink . Thus, efficient communication is crucial to reduce energy consumption and mitigate concerns in carbon footprint of DNNs training, especially for large-scale distributed training of gigantic DNNs. Our research cuts communication data size by 65-400X and scale this method to larger scale distribution, which will reduce energy consumption and mitigate environment concerns in gigantic DNNs training. This helps to fight climate change and global warming.

Meanwhile, we would like to point out, although our compression scheme guarantees theoretical convergence and shows no accuracy loss compared to baseline training over the tested models and applications, there could still be concerns about the impact of lossy gradient compression on neural network convergence performance. Especially when gradient compression is applied directly without fine tuning hyper-parameters, training could still be subject to instability, and thus it is recommended to examine the compression scheme over a wider range of models and applications. Our conservative compression selection rules (described in section 4) help mitigate this concern, however task-specific robustness studies are recommended for special applications.

Acknowledgments

The authors would like to thank Jintao Zhang and Ashish Ranjan for helpful technical discussions, Kim-Khanh Tran, Anthony Giordano, I-Hsin Chung, Ming-Hung Chen, Kaoutar El maghraoui, and Jeffrey Burns for the computing infrastructure, and Leland Chang, Arvind Kumar, Yulong Li, Shubham Jain, Sunil Shukla, Ankur Agrawal, Marcel Schaal, Mauricio Serrano, Wei Wang and the team for the chip platform targeted in this work. This research is realized by generous collaborations across IBM Research. Funding of this work is fully supported by IBM Research.

References

Appendix A Observations in Local Memory Similarity

We observed local memory’s similarity through Q–Q (quantile-quantile) plots as shown in Figure A1(a)-(c). In Figure A1(a), the linearity of the points in Q-Q plot suggests that the worker 1’s local memory (accumulated gradient) magnitudes have very similar statistical distributions as worker 2. The red line is the linear regression fitting for the blue dots; overall its

score is 0.99; indicates their local memory magnitude distributions (worker 1 and 2) are very similar. This is consistent to our observations in pairwise cosine distance shown in Figure 2(a). The memory accumulation (accumulate gradients over iterations) reduces gradient variation and enhances the similarity across workers. On the other hand, when plotting the computed gradients (right after backward computation) in Q-Q plot, we do not observe the excellent similarity between worker 1 and 2 as shown in Figure A1(b). In this case, the linear regression fitting score is 0.89. In Figure A1(c), we compare worker 1’s error-feedback gradients (local memory + computed gradients) magnitudes with global all-reduced error-feedback gradient magnitudes. From the plot, we can observe that their distribution quantiles are highly correlated ( linear regression fitting score is 0.99). The Spearman’s rank correlation coefficient between worker 1 and all-reduced error-feedback gradient magnitudes is 0.657 (p-value=0). This indicates that we can possibly use local worker’s top- to approximate true top- selections.

Figure A1: Q-Q plots for local memory and gradient magnitudes on ResNet18 (CIFAR10) with 8 workers at iteration 100 in epoch 0. Local top- strom2015scalablea (top 0.1%) and learning rate 0.01 are used in the experiments. Red lines are the linear regression fitting results. (a) worker 1’s local memory magnitudes quantile versus worker 2; (b) worker 1’s computed gradient magnitudes versus worker 2. (c) worker 1’s error-feedback gradient (local memory + computed gradients) magnitudes versus all-reduced (global) error-feedback gradient magnitudes.

Appendix B Preliminaries

Before showing the convergence proofs, we give the following table to highlight the notations and definitions of the variables used in the proofs.

Recall the optimization problem is

(A1)

and the gradient estimate is as the following

(A2)
Parameter Expression/Definition Representation
problem dimension
(6) Hamming distance
total number of workers
step size
discounting factor of low-pass filter
contraction parameter
optimization variable
individual function
(A1) objective function
(2) index set of top entries
n/a mini-batch set at worker
or (A2) gradient estimate by
memory at the th node
index of iteration
total number of iterations
A.1 upper bound of gradients’ size
A.1 gradient Lipschitz constant
A.1 global minimum of
Table A1: Definition of parameters used in the proofs

We also use several standard inequalities as follows:

1, Young’s inequality with parameter is

(A3)

where are vectors.

One variant of Young’s inequality is

(A4)

2, The quadrilateral identity is

(A5)

Appendix C Proof of Lemma 1 on Contraction Property

Proof.

Suppose is an -dimensional vector and is its top- index set. is sparsified by the compressor ) via another index set and .

We have

(A6)
(A7)
(A8)
(A9)

Since , there is an overlap of indices between and . Taking the expectation over the all possible combinations and permutation, we have

(A10)
(A11)
(A12)
(A13)
(A14)
(A15)

which completes the proof. ∎

Appendix D Convergence Performance Analysis of CLT-

Lemma 3.

Under assumptions A.1-A.2. Suppose the sequences is generated by CLT-. Then, when , we have

(A16)

where , and and are some constants.

d.1 Proof of Lemma 3

Proof.

Define a “virtual” sequence as the following

(A17)

For simplicity of notation, we denote as . When , it is easy to check that

(A18)

so we have

(A19)

From A. 2, we have

(A20)

since .

Then, we are able to quantify as follows:

(A21)
(A22)
(A23)

where in we apply Lemma 1.

Squaring both sides of (A23), we have

(A24)
(A25)

where in we use Young’s inequality with parameter .

Taking expectation on both sides of (A25), we have

(A26)
(A27)

Since the size of the gradients is bounded by , we have

(A28)
(A29)

It is obvious that when

(A30)

then sequence will be upper bounded by a constant. Therefore, we request

(A31)

i.e., such that . Then, we can choose small enough, i.e.,

(A32)

so that when .

Next, in order to get , we need

(A33)

which is equivalent to

(A34)

Therefore, when

(A35)

and , we have . Note that here is always greater than . The reasons are as follows: first, we know that

(A36)

Adding on both sides, we will get

(A37)

which implies

(A38)

Dividing on both sides, we can arrive at

(A39)

which is the desired result.

Then, iterate (A29) gives

(A40)