1. Introduction
Training deep learning models is a major workload on largescale computing systems. While such training may be parallelized in many ways (BenNun and Hoefler, 2019), the dominant and simplest form is data parallelism. In dataparallel training, the model is replicated across different compute nodes. After the computation of local gradient on each process is finished, the distributed gradients are accumulated across all processes, usually using an allreduce (Chan et al., 2007) operation. However, not all dimensions are equally important and the communication of the distributed gradients can be sparsified significantly, introducing up to 99.9% zero values without significant loss of accuracy. Only the nonzero values of the distributed gradients are accumulated across all processes. See (Hoefler et al., 2021) for an overview of gradient and other sparsification approaches in deep learning.
However, sparse reductions suffer from scalability issues. Specifically, the communication volume of the existing sparse reduction algorithms grows with the number of processes . Taking the allgatherbased sparse reduction (Renggli et al., 2019; Wang et al., 2020; Shi et al., 2019a) as an example, its communication volume is proportional to , which eventually surpasses the dense allreduce as increases. Other more complex algorithms (Renggli et al., 2019) suffer from significant fillin during the reduction, which also leads to a quick increase of the data volume as grows, and may degrade to dense representations on the fly. For example, let us assume the model has 1 million weights and it is 99% sparse at each node—thus, each node contributes its 10,000 largest gradient values and their indexes to the calculation. Let us now assume that the computation is distributed across 128 dataparallel nodes and the reduction uses a dissemination algorithm (Hensgen et al., 1988; Li et al., 2013) with 7 stages. In stage one, each process communicates its 10,000 values to be summed up. Each process now enters the next stage with up to 20,000 values. Those again are summed up leading to up to 40,000 values in stage 3 (if the value indexes do not overlap). The number of values grows exponentially until the algorithm converges after 7 stages with 640,000 values (nearly dense!). Even with overlapping indexes, the fillin will quickly diminish the benefits of gradient sparsity in practice and lead to large and suboptimal communication volumes (Renggli et al., 2019).
We show how to solve or significantly alleviate the scalability issues for large allreduce operations, leading to an asymptotically optimal O() sparse reduction algorithm. Our intuitive and effective scheme, called OTop is easy to implement and can be extended with several features to improve its performance: (1) explicit sparsity load balancing can distribute the communication and computation more evenly, leading to higher performance; (2) a shifted schedule and bucketing during the reduction phase avoids local hotspots; and (3) an efficient selection scheme for top values avoids costly sorting of values leading to a significant speedup.
We implement OTop
in PyTorch
(Paszke et al., 2019) and compare it to four other sparse allreduce approaches. Specifically, OTop enables:
a novel sparse allreduce incurring less than 6 (asymptotically optimal) communication volume which is more scalable than the existing algorithms,

a parallel SGD optimizer using the proposed sparse allreduce with high training speed and convergence guarantee,

an efficient and accurate top values prediction by regarding the gradient values (along the time dimension) as a slowly changing stochastic process.
We study the parallel scalability and the convergence of different neural networks, including image classification (VGG16 (Simonyan and Zisserman, 2014)
on Cifar10), speech recognition (LSTM
(Hochreiter and Schmidhuber, 1997)on AN4), and natural language processing (BERT
(Devlin et al., 2018) on Wikipedia), on the Piz Daint supercomputer with a Cray Aries HPC network. Compared with the stateoftheart approaches, OTop achieves the fastest timetosolution (i.e., reaching the target accuracy/score using the shortest time for full training), and significantly improves training throughput (e.g., 3.29x12.95x improvement for BERT on 256 GPUs). We expect speedups to be even bigger in cloud environments with commodity networks. The code of OTop is available: https://github.com/Shigangli/OkTopk2. Background and Related Work
Minibatch stochastic gradient descent (SGD) (Bottou et al., 2018) is the mainstream method to train deep neural networks. Let be the minibatch size, the neural network weights at iteration , a sample in a minibatch, and
a loss function. During training, it computes the loss in the forward pass for each sample as
, and then a stochastic gradient in the backward pass asThe model is trained in iterations such that , where is the learning rate.
To scale up the training process to parallel machines, data parallelism (Goyal et al., 2017; Sergeev and Del Balso, 2018; You et al., 2018, 2019b; Li et al., 2020a, b) is the common method, in which the minibatch is partitioned among workers and each worker maintains a copy of the entire model. Gradient accumulation across workers is often implemented using a standard dense allreduce (Chan et al., 2007), leading to about 2 communication volume where is the number of gradient components (equal to the number of model parameters). However, recent deep learning models (Devlin et al., 2018; Real et al., 2019; Radford et al., 2019; Brown et al., 2020) scale rapidly from millions to billions of parameters, and the proportionally increasing overhead of dense allreduce becomes the main bottleneck in dataparallel training.
Gradient sparsification (Aji and Heafield, 2017; Alistarh et al., 2018; Cai et al., 2018; Renggli et al., 2019; Shi et al., 2019a, b, c; Han et al., 2020; Fei et al., 2021; Xu et al., 2021) is a key approach to lower the communication volume. By top selection, i.e., only selecting the largest (in terms of the absolute value) of components, the gradient becomes very sparse (commonly around 99%). Sparse gradients are accumulated across workers using a sparse allreduce. Then, the accumulated sparse gradient is used in the Stochastic Gradient Descent (SGD) optimizer to update the model parameters, which is called Top SGD. The convergence of Top SGD has been theoretically and empirically proved (Alistarh et al., 2018; Renggli et al., 2019; Shi et al., 2019a). However, the parallel scalablity of the existing sparse allreduce algorithms is limited, which makes it very difficult to obtain real performance improvement, especially on the machines (e.g., supercomputers) with highperformance interconnected networks (Shanley, 2003; Alverson et al., 2012; Sensi et al., 2020; Foley and Danskin, 2017).
Algorithms  Bandwidth  Latency 

Dense (Chan et al., 2007)  
TopA (Renggli et al., 2019; Wang et al., 2020)  2(1)  
TopDSA (Renggli et al., 2019)  , ()^{1}  () 
gTop (Shi et al., 2019b)  4  2 
Gaussian (Shi et al., 2019a)  2(1)  
OTop  , ^{1}  () 

Intervals.
Table 1 summarizes the existing dense and sparse allreduce approaches. We assume all sparse approaches use the coordinate (COO) format to store the sparse gradient, which consumes 2 storage, i.e., values plus indexes. There are other sparse formats (see (Hoefler et al., 2021) for an overview), but format selection for a given sparsity is not the topic of this work. To model the communication overhead, we assume bidirectional and direct pointtopoint communication between the compute nodes, and use the classic latencybandwidth cost model. The cost of sending a message of size is , where is the latency and is the transfer time per word.
For dense allreduce, Rabenseifner’s algorithm (Chan et al., 2007) reaches the lower bound (Thakur et al., 2005) on the bandwidth term (i.e., about 2). TopA represents the Allgather based approach (Renggli et al., 2019; Shi et al., 2019b), in which each worker gathers the sparse gradients across workers, and then the sparse reduction is conducted locally. Although TopA is easy to realize and does not suffer from the fillin problem, the bandwidth overhead of allgather is proportional to (Thakur et al., 2005; Chan et al., 2007) and thus not scalable. TopDSA represents the Dynamic Sparse Allreduce used in SparCML (Renggli et al., 2019), which consists of reducescatter and allgather (motivated by Rabenseifner’s algorithm) on the sparse gradients. In the best case of that the indexes of top values are fully overlapped across workers and the top
values are uniformly distributed in the gradient space, Top
DSA only incurs about 4 communication volume. However, the best case is almost never encountered in the real world of distributed training, and TopDSA usually suffers from the fillin problem and switches to a dense allgather before sparsity cannot bring benefit, which incurs about 2+ communication volume. gTop (Shi et al., 2019b) implements the sparse allreduce using a reduction tree followed by a broadcast tree. To solve the fillin problem, gTop hierarchically selects top values in each level of the reduction tree, which results in 4 communication volume. Gaussian (Shi et al., 2019a) uses the same sparse allreduce algorithm as TopA with a further optimization for top selection. For our OTop, the communication volume is bounded by 6. Although OTop has a little higher latency term than the others, we target largescale models and thus the bandwidth term dominates. Since the bandwidth term of OTop is only related to , our algorithm is more efficient and scalable than all others. See Section 5.4 for experimental results.Gradient quantization (Dryden et al., 2016; Alistarh et al., 2017; Wen et al., 2017; Horváth et al., 2019; Nadiradze et al., 2021; Bernstein et al., 2018), which reduces the precision and uses a smaller number of bits to represent gradient values, is another technique to reduce the communication volume. Note that this method is orthogonal to gradient sparsification. A combination of sparsification and quantization is studied in SparCML (Renggli et al., 2019).
Another issue is the sparsification overhead. Although gradient sparsification significantly reduces the local message size, top selection on manycore architectures, such as GPU, is not efficient. A native method is to sort all values and then select the top components. Asymptotically optimal comparisonbased sorting algorithms, such as merge sort and heap sort, have O() complexity, but not friendly to GPU. Bitonic sort is friendly to GPU but requires O() comparisons. The quickselect (Mahmoud et al., 1995) based top selection has an average complexity of O() but again not GPUfriendly. Bitonic top (Shanbhag et al., 2018) is a GPUfriendly algorithm with complexity O(), but still not good enough for large . To lower the overhead of sparsification, Gaussian (Shi et al., 2019a)
approximates the gradient values distribution to a Gaussian distribution with the same mean and standard deviation, and then estimates a threshold using the percentpoint function and selects the values above the threshold. The top
selection in Gaussian is GPUfriendly with complexity O(), but it usually underestimates the value of because of the difference between Gaussian and the real distributions (see Section 3.1.3). Adjusting the threshold adaptively (e.g., lower the threshold for an underestimated ) (Shi et al., 2019a) is difficult to be accurate. In OTop, we use a different method for the top selection. We observe that the distribution of gradient values changes slowly during training. Therefore, we periodically calculate the accurate threshold and reuse it in the following iterations within a period. Empirical results show that this threshold reuse strategy achieves both accuracy (see Section 5.2) and efficiency (see Section 5.4) when selecting local and global top values in OTop.3. O() Sparse Allreduce
In this section, we will present the sparse allreduce algorithm of OTop, analyze the complexity using the aforementioned latency ()  bandwidth () cost model, and prove its optimality. We use COO format to store the sparse gradient. Since the algorithm incurs less than 6 communication volume, we call it O() sparse allreduce.
3.1. Sparse allreduce algorithm
O() sparse allreduce mainly includes two phases: (1) split and reduce, and (2) balance and allgatherv. During the two phases, we propose to use an efficient top selection strategy to select (what we call) local top values and global top values, respectively. Specifically, the semantic of O() sparse allreduce is defined by , where is the sparse gradient on worker at training iteration , the inner Top operator is the local top selection, and the outer Top operator is the global top selection.
3.1.1. Split and reduce
Figure 1 presents the split and reduce phase. Suppose we have 4 workers and each worker has a 2D gradient of size 16x4. In Figure 1(a), each worker selects the local top values to sparsify the gradient. How to efficiently select the top values will be discussed in Section 3.1.3. Then, a straightforward split and reduce for the sparse gradients is presented in Figure 1(b), in which the 2D space of the sparse gradient is evenly partitioned into regions and worker is responsible for the reduction on region . Each worker receives sparse regions from the other workers and then conducts the reduction locally. However, this simple partitioning method may lead to severe load imbalance, since the top values may not be evenly distributed among the regions. In an extreme case, all local top values will be in region 0 of each worker, then worker 0 has to receive a total of elements (i.e., values and indexes) while the other workers receive zero elements.
Without loss of generality, we can make a more balanced partition (as shown in Figure 1(c)) based on our observations for deep learning tasks: the coordinates distribution of the local top values of the gradient is approximately consistent among the workers at the coarsegrained (e.g., regionwise) level, and changes slowly during training. To achieve a balanced partition, each worker calculates the local boundaries of the regions by balancing the local top values. Then, a consensus is made among workers by globally averaging the
dimensional boundary vectors, which requires an allreduce with message size of
elements. The boundaries are recalculated after every iterations. We empirically set to get a performance benefit from periodic space repartition as shown in Section 5.3. Note that the small overhead of allreduce (i.e., ) is amortized by the reuse in the following 1 iterations, resulting in only overhead per iteration. Therefore, the overhead of boundary recalculation can be ignored. After making a balanced split, each worker approximately receives 2 elements from any of the other workers. Therefore, the overhead is(1) 
We further make two optimizations for split and reduce, including destination rotation and bucketing. As shown in Figure 2(a), a native communication pattern is that all workers send data to worker at step , which may lead to endpoint congestion (Wu et al., 2019). To avoid these hotspots, we rotate the destinations of each worker as shown in Figure 2(b). Furthermore, to utilize the network parallelism, we bucketize the communications. The messages with in a bucket are sent out simultaneously using nonblocking pointtopoint communication functions. Communications in the current bucket can be overlapped with the computation (i.e., local reduction) of the previous bucket.
3.1.2. Balance and allgatherv
Figure 3 presents the phase of balance and allgatherv. First, each worker selects the global top values from the reduced top values in the region that the worker is in charge of. Note that the global top selection only happens locally according to an estimated threshold (will be discussed in detail in Section 3.1.3). Next, each worker packages the selected global top values (and the corresponding indexes) into a consecutive buffer. Similar to the local top values, the global top values may also not be evenly distributed among the workers, causing load imbalance. In an extreme case, all global top values will be in one worker. The classic recursive doubling based allgatherv (Thakur et al., 2005) would incur communication volume, namely, total steps with each step causing traffic.
To bound the communication volume by 4, we conduct a data balancing step after packaging. Before balancing, an allgather is required to collect the consecutive buffer sizes from workers, which only incurs overhead (the bandwidth term can be ignored). Then, each worker uses the collected buffer sizes to generate the communication scheme (i.e., which data chunk should be sent to which worker) for data balancing. We use pointtopoint communication (blue arrows in the step of data balancing) to realize the scheme. The overhead of data balancing is bounded by the extreme case of all global top values locate in one worker, where data balancing costs . Data balancing in any other case has less cost than the extreme case. At last, an allgatherv using recursive doubling on the balanced data has the overhead of . Therefore, the overhead of balance and allgatherv is
(2) 
By adding the costs of the two phases, the total overhead of O() sparse allreduce is
(3) 
3.1.3. Efficient selection for top values
OTop relies on estimated thresholds to approximately select the local and global top values. The key idea is to regard the gradient values (along the time dimension) as a slowly changing stochastic process . Specifically, the statistics (such as top thresholds) of change very slowly. Therefore, we only calculate the accurate thresholds for local and global top values after every iterations, and then reuse the thresholds in the following 1 iterations. The accurate threshold can be obtained by sorting the gradient values and returning the th largest value. Top selection according to a threshold only requires comparisons and is quite efficient on GPU. The overhead of accurate threshold calculation is amortized by the reuse.
We validate our claim by the empirical results from different deep learning tasks presented in Figure 4. The gradient value distribution shown in Figure 4 is for a selected iteration where OTop uses a threshold calculated more than 25 iterations ago. We can see the threshold of OTop is still very close to the accurate threshold (see Section 5.2 for the accuracy verification for top selections of OTop in the scenario of full training). On the contrary, Gaussian severely underestimates the value of
by predicting a larger threshold, especially after the first few training epochs. This is because as the training progresses, the gradient values are getting closer to zero. Gaussian distribution, with the same mean and standard deviation as the real distribution, usually has a longer tail than the real distribution (see Figure
4).3.1.4. Pseudocode of O() sparse allreduce
We present the pseudocode of O() sparse allreduce in Algorithm 1. In Lines 24, the local top threshold is reevaluated after every iterations. In Lines 57, the region boundaries are reevaluated after every iterations. Split and reduce is conducted in Line 8, which returns the reduced local top values in region and the indexes of local top values. In Lines 912, the global top threshold is reevaluated after every iterations. Balance and allgatherv
is conducted in Line 13, which returns a sparse tensor
with global top values and the indexes of global top values. Line 14 calculates the intersection of the indexes of local top values and the indexes of global top values. This intersection (will be used in OTop SGD in Section 4) covers the indexes of local values which eventually contribute to the global top values.3.2. Lower bound for communication volume
Theorem 3.1 ().
For sparse gradients stored in COO format, O() sparse allreduce incurs at least communication volume.
Proof.
For O() sparse allreduce, each worker eventually obtains the global top values. Assume that all workers receive less than values from the others, which means that each worker already has more than of the global top values locally. By adding up the number of global top values in each worker, we obtain more than global top values, which is impossible. Therefore, each worker has to receive at least values. Considering the corresponding indexes, the lower bound is .
∎
The lower bound in Theorem 3.1 is achieved by O() sparse allreduce in the following special case: All local top values of worker are in region that worker is in charge of, so that the reduction across workers is no longer required. Furthermore, the global top values are uniformly distributed among workers, namely each worker has exactly of the global top values. Then, an allgather to obtain the global top values (plus indexes) incurs communication volume. Therefore, the lower bound is tight. Since O() sparse allreduce incurs at most communication volume (see Equation 3), it is asymptotically optimal.
4. OTop SGD Algorithm
We discuss how O() sparse allreduce works with the SGD optimizer for distributed training in this section. An algorithmic description of OTop SGD in presented in Algorithm 2. The key point of the algorithm is to accumulate the residuals (i.e., the gradient values not contributing to the global top values) locally, which may be chosen by the top selection in the future training iterations to make contribution. Empirical results of existing work (Renggli et al., 2019; Shi et al., 2019b; Alistarh et al., 2018) show that residual accumulation benefits the convergence. We use to represent the residuals maintained by worker at iteration . In Line 4 of Algorithm 2, residuals are accumulated with the newly generated gradient to obtain the accumulator . Then, in Line 5, is passed to O() sparse allreduce (presented in Algorithm 1), which returns the sparse tensor containing global top values and the marking the values in which contribute to . In Line 6, residuals are updated by setting the values in marked by to zero. In Line 7, is applied to update the model parameters.
4.1. Convergence proof
Unless otherwise stated, denotes the 2norm.
Theorem 4.1 ().
Consider the OTop SGD algorithm when minimizing a smooth, nonconvex objective function . Then there exists a learning rate schedule () such that the following holds:
(4) 
Proof.
The update to in OTop SGD is
while top components of the sum of updates across all workers, i.e., the true global top values intended to be applied, is
We assume the difference between the update calculated by OTop SGD and the true global top values is bounded by the norm of the true gradient . Then, we make the following assumption:
Assumption 1 ().
There exists a (small) constant such that, for every iteration 0, we have:
(5) 
We validate Assumption 1 by the empirical results on different deep learning tasks in Section 5.1. Then, we utilize the convergence proof process for Top SGD in the nonconvex case, presented in the work of (Alistarh et al., 2018), to prove Theorem 4.1.
∎
Regarding Theorem 4.1
, we have the following discussion. First, since we analyze nonconvex objectives, a weaker notion of convergence than in the convex case (where we can prove convergence to a global minimum) is settled. Specifically, for a given sequence of learning rates, we prove that the algorithm will converge to a stationary point of negligible gradient. Second, like most theoretical results, it does not provide a precise set for the hyperparameters, except the indication of diminishing learning rates.
5. Evaluations
We conduct our experiments on the CSCS Piz Daint supercomputer. Each Cray XC50 compute node contains an Intel Xeon E52690 CPU, and one NVIDIA P100 GPU with 16 GB global memory. We utilize the GPU for acceleration in all following experiments. The compute nodes of Piz Daint are connected by a Cray Aries interconnect network in a Dragonfly topology.
Tasks  Models  Parameters  Dataset 

Image classification  VGG16 (Simonyan and Zisserman, 2014)  14,728,266  Cifar10 
Speech recognition  LSTM (Hochreiter and Schmidhuber, 1997)  27,569,568  AN4 (Acero and Stern, 1990) 
Language processing  BERT (Devlin et al., 2018)  133,547,324  Wikipedia (Devlin et al., 2018) 
We use three neural networks from different deep learning domains summarized in Table 2 for evaluation. For VGG16, we use SGD optimizer with initial learning rate of 0.1; for LSTM, we use SGD optimizer with initial learning rate of 1e3; for BERT, we use Adam (Kingma and Ba, 2014) optimizer with initial learning rate of 2e4, 0.9, 0.999, weight decay of 0.01, and linear decay of the learning rate. For BERT, sparse allreduce is conducted on the gradients and Adam optimizer is applied afterwards. We compare the performance of OTop with the parallel SGD schemes using the dense and sparse allreduce algorithms listed in Table 1, which covers the stateoftheart. For a fair comparison, all schemes are implemented in PyTorch (Paszke et al., 2019) with mpi4py as the communication library, which is built against CrayMPICH 7.7.16. Commonly, the gradients of network layers locate in noncontiguous buffers. We use Dense to denote a single dense allreduce on a long message aggregated from the gradients of all neural network layers. Furthermore, we use DenseOvlp to denote dense allreduce with the optimization of communication and computation overlap. For DenseOvlp, the gradients are grouped into buckets and the message aggregation is conducted within each bucket; once the aggregated message in a bucket is ready, a dense allreduce is fired. The sparse allreduce counterparts (i.e., TopA, TopDSA, gTop, and Gaussian) are already discussed in Section 2. In all following experiments, we define as , where is the number of components in the gradient.
We utilize the function provided by PyTorch (Paszke et al., 2019), which is accelerated on GPU, to realize the top selection in TopA, TopDSA, and gTop, as well as the periodic threshold reevaluation in OTop.
5.1. Evaluate the empirical value of
To validate Assumption 1, we present the empirical values of when training two models until convergence with different densities in Figure 5. For VGG16 and BERT, the value of increases quickly in the first few epochs or training iterations, and then turns to be stable. For LSTM, the value of gradually increases at the beginning and tends to plateau in the second half of the training. For all three models, the value of
with a higher density is generally smaller than that with a lower density, especially in the stable intervals. This can be explained trivially by the reason that, the higher the density the higher probability that the results of sparse and dense allreduces get closer.
As shown in Equation 14 of (Alistarh et al., 2018), the effect of is dampened by both and small (i.e., less than 1) learning rates. If < (satisfied by all three models in Figure 5) or not too larger than , we consider it has no significant effect on the convergence. Although slightly grows in Figure 4(b), which is caused by the decreasing of the true gradient norm as the model converges, a small learning rate (e.g., 0.001) can restrain its effect. Overall, Assumption 1 empirically holds with relatively low, stable or slowly growing values of .
5.2. Top values selection
We will verify the accuracy of the top selection strategy used by OTop on different neural network models. For VGG16 and LSTM, the models are trained for 160 epochs until convergence with =32. Recall that is the period of thresholds reevaluation. For BERT, the model is trained for 200,000 iterations (more than 20 hours on 32 GPU nodes) with =128. The numbers of local and global top values selected by OTop during training are monitored. We also record the values of predicted by Gaussian for comparison. The results are reported in Figure 6. We can see that the numbers of both local and global top values selected by OTop are very close to the accurate number for a given density, except that OTop overestimates the value of in the early epochs of VGG16 and LSTM. For both local top and global top on three models, the average deviation from the accurate number is below 11%. For example, the average deviation for local top selection on BERT is only 1.4%. These results demonstrate the accuracy of the threshold reuse strategy adopted by OTop. On the contrary, Gaussian overestimates the value of in the first few epochs and then severely underestimate (an order of magnitude lower than the accurate number) in the following epochs. This can be explained by the difference between Gaussian and the real distributions, as discussed in Section 3.1.3.
As a comparison, we also count the density of the output buffer (i.e., the accumulated gradient) for TopDSA (TopA has the same density), which expands to 13.2% and 34.5% on average for VGG16 (local density = 1.0%, on 16 GPUs) and LSTM (local density = 2.0%, on 32 GPUs), respectively. These statistics show the effect of the fillin issue for TopDSA.
5.3. Optimizations for load balancing in OTop
To evaluate the effectiveness of the load balancing optimizations of O() sparse allreduce, we train BERT for 8,192 iterations and report the average values.
First, we evaluate the periodic space repartition strategy (discussed in Section 3.1.1) for load balancing in the phase of split and reduce. The results are presented in Figure 7(a). Recall that we set the period to 64. The repartition overhead is counted and averaged in the runtime of the balanced reduce. In the naive reduce, the gradient space is partitioned into equalsized regions, regardless of the coordinate distribution of the local top values. The balanced reduce achieves 1.13x to 1.75x speedup over the naive one, with a trend of more significant speedup on more GPUs. This trend can be explained by that the load imbalance in the naive reduce incurs up to communication volume (proportional to ). While the balanced reduce incurs less than communication volume, which is more scalable.
Next, we evaluate the data balancing strategy (discussed in Section 3.1.2) in the phase of balance and allgatherv. Although data balancing helps to bound the bandwidth overhead of allgatherv, there is no need to conduct it if the data is roughly balanced already. Empirically, we choose to conduct data balancing before allgatherv if the max data size among workers is more than four times larger than the average data size, and otherwise use an allgatherv directly. Figure 7(b) presents the results for the iterations where data balancing is triggered. Data balancing and allgatherv achieve 1.12x to 1.43x speedup over the direct allgatherv. For similar reasons as split and reduce, more speedup is achieved on more GPUs.
5.4. Case studies on training time and model convergence
We study the training time and model convergence using realworld applications listed in Table 2. For training time per iteration, we report the average value of full training. To better understand the results, we make a further breakdown of the training time, including sparsification (i.e., top selection from the gradient), communication (i.e., dense or sparse allreduces), and computation (i.e., forward and backward passes) plus I/O (i.e., sampling from dataset).
As discussed in Section 5.2, Gaussian usually underestimates the value of , which makes the actual density far below the setting. Both empirical and theoretical results (Renggli et al., 2019; Shi et al., 2019b; Alistarh et al., 2018) show that a very low density would jeopardize the convergence. To make a fair comparison between the counterparts for both training time and model accuracy, we gradually scale the predicted threshold of Gaussian until the number of selected values is more than 3/4. The threshold adjustment is also suggested by (Shi et al., 2019a), although it is difficult to be accurate. The threshold adjustment may slightly increase the sparsification overhead of Gaussian, but compared with the other overheads it can be ignored (see the following results).
5.4.1. Image classification
Figure 8 presents the results of weak scaling for training VGG16 on Cifar10. DenseOvlp outperforms Dense by enabling communication and computation overlap. Although TopA and TopDSA have lower communication overhead than DenseOvlp, they have a high overhead for sparsification, which makes the benefit of lower communication disappear. Note that the communication overhead of gTop seems much higher than the others; this is because the overhead of hierarchical top selections in the reductiontree (with steps) is also counted in the communication overhead. Among all sparse allreduce schemes, Gaussian has the lowest sparsification overhead. OTop has the lowest communication overhead; by using the threshold reuse strategy, OTop only has a slightly higher sparsification overhead than Gaussian. When scaling from 16 GPUs to 32 GPUs, the communication overhead of TopA and Gaussian almost doubles. This is because allgatherbased sparse allreduce is not scalable (see the performance model in Table 1). On 32 GPU nodes, OTop outperforms the other schemes by 1.51x8.83x for the total training time.
Figure 9 presents the Top1 test accuracy as a function of runtime by training VGG16 on Cifar10 for 160 epochs. On both 16 and 32 GPUs, the accuracy achieved by OTop
is very close to dense allreduce. We did not do any hyperparameter optimization except simply diminishing the learning rate. The accuracy results are consistent with these reported in machine learning community
(Ayinde and Zurada, 2018; Shi et al., 2019a). On both 16 and 32 GPUs, OTop achieves the fastest timetosolution.5.4.2. Speech recognition
Figure 10 presents the results of weak scaling for training LSTM on AN4. Similar to the results on VGG16, OTop has a better scalability than the counterparts. On 64 GPUs, OTop outperforms the other schemes by 1.34x7.71x for the total training time.
Figure 11 presents the test Word Error Rate (WER, the smaller the better) as a function of runtime by training for 160 epochs. On 32 GPUs, OTop is 1.39x faster than DenseOvlp, and achieves 0.309 WER, which is very close to DenseOvlp (0.308). On 64 GPUs, all schemes achieve higher WERs than these on 32 GPUs. This is because the model accuracy is compromised by using a larger global batch size, which is also observed in many other deep learning tasks (You et al., 2018, 2019a; BenNun and Hoefler, 2019). How to tune hyperparameters for better accuracy with large batches is not the topic of this work. Surprisingly, on 64 GPUs, OTop, Gaussian, TopA and TopDSA achieve lower WERs than DenseOvlp, which may be caused by the noise introduced by the sparsification. Overall, on both 32 and 64 GPUs, OTop achieves the fastest timetosolution.
5.4.3. Natural language processing
BERT (Devlin et al., 2018) is a popular language model based on Transformer (Vaswani et al., 2017). The model is usually pretrained on a large dataset and then finetuned for various downstream tasks. Pretraining is commonly much more expensive (years on a single GPU) than finetuning. Therefore, we focus on pretraining in the evaluation.
Figure 12 presents the results of weak scaling for pretraining BERT. When scaling to 256 GPUs, the communication overhead of TopA and Gaussian is even higher than the dense allreduce, which again demonstrates that the allgatherbased sparse allreduce is not scalable. TopDSA exhibits better scalablity than the allgatherbased sparse allreduce, but its communication overhead also significantly increases, since the more workers, the more severe the fillin problem (Renggli et al., 2019). On 256 GPUs, OTop outperforms all counterparts by 3.29x12.95x. Using 32 nodes as the baseline, OTop achieves 76.3% parallel efficiency on 256 GPUs in weak scaling, which demonstrates a strong scalalibity of OTop.
In Figure 13, we report the training loss by pretraining BERT from scratch on the Wikipedia dataset (containing 114.5 million sequences with a max length of 128) for 400,000 iterations. Eventually, the training loss of OTop decreases to 2.43, which is very close to DenseOvlp (2.33). These results show that OTop has a similar convergence rate as the dense allreduce for BERT pretraining. Compared with DenseOvlp, OTop reduces the total training time on 32 GPUs from 150 hours to 47 hours (more than 3x speedup), and also outperforms Gaussian by 1.30x. Since pretraining BERT is very costly (energy and timeconsuming), in Figure 13 we only present the results for two important baselines (i.e., Gaussian, with the highest training throughput among all baselines, and DenseOvlp, a lossless approach). Since the other baselines are inferior to Gaussian and DenseOvlp in terms of training throughput and not better than DenseOvlp in terms of convergence rate, it is sufficient to show the advantage of OTop by comparing it with these two important baselines in Figure 13.
6. Conclusion
OTop is a novel scheme for distributed deep learning training with sparse gradients. The sparse allreduce of OTop incurs less than 6 communication volume, which is asymptotically optimal and more scalable than the counterparts. OTop enables an efficient and accurate top values prediction by utilizing the temporal locality of gradient value statistics. Empirical results for dataparallel training of realworld deep learning models on the Piz Daint supercomputer show that OTop significantly improves the training throughput while guaranteeing the model accuracy. The throughput improvement would be more significant on commodity clusters with lowbandwidth network. We foresee that our scheme will play an important role in scalable distributed training for largescale models with low communication overhead. In future work, we aim to further utilize OTop to reduce the communication overhead in distributed training with a hybrid data and pipeline parallelism (Li and Hoefler, 2021; Fan et al., 2021; Narayanan et al., 2021, 2019).
Acknowledgements.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 programme (grant agreement DAPP, No. 678880, EPiGRAMHS, No. 801039, and MAELSTROM, No. 955513). We also thank the Swiss National Supercomputing Center for providing the computing resources and technical support.References

Environmental robustness in automatic speech recognition
. In International Conference on Acoustics, Speech, and Signal Processing, pp. 849–852. Cited by: 3(b), Table 2.  Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021. Cited by: §2.
 QSGD: communicationefficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems 30, pp. 1709–1720. Cited by: §2.
 The convergence of sparsified gradient methods. arXiv preprint arXiv:1809.10505. Cited by: §2, §4.1, §4, §5.1, §5.4.
 Cray xc series network. Cray Inc., White Paper WPAries011112. Cited by: §2.
 Building efficient convnets using redundant feature pruning. arXiv preprint arXiv:1802.07653. Cited by: §5.4.1.
 Demystifying parallel and distributed deep learning: an indepth concurrency analysis. ACM Computing Surveys (CSUR) 52 (4), pp. 1–43. Cited by: §1, §5.4.2.
 SignSGD: compressed optimisation for nonconvex problems. In International Conference on Machine Learning, pp. 560–569. Cited by: §2.
 Optimization methods for largescale machine learning. SIAM Review 60 (2). Cited by: §2.
 Language models are fewshot learners. arXiv preprint arXiv:2005.14165. Cited by: §2.
 Long live time: improving lifetime for traininginmemory engines by structured gradient sparsification. In Proceedings of the 55th Annual Design Automation Conference, pp. 1–6. Cited by: §2.
 Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19 (13), pp. 1749–1783. Cited by: §1, Table 1, §2, §2.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, 3(c), §5.4.3, Table 2.
 Communication quantization for dataparallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1–8. Cited by: §2.
 DAPPLE: a pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445. Cited by: §6.
 Efficient sparse collective communication and its application to accelerate distributed deep learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 676–691. Cited by: §2.
 Ultraperformance Pascal GPU and NVLink Interconnect. IEEE Micro 37 (2), pp. 7–17. Cited by: §2.

Accurate, large minibatch sgd: training imagenet in 1 hour
. arXiv preprint arXiv:1706.02677. Cited by: §2.  Adaptive gradient sparsification for efficient federated learning: an online learning approach. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp. 300–310. Cited by: §2.
 Two algorithms for barrier synchronization. International Journal of Parallel Programming 17 (1), pp. 1–17. Cited by: §1.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, 3(b), Table 2.
 Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554. Cited by: §1, §2.

Stochastic distributed learning with gradient quantization and variance reduction
. arXiv preprint arXiv:1904.05115. Cited by: §2.  Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 Taming unbalanced training workloads in deep learning with partial collective operations. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Cited by: §2.
 Breaking (global) barriers in parallel stochastic optimization with waitavoiding group averaging. IEEE Transactions on Parallel and Distributed Systems 32 (7), pp. 1725–1739. Cited by: §2.
 NUMAaware sharedmemory collective communication for mpi. In Proceedings of the 22nd international symposium on Highperformance parallel and distributed computing, pp. 85–96. Cited by: §1.
 Chimera: efficiently training largescale neural networks with bidirectional pipelines. arXiv preprint arXiv:2107.06925. Cited by: §6.
 Analysis of quickselect: an algorithm for order statistics. RAIROTheoretical Informatics and ApplicationsInformatique Théorique et Applications 29 (4), pp. 255–276. Cited by: §2.
 Asynchronous decentralized sgd with quantized and local updates. Advances in Neural Information Processing Systems 34. Cited by: §2.
 PipeDream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15. Cited by: §6.
 Memoryefficient pipelineparallel dnn training. In International Conference on Machine Learning, pp. 7937–7947. Cited by: §6.
 Pytorch: an imperative style, highperformance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §1, §5, §5.
 Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.

Regularized evolution for image classifier architecture search
. InProceedings of the aaai conference on artificial intelligence
, Vol. 33, pp. 4780–4789. Cited by: §2.  SparCML: highperformance sparse communication for machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. Cited by: §1, Table 1, §2, §2, §2, §4, §5.4.3, §5.4.
 An InDepth Analysis of the Slingshot Interconnect. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20), Cited by: §2.

Horovod: fast and easy distributed deep learning in TensorFlow
. arXiv preprint arXiv:1802.05799. Cited by: §2.  Efficient topk query processing on massively parallel hardware. In Proceedings of the 2018 International Conference on Management of Data, pp. 1557–1570. Cited by: §2.
 InfiniBand network architecture. AddisonWesley Professional. Cited by: §2.
 Understanding topk sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772. Cited by: §1, Table 1, §2, §2, §2, §5.4.1, §5.4.
 A distributed synchronous sgd algorithm with global topk sparsification for low bandwidth networks. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 2238–2247. Cited by: Table 1, §2, §2, §4, §5.4.
 A convergence analysis of distributed sgd with communicationefficient gradient sparsification.. In IJCAI, pp. 3411–3417. Cited by: §2.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, 3(a), Table 2.
 Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications 19 (1), pp. 49–66. Cited by: §2, §3.1.2.
 Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §5.4.3.
 FFTbased gradient sparsification for the distributed training of deep neural networks. In Proceedings of the 29th International Symposium on HighPerformance Parallel and Distributed Computing, pp. 113–124. Cited by: §1, Table 1.
 TernGrad: ternary gradients to reduce communication in distributed deep learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 1508–1518. Cited by: §2.
 Network congestion avoidance through packetchaining reservation. In Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10. Cited by: §3.1.1.
 DeepReduce: a sparsetensor communication framework for federated deep learning. Advances in Neural Information Processing Systems 34. Cited by: §2.
 Largebatch training for LSTM and beyond. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. Cited by: §5.4.2.
 Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962. Cited by: §2.
 Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10. Cited by: §2, §5.4.2.
Comments
There are no comments yet.