I Introduction
With the increase of training data volume and growing complexity of deep neural networks (DNNs), distributed computing environments (such as GPU clusters) are widely adopted to accelerate the training of DNNs. The dataparallel synchronous stochastic gradient descent (SSGD) method is one of the commonly used optimizers to minimize the objective function of largescale DNNs [1][2]. Compared to SGD on a single worker, SSGD distributes the workloads to multiple workers to accelerate the training, but it also introduces the communication overhead of exchanging model parameters or gradients in each iteration. Assume that there are workers training a single DNN model with SSGD. In every iteration, all workers take different minibatches of data to calculate the model gradients in parallel. Then they need to average the gradients before updating the model parameters, which involves significant data communications [3]. Due to the fact that the computing power of computational units (e.g., GPUs and Google TPUs) grows much faster than the network speed, network communication performance has now become the training bottleneck when the communicationtocomputation ratio is high [4]. Many large IT companies use expensive highspeed networks such as 40/100Gbps IB or Ethernet to alleviate the communication pressure, but still many researchers and small companies can only use consumerlevel GPUs connected by lowbandwidth networks such as 1GigEthernet.
To conquer the communication challenge, one can either increase the workload of workers by choosing a large batch size, or reduce the required data communications in each iteration. Very recently, many largebatch SGD techniques have been proposed with sophisticated optimization strategies [5][6][7][8][9] to increase the scaling efficiency without losing the model accuracy. On the other hand, gradient sparsification, quantification and compression methods [10][11][12][13][14][15][16] have been proposed to dramatically reduce the size of exchanged gradients without affecting the convergence rate. Among the model/gradient size reduction techniques, the Top sparsification is one of the key approaches [17][12][18] that can sparsify the gradients to just about density ( gradients are zeros and there is no need to transfer these zeroout values) [11][12].
Aggregation Algorithm  Complexity  Time Cost 
DenseAllReduce  
TopKAllReduce  
Ours (gTopKAllReduce) 

Note: . and are machine dependent and constant on a specific machine.
Top sparsification has been a key gradient compression method with empirical and theoretical studies in [17][12][16], in which researchers have verified that only a small number of gradients are needed to be averaged during the phase of gradient aggregation without impairing model convergence or accuracy. However, the sparsified gradients are generally associated with irregular indices, which makes it a challenge to efficiently accumulate the selected gradients from all workers^{1}^{1}1Worker and GPU are interchangeable in this paper.. The ringbased AllReduce method used on dense gradients (DenseAllReduce) has an communication complexity [19], where is the number of workers and is the size of parameters (or gradients). In Top sparsification, assume that the density of gradients is on each worker, the value of , and the corresponding indices of nonzero values are irregular from different workers and iterations, thus it generally needs to transfer number of values (gradients and indices) in each iteration. Note that the gradient sparsification method is not suitable for the parameter server (PS) based [20] SSGD because the workers should pull the whole model without any compression/sparsification in every iteration, whilst decentralized SSGD with AllReduce can be better for gradient compression. However, with the sparse gradients, the DenseAllReduce method cannot be directly used to accumulate all the sparse gradients with irregular indices, and recent solutions use the AllGather collective [21][22], which is inefficient even if . The AllGather collective has an communication complexity [21]. We use TopKAllReduce to denote the method of averaging irregularly indexed Top gradients by adopting AllGather. When scaling to a large number of workers (i.e., is large), even high sparsification ratios still generate significant communication overhead.
In fact, the main idea of Top sparsification is based on the fact that gradients with larger absolute values can contribute more to the model convergence. We observe that one can further select Top gradients from the accumulated results from groups of Top values generated by workers. In other words, even though workers can generate a maximum number of nonzero gradients to model update, we can pick up the Top gradients (in terms of absolute values) for the model update in each iteration. Based on this observation, we propose an efficient Top sparsification method to tackle the difficulty of TopKAllReduce without affecting the model convergence. Specifically, instead of accumulating the irregularly indexed nonzero gradients from all workers, we choose the global Top (gTop) gradients in terms of absolute values. gTop^{2}^{2}2In this paper, we mainly discuss the decentralized SSGD with AllReduce to apply gTop sparsification. But it is also applicable to the PSbased distributed SGD. can elegantly make use of a tree structure to select the global Top values from all workers, which we call gTopKAllReduce, such that the communication complexity is reduced from to . We summarize the communication complexities of different gradient aggregation solutions in Table I.
In this paper, we first implement the gTopKAllReduce algorithm which provides much more efficient global Top sparse gradients summation from distributed workers. Then we integrate our proposed gTopKAllReduce to gTop
SSGD under PyTorch
^{3}^{3}3https://pytorch.org/, which is one of the most popular deep learning frameworks and MPI^{4}^{4}4https://www.openmpi.org/. On a 32node GPU cluster connected by 1Gbps Ethernet, gTop SSGD achieves 2.712.8x speedup than SSGD with highly optimized libraries Horovod [23] and NCCL^{5}^{5}5https://developer.nvidia.com/nccl. Compared to Top SSGD, gTop SSGD is generally up to times faster on the evaluated experiments on various DNNs and datasets. Our contributions are summaries as follows:
We observe that the accumulating results of Top sparsification can be further sparsified before being updated to the model.

We propose an efficient global Top sparsification algorithm on distributed SGD, called gTop SSGD, to accelerate distributed training of deep neural networks without losing the model convergence and accuracy.

We implement the proposed gTop SSGD atop popular framework PyTorch and MPI, and we also release all our experimental parameters for reproducibility.

gTop SSGD achieves significant improvement on the realworld applications with various DNNs and datasets under lowbandwidth networks (e.g., 1 Gbps Ethernet).
The rest of the paper is organized as follows. We introduce the preliminaries in Section II, in which some background information and the main problem is clarified. In Section III, we present our observation from Top sparsification and propose an efficient gTop SSGD algorithm. Then we demonstrate the detailed experimental study in Section IV. Section V gives an introduction to the related work, and finally we conclude the paper in Section VI.
Ii Preliminaries
In this section, we briefly introduce the background knowledge in training of DNNs, and the distributed SGD used for largescale models. We also illustrate the current Top sparsification technique for compressing gradients in distributed SGD. For ease of presentation, some frequently used notations are summarized in Table II.
Notation  Description 
The number of workers in the cluster.  
The size of a message in bytes.  
Latency (startup time) of the network between two workers.  
Transmission time per byte between two nodes.  
Density of the gradients for aggregation.  
The size of gradients to aggregation, and .  
Time of an iteration.  
Time of the forward pass in each iteration.  
Time of the backward propagation in each iteration.  
Time of the model update in each iteration.  
Time of communication cost in each iteration. 
Iia DNNs
Deep neural networks (DNNs) are generally stacked with many hierarchical layers, and each layer is a transformer function of the input values. We can formulate DNNs as Eq. 1.
(1) 
where and are the input and output of layer ( for layer DNNs) respectively. Inputs of current layer are the outputs of its previous layer(s) (e.g., ). The function
is the transformer function which consists of an operation (e.g., inner product or convolution) and an activation function (e.g., ReLU).
are the trainable model parameters, which could be iteratively updated during the model training using minibatch stochastic gradient descent (SGD) optimizers and the backpropagation algorithm
[24].IiB Minibatch SGD
There is an objective function to define the differences between the prediction values and the ground truth of a DNN. The minibatch SGD optimizer updates the parameters iteratively to minimize the objective function. To be specific, there are three phases in each iteration during training: 1) Feedforward phase: a minibatch of data () is read as inputs of a DNN, and is fed forward across the neural network from the first layer to the last layer, which finally generates the prediction values to be used as evaluation of the objective function . 2) Backwardpropagation phase: the gradients w.r.t. parameters and inputs are calculated from the last layer to the first layer. 3) Update phase, the parameters are updated by the aforegenerated gradients using the following formula:
(2) 
where is the learning rate. For a singleworker training, phase 1) and 2) are the main time costs of an iteration, which are computing intensive tasks. So we the average time of one iteration can be approximated by .
IiC Synchronized SGD
Synchronized SGD (SSGD) with data parallelism is widely applied to train models with multiple workers (say workers, and indexed by ). Each worker keeps a consistent model and takes a different minibatch of data and forwards it by phase 2), and then follows phase 3) to calculate the gradients in parallel. Since the data read by different workers are not the same, the generated gradients are inconsistent in each iteration; therefore, to keep explicit the same as minibatch SGD, the gradients from different workers should be averaged before updating the model. The update formula of parameters is rewritten as
(3) 
where denotes the gradients of worker at the iteration. The gradients are located in different workers without shared memory so that the averaging operation of gradients involves communication costs, which generally becomes the system bottleneck. The average iteration time of SSGD can be approximated by . Assume that we use weakscaling on workers with SSGD, the scaling efficiency can be approximated by
(4) 
is generally related the and the model/gradient size . Therefore, with larger , it is crucial to reduce to achieve lower and thus higher scaling efficiency.
IiD DenseAllReduce
In Eq. 3, the summation of (i.e., ) can be directly implemented by an AllReduce collective [25][23], which we denote DenseAllReduce. And the ringbased AllReduce algorithm [25] (which is also included in NCCL) is an efficient implementation on the denseGPU cluster. To understand the time cost of DenseAllReduce, we revisit the time model of the ringbased AllReduce. According to [26][27], the time cost of ringbased AllReduce can be represented by
(5) 
where is the latency (startup time) of a message transfer between two nodes, and is the transmission time per byte between two nodes using the alphabeta communication model [28].
IiE Top sparsification
From Eq. 5, it is noted that with or becoming large, the communication cost will be linearly increased. To reduce the size of transfer messages , Top sparsification [12] is proposed to introduce very highly sparse of gradients. Using Top sparsification, each worker only needs to contribute the largest absolute values of gradients to be summed up in each iteration, and the zeroedout values of gradients are stored locally and accumulated at the next iteration. Both theoretical and empirical studies have verified that the Top sparsification has little impact on the model convergence and accuracy [17][12][16]. The pseudocode of Top sparsification SSGD is shown in Algorithm 1. Note that at Line of Algorithm 1, the implementation of TopKAllReduce is completely different from the DenseAllReduce for efficiency since the nonzero values of may come from inconsistent indices from different workers. Efficient implementations of such sparse AllReduce are nontrivial. Current methods [21][22]
are using AllGather to implement TopKAllReduce, in which the sparsified gradients are gathered as a dense vector combined with its corresponding indices, say
. Both sizes of and are . According the communication model of AllGather [19], the time cost for allgathering size of messages is(6) 
From Eq. 6, we can see that with increasing , is linear increased. Therefore, Top sparsification is also difficult to scaling largescale clusters on lowbandwidth networks. In this paper, we propose a global Top (gTop) sparsification approach to address the problem.
Iii Methodology
In this section, we first demonstrate some observations from Top sparsification SSGD, and then we present our proposed global Top sparsification algorithm. For ease of presentation, we assume that the number of workers is the power of .
Iiia Observations from Top sparsification
In the previous section, we have introduced Top sparsification SSGD, in which there are values selected from the local worker and then are accumulated across all the workers. We get insight into the distribution of nonzero values (denoted as ) of which is generated by the summation of the sparse gradients from all workers. We found that not all values of (whose number of elements is , and ) contributes to the model convergence. Specifically, can be further sparsified as such that only a smaller number of nonzero gradients are needed for model updates. In other words, one can further select Top largest absolute values, , from to update the model while maintaining the model convergence. In this scenario, the selected from results in aforesummed gradients that are neither updated to the model nor stored into the local residuals. This finally could damage the model convergence. Therefore, if only elements are selected from to update the model, the remain elements should be put back as residuals with corresponding indices so that they can be accumulated locally and should contribute to model updates in future iterations. Therefore, it could have to ensure the convergence of gTop.
IiiB The key idea of gTop
According to the above observations, it only needs largest absolute values from all the sparsified gradients , where . Therefore, the problem is formulated as the global Top (gTop) selection from instead of TopKAllReduce, while are located in distributed workers. We again let denote the nonzero values and corresponding indicies of whose number of nonzero values is . We first use the AllGather version to illustrate the key idea of gTop sparsification, and then we present our new efficient algorithm for gTop sparsification. At Line of Algorithm 1, , with nonzero values contributing updates to . Different from top sparsification, we further sparsify by selecting largest absolute values from . The straightforward implementation of gTop is shown in Algorithm 2. Please be noted that this version is only used to illustrate the key idea that how to select those gradients to update the model. The efficient algorithm is presented afterward in the next subsection. An example of gTop sparsification using AllGather on workers is shown in Fig. 1.
IiiC gTopKAllReduce: An efficient AllReduce algorithm for gTop sparsification
From Eq. 6, we can see that the AllGather collective is inefficient to conduct the AllReduce operation from irregular indexed gradients. Based on the same density, the main purpose of our proposed efficient algorithm is to eliminate the high impact of the variable on the time cost. For ease of presentation, we first define a Top operation, , of two sparse vectors, say and , both of which have nonzero values.
Definition 1
A Top operation: . , where , and the largest value of .
Note that the number of nonzero values of is also . During training of SSGD, and are located in different workers without shared memory. One should exchange the two sparse vectors to achieve an global Top sparse vector: . The operation for two distributed workers is shown in Fig. 2, which demonstrates that can be efficiently implemented by a send operation (network communication), followed by a local Top selection on a maximum number of nonzero values.
When scaling to workers (assume that is the power of ), since the final is equal to the local , we use a recursive doubling technique to reduce the total transfer size. To show this recursive doubling procedure used for gTop, we show an worker example in selecting the global Top values in Fig. 3. There are rounds of communications for workers (i.e., ). At the round, there are pairs of workers to do the operations, which is the same as Fig. 2, in parallel. After rounds, the first worker (rank ) finally generates the global Top values (i.e., ).
According to the illustration of recursive doubling gTop, we propose the gTop based AllReduce, shortly called gTopKAllReduce. The complete procedure of gTopKAllReduce is shown in Algorithm 3. Line selects the nonzero values of sparse to assign the variable “”, which should be sent to other workers. Line allocates the buffer “” to receive the “” from another worker at each communication round. Line  is the procedure of , which is finally stored in “” for the next round communication. The functions “Recv” and “Send” in Line and are a pair operation, and can be implemented by the interfaces of MPI. Since the result by far is only stored at the first worker (rank=), Line broadcasts the to all the other workers, which also requires number of communications using the flattree algorithm [29]. Finally, Line and record the which indicates the indices that are used in .
IiiD Communication complexity analysis of gTopKAllReduce
There are two main processes of gTopKAllReduce. The first one is the calculation of . From Fig. 3, the first worker should take part in the communication at every round, so we only need to analyze the big of rank . In the worker of rank=, it takes rounds of communications to calculate . In each communication, rank should receive elements from another worker, which takes a time cost of . Thus, the overall time cost of the first process is . In the second process, the global Top values (i.e., ) in the first worker should be broadcasted to all the other workers. The broadcast operation takes according to the flattree algorithm. In summary, the time cost of gTopKAllReduce is
(7) 
The communication complexity is much lower than TopKAllReduce especially when is large.
IiiE gTop SSGD with gTopKAllReduce
With the above proposed efficient implementation of gTopKAllReduce, we improve the gTop SSGD in Algorithm 2 by replacing Line  with a line that invokes gTopKAllReduce shown Algorithm 3. The improved version of the gTop SSGD training algorithm is shown in Algorithm 4. Compared to Top SSGD, gTop SSGD only introduces an extra computation (Line in Algorithm 4) whose overhead is much smaller than the communication overhead, while gTop SSGD reduces the communication complexity a lot.
IiiF System overview
We implement our proposed gTop SSGD atop PyTorch and MPI. Since the sparsification (i.e., Top selection in local) and residual operations have extra overheads, and they can be parallelized with the feedforward and backward computation. Therefore, we make gradient sparsification related operations be separate with feedforward and backward operations. To be specific, there is a thread to process the gradient sparsification and residual management for communication, and the main thread takes charge of feedforward/backward computation. An overview of our system architecture is shown in Fig. 4.
Note that gradient sparsification is done on the GPU, which means that the Top selection is invoked on the GPU, and then the handler transfers sparsified results to CPU for communication. Such design has two advantages: First, when the number of gradients is large, GPU could be more efficient than CPU; Second, because the density is generally set to a very small value, then transferring the nonvalues through PCIe could be much faster than transferring the whole gradients.
Iv Experimental Study
We conduct experimental evaluations to show the effectiveness of our proposed gTop SSGD by realworld applications on a 32GPU cluster. We first validate the convergence of our proposed gTop SSGD, which should have nearly consistent convergence behavior with the dense version. Then we evaluate the time cost and efficiency of gTopKAllReduce and compare with the dense AllReduce (DenseAllReduce) and the Top AllReduce (gTopKAllReduce) in different sizes of messages. After that, we make a comparison on the training efficiency among the three SSGD algorithms (i.e., SSGD with dense gradients, Top SSGD, and gTop SSGD). We also breakdown the training process in an iteration to several timeconsuming phases to analyze the extra overheads that are introduced by gTop sparsification.
Iva Experimental setup
Hardware: The distributed environments are configured as a GPU cluster with machines, each of which is equipped with one Nvidia P102100 GPU. The network between two machines is 1 Gbps Ethernet (1GbE). Details of the hardware are shown in Table III. Each machine is under a lowperformance configuration just like a personal desktop computer.
Hardware  Model 
CPU  Intel(R) Celeron(R) CPU N3350 @ 1.10GHz 
GPU  Nvidia P102100 (3200 CUDA cores and 5GB Memory) 
PCIe  PCIe1 lane with a maximum bandwidth of 250 MB/s 
Memory  4GB DDR3 with a 16GB swap file 
Disk  256GB SSD 
Network  1 Gbps Ethernet (1GbE) 
Software: All GPU machines are installed with Ubuntu16.04, the Nvidia GPU driver at version 390.48 and CUDA9.1. The communication libraries are OpenMPI3.1.1^{6}^{6}6https://www.openmpi.org/ and NCCL2.1.5^{7}^{7}7https://developer.nvidia.com/nccl. We use the highly optimized distributed training library Horovod^{8}^{8}8https://github.com/uber/horovod [23] at the version of 1.4.1. The deep learning framework is PyTorch at the version of 0.4.0 with cuDNN7.1.
Dataset  Training samples  Validation samples  Input size 
Cifar10  50000  10000  32 32 
ImageNet  1.2 million  10000  224 224 
PTB  923000  73000  10000 
Model  Dataset 
of Epochs 
Batch size  Learning rate 
VGG16  Cifar10  140  128  0.1 
ResNet20  Cifar10  140  128  0.1 
AlexNet  ImageNet  45  64  0.01 
ResNet50  ImageNet  15  256  0.01 
LSTMPTB  PTB  40  100  1.0 

Note: All models are trained with the single precision floating point (i.e., 32bit).
DNNs: We choose various DNNs from several areas of AI applications with different datasets. The datasets include Cifar10 [30] and ImageNet [31] for image classification and the Penn Treebank corpus (PTB) [32] for language modeling. The data sizes of evaluated datasets are shown in Table IV. For each dataset, we use one to two benchmarking deep models. For the Cifar10 dataset, we use the VGG16 model [33] and the ResNet20 model [34]. For the ImageNet dataset, the AlexNet model [35] and the ResNet50 model [34] are used. We exploit a 2layer LSTM language model (LSTMPTB) for the PTB dataset, which is similar as in [12]. The details of the deep models are shown in Table V. All the model training are using momentum SGD with a momentum of .
IvB Convergence comparison
The convergence of Top sparsification SSGD has been verified to be nearly consistent with the dense version in much previous work [17][12][16], so we would not include the convergence curves of Top SSGD. We compare our gTop SSGD with the original SSGD with dense gradients running on workers. It has been shown that the warmup strategy in the first several epochs helps the model convergent better [12], so we adopt the similar warmup configuration. To be specific, the first epochs use the dynamic densities of and smaller learning rates, and the remaining epochs adopt a density of or , which means we can compress the gradients by hundreds of smaller size of communication messages from the fifth epoch.
Convergence on the Cifar10 dataset: The convergence of VGG16 and ResNet20 models on the Cifar10 dataset is shown in Fig. 5. The results show that the convergence rate of ResNet20 is almost the same with the baseline, while the VGG16 model even converges slightly better than the baseline.
Convergence on the ImageNet dataset: The convergences of AlexNet and ResNet50 models on the ImageNet dataset are shown in Fig. 6. Again, the results show that the convergence rates of the two networks are close to the baselines. On the AlexNet model, the convergence of gTop SSGD with density is slightly worse than the baseline, which could be caused by the very low density effected on the convolutional layers while the fully connected layers have a large proportion of parameters. On the other hand, gTop sparsification works well on the ResNet50 model, which converges slightly faster than the baseline.
Convergence on the LSTM network: The convergence of LSTMPTB on the PTB dataset is shown in Fig. 7. It is also noted that gTop SSGD converges close to that of SSGD under a density of .
In summary, three different types of DNNs from different benchmarking datasets show that our proposed gTop sparsification on SSGD would not damage the model during training and keeps very close model convergence to the dense version of SSGD.
IvC Communication speed
Before we demonstrate the efficiency of gTop SSGD, we first evaluate the communication speed of the cluster. We test the pointtopoint communication time with some various sizes of messages because the performance of pointtopoint communication plays an important role on MPI collectives and our gTopKAllReduce. After that we evaluate the time speeds of DenseAllReduce, TopKAllReduce and gTopKAllReduce in different sizes of sparse vectors and a scaling number of workers on the 1GbE cluster.
Pointtopoint communication: We test the pointtopoint communication speed by using OSU MicroBenchmark^{9}^{9}9http://mvapich.cse.ohiostate.edu/benchmarks/ at the version of 5.5. The time costs of the pointtopoint communication between two machines are shown in Fig. 8, in which we run
times to calculate the mean and standard variance from the reported values. It can be seen that the time used for transferring a message is a linear function with the size of the message, which verifies the
 model. Based on the measured data, we can get and .Time performance of AllReduce operations: Since and are two main factors affecting the performance of TopKAllReduce and gTopkAllReduce, we compare their time performances in two dimensions (i.e., and ) based on the measured , and the time cost models in Table. I. First, we compare the time cost with the number of workers increases based on MB (the approximate model size of ResNet50) and . Second, in the configuration of a cluster with 64 workers, we make a comparison on how the time cost changes with the size of the message increases. The time comparison is shown in Fig. 9. From the left of Fig. 9, when the number of workers is small, TopKAllReduce is slightly faster than gTopKAllReduce. However, when the number of workers scales to , TopKAllReduce becomes much worse than gTopKAllReduce. Furthermore, our proposed gTopKAllReduce is much more efficient than TopKAllReduce when scaling to large sizes of messages. To summarize, a larger number of workers or a larger message size would make gTopKAllReduce higher efficiency than TopKAllReduce.
Model  Batch size  Iteration time  Throughput 
VGG16  128  0.097s  1317 Images/s 
ResNet20  128  0.146s  876 Images/s 
AlexNet  64  0.369s  173 Images/s 
ResNet50  256  4.842s  52 Images/s 
IvD Training efficiency
SingleGPU training speed: We first demonstrate the average training speed of one iteration on a single GPU, which is shown in Table VI. It can be seen that the computation time of each iteration is from tens of milliseconds to several seconds so that scaling such models on 1GbE clusters is challenging especially for the models with a large number of parameters (e.g., AlexNet) because of high communicationtocomputation ratios.
Scaling efficiency: After integrating gTopAllReduce to gTop SSGD, we want to consider how many speedups can be achieved on the low bandwidth networks with different models on a different number of workers. The scaling efficiency of SSGD with three different AllReduce algorithms are shown in Fig. 10. It can be seen that the dense SSGD has worst scaling efficiency because the full size of gradients makes the communication very slow on 1GbE clusters. The Top SSGD achieves some improvement on a small number of workers then SSGD, but it has an obvious performance decrease when scaling to GPUs. However, our proposed algorithm gTop SSGD achieves much more stable scaling efficiency even on clusters with the larger number of GPUs. For example, when scaling to GPUs, our proposed gTop SSGD achieves faster than dense SSGD on average, and it achieves improvement on average compared to Top. Particularly, gTop SSGD is up to and than SSGD and Top SSGD respectively on the AlexNet model. Summary of the training throughput on different models is shown in Table. VII.
Model  Dense SSGD  Top  gTop  
VGG16  403  2016  3020  7.5  1.5 
ResNet20  9212  22272  25280  2.7  1.1 
AlexNet  39  296  505  12.8  1.7 
ResNet50  343  978  1251  3.65  1.3 

Note: The throughput is measured with processed images per second (i.e., the unit is Images/s). indicates the speedup of gTop compared to the dense one, and indicates the speedup of gTop compared to Top.
IvE Time performance analysis
We use the cases of workers to analyze the time performance of gTop SSGD. To better understand the overheads of gTop sparsification, we breakdown the time of an iteration into three parts: GPU computation time (), local sparsification time (), and communication time (). Note that in this work, we do not consider the parallelization between computation and communication during backward propagation. The main reason is that for some deep models like ResNet50 that consume large size of memory and the minibatch size could not be set too large, to the computation time is short. But we also need to reduce the communicationtocomputation ratio to alleviate the impact of communication, so an effective method is to accumulating gradients for different small sizes of unupdated minibatches. In our evaluated experiments of ResNet50, we set local minibatch size as , and it accumulates the gradients times for a single update, so the effective minibatch size is . Therefore, it has little contributions from the pipeline of backward propagation and communication on lowbandwidth networks. But gTop sparsification is also applicable to the waitfree backward propagation algorithm [36] and the optimal gradient merge algorithm [37].
The time breakdown for the evaluated models is shown in Fig. 11. From Fig. 11, on one hand, in time breakdown of VGG16 and AlexNet models, the communication overheads are much larger than computation because VGG16 and AlexNet have three fully connected layers which are equipped with a large number of parameters, while the computation is fast. These also reflect that the scaling efficiency is low in Fig. 10 of SSGD even with gTop sparsification. On the other hand, the communication and sparsification overheads are much smaller than the computation with ResNet20 and ResNet50, which indicates low communicationtocomputation ratios, so that the scaling efficiency can be up to even on the lowbandwidth network.
Furthermore, it is noted that the time used by gradient sparsification is comparable to the computation time on VGG16 and AlexNet models. The main reason is that Top selection on GPU is inefficient, which generally requires a sort operation on the whole gradients, and it could be nontrivial to be highly parallelized on SIMD architectures [38]. We will leave this as our future optimization direction.
IvF Convergence sensibility to densities
To understand the sensibility of model convergence to densities, we run the experiments with different values of the density using VGG16 and ResNet20 on the Cifar10 dataset. The convergence curves are shown in Fig. 12. It can be seen that even a very low density of does not have a big impact to the model convergence to both models. However, a tradeoff should be made to balance the high sparsification ratio and the convergence speed. One one hand, the higher sparsification would bring higher scaling efficiency to a larger number of workers. One the other hand, one should also be careful to the upper bound of the sparsity that would have a negative impact on the model convergence.
V Related Work
Gradient size reduction in communication is crucial for distributed training using synchronous SGD. Researchers have proposed quantization, sparsification, and lossless data stream compression. Gradients or weights in DNN models are stored in single precision floating points, which means that it requires 32 bits for each gradient. Gupta et al. [39] propose the 16bit wide fixedpoint number representation for model parameters and gradients to improve the computation and energy efficiency. To keep the model accuracy, researchers [40][41] propose the mixed precision technique which updates the model with 32bit precision while the computation is in 16bit precision. Hubrara et al. [10] further quantize model parameters to 4bit precision without losing accuracy, and 2bit [42], and even 1bit [14][43] are also proposed for minimal quantization. 1bit quantization is the smallest number of bit of quantization of a single value, which would achieve up to 32 smaller message size than the 32bit counterpart. However, there still exist some quantization errors because some values become zeros if they exceed the numerical range of that precision can represent even with careful consideration, which to some extent hearts the model accuracy. Error compensation techniques [13][44][12] are further proposed to address this issue of quantization errors. However, even 1bit gradients can only achieve reduced size, which is either not enough on large models and slow networks.
In terms of gradient sparsification, which zeroouts a large proportion of gradients to reduce the communication size, Aji et al. [17] and Chen et al. [11] empirically demonstrate that up to gradients are not needed to update the model at each iteration, which indicates that the gradients would be very sparse to convergent the model with accumulation of gradient residuals. Aji et al. [17] use static threshold selection to determine the value of , while Chen et al. [11] propose a dynamic version. Lin et al. [12]
further propose some optimization tricks (including the warmup strategy, momentum correction, and gradient clipping) to address the accuracy loss introduced by dropping a large number of gradients, and they show that Top
sparsification SSGD can converge very close to SSGD with dense gradients. The above techniques of quantization and sparsification can be combined together to achieve a higher compression ratio of gradients with no (or a very small) accuracy loss. For example, Lin et al. [12] achieve up to 270 and 600 compression ratio without loss of accuracy, while Sattler et al. [45] achieve up to 37208 with only 1% lower accuracy. Furthermore, after quantization and sparsification, one can further use some lossless compression techniques like runlength encoding to further reduce the size of transfer bytes [12].Focusing on the study of sparsification, the other two work in [21] and [22] are mostly related to our work. They have realized that efficient sparse AllReduce algorithms are nontrivial to implement, and they both propose the AllGather solution. However, the AllGather method requires a linear increase cost with respect to the number of workers. Therefore, the AllGather could be inefficient when scaling to largescale clusters.
Vi Conclusion and Future Work
In this work, we first show that the accumulating results from Top gradients can be further sparsified by choosing some largest absolute gradients before updating the model, which has no much impact on the model convergence. Then we identify that the Top sparsification is inefficient in averaging the gradients from all workers because the indices of the Top gradients are not the same such that one should use the AllGather collective to collect all the Top gradients and indices. The AllGather method for Top aggregation (TopKAllReduce) is linear expensive to the number of workers (i.e., the communication complexity is , where is the number of workers), so it would have very low scalability when scaling to largescale clusters. To this end, we propose a global Top (gTop) sparsification approach for SSGD, which is communicationefficient. The gradient aggregation algorithm based on gTop, named gTopKAllReduce, only requires a communication complexity of
, which is much lower than TopKAllReduce. Experimental studies on various of deep neural networks including convolutional neural networks and recurrent neural networks (LSTM) are conducted to verify gTop
SSGD has little impact on the model convergence (experimental results demonstrate that the convergence curves are similar to SSGD with dense gradients). The experiments conducted on the 32GPU cluster interconnected with 1 Gbps Ethernet show that our proposed gTop SSGD has much higher scaling efficiency than SSGD and Top SSGD.In the future work, we would like to integrate the gTop sparsification method with quantization to achieve higher compression ratios, and we will also study some strong theoretical convergence analysis of gTop sparsification for convex and nonconvex optimization problems.
Vii Acknowledgements
We would like to thank MassGrid.com for their support on providing the GPU cluster for experiments.
References
 [1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231.
 [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 [3] S. Shi, W. Qiang, and X. Chu, “Performance modeling and evaluation of distributed deep learning frameworks on GPUs,” in The Fourth IEEE International Conference on Big Data Intelligence and Computing (DataCom 2018). IEEE, 2018.
 [4] S. Shi, X. Chu, and B. Li, “A dag model of synchronous stochastic gradient descent in distributed deep learning,” in Parallel and Distributed Systems (ICPADS), 2018 IEEE 24rd International Conference. IEEE, 2018.
 [5] D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan, D. Kalamkar, B. Kaul, and P. Dubey, “Distributed deep learning using synchronous stochastic gradient descent,” arXiv preprint arXiv:1602.06709, 2016.
 [6] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
 [7] W. Wang and N. Srebro, “Stochastic nonconvex optimization with large minibatches,” arXiv preprint arXiv:1709.08728, 2017.
 [8] Y. You, Z. Zhang, C.J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in minutes,” in Proceedings of the 47th International Conference on Parallel Processing. ACM, 2018, p. 1.
 [9] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu et al., “Highly scalable deep learning training system with mixedprecision: Training ImageNet in four minutes,” arXiv preprint arXiv:1807.11205, 2018.

[10]
I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Quantized
neural networks: Training neural networks with low precision weights and
activations,”
The Journal of Machine Learning Research
, vol. 18, no. 1, pp. 6869–6898, 2017. 
[11]
C.Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan,
“Adacomp: Adaptive residual gradient compression for dataparallel
distributed training,” in
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [12] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in International Conference on Learning Representations, 2018.
 [13] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized SGD and its applications to largescale distributed optimization,” International Conference on Machine Learning, 2018.
 [14] J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: compressed optimisation for nonconvex problems,” arXiv preprint arXiv:1802.04434, 2018.
 [15] D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, “The convergence of sparsified gradient methods,” arXiv preprint arXiv:1809.10505, 2018.
 [16] S. U. Stich, J.B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” arXiv preprint arXiv:1809.07599, 2018.

[17]
A. F. Aji and K. Heafield, “Sparse communication for distributed gradient
descent,” in
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, 2017, pp. 440–445.  [18] P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2530–2541.
 [19] E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn, “Collective communication: theory, practice, and experience,” Concurrency and Computation: Practice and Experience, vol. 19, no. 13, pp. 1749–1783, 2007.
 [20] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su, “Scaling distributed machine learning with the parameter server.” in OSDI, vol. 14, 2014, pp. 583–598.
 [21] C. Renggli, D. Alistarh, and T. Hoefler, “SparCML: Highperformance sparse communication for machine learning,” arXiv preprint arXiv:1802.08021, 2018.
 [22] J. Fang, H. Fu, G. Yang, and C.J. Hsieh, “RedSync: Reducing synchronization traffic for distributed deep learning,” arXiv preprint arXiv:1808.04357, 2018.
 [23] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.
 [24] R. HechtNielsen, “Theory of the backpropagation neural network,” in Neural networks for perception. Elsevier, 1992, pp. 65–93.
 [25] A. Gibiansky, “Bringing HPC techniques to deep learning.(2017),” URL http://research. baidu. com/bringinghpctechniquesdeeplearning, 2017.
 [26] T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff, “Toward performance models of MPI implementations for understanding application scaling issues,” in European MPI Users’ Group Meeting. Springer, 2010, pp. 21–30.
 [27] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.
 [28] S. Sarvotham, R. Riedi, and R. Baraniuk, “Connectionlevel analysis and modeling of network traffic,” in Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement. ACM, 2001, pp. 99–103.
 [29] J. PješivacGrbović, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra, “Performance analysis of MPI collective operations,” Cluster Computing, vol. 10, no. 2, pp. 127–143, 2007.
 [30] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar10 (canadian institute for advanced research),” URL http://www. cs. toronto. edu/kriz/cifar. html, 2010.
 [31] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “ImageNet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
 [32] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated corpus of English: The Penn Treebank,” Computational linguistics, vol. 19, no. 2, pp. 313–330, 1993.
 [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [35] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
 [36] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters,” in Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 2017, pp. 181–193.
 [37] S. Shi, X. Chu, and B. Li, “MGWFBP: Efficient data communication for distributed synchronous sgd algorithms,” in INFOCOM 2019IEEE Conference on Computer Communications, IEEE, 2019.
 [38] A. Shanbhag, H. Pirk, and S. Madden, “Efficient TopK query processing on massively parallel hardware,” in Proceedings of the 2018 International Conference on Management of Data. ACM, 2018, pp. 1557–1570.
 [39] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International Conference on Machine Learning, 2015, pp. 1737–1746.
 [40] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh et al., “Mixed precision training,” in International Conference on Learning Representations, 2018.
 [41] A. Svyatkovskiy, J. KatesHarbeck, and W. Tang, “Training distributed deep recurrent neural networks with mixed precision on GPU clusters,” in Proceedings of the Machine Learning on HPC Environments. ACM, 2017, p. 10.
 [42] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in neural information processing systems, 2017, pp. 1509–1519.
 [43] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [44] N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen, “Communication quantization for dataparallel training of deep neural networks,” in Machine Learning in HPC Environments (MLHPC), Workshop on. IEEE, 2016, pp. 1–8.
 [45] F. Sattler, S. Wiedemann, K.R. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning with minimal communication,” arXiv preprint arXiv:1805.08768, 2018.
Comments
There are no comments yet.