With the increase of training data volume and growing complexity of deep neural networks (DNNs), distributed computing environments (such as GPU clusters) are widely adopted to accelerate the training of DNNs. The data-parallel synchronous stochastic gradient descent (S-SGD) method is one of the commonly used optimizers to minimize the objective function of large-scale DNNs . Compared to SGD on a single worker, S-SGD distributes the workloads to multiple workers to accelerate the training, but it also introduces the communication overhead of exchanging model parameters or gradients in each iteration. Assume that there are workers training a single DNN model with S-SGD. In every iteration, all workers take different mini-batches of data to calculate the model gradients in parallel. Then they need to average the gradients before updating the model parameters, which involves significant data communications . Due to the fact that the computing power of computational units (e.g., GPUs and Google TPUs) grows much faster than the network speed, network communication performance has now become the training bottleneck when the communication-to-computation ratio is high . Many large IT companies use expensive high-speed networks such as 40/100Gbps IB or Ethernet to alleviate the communication pressure, but still many researchers and small companies can only use consumer-level GPUs connected by low-bandwidth networks such as 1Gig-Ethernet.
To conquer the communication challenge, one can either increase the workload of workers by choosing a large batch size, or reduce the required data communications in each iteration. Very recently, many large-batch SGD techniques have been proposed with sophisticated optimization strategies  to increase the scaling efficiency without losing the model accuracy. On the other hand, gradient sparsification, quantification and compression methods  have been proposed to dramatically reduce the size of exchanged gradients without affecting the convergence rate. Among the model/gradient size reduction techniques, the Top- sparsification is one of the key approaches  that can sparsify the gradients to just about density ( gradients are zeros and there is no need to transfer these zero-out values) .
|Aggregation Algorithm||Complexity||Time Cost|
Note: . and are machine dependent and constant on a specific machine.
Top- sparsification has been a key gradient compression method with empirical and theoretical studies in , in which researchers have verified that only a small number of gradients are needed to be averaged during the phase of gradient aggregation without impairing model convergence or accuracy. However, the sparsified gradients are generally associated with irregular indices, which makes it a challenge to efficiently accumulate the selected gradients from all workers111Worker and GPU are inter-changeable in this paper.. The ring-based AllReduce method used on dense gradients (DenseAllReduce) has an communication complexity , where is the number of workers and is the size of parameters (or gradients). In Top- sparsification, assume that the density of gradients is on each worker, the value of , and the corresponding indices of non-zero values are irregular from different workers and iterations, thus it generally needs to transfer number of values (gradients and indices) in each iteration. Note that the gradient sparsification method is not suitable for the parameter server (PS) based  S-SGD because the workers should pull the whole model without any compression/sparsification in every iteration, whilst decentralized S-SGD with AllReduce can be better for gradient compression. However, with the sparse gradients, the DenseAllReduce method cannot be directly used to accumulate all the sparse gradients with irregular indices, and recent solutions use the AllGather collective , which is inefficient even if . The AllGather collective has an communication complexity . We use TopKAllReduce to denote the method of averaging irregularly indexed Top- gradients by adopting AllGather. When scaling to a large number of workers (i.e., is large), even high sparsification ratios still generate significant communication overhead.
In fact, the main idea of Top- sparsification is based on the fact that gradients with larger absolute values can contribute more to the model convergence. We observe that one can further select Top- gradients from the accumulated results from groups of Top- values generated by workers. In other words, even though workers can generate a maximum number of non-zero gradients to model update, we can pick up the Top- gradients (in terms of absolute values) for the model update in each iteration. Based on this observation, we propose an efficient Top- sparsification method to tackle the difficulty of TopKAllReduce without affecting the model convergence. Specifically, instead of accumulating the irregularly indexed non-zero gradients from all workers, we choose the global Top- (gTop-) gradients in terms of absolute values. gTop-222In this paper, we mainly discuss the decentralized S-SGD with AllReduce to apply gTop- sparsification. But it is also applicable to the PS-based distributed SGD. can elegantly make use of a tree structure to select the global Top- values from all workers, which we call gTopKAllReduce, such that the communication complexity is reduced from to . We summarize the communication complexities of different gradient aggregation solutions in Table I.
In this paper, we first implement the gTopKAllReduce algorithm which provides much more efficient global Top- sparse gradients summation from distributed workers. Then we integrate our proposed gTopKAllReduce to gTop-
S-SGD under PyTorch333https://pytorch.org/, which is one of the most popular deep learning frameworks and MPI444https://www.open-mpi.org/. On a 32-node GPU cluster connected by 1-Gbps Ethernet, gTop- S-SGD achieves 2.7-12.8x speedup than S-SGD with highly optimized libraries Horovod  and NCCL555https://developer.nvidia.com/nccl. Compared to Top- S-SGD, gTop- S-SGD is generally up to times faster on the evaluated experiments on various DNNs and datasets. Our contributions are summaries as follows:
We observe that the accumulating results of Top- sparsification can be further sparsified before being updated to the model.
We propose an efficient global Top- sparsification algorithm on distributed SGD, called gTop- S-SGD, to accelerate distributed training of deep neural networks without losing the model convergence and accuracy.
We implement the proposed gTop- S-SGD atop popular framework PyTorch and MPI, and we also release all our experimental parameters for reproducibility.
gTop- S-SGD achieves significant improvement on the real-world applications with various DNNs and datasets under low-bandwidth networks (e.g., 1 Gbps Ethernet).
The rest of the paper is organized as follows. We introduce the preliminaries in Section II, in which some background information and the main problem is clarified. In Section III, we present our observation from Top- sparsification and propose an efficient gTop- S-SGD algorithm. Then we demonstrate the detailed experimental study in Section IV. Section V gives an introduction to the related work, and finally we conclude the paper in Section VI.
In this section, we briefly introduce the background knowledge in training of DNNs, and the distributed SGD used for large-scale models. We also illustrate the current Top- sparsification technique for compressing gradients in distributed SGD. For ease of presentation, some frequently used notations are summarized in Table II.
|The number of workers in the cluster.|
|The size of a message in bytes.|
|Latency (startup time) of the network between two workers.|
|Transmission time per byte between two nodes.|
|Density of the gradients for aggregation.|
|The size of gradients to aggregation, and .|
|Time of an iteration.|
|Time of the forward pass in each iteration.|
|Time of the backward propagation in each iteration.|
|Time of the model update in each iteration.|
|Time of communication cost in each iteration.|
Deep neural networks (DNNs) are generally stacked with many hierarchical layers, and each layer is a transformer function of the input values. We can formulate DNNs as Eq. 1.
where and are the input and output of layer ( for -layer DNNs) respectively. Inputs of current layer are the outputs of its previous layer(s) (e.g., ). The function
are the trainable model parameters, which could be iteratively updated during the model training using mini-batch stochastic gradient descent (SGD) optimizers and the backpropagation algorithm.
Ii-B Mini-batch SGD
There is an objective function to define the differences between the prediction values and the ground truth of a DNN. The mini-batch SGD optimizer updates the parameters iteratively to minimize the objective function. To be specific, there are three phases in each iteration during training: 1) Feed-forward phase: a mini-batch of data () is read as inputs of a DNN, and is fed forward across the neural network from the first layer to the last layer, which finally generates the prediction values to be used as evaluation of the objective function . 2) Backward-propagation phase: the gradients w.r.t. parameters and inputs are calculated from the last layer to the first layer. 3) Update phase, the parameters are updated by the afore-generated gradients using the following formula:
where is the learning rate. For a single-worker training, phase 1) and 2) are the main time costs of an iteration, which are computing intensive tasks. So we the average time of one iteration can be approximated by .
Ii-C Synchronized SGD
Synchronized SGD (S-SGD) with data parallelism is widely applied to train models with multiple workers (say workers, and indexed by ). Each worker keeps a consistent model and takes a different mini-batch of data and forwards it by phase 2), and then follows phase 3) to calculate the gradients in parallel. Since the data read by different workers are not the same, the generated gradients are inconsistent in each iteration; therefore, to keep explicit the same as mini-batch SGD, the gradients from different workers should be averaged before updating the model. The update formula of parameters is rewritten as
where denotes the gradients of worker at the iteration. The gradients are located in different workers without shared memory so that the averaging operation of gradients involves communication costs, which generally becomes the system bottleneck. The average iteration time of S-SGD can be approximated by . Assume that we use weak-scaling on workers with S-SGD, the scaling efficiency can be approximated by
is generally related the and the model/gradient size . Therefore, with larger , it is crucial to reduce to achieve lower and thus higher scaling efficiency.
In Eq. 3, the summation of (i.e., ) can be directly implemented by an AllReduce collective , which we denote DenseAllReduce. And the ring-based AllReduce algorithm  (which is also included in NCCL) is an efficient implementation on the dense-GPU cluster. To understand the time cost of DenseAllReduce, we revisit the time model of the ring-based AllReduce. According to , the time cost of ring-based AllReduce can be represented by
where is the latency (startup time) of a message transfer between two nodes, and is the transmission time per byte between two nodes using the alpha-beta communication model .
Ii-E Top- sparsification
From Eq. 5, it is noted that with or becoming large, the communication cost will be linearly increased. To reduce the size of transfer messages , Top- sparsification  is proposed to introduce very highly sparse of gradients. Using Top- sparsification, each worker only needs to contribute the largest absolute values of gradients to be summed up in each iteration, and the zeroed-out values of gradients are stored locally and accumulated at the next iteration. Both theoretical and empirical studies have verified that the Top- sparsification has little impact on the model convergence and accuracy . The pseudo-code of Top- sparsification S-SGD is shown in Algorithm 1. Note that at Line of Algorithm 1, the implementation of TopKAllReduce is completely different from the DenseAllReduce for efficiency since the non-zero values of may come from inconsistent indices from different workers. Efficient implementations of such sparse AllReduce are non-trivial. Current methods 
are using AllGather to implement TopKAllReduce, in which the sparsified gradients are gathered as a dense vector combined with its corresponding indices, say. Both sizes of and are . According the communication model of AllGather , the time cost for all-gathering size of messages is
From Eq. 6, we can see that with increasing , is linear increased. Therefore, Top- sparsification is also difficult to scaling large-scale clusters on low-bandwidth networks. In this paper, we propose a global Top- (gTop-) sparsification approach to address the problem.
In this section, we first demonstrate some observations from Top- sparsification S-SGD, and then we present our proposed global Top- sparsification algorithm. For ease of presentation, we assume that the number of workers is the power of .
Iii-a Observations from Top- sparsification
In the previous section, we have introduced Top- sparsification S-SGD, in which there are values selected from the local worker and then are accumulated across all the workers. We get insight into the distribution of non-zero values (denoted as ) of which is generated by the summation of the sparse gradients from all workers. We found that not all values of (whose number of elements is , and ) contributes to the model convergence. Specifically, can be further sparsified as such that only a smaller number of non-zero gradients are needed for model updates. In other words, one can further select Top- largest absolute values, , from to update the model while maintaining the model convergence. In this scenario, the selected from results in afore-summed gradients that are neither updated to the model nor stored into the local residuals. This finally could damage the model convergence. Therefore, if only elements are selected from to update the model, the remain elements should be put back as residuals with corresponding indices so that they can be accumulated locally and should contribute to model updates in future iterations. Therefore, it could have to ensure the convergence of gTop-.
Iii-B The key idea of gTop-
According to the above observations, it only needs largest absolute values from all the sparsified gradients , where . Therefore, the problem is formulated as the global Top- (gTop-) selection from instead of TopKAllReduce, while are located in distributed workers. We again let denote the non-zero values and corresponding indicies of whose number of non-zero values is . We first use the AllGather version to illustrate the key idea of gTop- sparsification, and then we present our new efficient algorithm for gTop- sparsification. At Line of Algorithm 1, , with non-zero values contributing updates to . Different from top- sparsification, we further sparsify by selecting largest absolute values from . The straightforward implementation of gTop- is shown in Algorithm 2. Please be noted that this version is only used to illustrate the key idea that how to select those gradients to update the model. The efficient algorithm is presented afterward in the next subsection. An example of gTop- sparsification using AllGather on workers is shown in Fig. 1.
Iii-C gTopKAllReduce: An efficient AllReduce algorithm for gTop- sparsification
From Eq. 6, we can see that the AllGather collective is inefficient to conduct the AllReduce operation from irregular indexed gradients. Based on the same density, the main purpose of our proposed efficient algorithm is to eliminate the high impact of the variable on the time cost. For ease of presentation, we first define a Top- operation, , of two sparse vectors, say and , both of which have non-zero values.
A Top- operation: . , where , and the largest value of .
Note that the number of non-zero values of is also . During training of S-SGD, and are located in different workers without shared memory. One should exchange the two sparse vectors to achieve an global Top- sparse vector: . The operation for two distributed workers is shown in Fig. 2, which demonstrates that can be efficiently implemented by a send operation (network communication), followed by a local Top- selection on a maximum number of non-zero values.
When scaling to workers (assume that is the power of ), since the final is equal to the local , we use a recursive doubling technique to reduce the total transfer size. To show this recursive doubling procedure used for gTop-, we show an -worker example in selecting the global Top- values in Fig. 3. There are rounds of communications for workers (i.e., ). At the round, there are pairs of workers to do the operations, which is the same as Fig. 2, in parallel. After rounds, the first worker (rank ) finally generates the global Top- values (i.e., ).
According to the illustration of recursive doubling gTop-, we propose the gTop- based AllReduce, shortly called gTopKAllReduce. The complete procedure of gTopKAllReduce is shown in Algorithm 3. Line selects the non-zero values of sparse to assign the variable “”, which should be sent to other workers. Line allocates the buffer “” to receive the “” from another worker at each communication round. Line - is the procedure of , which is finally stored in “” for the next round communication. The functions “Recv” and “Send” in Line and are a pair operation, and can be implemented by the interfaces of MPI. Since the result by far is only stored at the first worker (rank=), Line broadcasts the to all the other workers, which also requires number of communications using the flat-tree algorithm . Finally, Line and record the which indicates the indices that are used in .
Iii-D Communication complexity analysis of gTopKAllReduce
There are two main processes of gTopKAllReduce. The first one is the calculation of . From Fig. 3, the first worker should take part in the communication at every round, so we only need to analyze the big of rank . In the worker of rank=, it takes rounds of communications to calculate . In each communication, rank should receive elements from another worker, which takes a time cost of . Thus, the overall time cost of the first process is . In the second process, the global Top- values (i.e., ) in the first worker should be broadcasted to all the other workers. The broadcast operation takes according to the flat-tree algorithm. In summary, the time cost of gTopKAllReduce is
The communication complexity is much lower than TopKAllReduce especially when is large.
Iii-E gTop- S-SGD with gTopKAllReduce
With the above proposed efficient implementation of gTopKAllReduce, we improve the gTop- S-SGD in Algorithm 2 by replacing Line - with a line that invokes gTopKAllReduce shown Algorithm 3. The improved version of the gTop- S-SGD training algorithm is shown in Algorithm 4. Compared to Top- S-SGD, gTop- S-SGD only introduces an extra computation (Line in Algorithm 4) whose overhead is much smaller than the communication overhead, while gTop- S-SGD reduces the communication complexity a lot.
Iii-F System overview
We implement our proposed gTop- S-SGD atop PyTorch and MPI. Since the sparsification (i.e., Top- selection in local) and residual operations have extra overheads, and they can be parallelized with the feed-forward and backward computation. Therefore, we make gradient sparsification related operations be separate with feed-forward and backward operations. To be specific, there is a thread to process the gradient sparsification and residual management for communication, and the main thread takes charge of feed-forward/backward computation. An overview of our system architecture is shown in Fig. 4.
Note that gradient sparsification is done on the GPU, which means that the Top- selection is invoked on the GPU, and then the handler transfers sparsified results to CPU for communication. Such design has two advantages: First, when the number of gradients is large, GPU could be more efficient than CPU; Second, because the density is generally set to a very small value, then transferring the non-values through PCIe could be much faster than transferring the whole gradients.
Iv Experimental Study
We conduct experimental evaluations to show the effectiveness of our proposed gTop- S-SGD by real-world applications on a 32-GPU cluster. We first validate the convergence of our proposed gTop- S-SGD, which should have nearly consistent convergence behavior with the dense version. Then we evaluate the time cost and efficiency of gTopKAllReduce and compare with the dense AllReduce (DenseAllReduce) and the Top- AllReduce (gTopKAllReduce) in different sizes of messages. After that, we make a comparison on the training efficiency among the three S-SGD algorithms (i.e., S-SGD with dense gradients, Top- S-SGD, and gTop- S-SGD). We also breakdown the training process in an iteration to several time-consuming phases to analyze the extra overheads that are introduced by gTop- sparsification.
Iv-a Experimental setup
Hardware: The distributed environments are configured as a -GPU cluster with machines, each of which is equipped with one Nvidia P102-100 GPU. The network between two machines is 1 Gbps Ethernet (1GbE). Details of the hardware are shown in Table III. Each machine is under a low-performance configuration just like a personal desktop computer.
|CPU||Intel(R) Celeron(R) CPU N3350 @ 1.10GHz|
|GPU||Nvidia P102-100 (3200 CUDA cores and 5GB Memory)|
|PCI-e||PCI-e1 lane with a maximum bandwidth of 250 MB/s|
|Memory||4GB DDR3 with a 16GB swap file|
|Network||1 Gbps Ethernet (1GbE)|
Software: All GPU machines are installed with Ubuntu-16.04, the Nvidia GPU driver at version 390.48 and CUDA-9.1. The communication libraries are OpenMPI-3.1.1666https://www.open-mpi.org/ and NCCL-2.1.5777https://developer.nvidia.com/nccl. We use the highly optimized distributed training library Horovod888https://github.com/uber/horovod  at the version of 1.4.1. The deep learning framework is PyTorch at the version of 0.4.0 with cuDNN-7.1.
|Dataset||Training samples||Validation samples||Input size|
|ImageNet||1.2 million||10000||224 224|
|Batch size||Learning rate|
Note: All models are trained with the single precision floating point (i.e., 32-bit).
DNNs: We choose various DNNs from several areas of AI applications with different datasets. The datasets include Cifar-10  and ImageNet  for image classification and the Penn Treebank corpus (PTB)  for language modeling. The data sizes of evaluated datasets are shown in Table IV. For each dataset, we use one to two benchmarking deep models. For the Cifar-10 dataset, we use the VGG-16 model  and the ResNet-20 model . For the ImageNet dataset, the AlexNet model  and the ResNet-50 model  are used. We exploit a 2-layer LSTM language model (LSTM-PTB) for the PTB dataset, which is similar as in . The details of the deep models are shown in Table V. All the model training are using momentum SGD with a momentum of .
Iv-B Convergence comparison
The convergence of Top- sparsification S-SGD has been verified to be nearly consistent with the dense version in much previous work , so we would not include the convergence curves of Top- S-SGD. We compare our gTop- S-SGD with the original S-SGD with dense gradients running on workers. It has been shown that the warmup strategy in the first several epochs helps the model convergent better , so we adopt the similar warmup configuration. To be specific, the first epochs use the dynamic densities of and smaller learning rates, and the remaining epochs adopt a density of or , which means we can compress the gradients by hundreds of smaller size of communication messages from the fifth epoch.
Convergence on the Cifar-10 dataset: The convergence of VGG-16 and ResNet-20 models on the Cifar-10 dataset is shown in Fig. 5. The results show that the convergence rate of ResNet-20 is almost the same with the baseline, while the VGG-16 model even converges slightly better than the baseline.
Convergence on the ImageNet dataset: The convergences of AlexNet and ResNet-50 models on the ImageNet dataset are shown in Fig. 6. Again, the results show that the convergence rates of the two networks are close to the baselines. On the AlexNet model, the convergence of gTop- S-SGD with density is slightly worse than the baseline, which could be caused by the very low density effected on the convolutional layers while the fully connected layers have a large proportion of parameters. On the other hand, gTop- sparsification works well on the ResNet-50 model, which converges slightly faster than the baseline.
Convergence on the LSTM network: The convergence of LSTM-PTB on the PTB dataset is shown in Fig. 7. It is also noted that gTop- S-SGD converges close to that of S-SGD under a density of .
In summary, three different types of DNNs from different benchmarking datasets show that our proposed gTop- sparsification on S-SGD would not damage the model during training and keeps very close model convergence to the dense version of S-SGD.
Iv-C Communication speed
Before we demonstrate the efficiency of gTop- S-SGD, we first evaluate the communication speed of the cluster. We test the point-to-point communication time with some various sizes of messages because the performance of point-to-point communication plays an important role on MPI collectives and our gTopKAllReduce. After that we evaluate the time speeds of DenseAllReduce, TopKAllReduce and gTopKAllReduce in different sizes of sparse vectors and a scaling number of workers on the 1GbE cluster.
Point-to-point communication: We test the point-to-point communication speed by using OSU Micro-Benchmark999http://mvapich.cse.ohio-state.edu/benchmarks/ at the version of 5.5. The time costs of the point-to-point communication between two machines are shown in Fig. 8, in which we run
times to calculate the mean and standard variance from the reported values. It can be seen that the time used for transferring a message is a linear function with the size of the message, which verifies the- model. Based on the measured data, we can get and .
Time performance of AllReduce operations: Since and are two main factors affecting the performance of TopKAllReduce and gTopkAllReduce, we compare their time performances in two dimensions (i.e., and ) based on the measured , and the time cost models in Table. I. First, we compare the time cost with the number of workers increases based on MB (the approximate model size of ResNet-50) and . Second, in the configuration of a cluster with 64 workers, we make a comparison on how the time cost changes with the size of the message increases. The time comparison is shown in Fig. 9. From the left of Fig. 9, when the number of workers is small, TopKAllReduce is slightly faster than gTopKAllReduce. However, when the number of workers scales to , TopKAllReduce becomes much worse than gTopKAllReduce. Furthermore, our proposed gTopKAllReduce is much more efficient than TopKAllReduce when scaling to large sizes of messages. To summarize, a larger number of workers or a larger message size would make gTopKAllReduce higher efficiency than TopKAllReduce.
|Model||Batch size||Iteration time||Throughput|
Iv-D Training efficiency
Single-GPU training speed: We first demonstrate the average training speed of one iteration on a single GPU, which is shown in Table VI. It can be seen that the computation time of each iteration is from tens of milliseconds to several seconds so that scaling such models on 1GbE clusters is challenging especially for the models with a large number of parameters (e.g., AlexNet) because of high communication-to-computation ratios.
Scaling efficiency: After integrating gTopAllReduce to gTop- S-SGD, we want to consider how many speedups can be achieved on the low bandwidth networks with different models on a different number of workers. The scaling efficiency of S-SGD with three different AllReduce algorithms are shown in Fig. 10. It can be seen that the dense S-SGD has worst scaling efficiency because the full size of gradients makes the communication very slow on 1GbE clusters. The Top- S-SGD achieves some improvement on a small number of workers then S-SGD, but it has an obvious performance decrease when scaling to GPUs. However, our proposed algorithm gTop- S-SGD achieves much more stable scaling efficiency even on clusters with the larger number of GPUs. For example, when scaling to GPUs, our proposed gTop- S-SGD achieves faster than dense S-SGD on average, and it achieves improvement on average compared to Top-. Particularly, gTop- S-SGD is up to and than S-SGD and Top- S-SGD respectively on the AlexNet model. Summary of the training throughput on different models is shown in Table. VII.
Note: The throughput is measured with processed images per second (i.e., the unit is Images/s). indicates the speedup of gTop- compared to the dense one, and indicates the speedup of gTop- compared to Top-.
Iv-E Time performance analysis
We use the cases of workers to analyze the time performance of gTop- S-SGD. To better understand the overheads of gTop- sparsification, we breakdown the time of an iteration into three parts: GPU computation time (), local sparsification time (), and communication time (). Note that in this work, we do not consider the parallelization between computation and communication during backward propagation. The main reason is that for some deep models like ResNet-50 that consume large size of memory and the mini-batch size could not be set too large, to the computation time is short. But we also need to reduce the communication-to-computation ratio to alleviate the impact of communication, so an effective method is to accumulating gradients for different small sizes of un-updated mini-batches. In our evaluated experiments of ResNet-50, we set local mini-batch size as , and it accumulates the gradients times for a single update, so the effective mini-batch size is . Therefore, it has little contributions from the pipeline of backward propagation and communication on low-bandwidth networks. But gTop- sparsification is also applicable to the wait-free backward propagation algorithm  and the optimal gradient merge algorithm .
The time breakdown for the evaluated models is shown in Fig. 11. From Fig. 11, on one hand, in time breakdown of VGG-16 and AlexNet models, the communication overheads are much larger than computation because VGG-16 and AlexNet have three fully connected layers which are equipped with a large number of parameters, while the computation is fast. These also reflect that the scaling efficiency is low in Fig. 10 of S-SGD even with gTop- sparsification. On the other hand, the communication and sparsification overheads are much smaller than the computation with ResNet-20 and ResNet-50, which indicates low communication-to-computation ratios, so that the scaling efficiency can be up to even on the low-bandwidth network.
Furthermore, it is noted that the time used by gradient sparsification is comparable to the computation time on VGG-16 and AlexNet models. The main reason is that Top- selection on GPU is inefficient, which generally requires a sort operation on the whole gradients, and it could be non-trivial to be highly parallelized on SIMD architectures . We will leave this as our future optimization direction.
Iv-F Convergence sensibility to densities
To understand the sensibility of model convergence to densities, we run the experiments with different values of the density using VGG-16 and ResNet-20 on the Cifar-10 dataset. The convergence curves are shown in Fig. 12. It can be seen that even a very low density of does not have a big impact to the model convergence to both models. However, a trade-off should be made to balance the high sparsification ratio and the convergence speed. One one hand, the higher sparsification would bring higher scaling efficiency to a larger number of workers. One the other hand, one should also be careful to the upper bound of the sparsity that would have a negative impact on the model convergence.
V Related Work
Gradient size reduction in communication is crucial for distributed training using synchronous SGD. Researchers have proposed quantization, sparsification, and lossless data stream compression. Gradients or weights in DNN models are stored in single precision floating points, which means that it requires 32 bits for each gradient. Gupta et al.  propose the 16-bit wide fixed-point number representation for model parameters and gradients to improve the computation and energy efficiency. To keep the model accuracy, researchers  propose the mixed precision technique which updates the model with 32-bit precision while the computation is in 16-bit precision. Hubrara et al.  further quantize model parameters to 4-bit precision without losing accuracy, and 2-bit , and even 1-bit  are also proposed for minimal quantization. 1-bit quantization is the smallest number of bit of quantization of a single value, which would achieve up to 32 smaller message size than the 32-bit counterpart. However, there still exist some quantization errors because some values become zeros if they exceed the numerical range of that precision can represent even with careful consideration, which to some extent hearts the model accuracy. Error compensation techniques  are further proposed to address this issue of quantization errors. However, even 1-bit gradients can only achieve reduced size, which is either not enough on large models and slow networks.
In terms of gradient sparsification, which zero-outs a large proportion of gradients to reduce the communication size, Aji et al.  and Chen et al.  empirically demonstrate that up to gradients are not needed to update the model at each iteration, which indicates that the gradients would be very sparse to convergent the model with accumulation of gradient residuals. Aji et al.  use static threshold selection to determine the value of , while Chen et al.  propose a dynamic version. Lin et al. 
further propose some optimization tricks (including the warmup strategy, momentum correction, and gradient clipping) to address the accuracy loss introduced by dropping a large number of gradients, and they show that Top-sparsification S-SGD can converge very close to S-SGD with dense gradients. The above techniques of quantization and sparsification can be combined together to achieve a higher compression ratio of gradients with no (or a very small) accuracy loss. For example, Lin et al.  achieve up to 270 and 600 compression ratio without loss of accuracy, while Sattler et al.  achieve up to 37208 with only 1% lower accuracy. Furthermore, after quantization and sparsification, one can further use some lossless compression techniques like run-length encoding to further reduce the size of transfer bytes .
Focusing on the study of sparsification, the other two work in  and  are mostly related to our work. They have realized that efficient sparse AllReduce algorithms are non-trivial to implement, and they both propose the AllGather solution. However, the AllGather method requires a linear increase cost with respect to the number of workers. Therefore, the AllGather could be inefficient when scaling to large-scale clusters.
Vi Conclusion and Future Work
In this work, we first show that the accumulating results from Top- gradients can be further sparsified by choosing some largest absolute gradients before updating the model, which has no much impact on the model convergence. Then we identify that the Top- sparsification is inefficient in averaging the gradients from all workers because the indices of the Top- gradients are not the same such that one should use the AllGather collective to collect all the Top- gradients and indices. The AllGather method for Top- aggregation (TopKAllReduce) is linear expensive to the number of workers (i.e., the communication complexity is , where is the number of workers), so it would have very low scalability when scaling to large-scale clusters. To this end, we propose a global Top- (gTop-) sparsification approach for S-SGD, which is communication-efficient. The gradient aggregation algorithm based on gTop-, named gTopKAllReduce, only requires a communication complexity of
, which is much lower than TopKAllReduce. Experimental studies on various of deep neural networks including convolutional neural networks and recurrent neural networks (LSTM) are conducted to verify gTop-S-SGD has little impact on the model convergence (experimental results demonstrate that the convergence curves are similar to S-SGD with dense gradients). The experiments conducted on the 32-GPU cluster inter-connected with 1 Gbps Ethernet show that our proposed gTop- S-SGD has much higher scaling efficiency than S-SGD and Top- S-SGD.
In the future work, we would like to integrate the gTop- sparsification method with quantization to achieve higher compression ratios, and we will also study some strong theoretical convergence analysis of gTop- sparsification for convex and non-convex optimization problems.
We would like to thank MassGrid.com for their support on providing the GPU cluster for experiments.
-  J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
-  S. Shi, W. Qiang, and X. Chu, “Performance modeling and evaluation of distributed deep learning frameworks on GPUs,” in The Fourth IEEE International Conference on Big Data Intelligence and Computing (DataCom 2018). IEEE, 2018.
-  S. Shi, X. Chu, and B. Li, “A dag model of synchronous stochastic gradient descent in distributed deep learning,” in Parallel and Distributed Systems (ICPADS), 2018 IEEE 24rd International Conference. IEEE, 2018.
-  D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan, D. Kalamkar, B. Kaul, and P. Dubey, “Distributed deep learning using synchronous stochastic gradient descent,” arXiv preprint arXiv:1602.06709, 2016.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
-  W. Wang and N. Srebro, “Stochastic nonconvex optimization with large minibatches,” arXiv preprint arXiv:1709.08728, 2017.
-  Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in minutes,” in Proceedings of the 47th International Conference on Parallel Processing. ACM, 2018, p. 1.
-  X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu et al., “Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes,” arXiv preprint arXiv:1807.11205, 2018.
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized
neural networks: Training neural networks with low precision weights and
The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.
C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan,
“Adacomp: Adaptive residual gradient compression for data-parallel
distributed training,” in
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in International Conference on Learning Representations, 2018.
-  J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized SGD and its applications to large-scale distributed optimization,” International Conference on Machine Learning, 2018.
-  J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: compressed optimisation for non-convex problems,” arXiv preprint arXiv:1802.04434, 2018.
-  D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, “The convergence of sparsified gradient methods,” arXiv preprint arXiv:1809.10505, 2018.
-  S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” arXiv preprint arXiv:1809.07599, 2018.
A. F. Aji and K. Heafield, “Sparse communication for distributed gradient
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 440–445.
-  P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2530–2541.
-  E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn, “Collective communication: theory, practice, and experience,” Concurrency and Computation: Practice and Experience, vol. 19, no. 13, pp. 1749–1783, 2007.
-  M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server.” in OSDI, vol. 14, 2014, pp. 583–598.
-  C. Renggli, D. Alistarh, and T. Hoefler, “SparCML: High-performance sparse communication for machine learning,” arXiv preprint arXiv:1802.08021, 2018.
-  J. Fang, H. Fu, G. Yang, and C.-J. Hsieh, “RedSync: Reducing synchronization traffic for distributed deep learning,” arXiv preprint arXiv:1808.04357, 2018.
-  A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.
-  R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Neural networks for perception. Elsevier, 1992, pp. 65–93.
-  A. Gibiansky, “Bringing HPC techniques to deep learning.(2017),” URL http://research. baidu. com/bringing-hpc-techniques-deep-learning, 2017.
-  T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff, “Toward performance models of MPI implementations for understanding application scaling issues,” in European MPI Users’ Group Meeting. Springer, 2010, pp. 21–30.
-  R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.
-  S. Sarvotham, R. Riedi, and R. Baraniuk, “Connection-level analysis and modeling of network traffic,” in Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement. ACM, 2001, pp. 99–103.
-  J. Pješivac-Grbović, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra, “Performance analysis of MPI collective operations,” Cluster Computing, vol. 10, no. 2, pp. 127–143, 2007.
-  A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research),” URL http://www. cs. toronto. edu/kriz/cifar. html, 2010.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
-  M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated corpus of English: The Penn Treebank,” Computational linguistics, vol. 19, no. 2, pp. 313–330, 1993.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
-  H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters,” in Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 2017, pp. 181–193.
-  S. Shi, X. Chu, and B. Li, “MG-WFBP: Efficient data communication for distributed synchronous sgd algorithms,” in INFOCOM 2019-IEEE Conference on Computer Communications, IEEE, 2019.
-  A. Shanbhag, H. Pirk, and S. Madden, “Efficient Top-K query processing on massively parallel hardware,” in Proceedings of the 2018 International Conference on Management of Data. ACM, 2018, pp. 1557–1570.
-  S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International Conference on Machine Learning, 2015, pp. 1737–1746.
-  P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh et al., “Mixed precision training,” in International Conference on Learning Representations, 2018.
-  A. Svyatkovskiy, J. Kates-Harbeck, and W. Tang, “Training distributed deep recurrent neural networks with mixed precision on GPU clusters,” in Proceedings of the Machine Learning on HPC Environments. ACM, 2017, p. 10.
-  W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in neural information processing systems, 2017, pp. 1509–1519.
-  F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen, “Communication quantization for data-parallel training of deep neural networks,” in Machine Learning in HPC Environments (MLHPC), Workshop on. IEEE, 2016, pp. 1–8.
-  F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning with minimal communication,” arXiv preprint arXiv:1805.08768, 2018.