A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks

01/14/2019 ∙ by Shaohuai Shi, et al. ∙ Hong Kong Baptist University 0

Distributed synchronous stochastic gradient descent (S-SGD) with data parallelism requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-k sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers and thus alleviate the network pressure. Top-k sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their indices, and the irregular indices make the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of O(kP), where P is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all Top-k gradients from P workers are needed for the model update, and therefore we propose a novel global Top-k (gTop-k) sparsification mechanism to address the difficulty of aggregating sparse gradients. Specifically, we choose global Top-k largest absolute values of gradients from P workers, instead of accumulating all local Top-k gradients to update the model in each iteration. The gradient aggregation method based on gTop-k sparsification, namely gTopKAllReduce, reduces the communication complexity from O(kP) to O(klog_2P). Through extensive experiments on different DNNs, we verify that gTop-k S-SGD has nearly consistent convergence performance with S-SGD. We evaluate the training efficiency of gTop-k on a cluster with 32 GPU machines which are inter-connected with 1 Gbps Ethernet. The experimental results show that our method achieves up to 2.7-12× higher scaling efficiency than S-SGD with dense gradients, and 1.1-1.7× improvement than the existing Top-k S-SGD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the increase of training data volume and growing complexity of deep neural networks (DNNs), distributed computing environments (such as GPU clusters) are widely adopted to accelerate the training of DNNs. The data-parallel synchronous stochastic gradient descent (S-SGD) method is one of the commonly used optimizers to minimize the objective function of large-scale DNNs [1][2]. Compared to SGD on a single worker, S-SGD distributes the workloads to multiple workers to accelerate the training, but it also introduces the communication overhead of exchanging model parameters or gradients in each iteration. Assume that there are workers training a single DNN model with S-SGD. In every iteration, all workers take different mini-batches of data to calculate the model gradients in parallel. Then they need to average the gradients before updating the model parameters, which involves significant data communications [3]. Due to the fact that the computing power of computational units (e.g., GPUs and Google TPUs) grows much faster than the network speed, network communication performance has now become the training bottleneck when the communication-to-computation ratio is high [4]. Many large IT companies use expensive high-speed networks such as 40/100Gbps IB or Ethernet to alleviate the communication pressure, but still many researchers and small companies can only use consumer-level GPUs connected by low-bandwidth networks such as 1Gig-Ethernet.

To conquer the communication challenge, one can either increase the workload of workers by choosing a large batch size, or reduce the required data communications in each iteration. Very recently, many large-batch SGD techniques have been proposed with sophisticated optimization strategies [5][6][7][8][9] to increase the scaling efficiency without losing the model accuracy. On the other hand, gradient sparsification, quantification and compression methods [10][11][12][13][14][15][16] have been proposed to dramatically reduce the size of exchanged gradients without affecting the convergence rate. Among the model/gradient size reduction techniques, the Top- sparsification is one of the key approaches [17][12][18] that can sparsify the gradients to just about density ( gradients are zeros and there is no need to transfer these zero-out values) [11][12].

Aggregation Algorithm Complexity Time Cost
DenseAllReduce
TopKAllReduce
Ours (gTopKAllReduce)
  • Note: . and are machine dependent and constant on a specific machine.

TABLE I: Communication complexity of gradient aggregation algorithms

Top- sparsification has been a key gradient compression method with empirical and theoretical studies in [17][12][16], in which researchers have verified that only a small number of gradients are needed to be averaged during the phase of gradient aggregation without impairing model convergence or accuracy. However, the sparsified gradients are generally associated with irregular indices, which makes it a challenge to efficiently accumulate the selected gradients from all workers111Worker and GPU are inter-changeable in this paper.. The ring-based AllReduce method used on dense gradients (DenseAllReduce) has an communication complexity [19], where is the number of workers and is the size of parameters (or gradients). In Top- sparsification, assume that the density of gradients is on each worker, the value of , and the corresponding indices of non-zero values are irregular from different workers and iterations, thus it generally needs to transfer number of values (gradients and indices) in each iteration. Note that the gradient sparsification method is not suitable for the parameter server (PS) based [20] S-SGD because the workers should pull the whole model without any compression/sparsification in every iteration, whilst decentralized S-SGD with AllReduce can be better for gradient compression. However, with the sparse gradients, the DenseAllReduce method cannot be directly used to accumulate all the sparse gradients with irregular indices, and recent solutions use the AllGather collective [21][22], which is inefficient even if . The AllGather collective has an communication complexity [21]. We use TopKAllReduce to denote the method of averaging irregularly indexed Top- gradients by adopting AllGather. When scaling to a large number of workers (i.e., is large), even high sparsification ratios still generate significant communication overhead.

In fact, the main idea of Top- sparsification is based on the fact that gradients with larger absolute values can contribute more to the model convergence. We observe that one can further select Top- gradients from the accumulated results from groups of Top- values generated by workers. In other words, even though workers can generate a maximum number of non-zero gradients to model update, we can pick up the Top- gradients (in terms of absolute values) for the model update in each iteration. Based on this observation, we propose an efficient Top- sparsification method to tackle the difficulty of TopKAllReduce without affecting the model convergence. Specifically, instead of accumulating the irregularly indexed non-zero gradients from all workers, we choose the global Top- (gTop-) gradients in terms of absolute values. gTop-222In this paper, we mainly discuss the decentralized S-SGD with AllReduce to apply gTop- sparsification. But it is also applicable to the PS-based distributed SGD. can elegantly make use of a tree structure to select the global Top- values from all workers, which we call gTopKAllReduce, such that the communication complexity is reduced from to . We summarize the communication complexities of different gradient aggregation solutions in Table I.

In this paper, we first implement the gTopKAllReduce algorithm which provides much more efficient global Top- sparse gradients summation from distributed workers. Then we integrate our proposed gTopKAllReduce to gTop-

S-SGD under PyTorch

333https://pytorch.org/, which is one of the most popular deep learning frameworks and MPI444https://www.open-mpi.org/. On a 32-node GPU cluster connected by 1-Gbps Ethernet, gTop- S-SGD achieves 2.7-12.8x speedup than S-SGD with highly optimized libraries Horovod [23] and NCCL555https://developer.nvidia.com/nccl. Compared to Top- S-SGD, gTop- S-SGD is generally up to times faster on the evaluated experiments on various DNNs and datasets. Our contributions are summaries as follows:

  • We observe that the accumulating results of Top- sparsification can be further sparsified before being updated to the model.

  • We propose an efficient global Top- sparsification algorithm on distributed SGD, called gTop- S-SGD, to accelerate distributed training of deep neural networks without losing the model convergence and accuracy.

  • We implement the proposed gTop- S-SGD atop popular framework PyTorch and MPI, and we also release all our experimental parameters for reproducibility.

  • gTop- S-SGD achieves significant improvement on the real-world applications with various DNNs and datasets under low-bandwidth networks (e.g., 1 Gbps Ethernet).

The rest of the paper is organized as follows. We introduce the preliminaries in Section II, in which some background information and the main problem is clarified. In Section III, we present our observation from Top- sparsification and propose an efficient gTop- S-SGD algorithm. Then we demonstrate the detailed experimental study in Section IV. Section V gives an introduction to the related work, and finally we conclude the paper in Section VI.

Ii Preliminaries

In this section, we briefly introduce the background knowledge in training of DNNs, and the distributed SGD used for large-scale models. We also illustrate the current Top- sparsification technique for compressing gradients in distributed SGD. For ease of presentation, some frequently used notations are summarized in Table II.

Notation Description
The number of workers in the cluster.
The size of a message in bytes.
Latency (startup time) of the network between two workers.
Transmission time per byte between two nodes.
Density of the gradients for aggregation.
The size of gradients to aggregation, and .
Time of an iteration.
Time of the forward pass in each iteration.
Time of the backward propagation in each iteration.
Time of the model update in each iteration.
Time of communication cost in each iteration.
TABLE II: Frequently used notations

Ii-a DNNs

Deep neural networks (DNNs) are generally stacked with many hierarchical layers, and each layer is a transformer function of the input values. We can formulate DNNs as Eq. 1.

(1)

where and are the input and output of layer ( for -layer DNNs) respectively. Inputs of current layer are the outputs of its previous layer(s) (e.g., ). The function

is the transformer function which consists of an operation (e.g., inner product or convolution) and an activation function (e.g., ReLU).

are the trainable model parameters, which could be iteratively updated during the model training using mini-batch stochastic gradient descent (SGD) optimizers and the backpropagation algorithm

[24].

Ii-B Mini-batch SGD

There is an objective function to define the differences between the prediction values and the ground truth of a DNN. The mini-batch SGD optimizer updates the parameters iteratively to minimize the objective function. To be specific, there are three phases in each iteration during training: 1) Feed-forward phase: a mini-batch of data () is read as inputs of a DNN, and is fed forward across the neural network from the first layer to the last layer, which finally generates the prediction values to be used as evaluation of the objective function . 2) Backward-propagation phase: the gradients w.r.t. parameters and inputs are calculated from the last layer to the first layer. 3) Update phase, the parameters are updated by the afore-generated gradients using the following formula:

(2)

where is the learning rate. For a single-worker training, phase 1) and 2) are the main time costs of an iteration, which are computing intensive tasks. So we the average time of one iteration can be approximated by .

Ii-C Synchronized SGD

Synchronized SGD (S-SGD) with data parallelism is widely applied to train models with multiple workers (say workers, and indexed by ). Each worker keeps a consistent model and takes a different mini-batch of data and forwards it by phase 2), and then follows phase 3) to calculate the gradients in parallel. Since the data read by different workers are not the same, the generated gradients are inconsistent in each iteration; therefore, to keep explicit the same as mini-batch SGD, the gradients from different workers should be averaged before updating the model. The update formula of parameters is rewritten as

(3)

where denotes the gradients of worker at the iteration. The gradients are located in different workers without shared memory so that the averaging operation of gradients involves communication costs, which generally becomes the system bottleneck. The average iteration time of S-SGD can be approximated by . Assume that we use weak-scaling on workers with S-SGD, the scaling efficiency can be approximated by

(4)

is generally related the and the model/gradient size . Therefore, with larger , it is crucial to reduce to achieve lower and thus higher scaling efficiency.

Ii-D DenseAllReduce

In Eq. 3, the summation of (i.e., ) can be directly implemented by an AllReduce collective [25][23], which we denote DenseAllReduce. And the ring-based AllReduce algorithm [25] (which is also included in NCCL) is an efficient implementation on the dense-GPU cluster. To understand the time cost of DenseAllReduce, we revisit the time model of the ring-based AllReduce. According to [26][27], the time cost of ring-based AllReduce can be represented by

(5)

where is the latency (startup time) of a message transfer between two nodes, and is the transmission time per byte between two nodes using the alpha-beta communication model [28].

Input: The dataset:
The initialized weights:
The mini-batch size per worker:
The number of workers:
The number of iterations to train:
The number gradients to select:

1:
2:for  do
3:     Sampling a mini-batch of data from ;
4:     ;
5:     Select threhold the largest value of ;
6:     ;
7:     ; // Mask has non-zero values
8:     ; // Store the residuals
9:     TopKAllReduce(); //
10:     ;
11:end for
12:procedure TopKAllReduce()
13:     ;
14:     AllGather();
15:     ;
16:     for  do
17:          ;
18:     end for
19:     ;
20:     Return ;
21:end procedure
Algorithm 1 S-SGD with Top- sparsification on worker [12][21][22]

Ii-E Top- sparsification

From Eq. 5, it is noted that with or becoming large, the communication cost will be linearly increased. To reduce the size of transfer messages , Top- sparsification [12] is proposed to introduce very highly sparse of gradients. Using Top- sparsification, each worker only needs to contribute the largest absolute values of gradients to be summed up in each iteration, and the zeroed-out values of gradients are stored locally and accumulated at the next iteration. Both theoretical and empirical studies have verified that the Top- sparsification has little impact on the model convergence and accuracy [17][12][16]. The pseudo-code of Top- sparsification S-SGD is shown in Algorithm 1. Note that at Line of Algorithm 1, the implementation of TopKAllReduce is completely different from the DenseAllReduce for efficiency since the non-zero values of may come from inconsistent indices from different workers. Efficient implementations of such sparse AllReduce are non-trivial. Current methods [21][22]

are using AllGather to implement TopKAllReduce, in which the sparsified gradients are gathered as a dense vector combined with its corresponding indices, say

. Both sizes of and are . According the communication model of AllGather [19], the time cost for all-gathering size of messages is

(6)

From Eq. 6, we can see that with increasing , is linear increased. Therefore, Top- sparsification is also difficult to scaling large-scale clusters on low-bandwidth networks. In this paper, we propose a global Top- (gTop-) sparsification approach to address the problem.

Iii Methodology

In this section, we first demonstrate some observations from Top- sparsification S-SGD, and then we present our proposed global Top- sparsification algorithm. For ease of presentation, we assume that the number of workers is the power of .

Iii-a Observations from Top- sparsification

In the previous section, we have introduced Top- sparsification S-SGD, in which there are values selected from the local worker and then are accumulated across all the workers. We get insight into the distribution of non-zero values (denoted as ) of which is generated by the summation of the sparse gradients from all workers. We found that not all values of (whose number of elements is , and ) contributes to the model convergence. Specifically, can be further sparsified as such that only a smaller number of non-zero gradients are needed for model updates. In other words, one can further select Top- largest absolute values, , from to update the model while maintaining the model convergence. In this scenario, the selected from results in afore-summed gradients that are neither updated to the model nor stored into the local residuals. This finally could damage the model convergence. Therefore, if only elements are selected from to update the model, the remain elements should be put back as residuals with corresponding indices so that they can be accumulated locally and should contribute to model updates in future iterations. Therefore, it could have to ensure the convergence of gTop-.

Input: The dataset:
The initialized weights:
The mini-batch size per worker:
The number of workers:
The number of iterations to train:
The number of gradients to select:

1:
2:for  do
3:     Sampling a mini-batch of data from ;
4:     ;
5:     Select threshold the largest value of ;
6:     ;
7:     ; // Mask has non-zero values
8:     ; // Store the residuals
9:     SparseAllReduce(); //
10:     // At this time all workers have consistent
11:     Select global threshold the largest value of ;
12:     ;
13:     ;
14:     ; // Store extra residuals
15:     ;
16:end for
Algorithm 2 Naive version S-SGD with gTop- on worker

Iii-B The key idea of gTop-

According to the above observations, it only needs largest absolute values from all the sparsified gradients , where . Therefore, the problem is formulated as the global Top- (gTop-) selection from instead of TopKAllReduce, while are located in distributed workers. We again let denote the non-zero values and corresponding indicies of whose number of non-zero values is . We first use the AllGather version to illustrate the key idea of gTop- sparsification, and then we present our new efficient algorithm for gTop- sparsification. At Line of Algorithm 1, , with non-zero values contributing updates to . Different from top- sparsification, we further sparsify by selecting largest absolute values from . The straightforward implementation of gTop- is shown in Algorithm 2. Please be noted that this version is only used to illustrate the key idea that how to select those gradients to update the model. The efficient algorithm is presented afterward in the next subsection. An example of gTop- sparsification using AllGather on workers is shown in Fig. 1.

Fig. 1: An example of gTop- using AllGather on workers, and .

Iii-C gTopKAllReduce: An efficient AllReduce algorithm for gTop- sparsification

From Eq. 6, we can see that the AllGather collective is inefficient to conduct the AllReduce operation from irregular indexed gradients. Based on the same density, the main purpose of our proposed efficient algorithm is to eliminate the high impact of the variable on the time cost. For ease of presentation, we first define a Top- operation, , of two sparse vectors, say and , both of which have non-zero values.

Definition 1

A Top- operation: . , where , and the largest value of .

Note that the number of non-zero values of is also . During training of S-SGD, and are located in different workers without shared memory. One should exchange the two sparse vectors to achieve an global Top- sparse vector: . The operation for two distributed workers is shown in Fig. 2, which demonstrates that can be efficiently implemented by a send operation (network communication), followed by a local Top- selection on a maximum number of non-zero values.

Fig. 2: An implementation of for two distributed sparse vectors and . The second worker () with non-zero elements () combined with indices () sends to the first worker, and then the first worker has the information of indices to add the values received from the second worker, i.e., , and the first worker easily computes according to Definition 1.

When scaling to workers (assume that is the power of ), since the final is equal to the local , we use a recursive doubling technique to reduce the total transfer size. To show this recursive doubling procedure used for gTop-, we show an -worker example in selecting the global Top- values in Fig. 3. There are rounds of communications for workers (i.e., ). At the round, there are pairs of workers to do the operations, which is the same as Fig. 2, in parallel. After rounds, the first worker (rank ) finally generates the global Top- values (i.e., ).

Fig. 3: An example of gTop- for distributed sparse vectors . I.e., . It only requires rounds of network communications to select the global Top-.

Input: The sparsified gradients:
The number of non-zero elements:
The number of workers:
The rank of worker:

1:;
2:Initialize with the same as ;
3:;
4:for  do
5:     ;
6:     if  in  then
7:          ;
8:          if  then
9:               ;
10:               Recv(, source=);
11:               ;
12:          else
13:               ;
14:               Send(, dest=);
15:          end if
16:     end if
17:     Barriar();
18:end for
19:Bcast(, root=0);
20:;
21: in the same number of elements with ;
22:;
23:Return ;
Algorithm 3 gTopKAllReduce

According to the illustration of recursive doubling gTop-, we propose the gTop- based AllReduce, shortly called gTopKAllReduce. The complete procedure of gTopKAllReduce is shown in Algorithm 3. Line selects the non-zero values of sparse to assign the variable “”, which should be sent to other workers. Line allocates the buffer “” to receive the “” from another worker at each communication round. Line - is the procedure of , which is finally stored in “” for the next round communication. The functions “Recv” and “Send” in Line and are a pair operation, and can be implemented by the interfaces of MPI. Since the result by far is only stored at the first worker (rank=), Line broadcasts the to all the other workers, which also requires number of communications using the flat-tree algorithm [29]. Finally, Line and record the which indicates the indices that are used in .

Iii-D Communication complexity analysis of gTopKAllReduce

There are two main processes of gTopKAllReduce. The first one is the calculation of . From Fig. 3, the first worker should take part in the communication at every round, so we only need to analyze the big of rank . In the worker of rank=, it takes rounds of communications to calculate . In each communication, rank should receive elements from another worker, which takes a time cost of . Thus, the overall time cost of the first process is . In the second process, the global Top- values (i.e., ) in the first worker should be broadcasted to all the other workers. The broadcast operation takes according to the flat-tree algorithm. In summary, the time cost of gTopKAllReduce is

(7)

The communication complexity is much lower than TopKAllReduce especially when is large.

Iii-E gTop- S-SGD with gTopKAllReduce

With the above proposed efficient implementation of gTopKAllReduce, we improve the gTop- S-SGD in Algorithm 2 by replacing Line - with a line that invokes gTopKAllReduce shown Algorithm 3. The improved version of the gTop- S-SGD training algorithm is shown in Algorithm 4. Compared to Top- S-SGD, gTop- S-SGD only introduces an extra computation (Line in Algorithm 4) whose overhead is much smaller than the communication overhead, while gTop- S-SGD reduces the communication complexity a lot.

Input: The dataset:
The initialized weights:
The mini-batch size per worker:
The number of workers:
The number of iterations to train:
The number gradients to select:

1:
2:for  do
3:     Sampling a mini-batch of data from ;
4:     ;
5:     Select threshold the largest value of ;
6:     ;
7:     ; // Mask has non-zero values
8:     ; // Store the residuals
9:     gTopKAllReduce(,,,);
10:     ; // Store extra residuals
11:     ;
12:end for
Algorithm 4 gTopKAllReduce based S-SGD on worker

Iii-F System overview

We implement our proposed gTop- S-SGD atop PyTorch and MPI. Since the sparsification (i.e., Top- selection in local) and residual operations have extra overheads, and they can be parallelized with the feed-forward and backward computation. Therefore, we make gradient sparsification related operations be separate with feed-forward and backward operations. To be specific, there is a thread to process the gradient sparsification and residual management for communication, and the main thread takes charge of feed-forward/backward computation. An overview of our system architecture is shown in Fig. 4.

Fig. 4: An overview of the gTop- based distributed training system.

Note that gradient sparsification is done on the GPU, which means that the Top- selection is invoked on the GPU, and then the handler transfers sparsified results to CPU for communication. Such design has two advantages: First, when the number of gradients is large, GPU could be more efficient than CPU; Second, because the density is generally set to a very small value, then transferring the non-values through PCIe could be much faster than transferring the whole gradients.

Iv Experimental Study

We conduct experimental evaluations to show the effectiveness of our proposed gTop- S-SGD by real-world applications on a 32-GPU cluster. We first validate the convergence of our proposed gTop- S-SGD, which should have nearly consistent convergence behavior with the dense version. Then we evaluate the time cost and efficiency of gTopKAllReduce and compare with the dense AllReduce (DenseAllReduce) and the Top- AllReduce (gTopKAllReduce) in different sizes of messages. After that, we make a comparison on the training efficiency among the three S-SGD algorithms (i.e., S-SGD with dense gradients, Top- S-SGD, and gTop- S-SGD). We also breakdown the training process in an iteration to several time-consuming phases to analyze the extra overheads that are introduced by gTop- sparsification.

Iv-a Experimental setup

Hardware: The distributed environments are configured as a -GPU cluster with machines, each of which is equipped with one Nvidia P102-100 GPU. The network between two machines is 1 Gbps Ethernet (1GbE). Details of the hardware are shown in Table III. Each machine is under a low-performance configuration just like a personal desktop computer.

Hardware Model
CPU Intel(R) Celeron(R) CPU N3350 @ 1.10GHz
GPU Nvidia P102-100 (3200 CUDA cores and 5GB Memory)
PCI-e PCI-e1 lane with a maximum bandwidth of 250 MB/s
Memory 4GB DDR3 with a 16GB swap file
Disk 256GB SSD
Network 1 Gbps Ethernet (1GbE)
TABLE III: The experimental setup of hardware.

Software: All GPU machines are installed with Ubuntu-16.04, the Nvidia GPU driver at version 390.48 and CUDA-9.1. The communication libraries are OpenMPI-3.1.1666https://www.open-mpi.org/ and NCCL-2.1.5777https://developer.nvidia.com/nccl. We use the highly optimized distributed training library Horovod888https://github.com/uber/horovod [23] at the version of 1.4.1. The deep learning framework is PyTorch at the version of 0.4.0 with cuDNN-7.1.

Dataset Training samples Validation samples Input size
Cifar-10 50000 10000 32 32
ImageNet 1.2 million 10000 224 224
PTB 923000 73000 10000
TABLE IV: Datasets for evaluation.
Model Dataset

of Epochs

Batch size Learning rate
VGG-16 Cifar-10 140 128 0.1
ResNet-20 Cifar-10 140 128 0.1
AlexNet ImageNet 45 64 0.01
ResNet-50 ImageNet 15 256 0.01
LSTM-PTB PTB 40 100 1.0
  • Note: All models are trained with the single precision floating point (i.e., 32-bit).

TABLE V: Deep models for training.

DNNs: We choose various DNNs from several areas of AI applications with different datasets. The datasets include Cifar-10 [30] and ImageNet [31] for image classification and the Penn Treebank corpus (PTB) [32] for language modeling. The data sizes of evaluated datasets are shown in Table IV. For each dataset, we use one to two benchmarking deep models. For the Cifar-10 dataset, we use the VGG-16 model [33] and the ResNet-20 model [34]. For the ImageNet dataset, the AlexNet model [35] and the ResNet-50 model [34] are used. We exploit a 2-layer LSTM language model (LSTM-PTB) for the PTB dataset, which is similar as in [12]. The details of the deep models are shown in Table V. All the model training are using momentum SGD with a momentum of .

Iv-B Convergence comparison

The convergence of Top- sparsification S-SGD has been verified to be nearly consistent with the dense version in much previous work [17][12][16], so we would not include the convergence curves of Top- S-SGD. We compare our gTop- S-SGD with the original S-SGD with dense gradients running on workers. It has been shown that the warmup strategy in the first several epochs helps the model convergent better [12], so we adopt the similar warmup configuration. To be specific, the first epochs use the dynamic densities of and smaller learning rates, and the remaining epochs adopt a density of or , which means we can compress the gradients by hundreds of smaller size of communication messages from the fifth epoch.

Fig. 5: The convergence of the deep models on the Cifar-10 dataset.

Convergence on the Cifar-10 dataset: The convergence of VGG-16 and ResNet-20 models on the Cifar-10 dataset is shown in Fig. 5. The results show that the convergence rate of ResNet-20 is almost the same with the baseline, while the VGG-16 model even converges slightly better than the baseline.

Fig. 6: The convergence of the deep models on the ImageNet dataset.

Convergence on the ImageNet dataset: The convergences of AlexNet and ResNet-50 models on the ImageNet dataset are shown in Fig. 6. Again, the results show that the convergence rates of the two networks are close to the baselines. On the AlexNet model, the convergence of gTop- S-SGD with density is slightly worse than the baseline, which could be caused by the very low density effected on the convolutional layers while the fully connected layers have a large proportion of parameters. On the other hand, gTop- sparsification works well on the ResNet-50 model, which converges slightly faster than the baseline.

Convergence on the LSTM network: The convergence of LSTM-PTB on the PTB dataset is shown in Fig. 7. It is also noted that gTop- S-SGD converges close to that of S-SGD under a density of .

Fig. 7: The convergence of the deep models on the PTB dataset with a density of .

In summary, three different types of DNNs from different benchmarking datasets show that our proposed gTop- sparsification on S-SGD would not damage the model during training and keeps very close model convergence to the dense version of S-SGD.

Iv-C Communication speed

Before we demonstrate the efficiency of gTop- S-SGD, we first evaluate the communication speed of the cluster. We test the point-to-point communication time with some various sizes of messages because the performance of point-to-point communication plays an important role on MPI collectives and our gTopKAllReduce. After that we evaluate the time speeds of DenseAllReduce, TopKAllReduce and gTopKAllReduce in different sizes of sparse vectors and a scaling number of workers on the 1GbE cluster.

Point-to-point communication: We test the point-to-point communication speed by using OSU Micro-Benchmark999http://mvapich.cse.ohio-state.edu/benchmarks/ at the version of 5.5. The time costs of the point-to-point communication between two machines are shown in Fig. 8, in which we run

times to calculate the mean and standard variance from the reported values. It can be seen that the time used for transferring a message is a linear function with the size of the message, which verifies the

- model. Based on the measured data, we can get and .

Fig. 8: Data transfer time in milliseconds with respective to the size of message on our experiment cluster.

Time performance of AllReduce operations: Since and are two main factors affecting the performance of TopKAllReduce and gTopkAllReduce, we compare their time performances in two dimensions (i.e., and ) based on the measured , and the time cost models in Table. I. First, we compare the time cost with the number of workers increases based on MB (the approximate model size of ResNet-50) and . Second, in the configuration of a cluster with 64 workers, we make a comparison on how the time cost changes with the size of the message increases. The time comparison is shown in Fig. 9. From the left of Fig. 9, when the number of workers is small, TopKAllReduce is slightly faster than gTopKAllReduce. However, when the number of workers scales to , TopKAllReduce becomes much worse than gTopKAllReduce. Furthermore, our proposed gTopKAllReduce is much more efficient than TopKAllReduce when scaling to large sizes of messages. To summarize, a larger number of workers or a larger message size would make gTopKAllReduce higher efficiency than TopKAllReduce.

Fig. 9: Left: Time used for AllReduce algorithms on different number of workers at the message size of MB and the density of . Right: The time cost with respective to the message size on a cluster with 32 workers. The lower the better.
Model Batch size Iteration time Throughput
VGG-16 128 0.097s 1317 Images/s
ResNet-20 128 0.146s 876 Images/s
AlexNet 64 0.369s 173 Images/s
ResNet-50 256 4.842s 52 Images/s
TABLE VI: Training speed on a single P102-100 GPU.

Iv-D Training efficiency

Single-GPU training speed: We first demonstrate the average training speed of one iteration on a single GPU, which is shown in Table VI. It can be seen that the computation time of each iteration is from tens of milliseconds to several seconds so that scaling such models on 1GbE clusters is challenging especially for the models with a large number of parameters (e.g., AlexNet) because of high communication-to-computation ratios.

Fig. 10: Comparison of scaling efficiency of S-SGD with dense AllReduce (DenseAllReduce), Top- sparsification (TopKAllReduce) and gTop- sparsification (gTopKReduce), where . The higher the better.

Scaling efficiency: After integrating gTopAllReduce to gTop- S-SGD, we want to consider how many speedups can be achieved on the low bandwidth networks with different models on a different number of workers. The scaling efficiency of S-SGD with three different AllReduce algorithms are shown in Fig. 10. It can be seen that the dense S-SGD has worst scaling efficiency because the full size of gradients makes the communication very slow on 1GbE clusters. The Top- S-SGD achieves some improvement on a small number of workers then S-SGD, but it has an obvious performance decrease when scaling to GPUs. However, our proposed algorithm gTop- S-SGD achieves much more stable scaling efficiency even on clusters with the larger number of GPUs. For example, when scaling to GPUs, our proposed gTop- S-SGD achieves faster than dense S-SGD on average, and it achieves improvement on average compared to Top-. Particularly, gTop- S-SGD is up to and than S-SGD and Top- S-SGD respectively on the AlexNet model. Summary of the training throughput on different models is shown in Table. VII.

Model Dense S-SGD Top- gTop-
VGG-16 403 2016 3020 7.5 1.5
ResNet-20 9212 22272 25280 2.7 1.1
AlexNet 39 296 505 12.8 1.7
ResNet-50 343 978 1251 3.65 1.3
  • Note: The throughput is measured with processed images per second (i.e., the unit is Images/s). indicates the speedup of gTop- compared to the dense one, and indicates the speedup of gTop- compared to Top-.

TABLE VII: The system training throughput on a 32-GPU cluster.

Iv-E Time performance analysis

We use the cases of workers to analyze the time performance of gTop- S-SGD. To better understand the overheads of gTop- sparsification, we breakdown the time of an iteration into three parts: GPU computation time (), local sparsification time (), and communication time (). Note that in this work, we do not consider the parallelization between computation and communication during backward propagation. The main reason is that for some deep models like ResNet-50 that consume large size of memory and the mini-batch size could not be set too large, to the computation time is short. But we also need to reduce the communication-to-computation ratio to alleviate the impact of communication, so an effective method is to accumulating gradients for different small sizes of un-updated mini-batches. In our evaluated experiments of ResNet-50, we set local mini-batch size as , and it accumulates the gradients times for a single update, so the effective mini-batch size is . Therefore, it has little contributions from the pipeline of backward propagation and communication on low-bandwidth networks. But gTop- sparsification is also applicable to the wait-free backward propagation algorithm [36] and the optimal gradient merge algorithm [37].

Fig. 11: Time breakdown of computation, compression and communication. “Compu.” indicates forward and backward computation, “Compr.” indicates the compression (sparsification) operation, and “Commu.’ indicates gTop- gradients communication.

The time breakdown for the evaluated models is shown in Fig. 11. From Fig. 11, on one hand, in time breakdown of VGG-16 and AlexNet models, the communication overheads are much larger than computation because VGG-16 and AlexNet have three fully connected layers which are equipped with a large number of parameters, while the computation is fast. These also reflect that the scaling efficiency is low in Fig. 10 of S-SGD even with gTop- sparsification. On the other hand, the communication and sparsification overheads are much smaller than the computation with ResNet-20 and ResNet-50, which indicates low communication-to-computation ratios, so that the scaling efficiency can be up to even on the low-bandwidth network.

Furthermore, it is noted that the time used by gradient sparsification is comparable to the computation time on VGG-16 and AlexNet models. The main reason is that Top- selection on GPU is inefficient, which generally requires a sort operation on the whole gradients, and it could be non-trivial to be highly parallelized on SIMD architectures [38]. We will leave this as our future optimization direction.

Iv-F Convergence sensibility to densities

To understand the sensibility of model convergence to densities, we run the experiments with different values of the density using VGG-16 and ResNet-20 on the Cifar-10 dataset. The convergence curves are shown in Fig. 12. It can be seen that even a very low density of does not have a big impact to the model convergence to both models. However, a trade-off should be made to balance the high sparsification ratio and the convergence speed. One one hand, the higher sparsification would bring higher scaling efficiency to a larger number of workers. One the other hand, one should also be careful to the upper bound of the sparsity that would have a negative impact on the model convergence.

Fig. 12: Training losses on VGG-16 and ResNet-20 with different .

V Related Work

Gradient size reduction in communication is crucial for distributed training using synchronous SGD. Researchers have proposed quantization, sparsification, and lossless data stream compression. Gradients or weights in DNN models are stored in single precision floating points, which means that it requires 32 bits for each gradient. Gupta et al. [39] propose the 16-bit wide fixed-point number representation for model parameters and gradients to improve the computation and energy efficiency. To keep the model accuracy, researchers [40][41] propose the mixed precision technique which updates the model with 32-bit precision while the computation is in 16-bit precision. Hubrara et al. [10] further quantize model parameters to 4-bit precision without losing accuracy, and 2-bit [42], and even 1-bit [14][43] are also proposed for minimal quantization. 1-bit quantization is the smallest number of bit of quantization of a single value, which would achieve up to 32 smaller message size than the 32-bit counterpart. However, there still exist some quantization errors because some values become zeros if they exceed the numerical range of that precision can represent even with careful consideration, which to some extent hearts the model accuracy. Error compensation techniques [13][44][12] are further proposed to address this issue of quantization errors. However, even 1-bit gradients can only achieve reduced size, which is either not enough on large models and slow networks.

In terms of gradient sparsification, which zero-outs a large proportion of gradients to reduce the communication size, Aji et al. [17] and Chen et al. [11] empirically demonstrate that up to gradients are not needed to update the model at each iteration, which indicates that the gradients would be very sparse to convergent the model with accumulation of gradient residuals. Aji et al. [17] use static threshold selection to determine the value of , while Chen et al. [11] propose a dynamic version. Lin et al. [12]

further propose some optimization tricks (including the warmup strategy, momentum correction, and gradient clipping) to address the accuracy loss introduced by dropping a large number of gradients, and they show that Top-

sparsification S-SGD can converge very close to S-SGD with dense gradients. The above techniques of quantization and sparsification can be combined together to achieve a higher compression ratio of gradients with no (or a very small) accuracy loss. For example, Lin et al. [12] achieve up to 270 and 600 compression ratio without loss of accuracy, while Sattler et al. [45] achieve up to 37208 with only 1% lower accuracy. Furthermore, after quantization and sparsification, one can further use some lossless compression techniques like run-length encoding to further reduce the size of transfer bytes [12].

Focusing on the study of sparsification, the other two work in [21] and [22] are mostly related to our work. They have realized that efficient sparse AllReduce algorithms are non-trivial to implement, and they both propose the AllGather solution. However, the AllGather method requires a linear increase cost with respect to the number of workers. Therefore, the AllGather could be inefficient when scaling to large-scale clusters.

Vi Conclusion and Future Work

In this work, we first show that the accumulating results from Top- gradients can be further sparsified by choosing some largest absolute gradients before updating the model, which has no much impact on the model convergence. Then we identify that the Top- sparsification is inefficient in averaging the gradients from all workers because the indices of the Top- gradients are not the same such that one should use the AllGather collective to collect all the Top- gradients and indices. The AllGather method for Top- aggregation (TopKAllReduce) is linear expensive to the number of workers (i.e., the communication complexity is , where is the number of workers), so it would have very low scalability when scaling to large-scale clusters. To this end, we propose a global Top- (gTop-) sparsification approach for S-SGD, which is communication-efficient. The gradient aggregation algorithm based on gTop-, named gTopKAllReduce, only requires a communication complexity of

, which is much lower than TopKAllReduce. Experimental studies on various of deep neural networks including convolutional neural networks and recurrent neural networks (LSTM) are conducted to verify gTop-

S-SGD has little impact on the model convergence (experimental results demonstrate that the convergence curves are similar to S-SGD with dense gradients). The experiments conducted on the 32-GPU cluster inter-connected with 1 Gbps Ethernet show that our proposed gTop- S-SGD has much higher scaling efficiency than S-SGD and Top- S-SGD.

In the future work, we would like to integrate the gTop- sparsification method with quantization to achieve higher compression ratios, and we will also study some strong theoretical convergence analysis of gTop- sparsification for convex and non-convex optimization problems.

Vii Acknowledgements

We would like to thank MassGrid.com for their support on providing the GPU cluster for experiments.

References