## 1 Introduction

Large-scale distributed training plays an important role in deep learning to deal with large amounts of training data and models with deep architectures. An efficient distributed training algorithm aims at maximizing the convergence rate while minimizing the communication cost. Synchronous Parallel SGD (S-PSGD) is the de-facto distributed learning algorithms in practice. Recently, Decentralized Parallel SGD (D-PSGD) [dpsgd] and its asynchronous variant Asynchronous Decentralized Parallel SGD (AD-PSGD) [adpsgd] have been applied to a broad variety of deep learning tasks. Compared to S-PSGD, (A)D-PSGD replaces global weight synchronization with model averaging among neighboring learners in a peer-to-peer fashion while achieving the same convergence rate. In [icassp19], AD-PSGD was first applied to automatic speech recognition (ASR) to significantly shorten the acoustic model training time. In [interspeech19], it was discovered that (A)D-PSGD can converge with a much larger batch size than S-PSGD, which enables a larger degree of parallelism for distributed training. One drawback of (A)D-PSGD, however, is that when the number of learners grows, it requires more rounds of communication to reach consensus, slowing down convergence. Figure 1 illustrates (A)D-PSGD convergence curves for the 2000-hour Switchboard (SWB2000) and ImageNet tasks when running with different numbers of learners. It shows when the number of learners increases, the convergence slows down.

In this paper, we investigate techniques to improve large-scale (A)D-PSGD based training. In Section 2 we formulate the (A)D-PSGD problem. In Section 3, we first analyze why the fixed model averaging among neighboring learners, as proposed in the original (A)D-PSGD, incurs slow convergence to consensus. Based on that, we propose a randomized mixing scheme, Randomization Accelerated Decentralized Parallel SGD (RAND-PSGD), that can significantly improve the spectral gap of the mixing matrix to improve the convergence to consensus, while maintaining the same communication cost. We further investigate the “Delay-by-one” Decentralized Parallel SGD (D1D-PSGD) scheme which ensures the weights used to calculate gradients and the weights consensus differ by precisely one iteration of gradients calculation. D1D-PSGD enables the fast speed to reach consensus while maintaining the decentralized training structure so that it can still converge under a larger batch size compared to S-PSGD, at the cost of placing a global synchronization in a separate communication thread.

We describe the implementation details in Section 4 and present experimental results of RAND-PSGD and D1D-PSGD and discuss the trade-offs of each design choice in Section 6. We discuss related works in Section 7 and conclude with a summary in Section 8.

## 2 Problem Formulation

Stochastic gradient descent (SGD) is currently the dominant approach to optimizing deep neural networks. In SGD, models are iteratively updated as shown in Eq.1

(1) |

where are the parameters after iteration . The gradient is computed using model on

randomly drawn data samples indexed by the random variable

. The samples form a mini-batch and is the batch size. the learning rate.In (A)D-PSGD, the weights update rule is given in Eq.2:

(2) |

where is a matrix with each column consisting of model parameters in each learner at iteration ; is a doubly stochastic mixing matrix for model averaging among learners given a network topology; is a matrix with each column consisting of model parameters used for computing gradient in each learner at iteration . In the asynchronous mode, may not be equal to ; is a matrix with each column consisting of indexing random variables for mini-batch samples used for computing gradients in each learner at iteration and is a matrix with each column consisting of gradients computed in each learner at iteration . In S-PSGD, all models are collected and averaged by the total number of after all learners finish their gradient computation and local model update. The average is then broadcast to each learner. In this case, it can be easily seen that the mixing matrix where . In other words, S-PSGD is a special case of (A)D-PSGD.

## 3 Randomized Mixing

The mixing strategy in S-PSGD with is fast to reach consensus but it may be communication heavy as models have to be transferred among all learners. One way to reduce the communication cost is to use local averaging. For instance, each learner only averages models with its left and right neighbors in a ring [icassp19][interspeech19]. In this case, the mixing matrix is given by

(3) |

Since each learner only needs to communicate with its immediate neighbors, the communication can be significantly reduced compared to averaging across all learners. It can be shown that, as a doubly stochastic matrix,

will converge to :(4) |

The speed of convergence is controlled by the spectral gap between the largest (which is always 1) and the second largest eigenvalues of

. Suppose is the second largest eigenvalue of . We have(5) |

Given the circulant structure, the eigenvalues of

are simply the Fourier transform of the first row. The second largest eigenvalue is given by

(6) |

When is large, is very close to 1, which indicates a small spectral gap and therefore a slow convergence to consensus.

In this work, we investigate a randomized mixing strategy to accelerate the convergence without increasing the communication cost. Under this strategy, learners form a ring and the indices of the learners are randomly shuffled:

(7) |

where is a random permutation of the set . A learner averages models with its left and right neighbors in the mapped indices. The resulting mixing matrix of iteration constructed this way is obviously a doubly stochastic matrix and we have

(8) |

where is random permutation matrix. Moreover, we have

(9) |

It can be shown that

(10) |

It follows that

(11) |

which gives

(12) |

Comparing Eq.12 with Eqs. 5 and 6, we can see that this randomized mixing strategy converges much faster to consensus than the fixed mixing strategy in Eq.3.

## 4 Design and Implementation

In each iteration of (A)D-PSGD , a learner calculates gradients in one thread while concurrently exchanging its weights with its left and right neighbors in another thread.

In RAND-PSGD, a learner picks two random neighbors to communicate in each iteration. To achieve this, in each iteration each learner generates a random permutation of all the learner IDs to construct a communication ring (i.e. generates a new mixing matrix ). For this iteration, a learner communicates with the two neighbors in the newly constructed communication ring. We let each learner start with the same random seed to guarantee all learners generate the same random permutation. As in (A)D-PSGD, each learner sends two messages and receives two messages in each iteration. Assuming all learners are connected with the same communication switch, RAND-PSGD has the same communication cost as (A)D-PSGD.

In D1D-PSGD, we design the strategy in such way that and on the RHS of Eq.2 are carried out concurrently. In addition, the model averaging indicated by is realized with allreduce^{1}^{1}1An allreduce operation is a reduction operation, which is both associative and commutative such as summation, followed by a broadcast operation. A global summation is an example of an allreduce operation. divided by .

(13) |

On the other hand, the model used for computing the gradients in each learner is the model from the previous round of allreduce

(14) |

hence the name delay-by-one decentralized parallel SGD. The difference between D1D-PSGD and S-PSGD is that S-PSGD requires consensus on gradients before model update, which results in homogeneous models across learners when computing gradients. D1D-PSGD has models updated locally on each learner using different gradients before pushing the models for allreduce across learners, which introduces slight heterogeneity to local models that can be helpful for convergence as demonstrated in our experimental results in Section 6. In contrast, homogeneous models enforced by S-PSGD cannot convergence with a large batch size and aggressive learning rate for our ASR task setting[interspeech19]. A good allreduce implementation can finish each round of communication after effectively 2 messages are sent across the communication network, independent of the number of learners[fsu-allreduce]. We choose the Nvidia NCCL[nccl] as our allreduce implementation. Even though D1D-PSGD has the most favorable spectral gap (i.e. 1) while incurring the same communication cost as (A)D-PSGD and RAND-PSGD, it requires a global synchronization (i.e., allreduce) thus it suffers from the straggler problem in a distributed setting and the communication speed is bounded by the slowest communication link. Table 1 summarizes the design choice for each algorithm.

Consensus Convergence* | Time to Communicate in Each Iteration | Straggler Avoidance | |
---|---|---|---|

(A)D-PSGD | Slow | 2* | Y |

RAND-PSGD | Medium | 2* | Y |

D1D-PSGD | Fast | 2* | N |

## 5 Methodology

### 5.1 Hardware and Software

We experiment on an IBM cluster with a node architecture similar to the current fastest supercomputer in the world, Summit [Top500]

. This cluster is based on IBM POWER System AC922 nodes with IBM POWER9 CPUs and NVIDIA Volta V100 GPUs all connected together with NVIDIA’s high-speed NVLink dual links totaling 50GB/s bandwidth in each direction. Each node contains 22 cores, 512GB of DDR4 memory, 96GB of High Bandwidth Memory (HBM2) for use by the accelerators and is equipped with 6 GPUs. Nodes are connected with Mellanox EDR 100G Infiniband interconnect technology, each node has a combined network bandwidth of 25GB/s. Each node is equipped with 500GB NVME storage. We use PyTorch v1.1.0 and IBM Spectrum MPI along with XL compiler suite v16.1.1. For each learner, we use 4 I/O processes to drive the data loading.

### 5.2 Models and Dataset

The hybrid acoustic model used in the SWB2000 experiments is an LSTM with 6 bi-directional layers. Each layer has 1,024 cells (512 cells in each direction). A linear projection layer with 256 hidden units is inserted between the LSTM layers and the softmax layer with 32,000 output units. These 32,000 units correspond to context-dependent hidden Markov model (HMM) states. The LSTM is unrolled with 21 frames and trained with non-overlapping feature subsequences of that length. The feature input is a fusion of 40-dim FMLLR, 100-dim i-Vector and 40-dim logmel with its delta and double delta. The total input dimensionality is 260. The model size is 165MB. The language model is built using publicly available training data from a broad variety of sources. There are 36M 4-grams built on a vocabulary of 85K words. The test set is the Hub5 2000 evaluation set including two parts: 2.1 hours of switchboard (SWB) test set and 1.6 hours of call-home (CH) test set.

Our second benchmark dataset is collection of natural images used as a part of the 2012 edition of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012). The training set is a subset of the hand-labeled ImageNet database and contains 1.2 million images. The validation dataset has 50,000 images. Each image maps to one of the 1000 non-overlapping object categories. The model we use is ResNet-50 ([resnet]).

## 6 Experimental Results

### 6.1 Convergence Results

Figure 2 illustrates convergence results for AD-PSGD, RAND-PSGD and D1D-PSGD up to 64 learners. For the SWB2000 task, we use the same hyper-parameter settings as described in [interspeech19], and the total batch size across all the learners is 8192 (e.g., batch size 128 per learner in the 64 learners setting). For the ImageNet task, we use the same hyper-parameter settings as described in [facebook-1hr], the batch size for each learner is set 32. Up to 64 learners, RAND-PSGD and D1D-PSGD perform similarly and outperform AD-PSGD. Table 3 summarizes the WER of ASR models trained by each algorithm.

Single Learner | 16 Learners | 32 Learners | 64 Learners | |||||||

Baseline | AD | RAND | D1D | AD | RAND | D1D | AD | RAND | D1D | |

SWB | 7.5 | 7.6 | 7.6 | 7.4 | 7.9 | 7.7 | 7.6 | 8.1 | 7.8 | 7.5 |

CH | 13.0 | 13.2 | 13.1 | 13.3 | 13.6 | 13.4 | 13.1 | 14.0 | 13.4 | 13.3 |

WER comparison after 16 epochs for AD-PSGD, RAND-PSGD, and D1D-PSGD up to 64 learners. AD is short for AD-PSGD, RAND is short for RAND-PSGD, D1D is short for D1D-PSGD. Single learner baseline is trained with batch size 256 under a well-tuned training recipe. For all the other settings, the total batch size is 8192.

16 Learners | 32 Learners | 64 Learners | 96 Learners | 128 Learners | |||||
---|---|---|---|---|---|---|---|---|---|

WER(SWB/CH) | Time(hr) | WER | Time (hr) | WER | Time (hr) | WER | Time(hr) | WER | Time (hr) |

7.4/13.3 | 5.88 | 7.6/13.1 | 3.60 | 7.5/13.3 | 2.28 | 7.7/13.2 | 2.10 | 7.7/13.3 | 1.98 |

### 6.2 Runtime Results

Figure 3 shows the speedup up to 11 nodes (i.e. 66 GPUs). D1D-PSGD is the fastest, as NCCL implements sophisticated software pipelining to overlap message exchanges on the bidirectional network link. AD-PSGD and RAND-PSGD run at a similar speed.

presents the tradeoff of running time and model accuracy using D1D-PSGD. Our system can finish ASR training under 2 hours with 128 GPUs, with slight model accuracy degradation. Note that we keep the total batch size 8192 across all the runs; thus, when there are many learners in the system, the per learner batch size decreases and so does the computation efficiency. Also, when batch size per learner gets smaller, the sample variances per learner gets larger and that might explain the slight model accuracy degradation for 128 learners.

### 6.3 Discussion

In a super-computer environment like ours, where even the slowest communication link bandwidth is 25GB/s and all computing devices are highly homogeneous, D1D-PSGD guarantees the best convergence and can deliver near-metal runtime performance. In a cloud data center environment where network links are usually slow and the straggler problem becomes more prominent, an algorithm built on a global barrier such as allreduce is unlikely to be deployed and we suggest to use RAND-PSGD as it achieves convergence rate close to D1D-PSGD, has the same traffic cost as AD-PSGD and does not rely on global synchronization. Researchers find that AD-PSGD outperforms allreduce based algorithms significantly when network links are standard 10Gbit/s ethernet because allreduce speed is bounded by the slowest link [adpsgd, adpsgd-rabbat].

## 7 Related Work

Parameter Server[distbelief, adam] based asynchronous parallel SGD approach was first proposed to solve the straggler problems in distributed training. Due to the staleness issue[zhang-ijcai-2016] in the parameter server design, Synchronous Parallel SGD regains its popularity[facebook-1hr, ddl, revisit-sync-sgd]. (Asynchronous) Decentralized Parallel SGD is a state of the art distributed deep learning algorithm[dpsgd, adpsgd, adpsgd-rabbat] that removes the need of a centralized parameter server and relax the need for lock-step synchronization. Researchers have recently applied (A)D-PSGD to training ASR models in record time[icassp19, interspeech19]. One drawback of (A)D-PSGD is when the number of learners grows, convergence is hampered. To circumvent this problem, [adpsgd] increases rounds of communication when network links are free, [icassp19] adapts similar techniques to D1D-PSGD, [interspeech19] groups learners on the same server as one super learner and only applies (A)D-PSGD among super learners. None of the prior work studies the root cause of this efficiency issue. This paper is the first paper that formalizes the relationship between convergence in (A)D-PSGD and the number of learners and propose remedies. Orthogonal to (A)D-PSGD, many existing works[deepspeech2, bmuf, msr-1bit, seide2014on, amazon-million-hr] adapt a synchronous approach where all learners periodically synchronize and obtain the same set of weights. Among them, [bmuf, amazon-million-hr] reduce the communication frequency by improving the underlying optimizers.

## 8 Summary

We identify that in (A)D-PSGD based training with fixed local model averaging the spectral gap of the mixing matrix decreases when the number of learners grows. This gives rise to slow convergence to consensus and hence decreases the training efficiency. Our proposed algorithms,RAND-PSGD and D1D-PSGD, improve the spectral gap at no extra communication cost compared to (A)D-PSGD. We show the effectiveness of our proposed techniques on both ASR and computer vision tasks. On an IBM supercomputer, our system is able to train SWB2000 to reach a WER 7.5% on SWB and 13.3% on CH using 64 V100 GPUs in 2.28 hours, and to reach a WER 7.7% on SWB and 13.3% on CH using 128 V100 GPUs in 1.98 hours.