1 Introduction
Neural networks with deep architectures have been the dominant acoustic modeling approach for automatic speech recognition (ASR) in recent years. They have yielded stateoftheart performance as compared to previous technologies based on hidden Markov models (HMMs) and Gaussian mixtures [Hinton_DNNSPM]. In some tasks, Deep Learning (DL) has achieved near humanlevel ASR performance [saoninterspeech2017, msrspeech]. It is commonly agreed that the success of DL for ASR relies on the availability of large amount of training data and highperformance computing. Therefore, distributed DL is not only preferred but also necessary in DL ASR to guarantee fast turnaround time for model training.
Unlike the widelystudied computer vision tasks, such as ImageNet
[facebook1hr, ddl, lars, tencentimgnet, terngrad], few studies have been published regarding how to accelerate distributed training for ASR tasks on large public dataset (e.g., SWB2000) with the exception of [msr1bit, deepspeech2]. Compared to computer vision tasks, such as ImageNet, ASR tasks have very distinct behaviors in terms of distributed computing: (1) The stateoftheart acoustic models conventionally can only be trained with relatively small batch size (e.g. 256) [saoninterspeech2017], unlike ImageNet where a ResNet model can be trained with a batch size of 8192 or larger [facebook1hr, lars, tencentimgnet]. (2) Computation/communication ratio is low in ASR tasks. In Section 3.1, we demonstrate that SWB2000 with a stateoftheart LSTM is five times more challenging to scale out than a ResNet for ImageNet. Therefore, we need to revisit distributed training strategies other than the standard synchronous SGD training for acceleration. In this paper, we attempt to address the abovementioned two issues by (1) increasing the batch size for a highperformance LSTM model without impairing model accuracy (2) using the Asynchronous Decentralized Parallel SGD (ADPSGD) [adpsgd] approach to reduce the communication cost and remove runtime bottlenecks.2 background
A neural network training algorithm seeks to find a set of parameters that minimizes the discrepancy between the network output and the ground truth . This is usually accomplished by defining a differentiable cost function and iteratively updating each of the model parameters using some variant of the gradient descent algorithm:
(1a)  
(1b)  
(1c) 
where represents the model parameter at iteration , is the step size (also known as the learning rate), and is the batch size.
Distributed DL systems, the de facto approach to training large DL tasks, typically adopt a Parameter Server (PS) [distbelief] architecture. Figure 1(a) depicts a PS architecture in which each learner calculates gradients and transfers them to the PS. The PS then updates the weights and sends them back to the learners. The timestamp, a scalar counter, of the PS’s weights is increased by 1 (i.e. from to ) for each update. Staleness[zhangijcai2016] is defined as the discrepancy between the timestamp of learners’ weights which are used to calculate gradients, and the timestamp of the PS’s weights. To achieve convergence that is matching to the single learner system, Synchronous SGD (SSGD) is often used. In SSGD, the PS weight update rule is given in Equation 2^{1}^{1}1Throughout this paper, we use to denote the number of learners.: each learner calculates gradients and receives updated weights in lockstep with the others. The weights used to calculate the gradients are always the same as the one on the PS, thus staleness is 0.
(2)  
The summation operation in Equation 2 is communicative and associative; this is known as the “Reduce” operation in the High Performance Computing field. AllReduction is the operation that reduces (e.g., sums) all the elements and broadcasts the reduction results to each participant. When the message to be “AllReduced” is large, as in the DL case, the optimal algorithm maximizes the communication bandwidth utilization by breaking a large message to chunks and pipelining the reduction operation with message transferring in a ring topology[fsuallreduce]. Such an algorithm can finish the AllReduce operation in 2 M/Bandwidth time, where is the size of the message, regardless the number of participants. Many such implementations exist, most notably[paddlepaddle, nccl, ddl].
One key drawback of SSGD is that one slow learner can slowdown the entire training which is known as the straggler problem[mapreduce] in distributed computing. To avoid this problem, practitioners proposed Asynchronous SGD (ASGD) which allows each learner to calculate gradients and asynchronously push/pull the gradients/weights to/from PS. The weight update rule in ASGD is given in Equation 3:
(3)  
Staleness in ASGD is proportional to the number of learners in the system[zhangijcai2016, distbelief] and can severely harm convergence[revisitsyncsgd, zhang2016icdm]. To achieve the best model accuracy, most distributed deep learning tasks adopt SSGD only[deepspeech2, facebook1hr, nvidialmscaling, fbmtscaling].
To avoid the straggler problem and maintain competitive model accuracy, decentralized distributed computing algorithm[wildfire] is proposed, in both synchronous form Decentralized Parallel SGD (DPSGD)[dpsgd] and in asynchronous form Asynchronous Decentralized Parallel SGD (ADPSGD)[adpsgd]. The architecture of a decentralized SGD system is depicted in Figure 1(b), where each learner calculates the gradients, updates its weights, and averages its weights with its neighbor in a ring topology. DPSGD/ADPSGD weights update rule is defined in Equation 4.
(4)  
In DPSGD, assuming the pairwise weight averaging can be executed multiple times after each gradient calculation, all the learners will reach the same weights, thus the staleness can be zero. In ADPSGD, since computation is overlapped with communication, the staleness is 1 at best. ADPSGD has shown excellent runtime and convergence performance on computer vision tasks with CNN type of models (e.g, ResNet and VGG)[adpsgd]. In this paper, we demonstrate that ADPSGD also achieves excellent runtime and convergence performance on the SWB2000 speech recognition task with an LSTM model.
3 Design and Implementation
We describe how to increase the batch size to enable efficient distributed computing for SWB2000LSTM in Section 3.1. We then describe the design and implementation for different distributed learning strategies in Section 3.2.
3.1 Increase Batch Size
A sufficiently large batch size is necessary for enabling efficient distributed DL for two reasons: (1) The larger the batch, the more learners that can be used. (2) Gradient computation is more efficient with a larger batch size. In Figure (a)a, blue bars show the computation time per epoch under different batch size for the SWB2000LSTM^{2}^{2}2We describe the details of the LSTM model used in this paper in Section 4.2 task measured on a P100 GPU. It takes 8.58 hrs to finish one epoch under batch size 256 as compared to 18.33 hrs under batch size 32. Furthermore, with a smaller batch size, more frequent communication is required. In Figure (a)a, orange bars show the minimum bandwidth requirement to transfer the gradients so that communication time and computation time break even. Batch size 32 per learner requires almost 4X bandwidth (4.98GB/s) as compared with batch size 256 (1.33GB/s). Conventional wisdom on SWB2000LSTM task is batch size larger than 256 significantly lowers the model accuracy[saoninterspeech2017]
. The hyperparameter setup for the batch size 256 configuration is the learning rate is set to be 0.1, momentum is set as 0.9, and learning rate anneals by
every epoch from the 11 epoch. The training finishes in 16 epochs. Inspired by the work proposed in[facebook1hr], we are able to increase the batch size from 256 to 2560 without decreasing model accuracy by (1) linearly warming up the base learning rate from 0.1 to 1 in the first 10 epochs and (2) annealing the learning rate by from the 11 Epoch. Figure (b)b plots the heldloss w.r.t epochs for batch size 256 and batch size 2560;they are indistinguishable by epoch 16.In the ImageNetResNet task, a batch size of 32 takes about 0.18 sec to compute on one P100 GPU, whereas the same batch size for SWB2000LSTM task takes only 0.067sec to compute. Moreover, the ImageNetResNet model size is about 100MB, whereas the SWB2000LSTM model size is 165MB. The combination of shorter computation time and larger model size makes SWB2000LSTM 5x more challenging to parallelize than the ImageNetResNet task.
3.2 System Design
Three distributed SGD algorithms are considered and implemented as follows: Synchronous (SYNC), Asynchronous Decentralized Parallel SGD (ADPSGD), and the hybrid of these two algorithms HYBRID. In Figure 3, we sketch our system APIs integration (in boldfont texts) with the underlying DL framework. We assume the underlying DL framework can provide the following functionalities: g=calGrad(W), which calculates the gradients based on weights and W’=apply(W,g), which applies gradients to weights and returns updated weights .


SYNC: Summing of the weights and taking their average after every iteration, as shown in Figure (a)a, is equivalent to applying weights update by using the averaged gradients. We use the fastest Allreduce implementation available to us (DDLAllreduce[ddl]) to implement the SYNC strategy. As we will show in Section 5.2, DDLAllreduce is 1.2X3X faster than the opensource MPI_Allreduce implementation in OpenMPI[openmpi].
ADPSGD: Figure 4 shows the system architecture of ADPSGD. Assuming learners in a system (
is an even number), we designate learners of odd id
() as senders and learners of even id () as receivers. Bipartition of the communication graph guarantees acyclic communication, thus it is deadlockfree. Each sender communicates with its left and right neighbor in alternate iterations. Each sender process runs two threads: a main thread and a communication thread. The main thread calculates the gradients and applies the weight updates. When a new set of weights are generated, the sender signals its communication thread to send to its neighbor learner , receives learner ’s weights , and then updates its weight to be the average of and . Additionally, as required in the proof of [adpsgd], the weight matrix of learners need to be doubly stochastic and symmetric which implies the weight update on GPU and the weight averaging on the CPU must not interleave so that the receiver has a consistent view of weights on the sender. We enforce this atomicity via a condition variable[mesa]. Similarly, each receiver runs a main thread and a communication thread. In each iteration, a receiver’s main thread calculates gradients and then updates its weights in an atomic region. Meanwhile, the receiver’s communication thread waits until the weights are received from its neighbor . Then, the communication thread does the following in an atomic region: (i) sends its weights to its engaging neighbor and updates its weights by averaging its weights with the received weights. The atomicity is enforced via a mutex lock[ArpaciDusseau14Book]. As compared to the implementation in [adpsgd], this implementation updates weights on GPU, which runs faster and no gradient information needs be extracted from the underlying solver; the disadvantage of this implementation is that each sending operation is only triggered when new gradients are calculated and there is no way to exchange weights more frequently even when the network is free. Also, GPU weight updates need to wait if the weights are being changed in the communication thread.HYBRID: Note that the communication threads in ADPSGD essentially runs an Allreduce over learner and . By replacing the pointtopoint message passing with an Allreduce over all learners, we can leverage the optimized fast Allreduce implementation and also minimize the weights discrepancy among different learners. In essence, push_weights(W’) signals the communication thread to conduct an Allreduce which runs concurrently with gradients calculation, and pull_weights simply retrieves the Allreduced results from the last push and takes an average. The handshaking between communication thread and main thread is a fast lockfree implementation[dyce]. When gradient calculation takes longer time than Allreduce, this scheme should completely overlap communication with computation.
SYNC  HYBRID  ADPSGD  

Comm/Compute Overlap  ✓  ✓  
Straggler avoidance  ✓  
Staleness  0  1  at best 1 
Table 1 summarizes the runtime and staleness comparison between different algorithms.
4 Methodology
4.1 Software and Hardware
PyTorch 0.5.0 is the underlying DL framework. We use the CUDA 9.2 compiler, the CUDAaware OpenMPI 3.1.1, and g++ 4.8.5 compiler to build our communication library, which connects with PyTorch via a PythonC interface.
We develop and experiment our systems on a productionrun cluster ClusterA, which has 4 servers in total. Each server is equipped with 14core Intel Xeon E52680 v4 2.40GHz processor, 1TB main memory, and 4 P100 GPUs. We also run a 32GPU experiment on a 4server highGPUdensity experimental cluster ClusterB. Each ClusterB server is equipped with 18core Intel Xeon E52697 2.3GHz processor, 1TB main memory, and 8 V100 GPUs. Both servers are connected by 100Gbit/s ethernet. On both servers, GPUs and CPUs are connected via PCIe Gen3 bus, which has a 16GB/s peak bandwidth in each direction.
4.2 DL Models and Dataset
The acoustic model is an LSTM with 6 bidirectional layers. Each layer contains 1,024 cells (512 cells in each direction). On top of the LSTM layers, there is a linear projection layer with 256 hidden units followed by a softmax output layer with 32,000 units corresponding to contextdependent HMM states. The LSTM is unrolled with 21 frames and trained with nonoverlapping feature subsequences of that length. The feature input is a fusion of FMLLR (40dim), iVector (100dim), and logmel with its delta and double delta (40dim
3).The language model (LM) is rebuilt using publicly available training data, including Switchboard, Fisher, Gigaword, and Broadcast News, and Conversations. Its vocabulary has 85K words and it has 36M 4grams.
5 Experimental Results
5.1 Convergence Results


Baseline  SYNC  HYBRID  ADPSGD  

SWB  7.5%  7.6%  7.6%  7.6% 
CH  13.0%  13.1%  13.1%  13.2% 
Our LSTM baseline trained on single GPU (batchsize 256) gives a WER of 7.5%/13.0% on the Switchboard/CallHome (SWB/CH) of the NIST Hub5 2000 evaluation test sets after the CrossEntropy training, which is a competitive baseline. We compare this baseline with SYNC, HYBRID, and ADPSGD in Figure (a)a for heldout loss and in Table 2 for WER.
5.2 Runtime Results
Figure (b)b plots the speedup for different algorithms up to 16 GPUs on ClusterA. SYNCOpenMPI is found to be the slowest one. It is also found that ADPSGD achieves the best speedup (about 11X over 16 GPUs) and finishes the training in 13.98 hours. ADPSGD did not achieve linear speedup because it requires CPUbased weights averaging and GPU weights update to occur atomically which could be remedied by offloading GPU weights update to CPU. HYBRID does not outperform SYNCDDL significantly even though computation is long enough to hide the communication. This is because HYBRID asynchronously calls DDL which relies on NVIDIA NCCL library[nccl] for the intraserver reduction. NCCL heavily competes with training for GPU resources (e.g., stream processor and memory) when used asynchronously.
Table 3 shows the speedup measured for each algorithm when one of the 16 GPUs slows down. ADPSGD is immune to the straggler problem, whereas the speedup of other algorithms quickly diminishes. Figure 6 shows a snapshot of the number of minibatches processed by each GPU in one epoch when half of ClusterA are shared by other users. ADPSGD automatically balanced the workload per GPU. SYNC and HYBRID would enforce each GPU to process the same number of minibatches in this scenario.

ADPSGD  SYNCDDL/HYBRID  
Time/epoch (hr)  Speedup  Time/epoch (hr)  Speedup  
no slowdown  0.87  10.88  1.09/1.03  8.70/9.26  
2X  0.89  10.63  1.67/1.63  5.71/5.83  
10X  0.91  10.42  6.24/6.46  1.52/1.47  
100X  0.92  10.38  57.73/60.80  0.16/0.16 
5.3 Experiments on 32 GPUs
We ran an experiment on ClusterB with 32 GPUs and batch size 80 per GPU^{3}^{3}3It takes 195 hours to finish training SWB2000 task on a V100 GPU, with a batchsize 80.. SYNCDDL, HYBRID and ADPSGD complete one epoch in 0.75 hrs (16.25x speedup), 0.71hrs (17.17x speedup), and 0.83hrs (14.69x speedup) respectively. HYBRID trains SWB2000 to reach WER 7.6% on SWB and WER 13.1% on CH in 11.5 hrs.
ADPSGD requires more CPU resources than SYNC and HYBRID to conduct weights averaging and passing weights. ClusterB has a less favorable CPU/GPU ratio. Furthermore, 8 GPUs share one PCIe bus on ClusterB and ADPSGD saturates the 16GB/s bandwidth quickly. ADPSGD would run significantly faster if deployed on clusters with higher CPU/GPU ratio, higher main memory bandwidth, and/or advanced GPUCPU interconnect (e.g., NVLink[nvlink]).
6 Conclusion and Future work
In this paper, we made the following contributions: (1) we first described the hyperparameter setup for SWB2000LSTM speech recognition task using batch size of 2560, which is a sufficiently large batch size that enables efficient distributed training. (2) We implemented and compared different distributed learning algorithms for this task. Our system trains a model to WER 7.6% on SWB and WER 13.1% on CH in less than 12 hours. To the best of our knowledge, this is the fastest system that trains these tasks to this level of accuracy. Our future work includes: (1) Implement waitfree ADPSGD as proposed in [adpsgd] to further improve convergence and runtime. (2) Experiment with hardware with better memory bandwidth, CPU/GPU ratio, and CPUGPU interconnect. (3) Experiment with different types of speech recognition workloads. (4) Explore other methods to increase batch size and/or use mixedprecision training as in [lars, tencentimgnet].
7 Related Work
Distributed DL have been applied to speech recognition[deepspeech2, bmuf], computer vision[facebook1hr], language modeling[nvidialmscaling], and machine translation[fbmtscaling] tasks. To reduce the cost of communication, researchers have proposed gradient quantization[msr1bit, naigangnips18] and gradient compression[adacomp, terngrad]. All these works adopt a synchronous training method which would become unacceptably slow in a resourcesharing or Cloud environment[gadei]. Asynchronous SGD, based on the parameterserver architecture, is known to have inferior performance and should be avoided when possible[revisitsyncsgd, facebook1hr, deepspeech2, zhang2016icdm]. This work is the first that applies Asynchronous Decentralized Parallel SGD (ADPSGD), which has the theoretical guarantee to converge at the same rate as SGD[adpsgd], to the challenging SWB2000LSTM task to achieve stateoftheart model accuracy in a record time.
Comments
There are no comments yet.