Neural networks with deep architectures have been the dominant acoustic modeling approach for automatic speech recognition (ASR) in recent years. They have yielded state-of-the-art performance as compared to previous technologies based on hidden Markov models (HMMs) and Gaussian mixtures [Hinton_DNNSPM]. In some tasks, Deep Learning (DL) has achieved near human-level ASR performance [saon-interspeech-2017, msr-speech]. It is commonly agreed that the success of DL for ASR relies on the availability of large amount of training data and high-performance computing. Therefore, distributed DL is not only preferred but also necessary in DL ASR to guarantee fast turnaround time for model training.
A neural network training algorithm seeks to find a set of parameters that minimizes the discrepancy between the network output and the ground truth . This is usually accomplished by defining a differentiable cost function and iteratively updating each of the model parameters using some variant of the gradient descent algorithm:
where represents the model parameter at iteration , is the step size (also known as the learning rate), and is the batch size.
Distributed DL systems, the de facto approach to training large DL tasks, typically adopt a Parameter Server (PS) [distbelief] architecture. Figure 1(a) depicts a PS architecture in which each learner calculates gradients and transfers them to the PS. The PS then updates the weights and sends them back to the learners. The timestamp, a scalar counter, of the PS’s weights is increased by 1 (i.e. from to ) for each update. Staleness[zhang-ijcai-2016] is defined as the discrepancy between the timestamp of learners’ weights which are used to calculate gradients, and the timestamp of the PS’s weights. To achieve convergence that is matching to the single learner system, Synchronous SGD (SSGD) is often used. In SSGD, the PS weight update rule is given in Equation 2111Throughout this paper, we use to denote the number of learners.: each learner calculates gradients and receives updated weights in lockstep with the others. The weights used to calculate the gradients are always the same as the one on the PS, thus staleness is 0.
The summation operation in Equation 2 is communicative and associative; this is known as the “Reduce” operation in the High Performance Computing field. All-Reduction is the operation that reduces (e.g., sums) all the elements and broadcasts the reduction results to each participant. When the message to be “AllReduced” is large, as in the DL case, the optimal algorithm maximizes the communication bandwidth utilization by breaking a large message to chunks and pipe-lining the reduction operation with message transferring in a ring topology[fsu-allreduce]. Such an algorithm can finish the AllReduce operation in 2 M/Bandwidth time, where is the size of the message, regardless the number of participants. Many such implementations exist, most notably[paddle-paddle, nccl, ddl].
One key drawback of SSGD is that one slow learner can slowdown the entire training which is known as the straggler problem[mapreduce] in distributed computing. To avoid this problem, practitioners proposed Asynchronous SGD (ASGD) which allows each learner to calculate gradients and asynchronously push/pull the gradients/weights to/from PS. The weight update rule in ASGD is given in Equation 3:
Staleness in ASGD is proportional to the number of learners in the system[zhang-ijcai-2016, distbelief] and can severely harm convergence[revisit-sync-sgd, zhang2016icdm]. To achieve the best model accuracy, most distributed deep learning tasks adopt SSGD only[deepspeech2, facebook-1hr, nvidia-lm-scaling, fb-mt-scaling].
To avoid the straggler problem and maintain competitive model accuracy, decentralized distributed computing algorithm[wildfire] is proposed, in both synchronous form Decentralized Parallel SGD (DPSGD)[dpsgd] and in asynchronous form Asynchronous Decentralized Parallel SGD (ADPSGD)[adpsgd]. The architecture of a decentralized SGD system is depicted in Figure 1(b), where each learner calculates the gradients, updates its weights, and averages its weights with its neighbor in a ring topology. DPSGD/ADPSGD weights update rule is defined in Equation 4.
In DPSGD, assuming the pair-wise weight averaging can be executed multiple times after each gradient calculation, all the learners will reach the same weights, thus the staleness can be zero. In ADPSGD, since computation is overlapped with communication, the staleness is 1 at best. ADPSGD has shown excellent runtime and convergence performance on computer vision tasks with CNN type of models (e.g, ResNet and VGG)[adpsgd]. In this paper, we demonstrate that ADPSGD also achieves excellent runtime and convergence performance on the SWB2000 speech recognition task with an LSTM model.
3 Design and Implementation
We describe how to increase the batch size to enable efficient distributed computing for SWB2000-LSTM in Section 3.1. We then describe the design and implementation for different distributed learning strategies in Section 3.2.
3.1 Increase Batch Size
A sufficiently large batch size is necessary for enabling efficient distributed DL for two reasons: (1) The larger the batch, the more learners that can be used. (2) Gradient computation is more efficient with a larger batch size. In Figure (a)a, blue bars show the computation time per epoch under different batch size for the SWB2000-LSTM222We describe the details of the LSTM model used in this paper in Section 4.2 task measured on a P100 GPU. It takes 8.58 hrs to finish one epoch under batch size 256 as compared to 18.33 hrs under batch size 32. Furthermore, with a smaller batch size, more frequent communication is required. In Figure (a)a, orange bars show the minimum bandwidth requirement to transfer the gradients so that communication time and computation time break even. Batch size 32 per learner requires almost 4X bandwidth (4.98GB/s) as compared with batch size 256 (1.33GB/s). Conventional wisdom on SWB2000-LSTM task is batch size larger than 256 significantly lowers the model accuracy[saon-interspeech-2017]
. The hyperparameter setup for the batch size 256 configuration is the learning rate is set to be 0.1, momentum is set as 0.9, and learning rate anneals byevery epoch from the 11 epoch. The training finishes in 16 epochs. Inspired by the work proposed in[facebook-1hr], we are able to increase the batch size from 256 to 2560 without decreasing model accuracy by (1) linearly warming up the base learning rate from 0.1 to 1 in the first 10 epochs and (2) annealing the learning rate by from the 11 Epoch. Figure (b)b plots the held-loss w.r.t epochs for batch size 256 and batch size 2560;they are indistinguishable by epoch 16.
In the ImageNet-ResNet task, a batch size of 32 takes about 0.18 sec to compute on one P100 GPU, whereas the same batch size for SWB2000-LSTM task takes only 0.067sec to compute. Moreover, the ImageNet-ResNet model size is about 100MB, whereas the SWB2000-LSTM model size is 165MB. The combination of shorter computation time and larger model size makes SWB2000-LSTM 5x more challenging to parallelize than the ImageNet-ResNet task.
3.2 System Design
Three distributed SGD algorithms are considered and implemented as follows: Synchronous (SYNC), Asynchronous Decentralized Parallel SGD (ADPSGD), and the hybrid of these two algorithms HYBRID. In Figure 3, we sketch our system APIs integration (in bold-font texts) with the underlying DL framework. We assume the underlying DL framework can provide the following functionalities: g=calGrad(W), which calculates the gradients based on weights and W’=apply(W,g), which applies gradients to weights and returns updated weights .
SYNC: Summing of the weights and taking their average after every iteration, as shown in Figure (a)a, is equivalent to applying weights update by using the averaged gradients. We use the fastest Allreduce implementation available to us (DDL-Allreduce[ddl]) to implement the SYNC strategy. As we will show in Section 5.2, DDL-Allreduce is 1.2X-3X faster than the open-source MPI_Allreduce implementation in OpenMPI[openmpi].
ADPSGD: Figure 4 shows the system architecture of ADPSGD. Assuming learners in a system (
is an even number), we designate learners of odd id() as senders and learners of even id () as receivers. Bipartition of the communication graph guarantees acyclic communication, thus it is deadlock-free. Each sender communicates with its left and right neighbor in alternate iterations. Each sender process runs two threads: a main thread and a communication thread. The main thread calculates the gradients and applies the weight updates. When a new set of weights are generated, the sender signals its communication thread to send to its neighbor learner , receives learner ’s weights , and then updates its weight to be the average of and . Additionally, as required in the proof of [adpsgd], the weight matrix of learners need to be doubly stochastic and symmetric which implies the weight update on GPU and the weight averaging on the CPU must not interleave so that the receiver has a consistent view of weights on the sender. We enforce this atomicity via a condition variable[mesa]. Similarly, each receiver runs a main thread and a communication thread. In each iteration, a receiver’s main thread calculates gradients and then updates its weights in an atomic region. Meanwhile, the receiver’s communication thread waits until the weights are received from its neighbor . Then, the communication thread does the following in an atomic region: (i) sends its weights to its engaging neighbor and updates its weights by averaging its weights with the received weights. The atomicity is enforced via a mutex lock[ArpaciDusseau14-Book]. As compared to the implementation in [adpsgd], this implementation updates weights on GPU, which runs faster and no gradient information needs be extracted from the underlying solver; the disadvantage of this implementation is that each sending operation is only triggered when new gradients are calculated and there is no way to exchange weights more frequently even when the network is free. Also, GPU weight updates need to wait if the weights are being changed in the communication thread.
HYBRID: Note that the communication threads in ADPSGD essentially runs an Allreduce over learner and . By replacing the point-to-point message passing with an Allreduce over all learners, we can leverage the optimized fast Allreduce implementation and also minimize the weights discrepancy among different learners. In essence, push_weights(W’) signals the communication thread to conduct an Allreduce which runs concurrently with gradients calculation, and pull_weights simply retrieves the Allreduce-d results from the last push and takes an average. The handshaking between communication thread and main thread is a fast lock-free implementation[dyce]. When gradient calculation takes longer time than Allreduce, this scheme should completely overlap communication with computation.
|Staleness||0||1||at best 1|
Table 1 summarizes the runtime and staleness comparison between different algorithms.
4.1 Software and Hardware
PyTorch 0.5.0 is the underlying DL framework. We use the CUDA 9.2 compiler, the CUDA-aware OpenMPI 3.1.1, and g++ 4.8.5 compiler to build our communication library, which connects with PyTorch via a Python-C interface.
We develop and experiment our systems on a production-run cluster ClusterA, which has 4 servers in total. Each server is equipped with 14-core Intel Xeon E5-2680 v4 2.40GHz processor, 1TB main memory, and 4 P100 GPUs. We also run a 32-GPU experiment on a 4-server high-GPU-density experimental cluster ClusterB. Each ClusterB server is equipped with 18-core Intel Xeon E5-2697 2.3GHz processor, 1TB main memory, and 8 V100 GPUs. Both servers are connected by 100Gbit/s ethernet. On both servers, GPUs and CPUs are connected via PCIe Gen3 bus, which has a 16GB/s peak bandwidth in each direction.
4.2 DL Models and Dataset
The acoustic model is an LSTM with 6 bi-directional layers. Each layer contains 1,024 cells (512 cells in each direction). On top of the LSTM layers, there is a linear projection layer with 256 hidden units followed by a softmax output layer with 32,000 units corresponding to context-dependent HMM states. The LSTM is unrolled with 21 frames and trained with non-overlapping feature subsequences of that length. The feature input is a fusion of FMLLR (40-dim), i-Vector (100-dim), and logmel with its delta and double delta (40-dim3).
The language model (LM) is rebuilt using publicly available training data, including Switchboard, Fisher, Gigaword, and Broadcast News, and Conversations. Its vocabulary has 85K words and it has 36M 4-grams.
5 Experimental Results
5.1 Convergence Results
Our LSTM baseline trained on single GPU (batchsize 256) gives a WER of 7.5%/13.0% on the Switchboard/CallHome (SWB/CH) of the NIST Hub5 2000 evaluation test sets after the Cross-Entropy training, which is a competitive baseline. We compare this baseline with SYNC, HYBRID, and ADPSGD in Figure (a)a for heldout loss and in Table 2 for WER.
5.2 Runtime Results
Figure (b)b plots the speed-up for different algorithms up to 16 GPUs on ClusterA. SYNC-OpenMPI is found to be the slowest one. It is also found that ADPSGD achieves the best speed-up (about 11X over 16 GPUs) and finishes the training in 13.98 hours. ADPSGD did not achieve linear speed-up because it requires CPU-based weights averaging and GPU weights update to occur atomically which could be remedied by offloading GPU weights update to CPU. HYBRID does not outperform SYNC-DDL significantly even though computation is long enough to hide the communication. This is because HYBRID asynchronously calls DDL which relies on NVIDIA NCCL library[nccl] for the intra-server reduction. NCCL heavily competes with training for GPU resources (e.g., stream processor and memory) when used asynchronously.
Table 3 shows the speed-up measured for each algorithm when one of the 16 GPUs slows down. ADPSGD is immune to the straggler problem, whereas the speedup of other algorithms quickly diminishes. Figure 6 shows a snapshot of the number of minibatches processed by each GPU in one epoch when half of ClusterA are shared by other users. ADPSGD automatically balanced the workload per GPU. SYNC and HYBRID would enforce each GPU to process the same number of minibatches in this scenario.
|Time/epoch (hr)||Speed-up||Time/epoch (hr)||Speed-up|
5.3 Experiments on 32 GPUs
We ran an experiment on ClusterB with 32 GPUs and batch size 80 per GPU333It takes 195 hours to finish training SWB2000 task on a V100 GPU, with a batchsize 80.. SYNC-DDL, HYBRID and ADPSGD complete one epoch in 0.75 hrs (16.25x speed-up), 0.71hrs (17.17x speed-up), and 0.83hrs (14.69x speed-up) respectively. HYBRID trains SWB2000 to reach WER 7.6% on SWB and WER 13.1% on CH in 11.5 hrs.
ADPSGD requires more CPU resources than SYNC and HYBRID to conduct weights averaging and passing weights. ClusterB has a less favorable CPU/GPU ratio. Furthermore, 8 GPUs share one PCIe bus on ClusterB and ADPSGD saturates the 16GB/s bandwidth quickly. ADPSGD would run significantly faster if deployed on clusters with higher CPU/GPU ratio, higher main memory bandwidth, and/or advanced GPU-CPU interconnect (e.g., NVLink[nvlink]).
6 Conclusion and Future work
In this paper, we made the following contributions: (1) we first described the hyper-parameter setup for SWB2000-LSTM speech recognition task using batch size of 2560, which is a sufficiently large batch size that enables efficient distributed training. (2) We implemented and compared different distributed learning algorithms for this task. Our system trains a model to WER 7.6% on SWB and WER 13.1% on CH in less than 12 hours. To the best of our knowledge, this is the fastest system that trains these tasks to this level of accuracy. Our future work includes: (1) Implement wait-free ADPSGD as proposed in [adpsgd] to further improve convergence and runtime. (2) Experiment with hardware with better memory bandwidth, CPU/GPU ratio, and CPU-GPU inter-connect. (3) Experiment with different types of speech recognition workloads. (4) Explore other methods to increase batch size and/or use mixed-precision training as in [lars, tencent-imgnet].
7 Related Work
Distributed DL have been applied to speech recognition[deepspeech2, bmuf], computer vision[facebook-1hr], language modeling[nvidia-lm-scaling], and machine translation[fb-mt-scaling] tasks. To reduce the cost of communication, researchers have proposed gradient quantization[msr-1bit, naigang-nips18] and gradient compression[adacomp, terngrad]. All these works adopt a synchronous training method which would become unacceptably slow in a resource-sharing or Cloud environment[gadei]. Asynchronous SGD, based on the parameter-server architecture, is known to have inferior performance and should be avoided when possible[revisit-sync-sgd, facebook-1hr, deepspeech2, zhang2016icdm]. This work is the first that applies Asynchronous Decentralized Parallel SGD (ADPSGD), which has the theoretical guarantee to converge at the same rate as SGD[adpsgd], to the challenging SWB2000-LSTM task to achieve state-of-the-art model accuracy in a record time.