There is a great interest in automatic speech recognition (ASR) system because of the success of deep learning[1, 2, 3, 4, 5] and popularization of speech interfaces, e.g., smart-phone and smart-speaker. In practice, rapid execution of ASR decoding is essential for better user experience. Reduction of sequence length [6, 5, 7] and parallel computing [8, 9, 10] are mainly investigated for rapid computation of likelihoods and efficient traversal of search space.
Beam search  is one of the breadth-first search algorithms which imposes a restriction on the search space to reduce the computational complexity of both memory space and execution time. During the search, hypotheses are expanded from a root node, and the expanded nodes at each depth level (or time-step in time-synchronous beam search for ASR) are stored in a FIFO (First-In First-Out) queue for further expansion at next depth level. Thread parallelism and GPU-based execution accelerate computation of matrix multiplication and element-wise operation. However, the loop program with regard to hypothesis traversal still exists and decoder network needs to be executed per hypothesis in case of attention-based encoder decoder network. Therefore, there is a room for improvement of recognition time by concatenating hypotheses and processing them in a batch.
Dixon, et. al., proposed GPU based computation of acoustic scores , and Chong, et. al.,  and Chen, et. al.,  further extended the search algorithm by executing graph traversal on GPU. These studies focused on efficient computation of WFST (Weighted Finite-State Transducer) based decoding.
Different from the earlier works, we focus on a faster beam search algorithm for end-to-end attention-based encoder decoder networks. We first vectorize
beam size hypotheses and compute posterior probabilities for hypothesis expansion at next time step in a batch. This enables elimination of for-loop program with regard to beam size originally managed by FIFO (First-In First-Out) queue. Next, we vectorize multipleinput speech utterances to reduce the execution of for-loop program with regard to input speech data size. It is not trivial unlike during training due to introduction of several pruning and thresholding techniques per utterance for efficient decoding. During beam search, the encoder network generates hidden vectors of utterances at once, and the attention network and the decoder network process hypotheses in a batch. This algorithm is executable on both CPU and GPU without needing significant code modification. In the experiment, we evaluate the effectiveness of the hypothesis and speech vectorization method assuming the following two scenarios:
It is not a pure time-synchronous beam search because we used an attention-mechanism and bidirectional LSTM in the experiment. The proposed search algorithm is applicable to other online neural network architectures for pure time-synchronous beam search.: hypotheses are vectorized to eliminate the loop for hypothesis traversal. Vectorization of hypotheses enables execution of attention and decoder networks for hypotheses in a batch.
Offline decoding: speech utterances are further vectorized by processing multiple speech utterances. Vectorization of utterances and hypotheses enables execution of encoder network for utterances in a batch and also enables execution of attention and decoder networks for hypotheses in a batch.
The rest of the paper is organized as follows. Original implementation of beam search is described in Section 2. Our contributions, vectorization of hypotheses and utterances, are described in Section 3. Experiments are conducted using librispeech corpus and CSJ corpus in Section 4, followed by conclusions in Section 5.
2 Beam search
Let be a set of hypotheses in the FIFO-queue at decoding time step . Hypothesis has its own label history accumulated up to time step :
where denotes the -th output label of in distinct output label set .
At next time step , the decoder network generates new labels with its posterior probabilities which leads to hypotheses. Let be a set of indices for output labels, and for current hypotheses. Then the hypotheses at next time step are stored in a queue as the following equation:
Each hypothesis has a score which is an accumulation of log posterior probability up to decoding time step , and it is updated by adding the output of decoder network:
where is the probability of label calculated by output of the decoder network. Let be a set of posterior probabilities generated by the decoder network and be the posterior probability of -th label. In this paper, we follow the notation in :
where is the decoder state and is the context vector. Please refer to  for detail.
For the reduction of search space, the expanded hypotheses are pruned at each time step. In the experiment, we pruned the hypotheses in two step procedure, local pruning and global pruning. At the local pruning, the log probabilities computed by the decoder network at time step are sorted in descending order, and top probabilities are selected as candidates. When we define the function to select top -candidates with its indices from the set of hypotheses as , the local pruning is represented as:
resulting hypotheses (and corresponding accumulated scores), where is a set of selected indices. At the global pruning, they are further pruned to hypotheses as:
Other search parameters, e.g., labels and cells in recurrent connections, are pruned for next time step by tracking the indices. When we define this function as , the hypotheses, for example, is represented as:
The decoder network in Eq. (6) takes previous label information at time step to output the posterior probabilities at time step . Other than the previous label, the networks with recurrent connection have its internal states (e.g., in Eq. (6), in Eq. (7), and attention weight) which will be used in a future time step. These states also need to be pruned same as hypotheses. At implementation level, each hypothesis is represented as a dictionary data structure consists of these states, and stored in the FIFO-queue to reduce the execution of Eq. (10).
3 Hypotheses and speech vectorization
In this section, we reformulate the beam search algorithm in Section 2 by vectorizing the hypotheses and eliminating the loop with regard to beam size . We further batch multiple utterances for the reduction of computational time assuming offline decoding scenario. In case of online decoding scenario, the batch size is set to 1. Figure 1 shows an overview of the proposed hypotheses expansion and pruning techniques at time step .
For this purpose, we vectorize each element in the dictionary consists of the internal states as described in Section 2.2. At time step , the previous labels are defined as a vector of ”start-of-sequence” symbols:
and the accumulated scores are defined as:
The size of vector is represented in the square brackets. By concatenating
utterances, the encoder network can compute the hidden representations forutterances at once. The output of encoder network is then duplicated to hypotheses to match the number of hypotheses. Then, the decoder network computes the posterior probabilities for all beam hypotheses of utterances in a batch. Let be the calculated posterior probabilities for candidates. The attention-based decoder network in Eqs. (6) and (7) are replaced as:
After the expansion of hypotheses, the local pruning is applied to reduce the number of hypotheses from to for all hypotheses and utterances. We define this function as where is a number for return of top- candidates, and is a target index of selection222PyTorch supports this function as torch.topk.. The selected log probabilities are added to the accumulated score. To match the dimension of the log probabilities and the accumulated score, we duplicated the accumulated score up to by introducing a new axis:
The accumulated score at time step is:
where is the indices of top- output label candidates of hypotheses of utterances. The accumulated score is re-sized to for global pruning targeting candidates for utterances. The global pruning is represented as:
The duplicated labels are pruned and concatenated to update the hypotheses:
where is the operation for element-wise concatenation of accumulated label history and the current label.
3.2 Shallow fusion of external modules
During beam search, scores of RNNLM (recurrent neural network language model) and CTC prefix score are integrated as shallow fusion. ESPnet combines these scores and the final log probability, , is defined as weighted sum of CTC prefix score (), decoder network (), and RNNLM ():
where and are hyper-parameters and these values control contribution of each score. Please refer to  for further detailed explanation. The Eq. (13) is rewritten as Eq. (22) to combine the scores of RNNLM and CTC.
4.1 Experimental setup
We used English and Japanese speech corpora, Librispeech  and CSJ (Corpus of Spontaneous Japanese) [15, 16]333Recipes are available at ESPnet .. As input feature, we used 80-dimensional log Mel filterbank coefficients and pitch features with its delta and delta delta features (80+3=83dimension) extracted using Kaldi tools . Joint CTC/attention-based encoder decoder networks  were trained by using PyTorch .
On Librispeech corpus, we used a 8 layer BLSTM as the encoder network. The 2nd and 3rd bottom layers of the encoder network subsample hidden vector by the factor of 2 . Each BLSTM layer has 320 cells in each direction, and is followed by a linear projection layer with 320 units to combine the forward and backward LSTM outputs. The decoder network has an 1-layer LSTM with 300 cells. The number of labels was set to 29 including alphabets and special tokens. On CSJ corpus, we used a 4 layer BLSTM as the encoder network with the subsampling technique. Each BLSTM layer has 1024 cells in each direction, and is followed by a linear projection layer with 320 units to combine the forward and backward LSTM outputs. The decoder network has an 1-layer LSTM with 1024 cells. The number of labels was set to 3,260 including Japanese Kanji/Hiragana/Katakana characters and special tokens.
Beam search was performed using Intel Xeon Processor E5-2667 v3 for CPU-based search and Tesla K80 for GPU-based search. As evaluation set, we used randomly selected 1,000 utterances ( minutes) on librispeech corpus and evalution set-1 ( minutes) on CSJ corpus.
4.1.1 Search parameters
In the case of shallow fusion, we used and on both librispeech and CSJ. The beam size was set to 20 in decoding under all conditions. For the recognition without vectorization, we conducted thread parallelism and process parallelism to accelerate decoding time. In the case of thread parallelism, we controlled an environment variable and activated OpenMP. We did not change other parameters and left it to the back-end PyTorch. In case of process parallelism, test data is split into multiple subsets and each subset is recognized in parallel using multiple CPU cores independently.
4.2 Online decoding scenario
Table 1 shows a duration (minutes) and real-time factor on Librispeech which parallelize hypotheses assuming online scenario. The row ”” shows durations of recognition time based on conventional beam search algorithm using attention decoder (ATT) and attention decoder with RNNLM (+RNNLM). ”batch” is a number of utterances for concatenation. ”threads” is a number of threads for thread parallelism and ”procs” is a number of CPU cores used for process parallelism. In the case of ATT, recognition time of original beam search was 318.5 minutes, and it was decreased to 85.0 minutes by parallelizing 20 beam hypotheses on CPU. It was further decreased to 30.4 minutes by changing processing unit to GPU. Recognition based on ATT+RNNLM also showed speed improvement. Table 2 shows a duration (minutes) and real-time factor on CSJ corpus. The result on CSJ also showed the effectiveness of hypotheses vectorization with the usage of GPU for all conditions, ATT and ATT+RNNLM.
|1||1||1||318.5 (2.6)||518.3 (4.2)|
|CPU||1||1||1||85.0 (0.7)||108.2 (0.9)|
|GPU||1||1||1||30.4 (0.2)||33.0 (0.3)|
|1||1||1||591.3 (5.4)||713.7 (6.5)|
|CPU||1||1||1||163.6 (1.5)||190.4 (1.7)|
|GPU||1||1||1||32.2 (0.3)||32.2 (0.3)|
Our algorithm achieved significant gain from the conventional beam search algorithm on both librispeech corpus and CSJ corpus by vectorizing 20 hypotheses and eliminating the for-loop program for hypothesis traversal. In the case of ATT and ATT+RNNLM, real time factors were less than 1.0 and are applicable to online decoding scenario.
4.3 Offline decoding scenario
Table 3 shows recognition time of thread parallelism (threads 1), process parallelism (procs 1), and our hypothesis and speech vectorization method (batch 1) on librispeech corpus. When we used 8 threads and decoded using decoder network, recognition time was comparable to the single thread execution as in Table 1.
When multiple utterances are vectorized and recognized on CPU using the decoder network, the recognition time was 96.1 minutes. It was comparable to the process parallelism (80.3 minutes) even though our program consumed only one CPU core. The recognition time was further decreased to 16.0 minutes by changing the processing unit to GPU. Comparison with Table 1 showed the advantage of utterance vectorization: in the case of GPU-based execution, recognition time without utterance vectorization was 30.4 minutes, however, vectorization of multiple utterances decreased the recognition time to 16.0 minutes. In case of ATT+RNNLM, execution on one CPU core with vectorization of utterance and hypothesis consumed 104.8 minutes and it was comparable to the recognition time of process parallelism. Again, execution on GPU decreased the recognition time from 104.8 minutes to 16.1 minutes.
Table 4 shows the recognition time on CSJ. When the recognition was performed using the score of decoder network, the recognition time was decreased from 591.3 minutes (in Table 2) to 127.6 minutes, and it was further decreased to 16.1 minutes by changing processing unit to GPU.
By vectorizing 8 multiple utterances, recognition time of our algorithm showed comparable performance with process parallelism with 8 CPU cores on both two corpora. In addition, execution based on GPU can fully exploit the advantage of GPU, and achieved further reduction of recognition time in case of ATT and ATT+RNNLM.
4.4 Fusion of CTC prefix score
Table 5 shows recognition time which use scores of RNNLM and CTC prefix score as shallow fusion. Recognition time of the original beam search was 742.9 minutes, and it was decreased by vectorizing hypotheses. Usage of GPU further decreased the recognition time to 270.6 minutes and achieved speedup. We further vectorized 8 utterances in a batch. The recognition time was 51.3 minutes and it showed better result than the usage of 8 core CPU.
In the case of ATT+RNNLM/CTC, computation of CTC prefix score requires operations proportional to a length of hidden vector generated by the encoder network. The operation in this computation slow down the speed especially when a large set of labels are used, and it was significant at CSJ corpus (3260 vs 29). The recognition time of our algorithm based on GPU with 8-batch was 343.0 minutes and showed better result than the original program with single core CPU (742.9). However, it was slightly slower than the one with 8 core CPU (210.4). Acceleration of CTC prefix score is one of our future direction.
In this paper, we proposed a novel approach to speed up recognition time of beam search algorithm by vectorizing search hypotheses and multiple input utterances. We achieved 3.7 speedup compared with the original beam search algorithm by vectoring hypotheses on librispeech corpus, and 3.6 speed up on CSJ corpus. We further proposed a technique to batch multiple utterances. In the case of GPU-based execution, vectorization of multiple utterances further achieved 1.9 speed up on librispeech corpus and 2.0 speed up on CSJ corpus. This is available at open source project ESPnet.
We would like to thank Dr. Rohit Prabhavalkar at Google for many insightful discussions.
-  Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig, “The Microsoft 2016 conversational speech recognition system,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5934–5938.
-  Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, and Michael Picheny, “Building competitive direct acoustics-to-word models for english conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4759–4763.
-  Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4774–4778.
-  Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, and Yifan Gong, “Advancing acoustic-to-word CTC model,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5794–5798.
-  Takaaki Hori, Shinji Watanabe, Yu Zhang, and Chan William, “Advances in joint CTC-Attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017, pp. 949–953.
-  Willian Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney,
“Improved training of end-to-end attention models for speech recognition,”in Proc. Interspeech, 2018, pp. 7–11.
-  Paul R Dixon, Tasuku Oonishi, and Sadaoki Furui, “Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition,” Computer Speech & Language, vol. 23, no. 4, pp. 510–526, 2009.
-  Jike Chong, Ekaterina Gonina, Youngmin Yi, and Kurt Keutzer, “A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit,” in Proc. Interspeech 2009, 2009, pp. 1183–1186.
-  Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur, “A GPU-based WFST decoder with exact lattice generation,” arXiv preprint arXiv:1804.03243, 2018.
-  Xavier L Aubert, “An overview of decoding techniques for large vocabulary continuous speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 89–114, 2002.
-  Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4835–4839.
-  Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yolta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: end-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “LIBRISPEECH: An ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
-  Kikuo Maekawa, Hanae Koiso, Sadaoki Furui, and Hitoshi Isahara, “Spontaneous speech corpus of Japanese,” in International Conference on Language Resources and Evaluation (LREC), 2000, vol. 2, pp. 947–952.
-  Kikuo Maekawa, “Corpus of Spontaneous Japanese: Its design and evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi speech recognition toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec. 2011.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS-W, 2017.
-  Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4945–4949.