Automatic speech recognition (ASR) with Deep Neural Networks (DNN) operates in a hybrid framework using several models. These models include: DNN acoustic models (AM) that estimate the posterior probabilities of Hidden Markov Model (HMM) states, language models (LM) that estimate probabilities of word sequences, punctuation models and inverse text normalization (ITN) models dealing with number & date formatting. These models are optimized independently[hinton2012deep] and then combined together using Weighted Finite State Transducer (WFST) for efficient decoding.
End-2-End(E2E) speech recognition techniques such as connectionist temporal classification (CTC) [graves2006connectionist], listen, attend and spell(LAS) [las] and RNN-T [rnnt-graves, rnnt_arch, streaming_rnnt] have become successful because of advances in neural networks to model context and history in audio and text sequences [sak2014long]. E2E speech recognition combines all components of hybrid ASR model such as AM, LM, punctuation model and ITN into one component and predicts words directly from input acoustics.
E2E simplifies the training process for a new ASR system, but in order to run them in a server-side application they need to meet the following constraints: 1) streamable with constrained latency 2) match or improve the computational efficiency and WER of the baseline hybrid system.
LC BLSTM [lcblstm2, lcblstm_xue] are widely used to build ASR systems with constrained latency. They achieve this by using bi-directional context within short audio chunks without consuming the whole utterance. In this work, we use LC BLSTM for the Audio Encoder of RNN-T to achieve streamable ASR. Popular hybrid ASR decoding techniques such as static decoder [povey2011kaldi] and dynamic decoder [dynamic_decoder] use hyper parameters (beams) to prune hypotheses to improve computational efficiency. Inspired by these works we modify the RNN-T beam search to make it computationally more efficient. We evaluate our model under various settings for latency control using ‘throughput‘, defined as the number of audio seconds processed per wall clock second on a fixed server CPU architecture, and rtf@40, defined as real time factor at 40 concurrent audio streams.
The rest of the paper is organized as follows. In Section 2, we review the RNN-T model and the LC-BLSTM layer. We present our proposed changes in RNN-T beam search to improve computational efficiency in Section 3. We discuss our experimental setup and summarize our findings in Section 4. Finally, we conclude with a discussion of future work in Section 5.
2 RNN Tranducer
The framework of RNN-T ASR system is illustrated in Fig. 1. RNN-T for ASR has three main components: Audio Encoder, Text Predictor and Joiner. The Audio Encoder encodes audio frames up to a time as audio embedding . The Text Predictor encodes the text history up to index in reference or hypothesis as text embedding
. These embeddings are then fed to the Joiner which combines them to produce a probability distribution over the output units at. By incorporating both audio and text for producing probabilities over output symbols RNN-T can overcome the conditional independence assumptions of CTC models [graves2006connectionist]. In RNN-T, the output units include a special symbol to decide whether to move to next time frame or to emit more output units from same time frame for the next Joiner call. After every Joiner call we either move in time(t) axis to process next audio frame or we update the hypothesis () and emit more symbols from the same time frame . The former is done when the Joiner emits a symbol, whereas the latter is done for non emission.
2.1 Latency Controlled BLSTM for RNN-T
Unidirectional Audio Encoder models such as LSTMs base their predictions only on the audio history to the left and thus tend to yield worse word error rates than bi-directional encoders that have full left and right context. For live streaming application, we are constrained to use unidirectional encoders because transcript should be made available with minimum possible delay as audio is fed in. However, some applications permit a certain maximum latency between consuming parts of the input audio and producing the transcript for it. In such cases it greatly helps to use some amount of right context in the Audio Encoder to improve WER. Traditional BLSTM can not produce a transcript until the whole audio stream is processed. LC BLSTM [lcblstm_xue, lcblstm2] allows streamable application that has constrained latency. LC BLSTM (fig 2) has two LSTMs, left-lstm that runs from left to right in time axis whereas right-lstm runs from right to left in time axis.
In order to run LC BLSTM for RNN-T, the audio sequence is first divided into overlapping chunks of size . The amount of overlap between chunks is equal to the minimum amount of right context () available to frames in the chunk. As shown in fig 2, the amount of right context available is maximum () at the first frame of the chunk and it reduces linearly to at the end of chunk at . This allows every frame in the chunk to have some amount of right context to generate a high-quality audio embedding without delaying the generation of embedding until the whole audio stream is ingested.
3 Improving Beam search for RNN-T
Euclid’s Algorithm: An example of how to write algorithms in LaTeX
November 7, 2019
Inspired by speed improvements in decoders such as Kaldi [povey2011kaldi] using pruning, we modify the RNN-T beam search described in [rnnt-graves] to prune unlikely paths early during decoding to improve ‘throughput‘ and real time factor. Our modified algorithm is presented in Algorithm 1. We use same symbols as used in Algorithm 1 from [rnnt-graves].
In order to explain RNN-T beam search let us assume that hypothesis at time has audio embedding and text embedding . These embeddings are fed to the Joiner which combines them to produce probabilities over output units at . As explained in section 2 output units include a special symbol () to decide whether to move to the next time frame or to emit more output units from the same time frame . In order to ensure that at-least top (beam size) hypotheses that are being moved to have higher probability than the ones that can still be generated from , a beam search is performed using two sets of hypothesises, and .
Set contains hypothesises that are still being considered for time whereas hypothesis set contains hypothesises that have already emitted a symbol at time , and are now in time frame . As soon as has hypotheses more probable than the most probable hypothesis in the beam search criterion is met at time and we can start processing frame . During beam search we pick the best hypothesis in and expand it either with or non symbols. The expansion with moves a hypothesis to whereas expansions with non symbols are put back in , which results in a expanded set at . We introduce to limit number of expanded hypothesises that are added in . For a Joiner call at (,) that produces we first compute the best prob, among non output units() and only consider output units() that have higher than to be added to .
We also introduce a and use it as an additional hyper parameter of the beam search. If the best hypothesis in is worse by more than from the best hypothesis in in log space, we assume that future expansions of hypothesises available in are too unlikely to compete with already existing hypos in . We always use the natural logarithm of numerical value while discussing and in rest of the paper.
In experimentation section we show that we can improve ‘throughput‘ from 53 to 65 and decrease rtf@40 from .75 to .60 by using and with negligible WER impact.
The dataset used for our experiments was sampled from English videos shared publicly on Facebook. The data does not contain any user-identifiable information and is completely anonymized . The training set consists of around 1M videos with 13.7K hours in total. We use two test sets; vid-clean and vid-noisy. Vid-clean has 1.4K videos (about 20.9 hours) whereas vid-noisy that is more acoustically challenging has 1.3K videos (about 20.1 hours). More information about our data sets can be found in [base_hybrid_training].
The architecture of the RNN-T model (Figure 1) used for the experiments in this paper is as follows. The Audio Encoder has two components: a 5-layer LC BLSTM with 704 dimensions and Audio Encoder Linear Projection Layer (AELPL) of dimension by . We use subsampling of 2 across the time dimension after the first LC BLSTM layer to improve training and inference speed. The LC BLSTM uses a right context () of 20 frames (200ms) and chunk size () of 240 frames (2400ms) during training. The Text Predictor also has two components; a 2-layer LSTM of 704 dimensions and Text Predictor Linear Projection Layer (TPLPL) of dimension by
. The Joiner uses concatenation of three layers: summation layer, ReLU[relu]
layer and, a softmax layer and produces probabilities over output units (). We used a token set consisting of 200 sentence pieces, learnt using the sentence piece library [sentence_piece]. The entire model consists of 62M parameters.
The input to the network consists of globally normalized 80-dimensonal log Mel-filterbank, extracted with 25ms FFT windows and 10ms frame shifts. We use the Adam optimizer[kingma2014adam], learning rate of 0.0004, with dropout probability of 0.3 and policy LB of SpecAugmentation[spec_aug]
during training. Dropout is applied in all layers of LC BLSTM of Audio Encoder and LSTM layers of Text Predictor. The RNN-T training was ran for 25 epochs.
The latency budget can be chosen by setting the Decoding Threshold () at inference time. Decoding Threshold() is defined as the chunk size (in milliseconds) used during inference for LC BLSTM Audio Encoder. If not explicitly specified, we use of 800ms for our experiments. We use same amount of right context () during training and inference.
A beam size of 5, expand_beam of 2.3 and state_beam of 4.6 were used during inference. INT8 quantization from Pytorch was used during inference to speed up decoding. RNN-T is fully neural and does not use an external LM.
The baseline hybrid ASR [base_hybrid_training] system consisted of a 5-layered LC BLSTM model with 800 hidden units and an external WFST language model trained using transcripts from the same training data. The hybrid ASR system was trained using the model and policy described in [base_hybrid_training] to minimize the cross-entropy (20 epochs) loss first, followed by the LF-MMI criterion (8 epochs) [lffmi]. The hybrid ASR system also used INT8 quantization during inference time to speed up decoding.
4.3 Impact of expand and state beam on WER and Throughput
As discussed in Section 3, the introduction of and in beam search of RNN-T allows us to limit the number of hypotheses in the sets and at inference time. This boosts the ‘throughput‘ by around 22 percent and decrease rtf@40 by 20 percent with negligible impact on WER. We achieve a ‘throughput‘ of 65 and rtf@40 of .60 with = 2.3 and = 4.6, compared to a ‘throughput‘ of 53 and rtf@40 of .75 without the improved beam search. Table 2 shows the WERs for the vid-noisy test set with different values of these parameters.
|Expand Beam||State Beam||WER||Throughput||rtf@40|
4.4 Comparing Hybrid ASR Model with RNN-T
At 65MB, the RNN-T model is more than smaller than our production hybrid ASR model[base_hybrid_training], while obtaining a comparable WER. In addition to being larger, the Hybrid ASR model also requires its various components, such as the acoustic model, language model, punctuation model and inverse text normalization to be trained individually. Each of these components has their own data and training pipelines that need to be maintained separately. The RNN-T model combines these components into a single model that can be trained end-to-end, which simplifies the training and deployment process.
As seen in the Table 3, RNN-T achieves a similar WER and better ‘throughput‘ and rtf@40 for both test sets compared to hybrid system. of 800ms was used both for hybrid and RNN-T ASR system.
4.5 Impact of Decoding Threshold on WER and Throughput
The latency budget of ASR systems varies depending on the application that they are being used for. Chunk size of LC BLSTM layer used at inference time can be different from training chunk size (). We defined Decoding Threshold () as the chunk size used during inference time in Section 4.2. parameter gives us a way to achieve flexible latency budget. By adjusting the parameter at inference time, we can trade-off the latency (and throughput) of the system with WER. For larger values of , the latency between the input audio and the produced transcript is larger, but the system can achieve a better WER, because the average amount of right context available per frame increases. For smaller values of , the models needs to perform more computations per time step as the right context of each audio chunk has to be re-processed for the subsequent chunk, which reduces throughput.
As seen in table 4, the ‘throughput‘ decreases from 74 to 48 and rtf@40 increases from .53 to .81 when the is decreased from 2000 to 300 for the vid-noisy data set. For our model, the WER increases by 8.9% (relative) for vid-noisy and 13.3% (relative) for vid-clean when the is decreased from 2000 to 300 as observed in Table 4. We only show ‘throughput‘ and rtf@40 for vid-noisy test in Table 4, vid-clean follows a similar pattern.
RNN-T models can also be made streamable by using only uni-directional LSTMs in the Audio Encoder. However, our best unidirectional RNN-T only achieves a WER of 16.4% for vid-clean and 23.6% for vid-noisy. Tuning at inference time allows us to get better WERs than with unidirectional models while keeping ASR streamable.
In this work we show that RNN-T systems is suitable for streaming ASR with latency constraints. Our experiments demonstrate that RNN-T can achieve a good trade-off between latency and WER with LC-BLSTM. Our work improves on existing work in two ways: first, the changes we propose to the beam search procedure improve rtf@40 by relative 20% without impacting WER; second, we show that we can achieve a better WER with an RNN-T equipped with LC-BLSTM layers than one with only unidirectional LSTMs, while still keeping it streamable. The use of LC-BLSTMs also allows the latency of the models to be controlled at inference time. Future directions include using contextual information for RNN-T.
Authors would like to thank Awni Hannun and Yun Wang for the discussions and suggestions about this work.