E2E ASR has gained a lot of popularity due to its simplicity in training and decoding. An all-neural E2E model eliminates the need to individually train components of a conventional model (i.e., acoustic, pronunciation, and language models), and directly outputs subword (or word) symbols [1, 2, 3, 4, 5]. In large scale training, E2E models perform competitively compared to more sophisticated conventional systems on Google traffic [6, 7]. Given its all-neural nature, an E2E model can be reasonably downsized to fit on mobile devices .
. To bridge the quality gap between a streaming recurrent neural network transducer (RNN-T) and a large conventional model , a two-pass framework has been proposed in , which uses a non-streaming LAS decoder to rescore the RNN-T hypotheses. The rescorer attends to audio encoding from the encoder, and computes sequence-level log-likelihoods of first-pass hypotheses. The two-pass model achieves 17%-22% relative WER reduction (WERR) compared to RNN-T  and has a similar WER to a large conventional model .
A class of neural correction models post-process hypotheses using only the text information, and can be considered as second-pass models [11, 12, 13]. The models typically use beam search to generate new hypotheses, compared to rescoring where one leverages external language models trained with large text corpora . For example, a neural correction model in  takes first-pass text hypotheses and generates new sequences to improve numeric utterance recognition . A transformer-based spelling correction model is proposed in  to correct the outputs of a connectionist temporal classification model in Mandarin ASR. In addition,  leverages text-to-speech (TTS) audio to train an attention-based neural spelling corrector to improve LAS decoding. These neural correction models typically use only text as inputs, while the aforementioned two-pass model attends to acoustics alone for second-pass processing.
In this work, we propose to combine acoustics and first-pass text hypotheses for second-pass decoding based on the deliberation network . The deliberation model has been used in state-of-the-art machine translation , or generating intermediate representation in speech-to-text translation . Our deliberation model has a similar structure as : An RNN-T model generates the first-pass hypotheses, and deliberation attends to both acoustics and first-pass hypotheses for a second-pass decoding. We encode first-pass hypotheses bidirectionally to leverage context information for decoding. Note that the first-pass hypotheses are sequences of wordpieces  and are usually short in VS, and thus the encoding should have limited impact on latency.
Our experiments are conducted using the same training data as in [20, 21], which is from multiple domains such as Voice Search, YouTube, Farfield and Telephony. We first analyze the behavior of the deliberation model, including performance when attending to multiple RNN-T hypotheses, contribution of different attention, and rescoring vs. beam search. We apply additional encoder (AE) layers and minimum WER (MWER) training  to further improve quality. The results show that our MWER trained 8-hypothesis deliberation model performs 11% relatively better than LAS rescoring  in VS WER, and up to 15% for proper noun recognition. Joint training further improves VS slightly (2%) but significantly for a proper noun test set: 9%. As a result, our best deliberation model achieves a WER of 5.0% on VS, which is 21% relatively better than the large conventional model  (6.3% VS WER). Lastly, we analyze the computational complexity of the deliberation model, and show some decoding examples to understand its strength.
2 Deliberation Based Two-Pass E2E ASR
2.1 Model Architecture
As shown in Fig. 1, our deliberation network consists of three major components: A shared encoder, an RNN-T decoder , and a deliberation decoder, similar to [10, 16]. The shared encoder takes log-mel filterbank energies, , where denotes the number of frames, and generates an encoding . The encoder output is then fed to an RNN-T decoder to produce first-pass decoding results in a streaming fashion. Then the deliberation decoder attends to both and to predict a new sequence . We use a bidirectional encoder to further encode for useful context information, and the output is denoted as . Note that we could use multiple hypotheses , where and is the number of hypotheses, and in this scenario we encode each hypothesis separately using the same bidirectional encoder, and then concatenate their outputs in time to form
. We keep the audio encoder unidirectional due to latency considerations. Then, two attention layers are followed to attend to acoustic encoding and first-pass hypothesis encoding separately. The two context vectors,and , are concatenated as inputs to a LAS decoder.
There are two major differences between our model and the LAS rescoring . First, the deliberation model attends to both and , while  only attends to the acoustic embedding, . Second, our deliberation model encodes bidirectionally, while  only relies on unidirectional encoding for decoding.
2.1.1 Additional Encoder Layers
 shows that the incompatibility between an RNN-T encoder and a LAS decoder leads to a gap between the rescoring model and LAS-only model. To help adaptation, we introduce a 2-layer LSTM as an additional encoder (dashed box in Fig. 1 to indicate optional) to further encode . We show in Sect. 4 that additional encoder layers improve both deliberation and LAS rescoring models.
A deliberation model is typically trained from scratch by jointly optimizing all components . However, we find training a two-pass model from scratch tends to be unstable in practice , and thus use a two-step training process: Train the RNN-T as in , and then fix the RNN-T parameters and only train the deliberation decoder and additional encoder layers as in [7, 10].
2.2.1 MWER Loss
We apply the MWER loss  in training which optimizes the expected word error rate by using n-best hypotheses:
where is the th hypothesis from the deliberation decoder, and the number of word errors for w.r.t the ground truth target .
is the probability of theth hypothesis normalized over all other hypotheses to sum to 1. is the beam size. In practice, we combine the MWER loss with cross-entropy (CE) loss to stabilize training: , where as in .
2.2.2 Joint Training
Training the deliberation decoder while fixing RNN-T parameters is not optimal since the model components are not jointly updated. We propose to use a combined loss to train all modules jointly:
where is the RNN-T loss, and the CE loss for the deliberation decoder. , , and denote the parameters of shared encoder, RNN-T decoder, and deliberation decoder, respectively. Note that a jointly trained model can be further trained with MWER loss. The joint training is similar to “deep finetuning” in  but without a pre-trained decoder.
Our decoding consists of two passes: 1) Decode using the RNN-T model to obtain the first-pass sequence , and 2) Attend to both and , and perform the second beam search to generate . We are also curious how rescoring performs given bidirectional encoding from . In rescoring, we run the deliberation decoder on in a teacher-forcing mode . Note the difference from  when rescoring a hypothesis is that the deliberation network sees all candidate hypotheses. We compare rescoring and beam search in Sect. 4.
3 Experimental Setup
which include anonymized and hand-transcribed English utterances from general Google traffic, far-field environments, telephony conversations, and YouTube. We augment the clean training utterances by artificially corrupting them by using a room simulator, varying degrees of noise, and reverberation such that the signal-to-noise ratio (SNR) is between 0dB and 30dB. We also use mixed-bandwidth utterances at 8kHz or 16 kHz for training .
Our main test set includes ~14K anonymized hand-transcribed VS utterances sampled from Google traffic. To evaluate the performance of proper noun recognition, we report performance on a side-by-side (SxS) test set, and 4 voice command test sets . The SxS set contains utterances where the LAS rescoring model  performs inferior to a state-of-the-art conventional model , and one reason is due to proper nouns. The voice command test sets include 3 TTS test sets created using parallel-wavenet : Songs, Contacts-TTS, and Apps, where the commands include song, contact, and app names, respectively. The Contacts-Real set contains anonymized and hand-transcribed utterances from Google traffic to communicate with a contact, for example, “Call Jon Snow”.
3.2 Architecture Details and Training
Our first-pass RNN-T model has the same architecture as 
. The encoder of the RNN-T consists of an 8-layer Long Short-Term Memory (LSTM)
and the prediction network contains 2 layers. Each LSTM layer has 2,048 hidden units followed by 640-dimensional projection. A time-reduction layer is added after the second layer to improve the inference speed without accuracy loss. Outputs of the encoder and prediction network are fed to a joint-network with 640 hidden units, which is followed by a softmax layer predicting 4,096 mixed-case wordpieces.
The deliberation decoder can attend to multiple hypotheses, and RNN-T hypotheses with different lengths are thus padded with end-of-sentence label
to a length of 120. Each subword unit in a hypothesis is then mapped to a vector by a 96-dimensional embedding layer, and then encoded by a 2-layer bidirectional LSTM encoder, where each layer has 2,048 hidden units followed by 320-dimensional projection. Each of the two attention models is a multi-headed attention with four attention heads. The two output context vectors are concatenated and fed to a 2-layer LAS decoder (2,048 hidden units followed by 640-dimensional projection per layer). The LAS decoder has a 4,096-dimensional softmax layer to predict the same mixed-case wordpieces  as the RNN-T.
For feature extraction, we use 128-dimensional log-Mel features from 32-ms windows at a rate of 10 ms. Each feature is stacked with three previous frames to form a 512-dimensional vector, and then downsampled to a 30-ms frame rate. Our models are trained in Tensorflow using the Lingvo framework  on 8
8 Tensor Processing Units (TPU) slices with a global batch size of 4,096.
3.3 Computational Complexity
We estimate the computational complexity of the deliberation decoder using the number of floating-point operations (FLOPS) required:
where is the size of the bidirectional encoder, the number of decoded tokens, and the number of first-pass hypotheses. denotes the size of the LAS decoder, and the second beam search size. is the FLOPS required for two attention layers, and we compute it as the sum of multiplying the sizes of source and query matrices with the number of time frames and , respectively. Our deliberation decoder contains roughly 66M parameters, where the size of the bidirectional encoder is M, LAS decoder is M, and attention layers have 2M parameters.
In this section we analyze the importance of certain components of the deliberation model by ablation studies, improve the model by MWER and AE layers, and select one of our best deliberation models for comparison.
4.1 Number of RNN-T Hypotheses
The deliberation decoder may attend to multiple first-pass hypotheses. We encode the hypotheses separately, and then concatenate them as the input to the attention layer. We use a beam size of for RNN-T decoding. Unless stated otherwise, the WER we report is for VS test set. The third row in Table 1 shows that the WER improves slightly when increasing the number of RNN-T hypotheses from 1 to 8. However, after applying MWER training, the WER improves continuously: 5.4% to 5.1%. We suspect that MWER training specifically helps deliberation attend to relevant parts of first-pass hypotheses. Since 8-hypothesis model gives the best performance, we use that for subsequent experiments. MWER training is not used for simplicity.
|Model||1 hyp||2 hyp||4 hyp||8 hyp|
4.2 Acoustics vs. Text
We are curious about how different attention ( vs ) contribute to deliberation, and thus train separate models where we attend to either acoustics (E5) or text (E6) alone in training and inference. Table 2 shows that either E5 or E6 perform significantly better than the baseline RNN-T model (B0) with a 9% WERR. By using both attentions (E4), the model gains another 11% relative improvement. It seems surprising that E6 performs equally to E5. We note this could be because E6 has a bidirectional encoder while E5 does not.
4.3 Additional Encoder Layers
To help the deliberation decoder better adapt to the shared encoder, we add AE layers for dedicated encoding for the deliberation decoder. The AE consists of a 2-layer LSTM with 2,048 hidden units followed by 640-dimensional projection per layer. Beam search is used for decoding. In Table 3, we show that with AE layers (E7) the model performs around 4% better than without (E4). Similarly, we apply AE to the LAS beam search (B1B2), and obtain similar improvements.
|E7||E4 + AE||5.2|
|B2||LAS + AE||5.8|
|ID||Model||Decoding||WER (%)||Estimated GFLOPS|
|B5||LAS ||Beam search||5.5||29.0||11.7||14.7||22.9||8.3||4.8|
|E10||+ Joint training||Beam search||5.0||24.3||9.6||13.4||22.0||6.4||8.8|
We propose to use the deliberation decoder to rescore first-pass RNN-T results, and expect bidirectional encoding to help compared to LAS rescoring . Table 5 shows that the deliberation rescoring (E8) performs 5% relatively better than LAS rescoring (B3). AE layers are added to both models.
|B3||LAS + AE||6.0|
From the above analysis, an MWER trained 8-hypothesis deliberation model with AE layers performs the best, and thus we use that for comparison below.
In Table 4, we compare deliberation models with an RNN-T  and LAS rescoring model  in different recognition tasks including VS and proper noun recognition. We include two deliberation models: An MWER trained 8-hypothesis deliberation model with AE layers (E9), and a jointly trained version (E10). For LAS two-pass model, we add AE layers to the model in  and evaluate both rescoring (B4) and beam search (B5). We note that all models are MWER trained except the RNN-T model, which we find little improvement. First, we note that two-pass models perform substantially better than RNN-T (B0) in both VS task (15%–25% WERR) and rare word test sets (e.g. up to 30% in E10 for the SxS set). This confirms that second-pass decoding brings additional benefits. Second, the MWER trained 8-hypothesis deliberation model with AE layers (E9) performs significantly better than LAS rescoring (B4) or beam search (B5). When beam search is used for both of the deliberation and LAS models, the WERR is 7% for VS, and 8% for the SxS set. We observe significant improvements for voice command test sets too. Third, joint training (E10) brings an additional 2% relative improvement for VS, 9% for the SxS set, and uniform improvements for voice command test sets.
To understand where the improvement comes from, in Fig. 2 we show an example of deliberation attention distribution on the RNN-T hypotheses (x-axis) at every step of the second-pass decoding (y-axis). We can see the attention selects mainly one wordpiece when the first-pass result is correct (e.g. “_weather”, “_in”, etc). However, when the first-pass output is wrong (e.g. “ond” and “on”), the attention looks ahead at “_Nevada” for context information for correction. We speculate that the attention functions similarly as a context-aware language model on the first-pass sequence.
In Table 4, we also report gigaFLOPS (GFLOPS) estimated using Eq. (3) on the 90%-tile VS set, where an utterance has roughly 109 audio frames and a decoded sequence of 14 tokens. Since the deliberation decoder has a larger size than LAS decoder (67MB vs. 33MB), it requires around 1.8 times GFLOPS as LAS rescoring. The increase mainly comes from the bidirectional encoder for 8 first-pass hypotheses. However, we note that the computation can be parallelized across hypotheses  and should have less impact on latency. Latency estimation is complicated, and we will quantify that in future works.
4.6 Decoding Examples
Lastly, we compare some decoding examples between deliberation and LAS rescoring in Table 6. One type of wins for deliberation is URL, where the deliberation model corrects and concatenates string pieces to a single one since it sees the whole first-pass hypothesis. Second type is proper noun. Leveraging the context, deliberation realizes the previous word should be a proper noun (i.e. Walmart). Third, the deliberation decoder corrects semantic errors (china train). On the other hand, we also see some losses of deliberation due to over-correction of proper nouns or spelling difference. The former is probably from knowledge in training, and the latter is benign and does not affect semantics.
|Where my job application||Walmart job application|
|china near me||train near me|
|bio of Chesty Fuller||bio of Chester Fuller|
|2016 Kia Forte5||2016 Kia Forte 5|
We presented a new two-pass E2E ASR based on the deliberation network, and our best model obtained significant improvements over LAS rescoring in both VS tasks and proper noun recognition: 12% and 23% WERR, respectively. The model also performs 21% relatively better than a large conventional model for VS. Although the model requires more computation than LAS rescoring, batching across hypotheses can improve latency.
- Graves  A. Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
- Rao et al.  K. Rao, H. Sak, and R. Prabhavalkar. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 193–199. IEEE, 2017.
- Chan et al.  W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pages 4960–4964. IEEE, 2016.
- Bahdanau et al.  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end attention-based large vocabulary speech recognition. In Proc. ICASSP, pages 4945–4949. IEEE, 2016.
- Kim et al. [2017a] S. Kim, T. Hori, and S. Watanabe. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proc. ICASSP, pages 4835–4839. IEEE, 2017a.
- He et al.  Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. Streaming end-to-end speech recognition for mobile devices. In Proc. ICASSP, pages 6381–6385. IEEE, 2019.
- Chiu et al.  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, pages 4774–4778. IEEE, 2018.
- Pundak and Sainath  G. Pundak and T. Sainath. Lower frame rate neural network acoustic models. In Proc. Interspeech 2016, pages 22–26, 2016.
- Lüscher et al.  C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney. RWTH ASR systems for LibriSpeech: Hybrid vs attention. In Proc. Interspeech, pages 231–235, 2019.
- Sainath et al.  T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li, M. Visontai, Q. Liang, T. Strohman, Y. Wu, I. McGraw, and C.-C. Chiu. Two-pass end-to-end speech recognition. In Proc. Interspeech 2019, pages 2773–2777, 2019.
- Zhang et al. [2019a] H. Zhang, R. Sproat, A. H. Ng, F. Stahlberg, X. Peng, K. Gorman, and B. Roark. Neural models of text normalization for speech applications. Computational Linguistics, 45(2):293–337, 2019a.
- Zhang et al. [2019b] S. Zhang, M. Lei, and Z. Yan. Investigation of transformer based spelling correction model for CTC-based end-to-end mandarin speech recognition. Proc. Interspeech, pages 2180–2184, 2019b.
- Guo et al.  J. Guo, T. N. Sainath, and R. J. Weiss. A spelling correction model for end-to-end speech recognition. In Proc. ICASSP, pages 5651–5655. IEEE, 2019.
- Kannan et al.  A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar. An analysis of incorporating an external language model into a sequence-to-sequence model. In Proc. ICASSP, pages 5824–5828. IEEE, 2018.
- Peyser et al.  C. Peyser, H. Zhang, T. N. Sainath, and Z. Wu. Improving performance of end-to-end ASR on numeric sequences. In Proc. Interspeech 2019, pages 2185–2189, 2019.
- Xia et al.  Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, and T.-Y. Liu. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems, pages 1784–1794, 2017.
- Hassan et al.  H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li, et al. Achieving human parity on automatic Chinese to English news translation. arXiv preprint arXiv:1803.05567, 2018.
- Sung et al.  T.-W. Sung, J.-Y. Liu, H.-Y. Lee, and L.-S. Lee. Towards end-to-end speech-to-text translation with two-pass decoding. In Proc. ICASSP, pages 7175–7179. IEEE, 2019.
- Schuster and Nakajima  M. Schuster and K. Nakajima. Japanese and Korean voice search. In Proc. ICASSP, pages 5149–5152. IEEE, 2012.
- Narayanan et al.  A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani. Toward domain-invariant speech recognition via large scale training. In Proc. SLT 2018, pages 441–447. IEEE, 2018.
- Narayanan et al. [2019 (to appear] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach, T. Sainath, and T. Strohman. Recognizing long-form speech using streaming end-to-end models. In Proc. ASRU. IEEE, 2019 (to appear).
- Prabhavalkar et al.  R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan. Minimum word error rate training for attention-based sequence-to-sequence models. In Proc. ICASSP, pages 4839–4843. IEEE, 2018.
- Kim et al. [2017b] C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani. Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home. In Proc. Interspeech, pages 379–383, 2017b.
- Yu et al.  D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide. Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605, 2013.
- Oord et al.  A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
- Hochreiter and Schmidhuber  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Vaswani et al.  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
Abadi et al. 
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al.
Tensorflow: A system for large-scale machine learning.In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
- Shen et al.  J. Shen, P. Nguyen, Y. Wu, Z. Chen, M. X. Chen, Y. Jia, A. Kannan, T. Sainath, Y. Cao, C.-C. Chiu, et al. Lingvo: A modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019.