Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

10/08/2021
by   Zhiyun Lu, et al.
Google
0

End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. This paper presents an empirical study on the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model. We compare two widely used training objectives, log loss (or RNN-T loss) and minimum word error rate (MWER) loss. We conduct experiments on telephony datasets in four languages. Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases. The average relative WER gain is 15.7 and 8.8 lower WER than the log loss. Such difference between the two losses diminishes when the input length increases.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/24/2019

Recognizing long-form speech using streaming end-to-end models

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems t...
12/10/2021

Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

Many of the recent advances in speech separation are primarily aimed at ...
11/06/2019

A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...
06/10/2017

Articulation rate in Swedish child-directed speech increases as a function of the age of the child even when surprisal is controlled for

In earlier work, we have shown that articulation rate in Swedish child-d...
05/07/2020

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

In recent years, all-neural end-to-end approaches have obtained state-of...
05/08/2021

Robustness of end-to-end Automatic Speech Recognition Models – A Case Study using Mozilla DeepSpeech

When evaluating the performance of automatic speech recognition models, ...
02/22/2022

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

While end-to-end models have shown great success on the Automatic Speech...

1 Introduction

Automatic speech recognition (ASR) on telephony speech [1, 2, 3] is an important problem with many real-world applications: e.g., medical conversations and call centers. It is a challenging ASR task, because of its long-form and conversational nature, the presence of non-speech in the audio, and artifacts from the noisy communication channel. End-to-end (E2E) models [4] have achieved state-of-the-art performances on many benchmarks [5], like Librispeech. These datasets often have good acoustic conditions and high speech quality. However, when we apply E2E models on the telephony conversations, the long-form speech and noisy acoustic conditions can present a big challenge for the models’ generalization capability and robustness [6, 7]. In this paper, we investigate methods to improve E2E models performance on long-form telephony speech recognition. We focus our study on the RNN transducer (RNN-T) [8, 9, 10, 11, 12].

One difficulty of the telephony speech recognition is that the audio is long-form. The recording length ranges from 30 seconds to more than 10 minutes. In order to recognize long audio with E2E models, [7, 13] segmented the audio into short utterances before running inference. However, the segmented utterance would lose useful context information [14], e.g., the speaker or topic, and the imperfect segmentation can introduce additional segmentation errors into the system. To correct the errors introduced by the segmentation, special handling like overlapping inference [13] is needed in decoding. This is not ideal as it makes the system more complex and increases computational latency. In this work, we apply E2E models without any segmentation on the long-form data at test time. The model is able to leverage the context information, and the system is simple without any change in the infrastructure.

Figure 1: WERs on the Spanish and Portuguese test sets across different lengths of training examples. The red curve is for log loss and blue for MWER loss. For both losses, the WERs improve as we increase the length training example length.

E2E models are known to be sensitive to distribution mismatch between training and inference and susceptible to overfitting. In particular, if the model is trained on short utterances, it fails to generalize to long-form speech [6, 7]. To alleviate the issue,  [7] proposed various regularization techniques, and [6] proposed to simulate long-form characteristics during training by manipulating models’ LSTM states. In this work, we adopt a simpler but more direct approach: we train the model on longer segments while retaining the correct acoustic context.

We investigate the effect of training utterance length on the word error rate (WER) for long-form speech. The training data are long recordings of telephone calls. In addition to the text label, the transcription also contains manually annotated start and end times of the speech segments. This allows us to prepare the training set under different segmentation, by merging consecutive speech segments into a longer one to form training examples. We retain non-speech audio between the speech segments. And a single training example can contain multiple speakers. See Fig. 2 for a demonstration.

Furthermore, we compare two widely used training objectives for the RNN-T model on the long-form task, i.e. the log loss (or RNN-T loss), and minimum word error rate (MWER) loss  [15, 16, 17, 18]

. The log loss optimizes the log probability of the label sequence by marginalizing over alignments 

[8]

. It can be computed efficiently through the forward-backward algorithm. In theory, maximizing the log likelihood of the training data yields an unbiased estimator of the parameter when the training data is abundant. But in practice, it suffers from exposure bias 

[19, 20, 21]

, where the model makes predictions conditioned on the ground truth labels during training, but on the erroneous predicted labels during inference. This mismatch between training and inference hurts the model’s generalization. On the other hand, the MWER loss directly minimizes the expected number of word errors, the evaluation metric at test time. MWER is a variant of edit-based minimum Bayes risk 

[22], where the expectation is approximated with an empirical average of the -best hypothesis from the beam search. Compared to the log loss, MWER loss is conditioned on not only the reference but also the competing hypotheses, and its training procedure reduces the exposure bias by doing inference at training time. However, MWER training is more computationally expensive.

We conduct experiments on telephony datasets in four languages. Each language contains a few hundred hours of audio. Our main observations are three-fold: Firstly, increasing the length of training examples significantly improves the WER on the long-form task for both log loss and MWER loss. The average relative WER reduction is 15.7% for log loss and 8.8% for MWER loss. Secondly, MWER loss performs slightly better than the log loss, at the price of higher computational cost. The gain is more compelling when the training examples are short. Fig. 1 demonstrates the comparison between two losses across different training examples lengths on Spanish and Portuguese tasks. Lastly, training first with log loss on long utterances then fine-tuning with MWER on short utterances achieves a good WER with relatively low computation costs. The rest of the paper is organized as follows. We describe the method in §2, present experimental results in §3, and conclude in §4.

2 Method

In this section, we describe the telephony datasets and how we apply segmentation to create longer training examples in §2.1. We introduce the details of the log loss and MWER loss in §2.2.

2.1 Create long training examples

Figure 2: The training example is generated from segmentation of long audio recordings. is transcribed speech segment, and the slash filled interval is non-speech audio. The training example contains speech segments and and the non-speech audio in-between. There can be speaker changes within one example.

The datasets consist of recordings of telephone calls in 4 languages: Australian English (En), French (Fr), Mexican Spanish (Es), and Brazilian Portuguese (Pt). The total number of hours, and the average length of the audio recordings in seconds are summarized in Table 1. There are 2 telephony test sets in Es and Fr, and 1 test set in Pt and En. For both training and test sets, the audio is of a few minutes long. We also include an out-of-domain (OOD) test set collected from YouTube videos for each language, in order to monitor the generalization performance on non-telephony long-form speech.

max width=0.48 lang. train sets test sets tel 1 tel 2 OOD Es 552.1h / 130.4s 25.3h / 101.8s 24.7h / 184.5s 9.7h / 512.3s Pt 614.6h / 431.0s 19.2h / 139.8s - 9.8h / 519.0s En 215.1h / 62.2s 19.3h / 130.4s - 6.4h / 493.2s Fr 250.1h / 106.6s 19.9h / 125.9s 22.6h / 52.2s 10.0h / 611.9s

Table 1: Datasets statistics: total hours of the dataset / the average length of the recordings in seconds.

max width=0.48 seg. Es Pt En Fr raw short medium long

Table 2: Length of training examples (meanstd) in seconds under different segmentation regimes.

Each minutes-long audio contains multiple annotated transcripts. Each transcript contains a tuple of (text label, start time, end time). In Fig. 2, we refer to such a transcript tuple as a segment . From the raw transcription segment , we can sort them in order of starting time111The timestamp information is only used to segment the audio, but not used during training. and group consecutive segments into a longer unit. For example, and are grouped together into one example , and into . We use these utterances of relatively longer length as training examples. We can control the length of the training examples by determining how we merge the raw segments. To retain the correct acoustic context, we keep the non-speech audio, like background noise and music, in-between speech segments in the training example, and there can be speaker change within one example. In principle, we can also use any voice activity detector (VAD) [23, 24, 25] to segment the audio and control the length of training examples by varying the threshold of non-speech intervals. In our case, we simply use the segmentation provided in the annotation.

We prepare the training sets under 4 segmentation regimes, which we refer to as raw, short, medium, and long. Table 2

summarizes the mean and the standard deviation of the training example lengths under different regimes. For raw segmentation, we use the transcribed segments provided in the annotation without any merging. For short, medium, and long, raw segments are merged into longer utterances. From short to long, the number of segments that are merged,

i.e. the training example length, successively increases.

2.2 Optimization objectives

We denote the input utterance as , and the token sequence as . is the ground-truth target sequence. The RNN-T model outputs a probability over tokens for any pair of alignment. From the RNN-T output, we can derive the probability of any token sequence by marginalizing possible alignments using the forward backward algorithm [8].

log loss MWER loss
lang. test set initialization raw short medium long raw short medium long
Es Telephony 1 34.0 29.7 20.8 18.6 18.4 21.4 19.4 18.2 17.8
Telephony 2 38.1 33.9 25.9 24.4 24.7 27.3 25.0 25.4 24.1
OOD YouTube 17.6 18.4 15.9 15.7 17.6 17.1 16.2 15.8 15.4
Pt Telephony 35.0 29.0 24.0 22.5 21.8 25.8 22.5 21.9 21.7
OOD YouTube 17.0 17.5 21.9 18.0 17.5 17.1 18.0 18.0 17.1
En Telephony 24.7 18.0 17.6 17.2 17.0 17.3 17.2 17.0 16.9
OOD YouTube 11.4 12.3 12.0 11.8 11.5 12.0 11.8 11.7 11.6
Fr Telephony 1 28.0 23.8 22.6 22.4 22.4 23.2 22.2 22.5 22.1
Telephony 2 29.3 24.5 24.4 24.0 23.8 24.0 23.7 24.0 23.8
OOD YouTube 17.6 18.1 17.0 16.7 18.6 18.0 17.2 16.5 16.7
Table 3: WER (%) under different training example lengths. The 3rd column is WERs of initialized models, the red block is for models trained with log loss, and blue block MWER loss. For both losses, the WER improves as the segmentation length increases. The average relative improvement is 15.7% for log loss and 8.8% for MWER.

Log loss is defined as the negative log probability of the ground-truth sequence.

MWER loss. We use to denote the number of word errors in a hypothesis , relative to the ground-truth . Edit-based minimum Bayes risk (EMBR) [22] is defined as the expected number of word errors in Eq. 1. The expectation is approximated with empirical samples for E2E models. In this work, we use the -best hypotheses from beam search as the empirical samples, following [17, 18, 26], as it is shown to work better than sampling from the model distribution. We denote the -best list as . The MWER loss is defined in Eq. 2,

(1)
(2)

where is the re-normalized probability, and is the average number of word errors of -best hypotheses. MWER loss boosts the probability of the hypothesis that has better than average word errors, and reduces the probability of the one that is worse than the average. In practice, to stabilize MWER training [18]

, we interpolate the MWER loss with the log loss using a hyper-parameter

.

(3)

There are two main differences between the log loss and MWER loss. The log loss only increases the probability of the ground-truth sequence, while the MWER performs discriminative training among competing hypotheses; The log loss takes the ground-truth history label as input to the prediction network of RNN-T, while the MWER takes the (potentially erroneous) prediction history label as input, which helps reduce exposure bias [19].

3 Experiment

We describe the model and experimental setups in §3.1. We present the main results in §3.2, and two-stage training results in §3.3.

3.1 Setup

Model Architecture. We use the state-of-the-art RNN-T model [27] with the Conformer encoder [5]. The output tokens are 4,096 word-pieces. For the acoustic front-end, we use 128-dimensional log-Mel features, computed with a 32ms window and shifted every 10ms. The log-Mel features from 4 contiguous frames are stacked to form a 512-dimensional input, and then subsampled by a factor of 3.

Initialization model. We first pre-train the RNN-T model on YouTube and multi-domain data, before training on the telephony data. For Fr, Es and Pt tasks, the initialization model is pre-trained on YouTube segments labeled by Rover ensemble teacher model predictions, and multi-domain data. Multi-domain data covers search, far-field, and telephony. Please refer to [28] for more details. For En task, the initialization model is pre-trained on multi-domain data, the same as in [6].

All experiments are done on 8x8 Cloud TPU [29]

using the Tensorflow Lingvo toolkit 

[30]. For optimization, we use mini-batches of size 128 with an Adam optimizer [31]. SpecAugment [32] is applied for both log loss and MWER training. For MWER, the number of hypothesis is 4, and in of Eq. 3.

3.2 Effect of input length on WER

Figure 3: WERs on the Spanish Telephony 1 test set across different input lengths. The red bar is for log loss and blue for MWER. The blue dashed line marks the lowest WER of all models. When we train on the raw segment, the deletion error dominates the WER. For both losses, the deletions reduce significantly as we increase the length of the training examples. MWER achieves the best WER, with slight gain in substitution error compared to the log loss.

In Table 3, we report the WERs on test sets under different training example lengths. The 3rd column is the WER of the initialization checkpoint after pre-training. The red block is the WERs of the log loss models, and the blue block MWER loss. For both losses, the trend is that the longer the training example length, the better the WER. There is a 38% relative WER reduction in Es Tel 1 set when trained with “long” segments compared to “raw” segments using log loss. The average relative gain across all in-domain test sets is 15.7% for log loss and 8.8% for MWER.  [33] observed a similar trend in a teacher-student training framework. If we compare WERs across the two losses, we find that WERs of MWER are lower than the log loss. When the training example is short, the benefit of MWER is more substantial, while the gain diminishes as the training example gets longer.  [21] observed a similar trend for the oracle WER. We hypothesize that the lack of diversity in the -best list for longer utterances [34], coupled with the use of a small impairs MWER training on long utterances.

To examine the source of mistakes of models predictions, we break down the word errors into deletions (del), substitutions (sub), and insertions (ins). We visualize the breakdown on Es Tel 1 test set in Fig. 3. When we train on the raw segment with log loss, the del error dominates the WER, which takes up 50% of the total errors. MWER model is better, but still suffers from relatively high deletions. A possible explanation is that the raw transcribed segment contains a minimum amount of non-speech audio, which makes the model less robust against noise and incurs high deletions. As we increase the length of the training examples, from an average of 2.1 seconds to 25.0 seconds, the deletions reduce significantly for both MWER and log loss, which as a result gives a much lower total WER. MWER achieves the best WER among all, with a slight gain in substitutions compared to the log loss.

On the out-of-domain long-form YouTube set, increasing input length can improve WER in some cases, like Spanish and French, with the improvement from reduced deletions. But the gain is not always consistent, because of overfitting to the telephony domain.

Lastly we evaluate the WER on short-form data. Moreover, we want to verify whether doing inference on long-form speech improves over the segmentation-then-inference approach. We segment the En OOD YouTube set into short utterances. The average length is 11.1 seconds. We evaluate WER on the segmented test set, and compare it with the long-form decoding result (Table 3 row 9) in Table 4. Decoding on long audio has lower WERs, with improvement in all del/sub/ins. Besides, short-form WER is relatively stable across different training example lengths. This reassures us that training with longer utterances does not hurt the short-form performance.

loss
testing
segment
training segment
raw short medium long
log loss long 12.3 12.0 11.8 11.5
short 14.5 14.0 14.4 13.9
MWER loss long 12.0 11.8 11.7 11.6
short 14.1 14.0 13.9 14.2
Table 4: WER (%) of decoding on long-form vs. on the segmented short utterances in En OOD dataset. Long-form inference is better.

3.3 Best of both worlds: efficient two-stage training recipe

Despite good WERs, MWER training is computationally expensive, and it has diminished gains over the log loss when the input length increases. To this end, we experiment with a two-stage training recipe. We first train with log loss on long segments, and then fine-tune with MWER loss on raw segments. We hope to enjoy the best of both worlds: have the fast training speed and the benefit from longer inputs by log-loss, and have the good WER of the MWER loss. For the MWER fine-tuning, we experiment with both fine-tuning the full model, and fine-tuning the decoder only, i.e. the prediction network and the joint network of RNN-T. The intuition behind it is that we train the model, especially the encoder, to capture the long-form characteristics by log loss in the 1st stage. And in the 2nd stage, we fine-tune the decoder to output better predictions with MWER loss. In the two-stage experiment, we set the in to be smaller. for full model fine-tuning, and 0 for decoder fine-tuning.

lang. test 2nd stage (MWER)
best of Table 3 1st stage (log loss) full mdl dec only
Es Tel 1 17.8 18.4 17.9 17.5
Tel 2 24.1 24.7 24.4 24.4
OOD 15.4 17.6 18.4 17.0
Pt Tel 1 21.7 21.8 21.6 21.8
OOD 17.0 17.5 17.4 17.4
En Tel 1 16.9 17.0 16.9 16.9
OOD 11.4 11.5 11.5 11.4
Fr Tel 1 22.1 22.4 21.9 22.0
Tel 2 23.7 23.8 23.3 23.5
OOD 16.5 18.6 17.0 17.1
Table 5: WER (%) of the two-stage training. The 2nd stage MWER fine-tuning always improves over the 1st stage, and gets close or even outperforms (marked in bold) the best WER from Table 3.
MWER
log loss
(1st stage)
2nd stage (MWER)
full mdl dec only
segmentation long long raw raw
seconds / step 3.88 2.67 1.35 1.08
# steps 60k 8k 50k 16k
total hours 44.5 8.6 18.7 4.8
Table 6: Computation costs of different training recipes on the Spanish task. The total cost of two-stage training with decoder fine-tuning is 13.4 hours, of the cost of MWER training on long segments.

In Table 5, the numbers in gray block are copied from Table 3: the 3rd column is the best WER on each test set; the 4th column is the WER of 1st stage log loss training, column 7 of Table 3. The 5th and 6th columns are the WERs after MWER fine-tuning on raw segments, w.r.t. all weights and decoder weights respectively. On in-domain telephony test sets, fine-tuning with MWER consistently improves over the 1st stage WER. It gets close or even outperforms the best WER in some test sets.

Table 6 compares the computation costs of different training recipes. We break down the training time into two parts: average seconds per training step, and the number of steps until the best test WER. The last row is the total training time in hours. Comparing MWER with log loss on long segments in the 2nd and 3rd columns, MWER takes longer to train. It is slower every step, as it performs beam search and error computation; It also takes more steps to converge. The 2nd stage MWER fine-tunes on raw segments, which is 3 times faster than on long segments. Since the number of parameters in the decoder is only of the encoder, fine-tuning on decoder converges in much fewer steps than the full model. Thus it reduces the MWER fine-tuning time from 18.7 hours to 4.8 hours. To summarize, first training with log loss on long segments and then fine-tuning with MWER on raw segments can achieve good WERs with relatively low computation costs.

4 Conclusion

In this work, we compare RNN-T and MWER training across different training utterance lengths on long-form telephony speech recognition. Our results show that training on longer utterances can greatly reduce the WERs on long-form data. The average relative WER reduction is 15.7% for log loss and 8.8% for MWER loss. Future works include improving MWER training on long utterances, and incorporation of an external language model to improve WER.

5 Acknowledgement

We are grateful to Chung-Cheng Chiu, Wei Han, Yu Zhang, Arun Narayanan, Qiujia Li, Yongqiang Wang, Ruoming Pang, Hank Liao, Basi García, Han Lu, Qian Zhang, Hasim Sak, Oren Litvin for their help and suggestions.

6 References

References

  • [1] Z. Tüske, G. Saon, and B. Kingsbury, “On the limit of English conversational speech recognition,” in Proc. Interspeech, 2021.
  • [2] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke, “The Microsoft 2017 conversational speech recognition system,” in Proc. ICASSP, 2018.
  • [3] K. J. Han, S. Hahm, B. Kim, J. Kim, and I. Lane,

    Deep learning-based telephony speech recognition in the wild,”

    in Proc. Interspeech, 2017.
  • [4] Y. Zhang, J. Qin, D. Park, W. Han, C. Chiu, R. Pang, Q. Le, and Y. Wu,

    “Pushing the limits of semi-supervised learning for automatic speech recognition,”

    in Proc. NeurIPS SAS Workshop, 2020.
  • [5] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.
  • [6] A. Narayanan, R. Prabhavalkar, C. Chiu, D. Rybach, T. N. Sainath, and T. Strohman, “Recognizing long-form speech using streaming end-to-end models,” in Proc. ASRU, 2019.
  • [7] C. Chiu, A. Narayanan, W. Han, R. Prabhavalkar, Y. Zhang, N. Jaitly, R. Pang, T. N. Sainath, P. Nguyen, L. Cao, and Y. Wu, “RNN-T models fail to generalize to out-of-domain audio: Causes and solutions,” in Proc. SLT, 2021.
  • [8] A. Graves,

    “Sequence transduction with recurrent neural networks,”

    in Proc. ICML Representation Learning Workshop, 2012.
  • [9] A. Graves, A. Mohamed, and G. Hinton,

    “Speech recognition with deep recurrent neural networks,”

    in Proc. ICASSP, 2013.
  • [10] J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducer modeling for end-to-end speech recognition,” in Proc. ASRU, 2019.
  • [11] A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “A new training pipeline for an improved neural transducer,” in Proc. Interspeech, 2020.
  • [12] G. Saon, Z. Tüske, D. Bolanos, and B. Kingsbury, “Advancing RNN transducer technology for speech recognition,” in Proc. ICASSP, 2021.
  • [13] C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. N. Sainath, and Y. Wu, “A comparison of end-to-end models for long-form speech recognition,” in Proc. ASRU, 2019.
  • [14] Takaaki Hori, Niko Moritz, Chiori Hori, and Jonathan Le Roux, “Advanced long-context end-to-end speech recognition using context-expanded transformers,” in Proc. Interspeech, 2021.
  • [15] L. Lu, Z. Meng, N. Kanda, J. Li, and Y. Gong, “On minimum word error rate training of the hybrid autoregressive transducer,” in Proc. Interspeech, 2021.
  • [16] J. Guo, G. Tiwari, J. Droppo, M. V. Segbroeck, C. Huang, A. Stolcke, and R. Maas, “Efficient minimum word error rate training of RNN-transducer for end-to-end speech recognition,” in Proc. Interspeech, 2020.
  • [17] C. Weng, C. Yu, J. Cui, C. Zhang, and D. Yu, “Minimum bayes risk training of RNN-transducer for end-to-end speech recognition,” in Proc. Interspeech, 2020.
  • [18] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in Proc. ICASSP, 2018.
  • [19] X. Cui, B. Kingsbury, G. Saon, D. Haws, and Z. Tuske, “Reducing exposure bias in training recurrent neural network transducers,” in Proc. Interspeech, 2021.
  • [20] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in Proc. ICLR, 2016.
  • [21] Q. Li, Y. Zhang, B. Li, L. Cao, and P. C. Woodland,

    “Residual energy-based models for end-to-end speech recognition,”

    in Proc. Interspeech, 2021.
  • [22] M. Shannon, “Optimizing expected word error rate via sampling for speech recognition,” in Proc. Interspeech, 2017.
  • [23] S. Thomas, G. Saon, M. V. Segbroeck, and S. Narayanan, “Improvements to the IBM speech activity detection system for the DARPA RATS program,” in Proc. ICASSP, 2015.
  • [24] M. Lavechin, R. Bousbib, H. Bredin, E. Dupoux, A. Cristia, M. Gill, and L. Garcia-Perera, “End-to-end domain-adversarial voice activity detection,” in Proc. Interspeech, 2020.
  • [25] X. Xu, H. Dinkel, M. Wu, and K. Yu, “A lightweight framework for online voice activity detection in the wild,” in Proc. Interspeech, 2021.
  • [26] D. Povey, Discriminative training for large vocabulary speech recognition, Ph.D. thesis, University of Cambridge, 2005.
  • [27] B. Li, A. Gulati, J. Yu, T. N. Sainath, C. Chiu, A. Narayanan, S. Chang, R. Pang, Y. He, J. Qin, W. Han, Q. Liang, Y. Zhang, T. Strohman, and Y. Wu, “A better and faster end-to-end model for streaming ASR,” in Proc. ICASSP, 2021.
  • [28] T. Doutre, W. Han, C. Chi, R. Pang, O. Siohan, and L. Cao, “Bridging the gap between streaming and non-streaming asr systems by distilling ensembles of CTC and RNN-T models,” in Proc. Interspeech, 2021.
  • [29] N. P. Jouppi, C. Young, N. Patil, and et al.,

    “In-datacenter performance analysis of a tensor processing unit,”

    in Proc. Symposium on Computer Architecture, 2017.
  • [30] J. Shen, P. Nguyen, et al., “Lingvo: A modular and scalable framework for sequence-to-sequence modeling,” arXiv:2005.08100, 2019.
  • [31] D.P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, San Diego, 2015.
  • [32] D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. Cubuk, and Q. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
  • [33] T. Doutre, W. Han, M. Ma, Z. Lu, C. Chiu, R. Pang, A. Narayanan, A. Misra, Y. Zhang, and L. Cao, “Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data,” in Proc. ICASSP, 2021.
  • [34] R. Prabhavalkar, Y. He, D. Rybach, S. Campbell, A. Narayanan, T. Strohman, and T. N. Sainath, “Less is more: Improved RNN-T decoding using limited label context and path merging,” in Proc. ICASSP, 2021.