have gained large popularity in the automatic speech recognition (ASR) community over the last few years. These models replace components of a conventional ASR system, namely an acoustic (AM), pronunciation (PM) and language models (LM), with a single neural network. These models are a fraction of the size of a conventional ASR system, making them attractive for on-device ASR applications. Specifically, on-device means that instead of streaming audio from the device to the server, recognizing text on the server, and then streaming results back to the device, recognition is performed entirely on the device. This has important implications for reliability, privacy and latency.
Running an ASR model on-device presents numerous additional user interaction constraints. First, we require that recognition results be streaming; the recognized words should appear on the screen as they are spoken. Second, the delay between when a user stops speaking and the hypothesis is finalized, which we refer to as latency, must be low. RNN-T models, which meet these on-device constrains, have been shown to be competitive in terms of quality in recent studies [12, 19]. But under low-latency constrains, they lag behind a conventional server-side streaming ASR system . At the other end of the spectrum, non-streaming models, such as LAS, have been shown to outperform a conventional ASR system . However, LAS models are not streaming as they must attend to the entire audio segment. Recently, a 2-pass RNN-T+LAS model was proposed in , where LAS rescores hypotheses from RNN-T. This model was shown to abide by user interaction constraints, and offer comparable performance to a conventional model.
In this paper, we extend on the work from  in several directions, to develop an on-device E2E model that surpasses a conventional model  in both WER and latency. First, on the quality-front, we train our model on multi-domain audio-text utterance pairs, utilizing sources from different domains including search traffic, telephony data and YouTube data 
. This not only increases acoustic diversity, but also increases the vocabulary seen by the E2E model, as it is trained solely on audio-text pairs which is a small fraction compared to the text-only LM data used by a conventional model. Because the transcription and audio characteristics vary between domains, we also explore adding the domain-id as an input to the model. We find that by training with multi-domain data and feeding in a domain-id, we are able to improve upon a model trained on voice search data only. Second, also on the quality-front, we address improving robustness to different pronunciations. Conventional models handle this by using a lexicon that can have multiple pronunciations for a word. Since our E2E models directly predict word-pieces, we address this by including accented English data from different locales . Third, given the increased audio-text pairs used in training, we explore using a constant learning rate rather than gradually decaying the learning rate over time, thereby giving even weight to the training examples as training progresses.
We also explore various ideas to improve latency of our model. We define endpointer (EP) latency as the amount of time it takes for the microphone to close after a user stops speaking. To make a fair comparison, this metric excludes network latency and computation time when comparing the on-device and server endpointer latencies. Typically, an external voice activity detector (VAD) is used to make microphone-closing decisions. For conventional ASR systems, an end-of-query (EOQ) endpointer [28, 6, 7] is often used for improved EP latency. Recently, integrating the EOQ endpointer into the E2E model by predicting the end-of-query symbol, </s> , to aid in closing the microphone was shown to improve latency . We build on this work here, introducing a penalty in RNN-T training for emitting </s> too early or too late. Second, we improve the computation latency of the 2nd-pass rescoring model. Specifically, we reduce the 2nd-pass run time of LAS by batching inference over multiple arcs of a rescoring lattice, and also offloading part of the computation to the first pass. LAS rescoring also obtains better tradeoff between WER and EP latency due to the improved recognition quality.
2 Model Architecture
The proposed 2-pass E2E architecture  is shown in Figure 1. Let us denote input acoustic frames as , where are stacked log-mel filterbank energies () and the number of frames in . In the 1st-pass, each acoustic frame is passed through a shared encoder, consisting of a multi-layer LSTM, to get output , which is then passed to an RNN-T decoder 111RNN-T decoder consists of a prediction network and a joint network. that predicts , the output sequence, in a streaming fashion. Here is a sequence of word-piece tokens . In the 2nd-pass, the full output of the shared encoder, , is passed to a small additional encoder to generate , which is then passed to an LAS decoder. We add the additional encoder since it is found to be useful to adapt the encoder output to be more suitable for LAS. During training, the LAS decoder computes output according to . During decoding, the LAS decoder rescores multiple top hypotheses from RNN-T, , represented as a lattice. Specifically, we run the LAS decoder on each lattice arc in the teacher-forcing mode, with attention on
, to update the probability in the arc. At the end, the top output sequence with the highest probability is extracted from the rescored lattice.
3 Quality Improvements
3.1 Multi-domain Data
Our E2E model is trained on audio-text pairs only, which is a small fraction of data compared to the trillion-word text-only data a conventional LM is trained with. Previous work [12, 25] used only search utterances. To increase vocabulary and diversity of training data, we explore using more data by incorporating multi-domain utterances as described in . These multi-domain utterances span domains of search, farfield, telephony and YouTube. All datasets are anonymized and hand-transcribed; the transcription for YouTube utterances is done in a semi-supervised fashion [18, 30].
One of the issues with using multi-domain data is that each domain has different transcription conventions. For example, search data has numerics in the written-domain (e.g., $100) while YouTube queries are often in the spoken domain (one hundred dollars). Another issue is with respect to multiple speakers. Search queries contain only one speaker per utterance, while YouTube queries contain multiple speakers. Since a main goal is to improve the quality of search queries, we explore feeding a domain-id to the E2E model as a one-hot vector, with the id being one of the 4 domains. Following work from, we find it adequate to only feed the domain-id to the RNN-T encoder.
3.2 Robustness to Accents
Conventional ASR systems operate on phonemic representations of a word . Specifically, a lexicon maps each word in the vocabulary to a few pronunciations, represented as a sequence of phonemes, and this mapping is fixed before training. This poses challenges when it comes to accents; building an English recognizer that is accurate for American, Australian, British, Canadian, Indian, and Irish English variants is challenging because of phonetic variations.
Attempting to solve these issues by merging the phoneme sets is difficult. Using a lexicon with an on-device E2E system significantly increases the memory footprint, since the size of the lexicon can be upwards of 0.5 GB . In addition, the increased number of phonemes causes confusion and creates data sparsity problems. Finally, decisions regarding the phoneme set and the pronunciations of a word are not made directly from data.
Instead, our E2E model directly predicts word pieces. The model itself decides how to handle pronunciation and phonetic variations based on data. Its size is fixed regardless of the number of variants. As a simple strategy to improve robustness to different accents, we explore including additional training data from different English-accented locales, using the same data as described in . Specifically, we use data from Australia, New-Zealand, United Kingdom, Ireland, India, Kenya, Nigeria and South Africa. We down-weight the data proportion from these locales by a factor of during training. This number was chosen empirically to be the largest value that did not degrade performance on the American English set.
Spelling conventions vary from one variant of English to another. Since our training data was transcribed using the spelling convention of the locale, using the raw transcript can potentially cause unnecessary confusion during training. The E2E model might try to learn to detect the accent in order to decide which spelling convention to use, thus degrading robustness. Instead, we used VarCon  to convert the transcripts to the American spelling convention. For each word in the target, we use VarCon’s many-to-one mapping for conversion, and then use the converted sentence as a target. In addition, during inference when evaluating accented test sets, we convert all reference transcipts to the American spelling as well.
3.3 Learning Rates
Our past work has explored using an exponentially-decaying learning rate when training both RNN-T and LAS [12, 25]. Given the increased amount of multi-domain training data compared to search-only data, we explore using a constant learning rate. To help the model converge, we maintain an exponential moving average (EMA)  of the weights during training and use the EMA weights for evaluation.
4 Latency Improvements
An external voice activity detector (VAD)-based endpointer is often used to detect speech and filter out non-speech. It declares an end-of-query (EOQ) as soon as the VAD observes speech followed by a fixed interval of silence. EOQ-based endpointers which directly predict </s> and have been shown to improve latency . The EOQ detector can also be folded into the E2E systems for joint endpointing and recognition by introducing a </s> token into the training target vocabulary of the RNN-T model . During beam search decoding, </s> is a special symbol that signals the microphone should be closed. Premature prediction of </s> causes deletion errors, while late prediction increases latency.
In this work we extend the joint RNN-T endpointer (EP) model and address the above issue by applying additional early and late penalties on the </s> token. Specifically, during training for every input frame in and every label , RNN-T computes a matrix , which is used in the training loss computation. Here label is </s> , the last label in the sequence. We denote as the frame index after the last non-silence phoneme, obtained from the forced alignment of the audio with a conventional model. The RNN-T log-probability is modified to include a penalty at each time step for predicting </s> too early or too late. gives a grace period after the reference before this late penalty is applied, while and
are scales on the early and late penalties respectively. All hyperparameters are tuned experimentally.
In this work, the RNN-T model is trained on a mix of data from different domains. This poses a challenge for the endpointer models as different applications may require different endpointing behaviors. Endpointing aggressively for short search-like queries is preferrable, but can result in deletions for long-form transcription tasks like YouTube. Since the goal of this work is to improve the latency of search queries, we utilize the fed-in domain-id to only add the </s> token for the search queries, which addresses the latency on search queries while not affecting other domains.
4.2 LAS Rescoring
We apply LAS rescoring to a tree-based lattice, instead of rescoring an N-best list, for efficiency, as it avoids duplicate computation on the common prefixes between candidate sequences . We further reduce the LAS latency with batch inference of the arcs when expanding each lattice branch for rescoring, as it utilizes matrix-matrix multiplication more efficiently. Furthermore, we reduce the 2nd-pass latency by offloading the computation of the additional encoder as well as the attention source keys and values to the 1st-pass in a streaming fashion, whose outputs are cached to be used in the 2nd-pass.
5 Experimental Details
All models are trained using a 128-dimensions log-mel feature frontend . The features are computed using 32 msec windows with a 10 msec hop. Features from 4 contiguous frames are stacked to form a 512 dimensional input representation, which is further sub-sampled by a factor of 3 and passed to the model. Following [12, 25], all LSTM layers in the model are unidirectional, with 2,048 units and a projection layer with 640 units. The shared encoder consists of 8 LSTM layers, with a time-reduction layer after the 2nd-layer. The RNN-T decoder consists of a prediction network with 2 LSTM layers, and a joint network with a single feed-forward layer with 640 units. The additional LAS-specific encoder consists of 2 LSTM layers. The LAS decoder consists of multi-head attention  with 4 attention heads, which is fed into 2 LSTM layers. Both decoders are trained to predict 4,096 word pieces .
The RNN-T model has 120M parameters. The additional encoder and the LAS decoder have 57M parameters. All parameters are quantized to 8-bit fixed-point, as in our previous work 
. The total model size in memory/disk is 177MB. All models are trained in Tensorflow using the Lingvo  toolkit on Tensor Processing Units (TPU) slices with a global batch size of 4,096.
In addition to the diverse training sets described in Sec. 3.1 and 3.2, multi-condition training (MTR) [20, 14] and random data down-sampling to 8kHz  are also used to further increase data diversity. Noisy data is generated at signal-noise-ratio (SNR) from 0 to 30 dB, with an average SNR of 12 dB, and with T60 times ranging from 0 to 900 msec, averaging 500 msec. Noise segments are sampled from YouTube and daily life noisy environmental recordings. Both 8 kHz and 16 kHz versions of the data are generated, each with equal probability, to make the model robust to varying sample rates.
The main test set includes 14K Voice-search utterances (VS) extracted from Google traffic. Additionally, we use test sets with numeric (Num) and multi-talker interfering speech data (MT), with 4K and 6K utterances, respectively, to test robustness of the proposed models. Accented test sets come from the following locales: Australia (en-au), United Kingdom (en-gb), India (en-in), Kenya (en-ke), Nigeria (en-ng), and South Africa (en-za), with approximately 14k, 10K, 5K, 12K, 15K and 10K utterances, respectively. All test sets are anonymized and hand-transcribed.
In this section, all results presented are without endpointer and LAS rescoring.
6.1.1 Domain-ID Models
First, we analyze the behavior of RNN-T when training with multi-domain (MD) data. Table 1 shows the behavior on 3 datasets when training with Voice Search (VS) vs. Multi-domain data. The conventional model  () is also listed. The table shows that while behavior on and improves with MD data () compared to , performance on the numeric set degrades significantly due to the spoken-domain issue of MD data discussed in Section 3.1. However, once we train with a domain-id (DI) in , performance across all 3 sets improves, and outperforms on and .
6.1.2 Robustness to Accents
Next, we explore the behavior when including accented English data in training. Table 2 shows that (MD+DI) degrades significantly on accented test sets compared to the baseline conventional model , which is trained with a large lexicon. , which includes accented data, improves over on all accented sets. This demonstrates that injecting data with alternative accents helps for E2E models that are trained directly to output wordpieces, bypassing a lexicon.
6.1.3 Learning Rates
Next, we explore performance of RNN-T when decaying the learning rate (LR) () compared to using a constant LR (), which should have more benefits given the larger number of utterances in the MD training set. Table 3 shows that using a constant LR improves performance on and by 7% and 8% relative respectively, without significantly harming performance on . Note that while other types of learning-rate schedule could also help; we leave optimizing learning rate schedule further for future work.
In this section, we analyze results with the various latency improvements proposed in Section 4. The endpointer latency is measured by the median (EP50) and the 90-percentile latency (EP90).
6.2.1 E2E Endpointer
We first apply an external EOQ-based endpointer to the E4 RNN-T model . The endpointer model and the RNN-T model are optimized independently. This degrades WER since the endpointer might cut off the decoding hypotheses when the speaker has a short pause or the ASR model is not confident and delays the outputs. We report the best operating point that balances WER and latency gains obtained via sweeping endpointer parameters during decoding 222For E2E EP, we sweep an added penalty to </s> during decoding .. With the acoustic endpointer alone, we degrade the WER from 6.2% (no EP) to 7.4% to achieve a 450ms EP50 latency and 860ms EP90 latency. The joint RNN-T EP model that predicts </s> as a target in the RNN-T model training (E5) obtains a WER of 6.8% and reduces EP50 and EP90 by 20ms and 70ms, respectively. Like , E5 also combines EOQ for better endpointing coverage. It has a better WER and latency tradeoff than E4, which uses the acoustic EP alone.
6.2.2 Second-Pass LAS Rescoring
Next, we explore adding LAS rescoring (E6), where LAS is first trained with cross-entropy and then with MWER [22, 25]. The RNN-T model is kept unchanged during LAS training. Table 5 shows that adding LAS for rescoring reduces WER by 10% relative, from 6.8% to 6.1%, while not affecting EP latency. As a comparison, we also list the server model (B0), and will discuss this in the next section.
In order to show the improvement in LAS computation latency by batch inference, we benchmark the wall time for the second-pass rescoring part when we run the recognition system on 100 search utterances on a Google Pixel4 phone. Inference is run on the phone’s CPU. In Table 6, we show that batch inference reduces both median and 90-percentile computation latency by around 32% for LAS rescoring, achieving 97ms 90% latency.
6.3 Comparison to Conventional Model
In this section, we compare the proposed RNN-T+LAS model (0.18G in model size) to a state-of-the-art conventional model. This model uses a low-frame-rate (LFR) acoustic model which emits context-dependent phonemes  (0.1GB), a 764k-word pronunciation model (2.2GB), a 1st-pass 5-gram language-model (4.9GB), as well as a 2nd-pass larger MaxEnt language model (80GB) . Similar to how the E2E model incurs cost with a 2nd-pass LAS rescorer, the conventional model also incurs cost with the MaxEnt rescorer. We found that for voice-search traffic, the 50% computation latency for the MaxEnt rescorer is around 2.3ms and the 90% computation latency is around 28ms. In Figure 2, we compare both the WER and EP90 of the conventional and E2E models. The figure shows that for an EP90 operating point of 550ms or above, the E2E model has a better WER and EP latency tradeoff compared to the conventional model. At the operating point of matching 90% total latency (EP90 latency + 90% 2nd-pass rescoring computation latency) of E2E and server models, Table 6 shows E2E gives a 8% relative improvement over conventional, while being more than 400-times smaller in size.
TensorFlow: large-scale machine learning on heterogeneous distributed systems. Note: Available online: http://download.tensorflow.org/paper/whitepaper2015.pdf Cited by: §5.
VarCon open source dictionary. Note: http://wordlist.aspell.net/varcon-readme/ Cited by: §3.2.
-  (2017) Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals. In Proc. Interspeech, Cited by: §6.3.
-  (2015) Listen, attend and spell. CoRR abs/1508.01211. Cited by: §1.
-  (2019-05) Joint Endpointing and Decoding with End-to-End Models. In in Proc. ICASSP, Cited by: §1, §4.1, §6.2.1.
Endpoint detection using grid long short-term memory networks for streaming speech recognition.. In Proc. Interspeech, Cited by: §1.
-  (2019) A unified endpointer using multitask and multidomain training. In Proc. ASRU, Cited by: §1, §6.2.1, footnote 2.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, Cited by: §1, §1, §3.2.
-  (2018) Monotonic chunkwise attention. In Proc. ICLR, Cited by: §1.
-  (2013) Speech recognition with deep neural networks. In Proc. ICASSP, Cited by: §1.
-  (2012) Sequence transduction with recurrent neural networks. CoRR abs/1211.3711. Cited by: §1.
-  (2019) Streaming End-to-end Speech Recognition For Mobile Devices. In Proc. ICASSP, Cited by: §1, §1, §3.1, §3.3, §5, §5.
-  (2000) Speech and Language Processing. Cited by: §3.2.
-  (2017) Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home. In Proc. of Interspeech, Cited by: §5.
-  (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proc. ICASSP, pp. 4835–4839. Cited by: §1.
-  (2018) Multi-dialect speech recognition with a single sequence-to-sequence model. In Proc. ICASSP, pp. 4749–4753. Cited by: §1, §3.1, §3.2.
-  (2012) Improving Wideband Speech Rcognition using Mixed-bandwidth Training Data in CD-DNN-HMM. In Proc. SLT, Cited by: §5.
-  (2013) Large Scale Deep Neural Network Acoustic Modeling with Semi-supervised Training Data for YouTube Video Transcription. In Proc. of ASRU, Cited by: §3.1.
-  (2019) Recognizing Long-Form Speech Using Streaming End-to-End Models. In to appear in Proc. ASRU, Cited by: A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency, §1, §1, §3.1, §5.
-  (2016) Far-Field ASR Without Parallel Data.. In Proc. of Interspeech, Cited by: §5.
-  (1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization 30 (4). Cited by: §3.3.
-  (2018) Minimum Word Error Rate Training for Attention-based Sequence-to-sequence Models. In Proc. ICASSP, Note: asr Cited by: §6.2.2.
-  (2016) Lower frame rate neural network acoustic models. In Proc. Interspeech, Cited by: §1, §6.1.1, §6.3.
-  (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In Proc. ASRU, pp. 193–199. Cited by: §1.
-  (2019) Two-Pass End-to-End Speech Recognition. In Proc. Interspeech, Cited by: §1, §1, §2, §3.1, §3.3, §4.2, §5, §6.2.2.
-  (2012) Japanese and Korean voice search. In Proc. ICASSP, Cited by: §1, §5.
-  (2005) Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. Artificial Neural Networks: Formal Models and Their Applications-ICANN, pp. 799–804. Cited by: §2.
-  (2017) Improved End-of-Query Detection for Streaming Speech Recognition. In Proc. Interspeech, Cited by: §1, §4.1.
-  (2019) Lingvo: a modular and scalable framework for sequence-to-sequence modeling. External Links: Cited by: §5.
-  (2017) Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition. In Proc. of Interspeech, Cited by: §3.1.
-  (2017) Attention Is All You Need. CoRR abs/1706.03762. External Links: Cited by: §5.