End-to-end models such as the recurrent neural network transducer (RNN-T) [10, 9, 25], attention-based encoder-decoder models [2, 5], the transformer transducer [36, 34] have become increasingly popular alternatives to conventional hybrid systems  for automatic speech recognition. These models produce hypotheses in an autoregressive fashion by conditioning the output on all previously predicted labels, thus making fewer conditional independence assumptions than conventional hybrid systems. End-to-end ASR models have been shown to achieve state-of-the-art results [23, 11] on popular public benchmarks, as well as on on large scale industrial datasets [3, 26].
The increase in modeling power afforded by conditioning on all previous predictions, however, comes at the cost of a more complicated decoding process; computing the most likely label sequence exactly is intractable since it involves a discrete search over an exponential number of sequences each of which corresponds to a distinct model state. Instead, decoding is performed using an approximate beam search 
, with various heuristics to improve performance[4, 31]. Since model states corresponding to different label histories are unique, beam search decoding produces a tree of hypotheses rooted at the start of sentence label (). This, combined with a limited beam size, restricts the diversity of decoded hypotheses – a problem that becomes increasingly severe for longer utterances.
In this work, we conduct a detailed investigation of a specific aspect of streaming end-to-end models (RNN-T, in our work); we study the importance of conditioning the output sequence on the full history of previously predicted labels, and investigate modifications to the beam search process which can be applied to models by limiting context. The first part of this question has been investigated in a few papers recently, in different contexts. Ghodsi et al.  find that in low-resource settings, where training data is limited, the use of word-piece units  allows for a stateless prediction network (i.e., one which conditions on only one previous label) without a significant loss in accuracy. Zhang et al. 
investigate the impact of varying label context in the transformer-transducer model (RNN-T which replaces LSTMs with transformer networks
) finding that a context of 3-4 previous graphemes achieves similar performance as a full-context baseline on the Librispeech dataset. Finally, Varianiet al. 
find that the hybrid autoregressive transducer (HAT; RNN-T with an ‘internal language model (LM)’), trained to output phonemes and decoded with a separate lexicon and grammar achieves similar performance when context is limited to two previous phoneme labels on a large scale task.111
It should be noted, however, that in this case the effective context is larger than two previous phonemes due to the linguistic constraints introduced by the lexicon and the n-gram LM.
Our work differs from the previously mentioned works in two ways. First, we study the question on a large-scale task using a model trained to output word-piece targets and decoded without an external lexicon or language model. Since word-pieces naturally capture longer context than graphemes of phonemes, our results allow us to measure the effective context that is captured by full-context RNN-T models. Second, we consider modifications to the beam search process which are enabled by the use of limited context models (or the baseline, with approximations). This process, described in detail in Section 3.1, allows us to generate lattices by merging paths during the search process, unlike RNN-T systems which are typically decoded to produce trees. The proposed approach is similar to previous work on efficient rescoring with neural LMs  and to generating lattices in attention-based encoder-decoder models . The ability to produce rich lattices from sequence-to-sequence models has many potential applications: e.g., they can be used for spoken term detection ; as inputs to spoken language understanding systems 
; or for computing word-level posteriors for word- or utterance-level confidence estimation. In experimental evaluations, we find that the models require 5-gram contexts (i.e., conditioning on the four previously predicted labels) in order to obtain comparable WER results as the baseline. Additionally, we find that the proposed path-merging scheme is an effective technique to improve search efficiency. The proposed scheme reduces the number of model evaluations by up to 5.3%, while simultaneously improving the oracle WERs in the decoded lattice by up to 36% without any degradation in the WER.
The organization of the rest of the paper is as follows: in Section 2 we introduce the baseline RNN-T models; in Section 3 we describe how we limit label context during training, and our proposed path-merging scheme during decoding to produce lattices; we describe our experimental setup and results in Sections 4 and 5, respectively, before concluding in Section 6.
2 The Recurrent Neural Transducer (RNN-T)
We assume that the input speech utterance has been parameterized into suitable acoustic features: , where (log-mel filterbank energies, in this work). Each utterance has a corresponding label sequence, , where (word-pieces , in this work).
. The RNN-T model defines a probability distribution over the output label sequences conditioned on the input acoustics,, by marginalizing over all possible alignment sequences, . Specifically, RNN-T introduces a special blank symbol, , and defines the set of all valid frame-level alignments, , as the set of all label sequences, , where , such that is identical to after removing all symbols.
The RNN-T model is depicted in Figure 1. As can be seen in the figure, the model consists of three components: an acoustic encoder (a stack of unidirectional LSTM layers , in this work) which transform the input acoustic sequence into a higher-level representation, ; a prediction network (another stack of unidirectional LSTMs, in this work); and a joint network which combines these to produce a distribution over the output symbols and :
where, , , and is a special symbol denoting the start of the sentence; and denote the number of non-blank and blank symbols respectively in the partial alignment sequence . Note that the prediction network is only input with non-blank symbols. The summation in Equation 1 and the gradients of the log-likelihood function can be computed using the forward-backward algorithm .
3 Limiting Prediction Network Context
A vanilla RNN-T model is conditioned on all previous predictions, and can thus be thought of as a full-context model. We can limit the context of the RNN-T model by modifying the prediction network to only depend on a fixed number of previous labels. Specifically, a model with -gram context is conditioned on at most previous labels so that the output distribution of the RNN-T computes: . Since our baseline full-context RNN-T model uses a stack of LSTM layers to model the prediction network, in this work we limit the context through the use of an LSTM-based prediction network that is reset and sequentially fed only the sequence of the last labels at each step. This ensures that our results are comparable to the baseline configuration. However, other choices would also be reasonable to model a limited context prediction network: e.g., a transformer as in , or a simple feed-forward network with inputs. Each of these choices involves a different tradeoff in terms of computation versus runtime memory usage, and we leave the study of these alternate architectures for future work.
3.1 Decoding with Path Merging to Create Lattices
Traditional decoding algorithms for RNN-T [10, 31] only produce trees that are rooted at the label since distinct label sequences result in unique model states (i.e., the state of the prediction network, since the encoder state is not conditioned on the label sequence). In a limited context model, however, model states are identical if two paths on the beam share the same local label history. This allows for additional optimizations during the search, as illustrated in Figure 1(a) for a 3-gram limited context model. In this case, the model states for the partial hypotheses ‘a cat sat’ and ‘the cat sat’ are identical (indicated by using the same color to represent both states). For this example, assume that ‘the cat sat’
is a lower cost (i.e., higher probability) partial path. Since future costs of identical labels sequences starting from the blue states are identical, the current lower cost hypothesis is guaranteed to be better than the current higher cost hypothesis for all future steps. Therefore, we can remove the higher cost path from the active beam, and instead merge it with the lower cost path to create a lattice. This has the effect of ‘freeing up’ space on the beam, while retaining the alternative paths in the final lattice where they can be used for downstream applications. Note that this can have a large impact since end-to-end models are typically decoded with small number of candidates in the beam for efficiency, and thus the beam diversity tends to reduce for longer utterances . We note that a similar mechanism has been proposed previously by Zapotoczny et al.  in the context of lattice generation for attention-based encoder-decoder models, and by Liu et al.  in the context of efficiently rescoring lattices with neural LMs. To the best of our knowledge, our work is the first to apply these ideas to streaming end-to-end models such as RNN-T to optimize the search process.
In contrast, in a full-context RNN-T model, illustrated in Figure 1(b), model states are distinct for two partial hypotheses if they correspond to distinct label sequences (represented using distinct colors in the figure). For full-context models, at least in principle, a higher cost partial path at an intermediate point in the search could eventually be part of the lowest cost complete path. However, it is still possible to employ the same path merging scheme in full-context RNN-T models through an approximation if two partial hypotheses share the same local history, by retaining the model state corresponding to the lower cost partial hypothesis as illustrated in the figure. Note that in this case the retained state still corresponds to the full context of the entire label sequence until that point. As we demonstrate in Section 5, our proposed path-merging scheme results in a more efficient search process while improving word error rate.
4 Experimental Setup
Model Architecture: Our experimental setup is similar to our previous work . The input acoustics are parameterized using 128-dimensional log-mel filterbank energies computed over the 16KHz range following , with 32ms windows and a 10ms hop. In order to reduce the effective frame rate, features from four adjacent frames are concatenated together (to produce 512 dimensional features), which are further sub-sampled by a factor of 3, so that the effective input frame rate is 30ms. In this work, we also apply SpecAugment masks  using the configuration described in , which we find to improve performance over the system in . The encoder network in all of our experiments is modeled using a stack of 8 unidirectional LSTM  layers, each of which contains 2,048 units and a projection layer of 640 units. We add a time-reduction layer after the second LSTM layer which stacks two adjacent outputs and sub-samples them by a factor of 2, so that the effective encoder frame rate is 60ms. The prediction network is modeled using two layers of unidirectional LSTMs , each of which contains 2,048 units with a projection layer of 640 units. The joint layer is modeled as a single-layer feed-forward network with 640 units. The output of the joint network produces a distribution over 4,096 word-pieces  which are derived from a large text corpus. In total, each of our models contains 120 million trainable parameters.
Training Sets: Our models are trained on a diverse set of utterances from multiple domains including voice search, telephony, far-field, and YouTube . All utterances are anonymized and hand-transcribed; YouTube transcriptions are obtained using a semi-supervised approach . In order to improve robustness to environmental distortions, models are trained with additional noisy data using a room simulator . Noise samples are drawn from YouTube and from daily life noisy recordings. The noisy data are generated with an SNR of between 0dB and 30dB, with an average SNR of 12dB; T60 reverberation times range from 0 – 900ms with an average of 500ms. We employ mixed-bandwidth training , by randomly downsampling the training data to 8KHz 50% of the time.
Test Sets: Results are reported in two test domains: the first consists of utterances drawn from Google voice search traffic (VS: 11,585 utterances; 56,069 words); the second set consists of data drawn from non-voice search Google traffic (NVS: 12,426 utterances; 127,105 words). The utterances in the NVS set tend to be longer (both in terms of duration and label token length) on average than utterances in the VS set. The 90th percentile token sequence length is 14 word-pieces for the VS set, and 28 word-pieces for the NVS set. Results are also reported more challenging versions of the VS (VS-hard: 9,662 utterances; 46,673 words) and NVS test sets (NVS-hard: 19,411 utterances; 198,215 words) that contain more variation in terms of background noise, volume, and accents. All test set utterances are anonymized and hand-transcribed.
Training: Models are trained using Lingvo 
in Tensorflow, with Tensor Processing Units (TPUs) 
. Models are optimized using synchronized stochastic gradient descent with mini-batches of 4,096 utterances using the Adam optimizer.
Full and Limited Context Models: In addition to the full-context RNN-T model (baseline), we study the impact of limiting context by training models with varying amounts of context ranging from 2–10 (lc-2gram – lc-10gram; i.e., conditioning on 1 – 9 previously predicted labels, respectively).
Decoding: All results are reported after decoding models using the breadth-first search decoding algorithm .222Similar results are obtained using the best-first search decoding algorithm , however these are omitted due to space limitations. The limited context models are always evaluated using the path-merging process proposed in Section 3.1. We also consider the approximate path-merging scheme described in Section 3.1 applied to the baseline during inference. Note that in this case, the baseline is always trained with full-context; during evaluation, we merge paths if the last 1 – 9 non-blank labels are identical for two paths (base-pm2 – base-pm10), and retain the state corresponding to the lower cost path. All models are decoded with a maximum beam size of 10 partial hypotheses and a local beam of 10 (i.e., maximum allowable absolute difference between log-likelihoods of partial hypotheses).
Evaluation Metrics: Results are reported in terms of both word error rate (WER) as well as the oracle WER in the lattice (for the systems with path-merging) or the N-best list (for the baseline configuration). In order to evaluate whether the proposed path merging scheme described in Section 3.1 improves computational efficiency, results are also reported in terms of the number of model states which are expanded during the search. Specifically, we report the average number of joint network evaluations per utterance for each of the test sets (average model states), which serve as a good proxy to measure the cost of the search.
|System||WER / Oracle WER (%)|
|baseline||6.0 / 1.7||3.3 / 1.1||7.6 / 2.8||7.4 / 3.9|
|lc-2gram||6.4 / 1.1||3.5 / 0.7||8.1 / 2.2||8.0 / 3.3|
|lc-3gram||6.2 / 1.2||3.3 / 0.7||7.8 / 2.2||7.6 / 3.1|
|lc-4gram||6.1 / 1.3||3.3 / 0.7||7.6 / 2.2||7.2 / 3.0|
|lc-5gram||6.0 / 1.3||3.2 / 0.7||7.6 / 2.4||7.2 / 2.9|
|lc-7gram||6.0 / 1.5||3.2 / 0.8||7.7 / 2.6||7.3 / 3.2|
|lc-10gram||6.0 / 1.7||3.3 / 0.9||7.5 / 2.7||7.2 / 3.4|
|base-pm2||6.2 / 1.1||3.4 / 0.6||7.9 / 2.1||7.6 / 3.1|
|base-pm3||6.1 / 1.1||3.3 / 0.6||7.7 / 2.1||7.5 / 3.1|
|base-pm4||6.1 / 1.3||3.3 / 0.6||7.7 / 2.3||7.4 / 3.0|
|base-pm5||6.0 / 1.4||3.2 / 0.7||7.6 / 2.4||7.4 / 3.1|
|base-pm7||6.0 / 1.5||3.2 / 0.8||7.7 / 2.5||7.3 / 3.3|
|base-pm10||6.0 / 1.6||3.3 / 0.9||7.5 / 2.7||7.2 / 3.4|
Our results are presented in Table 1. First, we consider the systems trained by limiting prediction network context and evaluated with path-merging. As can be seen in the table, a model with 5-gram context (i.e., conditioned on four previous labels) performs as well as the baseline model across all test sets. In fact, this model outperforms the baseline on the NVS and NVS-hard sets which tend to contain longer utterances. We hypothesize that this is likely a consequence of our proposed path merging strategy which removes redundant paths from the beam, thus allowing the model to represent diverse competing hypotheses as described in Figure 2 and Section 3.1. This is further supported by comparing the Oracle WERs of the limited context systems relative to the baseline. Models with smaller contexts have more opportunities for path-merging, and thus achieve lower oracle WERs relative to models with larger contexts; this comes, however, at the cost of a degradation in the WER. Using 5-gram contexts provides the best balance between the two evaluation metrics, allowing the model to improve oracle WERs by between 14.3–36.4% across the various test sets, with larger improvements on the longer NVS/NVS-hard sets. We also note that our results are in contrast to previous studies with have been conducted on smaller-scale tasks with limited training data [36, 7].
Second, we consider the baseline model which is trained with full-context, but evaluated with path-merging using varying amounts of context. From the results in Table 1, we observe that the baseline models evaluated with path-merging perform at least as well the full-context baseline if not better, as long as we use at least 5-gram context during path merging. As was the case with the models trained with limited-context, using path-merging with 5-gram contexts appears to provide the best WER relative to the baseline configuration. Finally, we note that these systems achieve similarly large Oracle WER improvements as the systems trained with limited context.
5.1 Reducing Search Complexity Through Path-Merging
|System||Average Model States|
In Table 2 we measure the complexity of the search process for each of the systems in Table 1. For the purposes of this analysis, all models are evaluated with exactly the same pruning parameters as described in Section 4. As can be observed in the table, the use of path merging consistently results in a more efficient search – this is true for both systems trained with limited context (i.e., lc-2gram – lc-10gram) as well as for the baseline (i.e., base-pm2 – base-pm10). In general, systems which merge hypotheses more aggressively (e.g., lc-2gram) result in fewer average model states then systems which merge hypotheses less aggressively (e.g., lc-10gram). Comparing the complexity of the search for the baseline relative to the two best performing configurations in terms of WER – namely lc-5gram and base-pm5 – we observe that path merging improves efficiency by 5.3% and 4.5%, respectively without any degradation in WER.
6 Conclusions and Discussion
In this work, we studied the impact of label conditioning in the RNN-T model on WER and its implications for improving the efficiency of the search. In experimental evaluations, we found that a full-context RNN-T model with word-piece outputs performs comparably to a model trained with context limited to four previous labels. We also investigated modifications to the decoding strategy which are enabled by limiting context: either exactly – by training a model with limited context; or approximately in the baseline – by merging paths during decoding to produce lattices if two paths share the same local label history. The proposed path-merging strategy was shown to improve the oracle WER in the lattice by up to 36%, while improving the efficiency of the search by reducing the number of model evaluations by up to 5% without any degradation in WER.
The ability to create dense lattices which represent alternative hypotheses has a number of potential applications; future work will investigate its effectiveness in improving word-level confidence estimates , and in improving WER by rescoring lattices in the second-pass  taking advantage of the improved oracle error rates. Finally, we note that our work opens up interesting research directions for new RNN-T architectures which limit prediction network context to just a few previous labels; replacing recurrent LSTM units through such a process can reduce sequential dependencies in the system, thus potentially improving execution speed.
Tensorflow: A System for Large-Scale Machine Learning. In Proc. of Symposium on Operating Systems, Cited by: §4.
-  (2016) Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proc. of ICASSP, Cited by: §1.
-  (2018) State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. In Proc. of ICASSP, Cited by: §1.
-  (2017) Towards Better Decoding and Language Model Integration in Sequence to Sequence Models. In Proc. of Interspeech, Cited by: §1.
-  (2015) Attention-Based Models for Speech Recognition. In Proc. of Neurips, Cited by: §1.
-  (2008) Spoken Language Understanding. IEEE Signal Processing Magazine 25 (3), pp. 50–58. Cited by: §1.
-  (2020) Rnn-Transducer with Stateless Prediction Network. In Proc.of ICASSP, Cited by: §1, §5.
-  (2006) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proc. of ICML, Cited by: §2.
-  (2013) Speech Recognition with Deep Neural Networks. In Proc. of ICASSP, Cited by: §1, §2.
-  (2012) Sequence Transduction with Recurrent Neural Networks. arXiv preprint arXiv:1211.3711. Cited by: §1, §2, §2, §3.1, footnote 2.
-  (2020) Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv preprint arXiv:2005.08100. Cited by: §1.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2, §4.
-  (2017) In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proc. Symposium on Computer Architecture, Cited by: §4.
-  (1997) Estimating Confidence using Word Lattices. In Proc. of European Conference on Speech Communication and Technology, Cited by: §1, §6.
-  (2017) Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home. In Proc. of Interspeech, Cited by: §4.
-  (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2012) Improving wideband speech rcognition using mixed-bandwidth training data in cd-dnn-hmm. In Proc. of SLT, Cited by: §4.
-  (2013) Large Scale Deep Neural Network Acoustic Modeling with Semi-supervised Training Data for YouTube Video Transcription. In Proc. of ASRU, Cited by: §4.
-  (2014) Efficient Lattice Rescoring using Recurrent Neural Network Language Models. In Proc. of ICASSP, pp. 4908–4912. Cited by: §1, §3.1.
-  (2007) Rapid and Accurate Spoken Term Detection. In Proc. of Interspeech, Cited by: §1.
-  (1995-05) Continuous Speech Recognition. IEEE Signal Processing Magazine 12 (3), pp. 24–42. Cited by: §1.
-  (2019) Recognizing Long-Form Speech using Streaming End-to-End Models. In Proc. of ASRU, Cited by: §3.1, §4, §4.
-  (2019) Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. of Interspeech, Cited by: §1, §4.
-  (2020) Specaugment on Large Scale Datasets. In Proc. of ICASSP, Cited by: §4.
-  (2017) Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer. In Proc. of ASRU, Cited by: §1.
-  (2020) A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency. In Proc. of ICASSP, Cited by: §1, §3.1, §4.
-  (2019) Two-Pass End-to-End Speech Recognition. In Proc. of Interspeech, Cited by: §6.
-  (2012) Japanese and Korean Voice Search. In Proc. of ICASSP, Cited by: §1, §2, §4.
-  (2019) Lingvo: A Modular and Scalable Framework for Sequence-to-Sequence Modeling. arXiv preprint arXiv:1902.08295. Cited by: §4.
-  (2014) Sequence to Sequence Learning with Neural Networks. In Proc. of Neurips, Cited by: §1.
-  (2019) Monotonic Recurrent Neural Network Transducer and Decoding Strategies. In Proc. of ASRU, Cited by: §1, §3.1, §4.
-  (2020) Hybrid Autoregressive Transducer (HAT). In Proc. of ICASSP, Cited by: §1.
-  (2017) Attention Is All You Need. In Proc. of Neurips, Cited by: §1.
-  (2019) Transformer-Transducer: End-to-End Speech Recognition with Self-Attention. arXiv preprint arXiv:1910.12977. Cited by: §1.
-  (2019) Lattice Generation in Attention-Based Speech Recognition Models. In Proc. of Interspeech, Cited by: §1, §3.1.
-  (2020) Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In Proc. of ICASSP, Cited by: §1, §1, §3, §5.