where is a sequence of acoustic features and
is a candidate word sequence. In conventional modeling, the posterior probability is factorized as a product of an acoustic likelihood and a priorlanguage model (LM) via Bayes’ rule. The acoustic likelihood is further factorized as a product of conditional distributions of acoustic features given a phonetic sequence, the acoustic model (AM), and conditional distributions of a phonetic sequence given a word sequence, the pronunciation model. The phonetic model structure, pronunciations and LM are commonly represented and combined as weighted finite-state transducers (WFSTs) [30, 31].
The probabilistic factorization of ASR allows a modular design that provides practical advantages for training and inference. The modularity permits training acoustic and language models independently and on different data sets. A decent acoustic model can be trained with a relatively small amount of transcribed audio data whereas the LM can be trained on text-only data, which is available in vast amounts for many languages. Another advantage of the modular modeling approach is it allows dynamic modification of the component models in well proven, principled methods such as vocabulary augmentation , LM adaptation and contextual biasing .
From a modeling perspective, the modular factorization of the posterior probability has the drawback that the parameters are trained separately with different objective functions. In conventional ASR this has been addressed to some extent by introducing discriminative approaches for acoustic modeling [3, 33, 27] and language modeling [8, 28]. These optimize the posterior probability by varying the parameters of one model while the other is frozen.
Recent developments in the neural network literature, particularlysequence-to-sequence (Seq2Seq) modeling [2, 24, 10, 9, 40], have allowed the full optimization of the posterior probability, learning a direct mapping between feature sequences and orthographic-based (so-called character-based) labels, like graphemes or wordpieces. One class of these models uses a recurrent neural network (RNN) architecture such as the long short-term memory (LSTM) model 
followed by a softmax layer to produce label posteriors. All the parameters are optimized at the sequence level using, e.g., theconnectionist temporal classification (CTC)  criterion. The letter models [13, 18, 12] and word model  are examples of this neural model class. Except for the modeling unit, these models are very similar to conventional acoustic models and perform well when combined with an external LM during decoding (beam search). [42, 5].
Another category of direct models are the encoder-decoder networks. Instead of passing the encoder activations to a softmax layer for the label posteriors, these models use a decoder network to combine the encoder information with an embedding of the previously-decoded label history that provides the next label posterior. The encoder and decoder parameters are optimized together using a time-synchronous objective function as with the RNN transducer (RNNT)  or a label-synchronous objective as with listen, attend, and spell (LAS) . These models are usually trained with character-based units and decoded with a basic beam search. There has been extensive efforts to develop decoding algorithms that can use external LMs, so-called fusion methods [16, 11, 39, 21, 37]. However, these methods have shown relatively small gains on large-scale ASR tasks 
. In the current state of encoder-decoder ASR models, the general belief is that encoder-decoder models with character-based units outperform the corresponding phonetic-based models. This has led some to conclude a lexicon or external LM is unnecessary[35, 43, 22].
This paper proposes the hybrid autoregressive transducer (HAT), a time-synchronous encoder-decoder model which couples the powerful probabilistic capability of Seq2Seq models with an inference algorithm that preserves modularity and external lexicon and LM integration. The HAT model allows an internal LM quality measure, useful to decide if an external language model is beneficial. The finite-history version of the HAT model is also presented and used to show how much label-context history is needed to train a state-of-the-art ASR model on a large-scale training corpus.
2 Time-Synchronous Estimation of Posterior: Previous Work
For an acoustic feature sequence corresponding to a word sequence , assume is a tokenization of where , is either a phonetic unit or a character-based unit from a finite-size alphabet . The character tokenization of transcript <S> Hat and the corresponding acoustic feature sequence are depicted in the vertical and the horizontal axis of Figure 1, respectively. Define an alignment path as a sequence of edges where , the path length, is the number of edges in the path. Each path passes through sequence of nodes where , and , . The dotted path and the bold path in Figure 1 are two left-to-right alignment paths between and . For any sequence pairs and , there are an exponential number of alignment paths. Each time-synchronous model typically defines a subset of these paths as the permitted path inventory, which is usually referred to as the path lattice. Denote by the function that maps permitted alignment paths to the corresponding label sequence. For most models, this is many-to-one. Besides the definition of the lattice, time-synchronous models usually differ on how they define the alignment path posterior probability . Once defined, the posterior probability for a time-synchronous model is calculated by marginalizing over these alignment path posteriors:
The model parameters are optimized by maximizing .
Cross-Entropy (CE): In the cross-entropy models, the alignment lattice contains only one path which is derived by a pre-processing step known as forced alignment. These models encode the feature sequence via a stack of RNN layers to output activations
. The activations are then passed to a linear layer (aka logits layer) to produce the unnormalized class-level scores,. The conditional posterior is derived by normalizing the class-level scores using the softmax function
The alignment path posterior is derived by imposing the conditional independence assumption between label sequence given feature sequence :
For inference with an external LM, the search is defined as:
where is a scalar and is a pseudo-likelihood sequence-level score derived by applying Bayes’ rule on the time posteriors: where is a label prior .
Connectionist Temporal Classification (CTC): The CTC model  augments the label alphabet with a blank symbol, <b>, and defines the alignment lattice to be the grid of all paths with edit distance of to the ground truth label sequence after removal of all blank symbols and consecutive repetitions. The dotted path in Figure 1 is a valid CTC path. Similar to the CE models, the CTC models also use a stack of RNNs followed by a logits layer and softmax layer to produce local posteriors. The sequence-level alignment posterior is then calculated by making a similar conditional independence assumption between the alignment label sequence given , . Finally is calculated by marginalizing over the alignment posteriors with Eq 2.
Recurrent Neural Network Transducer (RNNT): The RNNT lattice encodes all the paths of length edges starting from the bottom-left corner of the lattice in Figure 1 and ending in the top-right corner. Unlike CE/CTC lattice, RNNT allows staying in the same time frame and emitting multiple labels, like the two consecutive vertical edges in the bold path in Figure 1. Each alignment path can be denoted by a sequence of labels , prepended by . Here
is a random variable representing the edge label corresponding to time framewhile the previous labels have been already emitted. The indicator function returns if and otherwise. For the bold path in Figure 1, .
The local posteriors in RNNT are calculated under the assumption that their value is independent of the path prefix, i.e., if , then . The local posterior is calculated using the encoder-decoder architecture. The encoder accepts input and outputs vectors . The decoder takes the label sequence and outputs vectors , where the vector corresponds to the <S> symbol. The unnormalized score of the next label corresponding to position is defined as: . The joint network is usually a multi-layer non-recurrent network. The local posterior is calculated by normalizing the above score over all labels using the softmax function. The alignment path posterior is derived by chaining the above quantity over the path,
3 Hybrid Autoregressive Transducer
The Hybrid Autoregressive Transducer (HAT) model is a time-synchronous encoder-decoder model which distinguishes itself from other time-synchronous models by 1) formulating the local posterior probability differently, 2) providing a measure of its internal language model quality, and 3) offering a mathematically-justified inference algorithm for external LM integration.
The RNNT lattice definition is used for the HAT model Alternative lattice functions can be explored, this choice is merely selected to provide fair comparison of HAT and RNNT models in the experimental section. To calculate the local conditional posterior, the HAT model differentiates between horizontal and vertical edges in the lattice. For edge corresponding to position , the posterior probability to take a horizontal edge and emitting <b>
is modeled by a Bernoulli distributionwhich is a function of the entire past history of labels and features. The vertical move is modeled by a posterior distribution , defined over labels in . The alignment posterior is formulated as:
Blank (Duration) distribution: The input feature sequence is fed to a stack of RNN layers to output encoded vectors . The label sequence is also fed to a stack of RNN layers to output . The conditional Bernoulli distribution is then calculated as:
is the sigmoid function,is a weight vector, is the dot-product and is a bias term.
Label distribution: The encoder function encodes input features and the function encodes label embeddings. At each time position , the joint score is calculated over all :
where can be any function that maps to a -dim score vector. The label posterior distribution is derived by normalizing the score functions across all labels in :
3.2 Internal Language Model Score
The separation of blank and label posteriors allows the HAT model to produce a local and sequence-level internal LM score. The local score at each label position is defined as: . In other words, this is exactly the posterior score of Eq 6 but eliminating the effect of the encoder activations, . The intuition here is that a language-model quality measure at label should be only a function of the previous labels and not the time frame or the acoustic features. Furthermore, this score can be normalized to produce a prior distribution for the next label . The sequence-level internal LM score is:
Note that the most accurate way of calculating the internal language model is by . However, the exact sum is not easy to derive, thus some approximation like Laplace’s methods for integrals  is needed. Meanwhile, the above definition can be justified for the special cases when (proof in Appendix A.).
The HAT model inference search for that maximizes:
where and are scalar weights. Subtracting the internal LM score, prior
, from the path posterior leads to a pseudo-likelihood (Bayes’ rule), which is justified for combining with an external LM score. We use a conventional FST decoder with a decoder graph encoding the phone context-dependency (if any), pronunciation dictionary and the (external) n-gram LM. The partial path hypotheses are augmented with the corresponding state of the model. That is, a hypothesis consists of the time frame, a state in the decoder graph FST, and a state of the model. Paths with equivalent history, i.e. an equal label sequence without blanks, are merged and the corresponding hypotheses are recombined.
The training set, M utterances, development set, k utterances, and test set, hours, are all anonymized, hand-transcribed, representative of Google traffic queries. The training examples are
-dim. log Mel features extracted from ams window every ms  . The training examples are noisified with different noise styles as detailed in . Each training example is forced-aligned to get the frame level phoneme alignment. All models (baselines and HAT) are trained to predict phonemes and are decoded with a lexicon and an n-gram language model that cover a M words vocabulary. The LM was trained on anonymized audio transcriptions and web documents. A maximum entropy LM  is applied using an additional lattice re-scoring pass. The decoding hyper-parameters are swept on separate development set.
Three time-synchronous baselines are presented. All models use layers of LSTMs with cells per layer for the encoder. For the models with a decoder network (RNNT and HAT), each label is embedded by a -dim. vector and fed to a decoder network which has layers of LSTMs with cells per layer. The encoder and decoder activations are projected to a -dim. vector and their sum is passed to the joint network. The joint network is a tanh layer followed by a linear layer of size and a softmax as in . The encoder and decoder networks are shared for the blank and the label posterior in the HAT model such that it has exactly the same number of parameters as the RNNT baseline. The CE and CTC models have M float parameters while RNNT and HAT models have M parameters.
|(del/ins/sub)||1st best||10-best Oracle|
|CE||8.9 (1.8/1.6/5.5)||4.0 (0.8/0.5/2.6)||8.3 (1.7/1.6/5.0)|
|CTC||8.9 (1.8/1.6/5.5)||4.1 (0.8/0.6/2.6)||8.4 (1.8/1.6/5.0)|
|RNNT||8.0 (1.1/1.5/5.4)||2.1 (0.2/0.4/1.6)||7.5 (1.2/1.4/4.9)|
|HAT||6.6 (0.8/1.5/4.3)||1.5 (0.1/0.4/1.0)||6.0 (0.8/1.3/3.9)|
HAT model performance: For the inference algorithm of Eq 9, we empirically observed that and lead to the best performance. Figure 3 plots the WER for different values of , for baseline HAT model and HAT model trained with multi-task learning (MTL). The best WER at on the convex curve shows that significantly down-weighting the internal LM and relying more on the external LM yields the best decoding quality. The HAT model outperforms the baseline RNNT model by absolute WER ( relative gain), Table 1. Furthermore, using a 2nd-pass LM gives an extra gain which is expected for the low oracle WER of . Comparing the different types of errors, it seems that the HAT model error pattern is different from all baselines. In all baselines, the deletion and insertion error rates are in the same range, whereas the HAT model makes two times more insertion than deletion errors.111We are examining this behavior and hoping to have some explanation for it in the final version of the paper.
Internal language model: The internal language model score proposed in subsection 3.2 can be analytically justified when . The joint function is usually a tanh function followed by a linear layer . The approximation holds iff falls in the linear range of the tanh function. Figure 3 presents the statistics of this -dim. vector, (), (), and () for which are accumulated on the test set. The linear range of tanh function is specified by the two dotted horizontal lines in Figure 3. The means and large portion of one standard deviation around the means fall within the linear range of the tanh, which suggests that the decomposition of the joint function into a sum of two joint functions is plausible.
Figure 4(a) shows the prior cost, , during the first epochs of training with being either the train or test set. The curves are plotted for both the HAT and RNNT models. Note that in case of RNNT, since blank and label posteriors are mixed together, a softmax is applied to normalize the non-blank scores to get the label posteriors, thus it just serves as an approximation. At the early steps of training, the prior cost is going down, which suggests that the model is learning an internal language model. However, after the first few epochs, the prior cost starts increasing, which suggests that the decoder network deviates from being a language model. Note that both train and test curves behave like this. One explanation of this observation can be: in maximizing the posterior, the model prefers not to choose parameters that also maximize the prior. We evaluated the performance of the HAT model when it further constrained with a cross-entropy multi-task learning (MTL) criterion that minimizes the prior cost. Applying this criterion results in decreasing the prior cost, Figure 4(b). However, this did not impact the WER. After sweeping , the WER for the HAT model with MTL loss is as good as the baseline HAT model, blue curve in Figure 3. Of course this might happen because the internal language model is still weaker than the external language model used for inference.
This observation might also be explained by Bayes’ rule: . The observation that the posterior cost goes down while the prior cost goes up suggests that the model is implicitly maximizing the log likelihood term in the left-side of the above equation. This might be why subtracting the internal language model log-probability from the log-posterior and replacing it with a strong external language model during inference led to superior performance.
Limited vs Infinite Context: Assuming that the decoder network is not taking advantage of the full label history, a natural question is how much of the label history actually is needed to achieve peak performance. The HAT model performance for different context sizes is shown in Table 2. A context size of means feeding no label history to the decoder network, which is similar to making a conditional independence assumption for calculating the posterior at each time-label position. The performance of this model is on a par with the performance of the CE and CTC baseline models, c.f. Table 1. The HAT model with context size of shows relative WER degradation compared to the infinite history HAT model. However, the HAT models with contexts and are matching the performance of the baseline HAT model.
While the posterior cost of the HAT model with context size of is about worse than the baseline HAT model loss, the WER of both models is the same. One reason for this behavior is that the external LM has compensated for any shortcoming of the model. Another explanation can be the problem of exposure bias , which refers to the mismatch between training and inference for encoder-decoder models. The error propagation in an infinite context model can be much severe than a finite context model with context , simply because the finite context model has a chance to reset its decision every labels.
Since the finite context of is sufficient to perform as well as an infinite context, one can simply replace all the expensive RNN kernels in the decoder network with a embedding vector corresponding to all the possible permutations of a finite context of size . In other words, trading computation with memory, which can significantly reduce total training and inference cost.
The HAT model is a step toward better acoustic modeling while preserving the modularity of a speech recognition system. One of the key advantages of the HAT model is the introduction of an internal language model quantity that can be measured to better understand encoder-decoder models and decide if equipping them with an external language model is beneficial. According to our analysis, the decoder network does not behave as a language model but more like a finite context model. We presented a few explanations for this observation. Further in-depth analysis is needed to confirm the exact source of this behavior and how to construct models that are really end-to-end, meaning the prior and posterior models behave as expected for a language model and acoustic model.
-  (2015) Rapid vocabulary addition to context-dependent decoder graphs. In Interspeech 2015, Cited by: §1.
-  (2013) Joint language and translation modeling with recurrent neural networks. Cited by: §1.
-  (1986) . In proc. icassp, Vol. 86, pp. 49–52. Cited by: §1.
-  (1983) A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence (2), pp. 179–190. Cited by: §1.
-  (2017) Exploring neural transducers for end-to-end speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 206–213. Cited by: §1.
-  (2017) Effectively building tera scale maxent language models incorporating non-linguistic signals. Cited by: §4.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
-  (2000) Discriminative training on language model. In Sixth International Conference on Spoken Language Processing, Cited by: §1.
On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Cited by: §1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1.
-  (2016) Towards better decoding and language model integration in sequence to sequence models. arXiv preprint arXiv:1612.02695. Cited by: §1, §2.
-  (2016) Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193. Cited by: §1.
-  (2009) From speech to letters-using a novel neural network architecture for grapheme based asr. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 376–380. Cited by: §1.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1, §2.
-  (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
-  (2015) On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535. Cited by: §1.
-  (2015) Composition-based on-the-fly rescoring for salient n-gram biasing. In Interspeech 2015, International Speech Communications Association, Cited by: §1.
-  (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1.
-  (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §4, §4.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
-  (2018) End-to-end speech recognition with word-based rnn language models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. Cited by: §1.
-  (2019) Model unit exploration for sequence-to-sequence speech recognition. arXiv preprint arXiv:1902.01955. Cited by: §1.
-  (1976) Continuous speech recognition by statistical methods. Proceedings of the IEEE 64 (4), pp. 532–556. Cited by: §1.
Recurrent continuous translation models.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700–1709. Cited by: §1.
-  (2018) An analysis of incorporating an external language model into a sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5828. Cited by: §1.
-  (2017) Efficient implementation of the room simulator for training deep neural network acoustic models. arXiv preprint arXiv:1712.03439. Cited by: §4.
-  (2009) Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3761–3764. Cited by: §1.
-  (2002) Discriminative training of language models for speech recognition. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. I–325. Cited by: §1.
-  (1986) Memoir on the probability of the causes of events. Statistical Science 1 (3), pp. 364–378. Cited by: §3.2.
-  (2002) Weighted finite-state transducers in speech recognition. Computer Speech & Language 16 (1), pp. 69–88. Cited by: §1.
-  (2008) Speech recognition with weighted finite-state transducers. In Handbook of Speech Processing, J. Benesty, M. Sondhi, and Y. Huang (Eds.), pp. 559–582. Cited by: §1.
Continuous speech recognition using multilayer perceptrons with hidden markov models. In International conference on acoustics, speech, and signal processing, pp. 413–416. Cited by: §2.
-  (2005) Discriminative training for large vocabulary speech recognition. Ph.D. Thesis, University of Cambridge. Cited by: §1.
-  (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §4.
-  (2018) No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859–5863. Cited by: §1.
-  (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947. Cited by: §2.
-  (2019) Component fusion: learning replaceable language model component for end-to-end speech recognition system. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5361–5635. Cited by: §1.
-  (2016) Neural speech recognizer: acoustic-to-word lstm model for large vocabulary speech recognition. arXiv preprint arXiv:1610.09975. Cited by: §1.
-  (2017) Cold fusion: training seq2seq models together with language models. arXiv preprint arXiv:1708.06426. Cited by: §1.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
End-to-end training of acoustic models for large vocabulary continuous speech recognition with tensorflow. Cited by: §4.
-  (2017) Comparison of decoding strategies for ctc acoustic models. arXiv preprint arXiv:1708.04469. Cited by: §1.
-  (2018) A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese. In International Conference on Neural Information Processing, pp. 210–220. Cited by: §1.
Appendix A Internal Language Model Score
The softmax function is invertible up to an additive constant. In other words, for any two real valued vectors , and real constant ,
If , then
which proves the if condition. For the other side of condition, the proof goes as follow:
taking logarithm from both sides:
this completes the proof since the right-hand-side (RHS) of above equation is constant for any . ∎
For any two random variables and and some real valued function , if the posterior distribution of given be
then , for some constant value .
The proof is straight-forward following Lemma 1. ∎
According to Eq 7
for some real valued constant . The first equality holds from Corollary 1, and the second equality holds by Bayes’ rule.
Applying exponential function and marginalizing over results in:
note that is dropped which make the left-hand-side (LHS) be proportional to the RHS of first equation. The second equation holds since is a distribution over .
Finally note that
where the first equality comes from the definition and the second one is the assumption made in the statement of the proposition. Applying exponential function and marginalizing over :
Finally note that , thus:
which completes the proof. ∎