1 Introduction
The automatic speech recognition (ASR) problem is probabilistically formulated as a maximum a posteriori decoding [23, 4]:
(1) 
where is a sequence of acoustic features and
is a candidate word sequence. In conventional modeling, the posterior probability is factorized as a product of an acoustic likelihood and a prior
language model (LM) via Bayes’ rule. The acoustic likelihood is further factorized as a product of conditional distributions of acoustic features given a phonetic sequence, the acoustic model (AM), and conditional distributions of a phonetic sequence given a word sequence, the pronunciation model. The phonetic model structure, pronunciations and LM are commonly represented and combined as weighted finitestate transducers (WFSTs) [30, 31].The probabilistic factorization of ASR allows a modular design that provides practical advantages for training and inference. The modularity permits training acoustic and language models independently and on different data sets. A decent acoustic model can be trained with a relatively small amount of transcribed audio data whereas the LM can be trained on textonly data, which is available in vast amounts for many languages. Another advantage of the modular modeling approach is it allows dynamic modification of the component models in well proven, principled methods such as vocabulary augmentation [1], LM adaptation and contextual biasing [17].
From a modeling perspective, the modular factorization of the posterior probability has the drawback that the parameters are trained separately with different objective functions. In conventional ASR this has been addressed to some extent by introducing discriminative approaches for acoustic modeling [3, 33, 27] and language modeling [8, 28]. These optimize the posterior probability by varying the parameters of one model while the other is frozen.
Recent developments in the neural network literature, particularly
sequencetosequence (Seq2Seq) modeling [2, 24, 10, 9, 40], have allowed the full optimization of the posterior probability, learning a direct mapping between feature sequences and orthographicbased (socalled characterbased) labels, like graphemes or wordpieces. One class of these models uses a recurrent neural network (RNN) architecture such as the long shortterm memory (LSTM) model [20]followed by a softmax layer to produce label posteriors. All the parameters are optimized at the sequence level using, e.g., the
connectionist temporal classification (CTC) [14] criterion. The letter models [13, 18, 12] and word model [38] are examples of this neural model class. Except for the modeling unit, these models are very similar to conventional acoustic models and perform well when combined with an external LM during decoding (beam search). [42, 5].Another category of direct models are the encoderdecoder networks. Instead of passing the encoder activations to a softmax layer for the label posteriors, these models use a decoder network to combine the encoder information with an embedding of the previouslydecoded label history that provides the next label posterior. The encoder and decoder parameters are optimized together using a timesynchronous objective function as with the RNN transducer (RNNT) [15] or a labelsynchronous objective as with listen, attend, and spell (LAS) [7]. These models are usually trained with characterbased units and decoded with a basic beam search. There has been extensive efforts to develop decoding algorithms that can use external LMs, socalled fusion methods [16, 11, 39, 21, 37]. However, these methods have shown relatively small gains on largescale ASR tasks [25]
. In the current state of encoderdecoder ASR models, the general belief is that encoderdecoder models with characterbased units outperform the corresponding phoneticbased models. This has led some to conclude a lexicon or external LM is unnecessary
[35, 43, 22].This paper proposes the hybrid autoregressive transducer (HAT), a timesynchronous encoderdecoder model which couples the powerful probabilistic capability of Seq2Seq models with an inference algorithm that preserves modularity and external lexicon and LM integration. The HAT model allows an internal LM quality measure, useful to decide if an external language model is beneficial. The finitehistory version of the HAT model is also presented and used to show how much labelcontext history is needed to train a stateoftheart ASR model on a largescale training corpus.
2 TimeSynchronous Estimation of Posterior: Previous Work
For an acoustic feature sequence corresponding to a word sequence , assume is a tokenization of where , is either a phonetic unit or a characterbased unit from a finitesize alphabet . The character tokenization of transcript <S> Hat and the corresponding acoustic feature sequence are depicted in the vertical and the horizontal axis of Figure 1, respectively. Define an alignment path as a sequence of edges where , the path length, is the number of edges in the path. Each path passes through sequence of nodes where , and , . The dotted path and the bold path in Figure 1 are two lefttoright alignment paths between and . For any sequence pairs and , there are an exponential number of alignment paths. Each timesynchronous model typically defines a subset of these paths as the permitted path inventory, which is usually referred to as the path lattice. Denote by the function that maps permitted alignment paths to the corresponding label sequence. For most models, this is manytoone. Besides the definition of the lattice, timesynchronous models usually differ on how they define the alignment path posterior probability . Once defined, the posterior probability for a timesynchronous model is calculated by marginalizing over these alignment path posteriors:
(2) 
The model parameters are optimized by maximizing .
CrossEntropy (CE): In the crossentropy models, the alignment lattice contains only one path which is derived by a preprocessing step known as forced alignment. These models encode the feature sequence via a stack of RNN layers to output activations
. The activations are then passed to a linear layer (aka logits layer) to produce the unnormalized classlevel scores,
. The conditional posterior is derived by normalizing the classlevel scores using the softmax functionThe alignment path posterior is derived by imposing the conditional independence assumption between label sequence given feature sequence :
For inference with an external LM, the search is defined as:
(3) 
where is a scalar and is a pseudolikelihood sequencelevel score derived by applying Bayes’ rule on the time posteriors: where is a label prior [32].
Connectionist Temporal Classification (CTC): The CTC model [14] augments the label alphabet with a blank symbol, <b>, and defines the alignment lattice to be the grid of all paths with edit distance of to the ground truth label sequence after removal of all blank symbols and consecutive repetitions. The dotted path in Figure 1 is a valid CTC path. Similar to the CE models, the CTC models also use a stack of RNNs followed by a logits layer and softmax layer to produce local posteriors. The sequencelevel alignment posterior is then calculated by making a similar conditional independence assumption between the alignment label sequence given , . Finally is calculated by marginalizing over the alignment posteriors with Eq 2.
Recurrent Neural Network Transducer (RNNT): The RNNT lattice encodes all the paths of length edges starting from the bottomleft corner of the lattice in Figure 1 and ending in the topright corner. Unlike CE/CTC lattice, RNNT allows staying in the same time frame and emitting multiple labels, like the two consecutive vertical edges in the bold path in Figure 1. Each alignment path can be denoted by a sequence of labels , prepended by . Here
is a random variable representing the edge label corresponding to time frame
while the previous labels have been already emitted. The indicator function returns if and otherwise. For the bold path in Figure 1, .The local posteriors in RNNT are calculated under the assumption that their value is independent of the path prefix, i.e., if , then . The local posterior is calculated using the encoderdecoder architecture. The encoder accepts input and outputs vectors . The decoder takes the label sequence and outputs vectors , where the vector corresponds to the <S> symbol. The unnormalized score of the next label corresponding to position is defined as: . The joint network is usually a multilayer nonrecurrent network. The local posterior is calculated by normalizing the above score over all labels using the softmax function. The alignment path posterior is derived by chaining the above quantity over the path,
3 Hybrid Autoregressive Transducer
The Hybrid Autoregressive Transducer (HAT) model is a timesynchronous encoderdecoder model which distinguishes itself from other timesynchronous models by 1) formulating the local posterior probability differently, 2) providing a measure of its internal language model quality, and 3) offering a mathematicallyjustified inference algorithm for external LM integration.
3.1 Formulation
The RNNT lattice definition is used for the HAT model Alternative lattice functions can be explored, this choice is merely selected to provide fair comparison of HAT and RNNT models in the experimental section. To calculate the local conditional posterior, the HAT model differentiates between horizontal and vertical edges in the lattice. For edge corresponding to position , the posterior probability to take a horizontal edge and emitting <b>
is modeled by a Bernoulli distribution
which is a function of the entire past history of labels and features. The vertical move is modeled by a posterior distribution , defined over labels in . The alignment posterior is formulated as:(4) 
Blank (Duration) distribution: The input feature sequence is fed to a stack of RNN layers to output encoded vectors . The label sequence is also fed to a stack of RNN layers to output . The conditional Bernoulli distribution is then calculated as:
(5) 
where
is the sigmoid function,
is a weight vector, is the dotproduct and is a bias term.Label distribution: The encoder function encodes input features and the function encodes label embeddings. At each time position , the joint score is calculated over all :
(6) 
where can be any function that maps to a dim score vector. The label posterior distribution is derived by normalizing the score functions across all labels in :
(7) 
3.2 Internal Language Model Score
The separation of blank and label posteriors allows the HAT model to produce a local and sequencelevel internal LM score. The local score at each label position is defined as: . In other words, this is exactly the posterior score of Eq 6 but eliminating the effect of the encoder activations, . The intuition here is that a languagemodel quality measure at label should be only a function of the previous labels and not the time frame or the acoustic features. Furthermore, this score can be normalized to produce a prior distribution for the next label . The sequencelevel internal LM score is:
(8) 
Note that the most accurate way of calculating the internal language model is by . However, the exact sum is not easy to derive, thus some approximation like Laplace’s methods for integrals [29] is needed. Meanwhile, the above definition can be justified for the special cases when (proof in Appendix A.).
3.3 Decoding
The HAT model inference search for that maximizes:
(9) 
where and are scalar weights. Subtracting the internal LM score, prior
, from the path posterior leads to a pseudolikelihood (Bayes’ rule), which is justified for combining with an external LM score. We use a conventional FST decoder with a decoder graph encoding the phone contextdependency (if any), pronunciation dictionary and the (external) ngram LM. The partial path hypotheses are augmented with the corresponding state of the model. That is, a hypothesis consists of the time frame
, a state in the decoder graph FST, and a state of the model. Paths with equivalent history, i.e. an equal label sequence without blanks, are merged and the corresponding hypotheses are recombined.4 Experiments
The training set, M utterances, development set, k utterances, and test set, hours, are all anonymized, handtranscribed, representative of Google traffic queries. The training examples are
dim. log Mel features extracted from a
ms window every ms [41] . The training examples are noisified with different noise styles as detailed in [26]. Each training example is forcedaligned to get the frame level phoneme alignment. All models (baselines and HAT) are trained to predict phonemes and are decoded with a lexicon and an ngram language model that cover a M words vocabulary. The LM was trained on anonymized audio transcriptions and web documents. A maximum entropy LM [6] is applied using an additional lattice rescoring pass. The decoding hyperparameters are swept on separate development set.Three timesynchronous baselines are presented. All models use layers of LSTMs with cells per layer for the encoder. For the models with a decoder network (RNNT and HAT), each label is embedded by a dim. vector and fed to a decoder network which has layers of LSTMs with cells per layer. The encoder and decoder activations are projected to a dim. vector and their sum is passed to the joint network. The joint network is a tanh layer followed by a linear layer of size and a softmax as in [19]. The encoder and decoder networks are shared for the blank and the label posterior in the HAT model such that it has exactly the same number of parameters as the RNNT baseline. The CE and CTC models have M float parameters while RNNT and HAT models have M parameters.
WER  1stpass  2ndpass  

(del/ins/sub)  1st best  10best Oracle  
CE  8.9 (1.8/1.6/5.5)  4.0 (0.8/0.5/2.6)  8.3 (1.7/1.6/5.0) 
CTC  8.9 (1.8/1.6/5.5)  4.1 (0.8/0.6/2.6)  8.4 (1.8/1.6/5.0) 
RNNT  8.0 (1.1/1.5/5.4)  2.1 (0.2/0.4/1.6)  7.5 (1.2/1.4/4.9) 
HAT  6.6 (0.8/1.5/4.3)  1.5 (0.1/0.4/1.0)  6.0 (0.8/1.3/3.9) 
HAT model performance: For the inference algorithm of Eq 9, we empirically observed that and lead to the best performance. Figure 3 plots the WER for different values of , for baseline HAT model and HAT model trained with multitask learning (MTL). The best WER at on the convex curve shows that significantly downweighting the internal LM and relying more on the external LM yields the best decoding quality. The HAT model outperforms the baseline RNNT model by absolute WER ( relative gain), Table 1. Furthermore, using a 2ndpass LM gives an extra gain which is expected for the low oracle WER of . Comparing the different types of errors, it seems that the HAT model error pattern is different from all baselines. In all baselines, the deletion and insertion error rates are in the same range, whereas the HAT model makes two times more insertion than deletion errors.^{1}^{1}1We are examining this behavior and hoping to have some explanation for it in the final version of the paper.
Internal language model: The internal language model score proposed in subsection 3.2 can be analytically justified when . The joint function is usually a tanh function followed by a linear layer [19]. The approximation holds iff falls in the linear range of the tanh function. Figure 3 presents the statistics of this dim. vector, (), (), and () for which are accumulated on the test set. The linear range of tanh function is specified by the two dotted horizontal lines in Figure 3. The means and large portion of one standard deviation around the means fall within the linear range of the tanh, which suggests that the decomposition of the joint function into a sum of two joint functions is plausible.
Figure 4(a) shows the prior cost, , during the first epochs of training with being either the train or test set. The curves are plotted for both the HAT and RNNT models. Note that in case of RNNT, since blank and label posteriors are mixed together, a softmax is applied to normalize the nonblank scores to get the label posteriors, thus it just serves as an approximation. At the early steps of training, the prior cost is going down, which suggests that the model is learning an internal language model. However, after the first few epochs, the prior cost starts increasing, which suggests that the decoder network deviates from being a language model. Note that both train and test curves behave like this. One explanation of this observation can be: in maximizing the posterior, the model prefers not to choose parameters that also maximize the prior. We evaluated the performance of the HAT model when it further constrained with a crossentropy multitask learning (MTL) criterion that minimizes the prior cost. Applying this criterion results in decreasing the prior cost, Figure 4(b). However, this did not impact the WER. After sweeping , the WER for the HAT model with MTL loss is as good as the baseline HAT model, blue curve in Figure 3. Of course this might happen because the internal language model is still weaker than the external language model used for inference.
This observation might also be explained by Bayes’ rule: . The observation that the posterior cost goes down while the prior cost goes up suggests that the model is implicitly maximizing the log likelihood term in the leftside of the above equation. This might be why subtracting the internal language model logprobability from the logposterior and replacing it with a strong external language model during inference led to superior performance.
Limited vs Infinite Context: Assuming that the decoder network is not taking advantage of the full label history, a natural question is how much of the label history actually is needed to achieve peak performance. The HAT model performance for different context sizes is shown in Table 2. A context size of means feeding no label history to the decoder network, which is similar to making a conditional independence assumption for calculating the posterior at each timelabel position. The performance of this model is on a par with the performance of the CE and CTC baseline models, c.f. Table 1. The HAT model with context size of shows relative WER degradation compared to the infinite history HAT model. However, the HAT models with contexts and are matching the performance of the baseline HAT model.
While the posterior cost of the HAT model with context size of is about worse than the baseline HAT model loss, the WER of both models is the same. One reason for this behavior is that the external LM has compensated for any shortcoming of the model. Another explanation can be the problem of exposure bias [34], which refers to the mismatch between training and inference for encoderdecoder models. The error propagation in an infinite context model can be much severe than a finite context model with context , simply because the finite context model has a chance to reset its decision every labels.
Context  0  1  2  4  

1stpass WER  8.5  7.4  6.6  6.6  6.6 
posterior cost  34.6  5.6  5.2  4.7  4.6 
Since the finite context of is sufficient to perform as well as an infinite context, one can simply replace all the expensive RNN kernels in the decoder network with a embedding vector corresponding to all the possible permutations of a finite context of size . In other words, trading computation with memory, which can significantly reduce total training and inference cost.
5 Conclusion
The HAT model is a step toward better acoustic modeling while preserving the modularity of a speech recognition system. One of the key advantages of the HAT model is the introduction of an internal language model quantity that can be measured to better understand encoderdecoder models and decide if equipping them with an external language model is beneficial. According to our analysis, the decoder network does not behave as a language model but more like a finite context model. We presented a few explanations for this observation. Further indepth analysis is needed to confirm the exact source of this behavior and how to construct models that are really endtoend, meaning the prior and posterior models behave as expected for a language model and acoustic model.
References
 [1] (2015) Rapid vocabulary addition to contextdependent decoder graphs. In Interspeech 2015, Cited by: §1.
 [2] (2013) Joint language and translation modeling with recurrent neural networks. Cited by: §1.

[3]
(1986)
Maximum mutual information estimation of hidden markov model parameters for speech recognition
. In proc. icassp, Vol. 86, pp. 49–52. Cited by: §1.  [4] (1983) A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence (2), pp. 179–190. Cited by: §1.
 [5] (2017) Exploring neural transducers for endtoend speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 206–213. Cited by: §1.
 [6] (2017) Effectively building tera scale maxent language models incorporating nonlinguistic signals. Cited by: §4.
 [7] (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
 [8] (2000) Discriminative training on language model. In Sixth International Conference on Spoken Language Processing, Cited by: §1.

[9]
(2014)
On the properties of neural machine translation: encoderdecoder approaches
. arXiv preprint arXiv:1409.1259. Cited by: §1.  [10] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1.
 [11] (2016) Towards better decoding and language model integration in sequence to sequence models. arXiv preprint arXiv:1612.02695. Cited by: §1, §2.
 [12] (2016) Wav2letter: an endtoend convnetbased speech recognition system. arXiv preprint arXiv:1609.03193. Cited by: §1.
 [13] (2009) From speech to lettersusing a novel neural network architecture for grapheme based asr. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 376–380. Cited by: §1.

[14]
(2006)
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
In
Proceedings of the 23rd international conference on Machine learning
, pp. 369–376. Cited by: §1, §2.  [15] (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
 [16] (2015) On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535. Cited by: §1.
 [17] (2015) Compositionbased onthefly rescoring for salient ngram biasing. In Interspeech 2015, International Speech Communications Association, Cited by: §1.
 [18] (2014) Deep speech: scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1.
 [19] (2019) Streaming endtoend speech recognition for mobile devices. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §4, §4.
 [20] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 [21] (2018) Endtoend speech recognition with wordbased rnn language models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 389–396. Cited by: §1.
 [22] (2019) Model unit exploration for sequencetosequence speech recognition. arXiv preprint arXiv:1902.01955. Cited by: §1.
 [23] (1976) Continuous speech recognition by statistical methods. Proceedings of the IEEE 64 (4), pp. 532–556. Cited by: §1.

[24]
(2013)
Recurrent continuous translation models.
In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
, pp. 1700–1709. Cited by: §1.  [25] (2018) An analysis of incorporating an external language model into a sequencetosequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5828. Cited by: §1.
 [26] (2017) Efficient implementation of the room simulator for training deep neural network acoustic models. arXiv preprint arXiv:1712.03439. Cited by: §4.
 [27] (2009) Latticebased optimization of sequence classification criteria for neuralnetwork acoustic modeling. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3761–3764. Cited by: §1.
 [28] (2002) Discriminative training of language models for speech recognition. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. I–325. Cited by: §1.
 [29] (1986) Memoir on the probability of the causes of events. Statistical Science 1 (3), pp. 364–378. Cited by: §3.2.
 [30] (2002) Weighted finitestate transducers in speech recognition. Computer Speech & Language 16 (1), pp. 69–88. Cited by: §1.
 [31] (2008) Speech recognition with weighted finitestate transducers. In Handbook of Speech Processing, J. Benesty, M. Sondhi, and Y. Huang (Eds.), pp. 559–582. Cited by: §1.

[32]
(1990)
Continuous speech recognition using multilayer perceptrons with hidden markov models
. In International conference on acoustics, speech, and signal processing, pp. 413–416. Cited by: §2.  [33] (2005) Discriminative training for large vocabulary speech recognition. Ph.D. Thesis, University of Cambridge. Cited by: §1.
 [34] (2015) Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §4.
 [35] (2018) No need for a lexicon? evaluating the value of the pronunciation lexica in endtoend models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859–5863. Cited by: §1.
 [36] (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947. Cited by: §2.
 [37] (2019) Component fusion: learning replaceable language model component for endtoend speech recognition system. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5361–5635. Cited by: §1.
 [38] (2016) Neural speech recognizer: acoustictoword lstm model for large vocabulary speech recognition. arXiv preprint arXiv:1610.09975. Cited by: §1.
 [39] (2017) Cold fusion: training seq2seq models together with language models. arXiv preprint arXiv:1708.06426. Cited by: §1.
 [40] (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.

[41]
(2017)
Endtoend training of acoustic models for large vocabulary continuous speech recognition with tensorflow
. Cited by: §4.  [42] (2017) Comparison of decoding strategies for ctc acoustic models. arXiv preprint arXiv:1708.04469. Cited by: §1.
 [43] (2018) A comparison of modeling units in sequencetosequence speech recognition with the transformer on mandarin chinese. In International Conference on Neural Information Processing, pp. 210–220. Cited by: §1.
Appendix A Internal Language Model Score
Lemma 1.
The softmax function is invertible up to an additive constant. In other words, for any two real valued vectors , and real constant ,
for .
Proof.
If , then
(10)  
which proves the if condition. For the other side of condition, the proof goes as follow:
taking logarithm from both sides:
which implies:
this completes the proof since the righthandside (RHS) of above equation is constant for any . ∎
Corollary 1.
For any two random variables and and some real valued function , if the posterior distribution of given be
then , for some constant value .
Proof.
The proof is straightforward following Lemma 1. ∎
Proposition 1.
Proof.
According to Eq 7
(12) 
which implies:
for some real valued constant . The first equality holds from Corollary 1, and the second equality holds by Bayes’ rule.
Applying exponential function and marginalizing over results in:
(14)  
note that is dropped which make the lefthandside (LHS) be proportional to the RHS of first equation. The second equation holds since is a distribution over .
Finally note that
(15)  
where the first equality comes from the definition and the second one is the assumption made in the statement of the proposition. Applying exponential function and marginalizing over :
(16)  
since the second term will be a constant and will not be function of . Comparing Eq 14 and Eq 16:
Finally note that , thus:
which completes the proof. ∎
Comments
There are no comments yet.