1 Introduction & Related Work
Language models (LM) based on long shortterm memory (LSTM) neural networks
[1]have been widely applied to automatic speech recognition. Both in lattice rescoring
[2, 3] and direct onepass recognition [4, 5, 6], large improvements over the gram countbased language models are observed. The recurrent update of cell states on each new word allows theoretically a capability of modeling any unlimited context.One of the most common neuralnetworkbased acoustic modeling methods is the hybrid hidden Markov model (HMM) approach
[7], which still gives stateoftheart performance on several widely used corpora [5, 6]. The joint probability of a word sequence and input features requires a full summation over all HMMstate sequences. This leads to a high search complexity, since no exact word boundaries are present and all word sequences have to be tracked separately. However, with gram LM, the LM probability for paths sharing the same gram context is anyway the same. A complexity reduction based on this property is required. Viterbi approximation instead of fullsum is then widely applied, where the selection of the best word sequence hypothesis is only derived from its single best HMM state sequence. This allows an independent treatment on each state path of each word sequence. Additionally, exact word boundaries of each path are revealed. Recombination based on the same gram context can then be applied to largely simplify the search.But due to the unlimited context with LSTM LM, search is forced to again track each word sequence separately even in Viterbi decoding. Although recombination can be forced by taking a truncated context, certain performance degradation is observed [4, 8]. This gives the motivation to reconsider the fullsum over HMM state sequences, where potential improvements of using better probability in decision making can be explored.
Fullsumbased training criteria, such as connectionist temporal classification (CTC) [9], have been widely adopted to improve model training. But even for models trained with fullsum related criteria, Viterbi decoding is still one standard approach for evaluation [10, 11]. One related work can be the prefix search decoding proposed in [9]
, where certain heuristics are used to apply a sectionwise fullsum search. This is claimed to outperform the best path decoding.
[12] extended it to subword level without the heuristics. Overall, there is only limited work to investigate fullsumbased decoding and its effect on decision making.In this work, we revisit the joint probability used in decoding of hybrid HMMbased speech recognition. We argue that by applying fullsum instead of Viterbi approximation, improved probabilities should also benefit decision making. We apply the fullsum decoding using prefixtree search with some modification. The proposed fullsum decoding is evaluated on both Switchboard and Librispeech corpora using stateoftheart systems. Different models using crossentropy (CE) and latticebased statelevel minimum Bayes risk (sMBR) [13] training criteria are used. Additionally, both maximum a posteriori (MAP) and confusion network (CN) decoding as approximated variants of general Bayes decision rule are evaluated. Consistent improvements over the strong baselines are achieved in almost all cases. We also discuss tuning effort, efficiency and some limitations of the fullsum decoding.
2 Fullsum Decoding
Let and denote a word sequence of length and input features of frames, respectively. In the hybrid HMM approach, the key quantity used for decision making is their joint probability:
(1)  
(2) 
where represents HMMstate sequences modeling the word sequence . By applying Viterbi approximation, this joint probability for each word sequence is quantified by its single best state sequence only. Except for the numerical loss, the effect of such approximation in terms of final decision on best word sequence is unclear.
2.1 Decision rules
In speech recognition, the general Bayes decision rule is given by:
where is the cost function between two word sequences. Instead of Levenshtein distance, the sentencelevel 01 cost function is often used to simplify the optimization. This leads to the MAP decision rule:
Here differences of using Equation 1 and Equation 2 may result in both better and worse recognition.
The MAP decision rule aims to minimize the expected sentence error rate, which is suboptimal for speech recognition. A thorough study on different cost functions in terms of word error rate (WER) is given in [14, 15], but experimental verification is done using Equation 2. Decision rules other than MAP usually include further summations (e.g. for word posteriors), which can have possible interaction with the summation on HMMlevel. Thus, the impact of fullsum on the decision becomes even more complicated.
We also investigate this with CN decoding, which is a common approach to better approximate the Bayes decision rule. We follow the pivotarcbased approach in [16] to construct the CN by clustering the word arcs in the lattice into slots. The resulting hypotheses space is a superset of the original one and all paths have the same length. In this case, Hamming distance can be directly used as the cost function and the optimization problem is converted to slotwise local decisions:
Here the posterior probability is obtained by normalizing the joint probability over all word sequences in the word lattice.
2.2 Prefixtree search
For large vocabulary continuous speech recognition (LVCSR), explicit enumeration of all possible word sequences is computationally infeasible. Prefixtree search using dynamic programming (DP) [17] is a very efficient procedure to handle this. We also apply fullsum decoding using this approach with some modification. Without recombination, all state paths are characterized by their complete word history and probability summation within each word sequence becomes feasible. This is mainly realized by modifying the auxiliary function in DP recursion:
At each target state of each time frame, the probability summation over all incoming paths is carried out and this merged path continues further in the search. Paths of pronunciation variants are normalized and summed up as well. Note that without recombination, this does not increase complexity comparing to Viterbi decoding, since the major difference is just replacing the maximization with summation. The rest of the search procedure is rather straightforward.
Standard scorebased beam pruning is applied to maintain a reasonable size of search space. Ideally an infinite pruning threshold is needed to include all state paths of each word sequence, but we observe that the summation saturates at a normal threshold already. By increasing the pruning threshold, the number of paths grow dramatically. But the contribution of those paths with very low probability is negligible, and eventually has no effect on the results. For each final word sequence hypothesis, the joint probability in Equation 1 contains the fullsum over all plausible state sequences.
Note that the resulting lattice from the fullsum decoding is a treelike structure with certain approximation. Within each word sequence, word boundaries of each arc are taken from the single best state path. We use score to represent probabilities by taking , which is common in decoding. The score at each arc boundary is computed from the probability summation of all partial paths merged at this boundary. The score marked on each word arc is the score difference between its right and left boundary, which does not correspond to the probability of this word. However, the accumulated score from the first till the last arc corresponds to the correct probability summation of all state paths for this word sequence. During CN construction, word boundaries are used as part of the arc distance measure for clustering and discarded afterwards. The correct fullsum is used to compute the posteriors for arc clustering and decision making.
2.3 Parameter tuning
In practice, acoustic scale and LM scale are often applied for an optimal weighting among the knowledge sources. For MAP Viterbi decoding, can be set to 1 and only needs to be tuned. But for fullsum decoding, both of them need to be tuned. This great tuning effort can be largely reduced by firstly tuning with set to 1, and then linearly scaling both of them with the fixed ratio. To our experience, the same optimum is achieved in most cases as obtained by grid search.
3 Experiments
3.1 Experimental settings
The proposed fullsum decoding is implemented based on the RWTH ASR toolkit [18] with extensions described in [4]. Both MAP and CN as approximated variants of Bayes decision rule are evaluated. For the MAP results, a onepass recognition setup using LSTM LM without recombination is applied. Word lattices generated from the onepass recognition are used to further apply CN decoding. Additionally, different models trained with CE and latticebased sMBR criteria are evaluated. Note that fullsum is not applied in the training. Experiments are conducted on both the Switchboard corpus [19] and the Librispeech corpus [20]. Acoustic and LM scales are optimized on the development sets.
3.2 Switchboard
The acoustic models are the same as described in [11, 6], which consists of six BLSTM layers with 500 units for each direction. Both the CE and sMBR based models are trained on 300h Switchboard1 Release 2 using 40dimensional gammatone features [21]. The language model consists of two LSTM layers with 1024 units, which is described in [4]. The Hub5’00 and Hub5’01 datasets are used as development and test set, respectively.
Table 1 compares the proposed fullsum decoding to the standard Viterbi decoding with different models and decision rules. Only one case is improved with MAP decision rule, but consistent improvements are obtained with CN for both models. Interestingly for the CE model, the performance of Viterbi decoding even degrades from MAP to CN, which is not the case for fullsum decoding.
Figure 1 shows efficiency comparison of the two decodings with the sMBR model and MAP decision rule on the Hub5’00 dataset. Smaller real time factor (RTF) is obtained with stronger pruning. Fullsum decoding is more sensitive to strong pruning. This is because, besides direct search errors made by the strong pruning, it also suffers from an indirect influence. During search with strong pruning, nonnegligible state paths of a partial word sequence hypothesis get pruned away, which introduces certain loss to the probability summation. This effect can accumulate across the search and eventually leads to more search errors. On the other hand, when WER converges with less pruning, there is no efficiency loss with the fullsum decoding.
Model  Decision  Decoding  Hub5’00  Hub5’01 

CE  MAP  Viterbi  12.2  12.2 
fullsum  12.2  12.2  
CN  Viterbi  12.4  12.4  
fullsum  12.1  12.2  
sMBR  MAP  Viterbi  11.7  11.5 
fullsum  11.6  11.5  
CN  Viterbi  11.7  11.5  
fullsum  11.4  11.3 
Model  Decision  Decoding  devclean  devother  testclean  testother 
CE  MAP  Viterbi  2.4  5.8  2.8  6.3 
fullsum  2.4  5.7  2.8  6.2  
CN  Viterbi  2.4  5.7  2.8  6.1  
fullsum  2.4  5.6  2.8  6.0  
sMBR  MAP  Viterbi  2.2  5.1  2.6  5.5 
fullsum  2.1  5.0  2.5  5.4  
CN  Viterbi  2.2  5.0  2.6  5.4  
fullsum  2.1  4.8  2.5  5.3 
3.3 Librispeech
We use the stateoftheart hybrid HMM system described in [5]. Both CE and sMBR based acoustic models consist of six BLSTM layers with 1000 units for each direction, and are trained on the complete Librispeech training data with 50dimensional gammatone features. The language model has two LSTM layers with 4096 units.
Detailed WER comparison of the two decodings with different models and decision rules are shown in Table 2. With such a strong baseline of very low WER, further improvements are obtained with the fullsum decoding in almost all scenarios. This again verifies the benefit of improved probabilities in terms of decision making. Note that there is no difference in the models but just a different way to apply them in decoding, therefore the improvements come without additional cost at training.
One interesting observation is that more improvements are achieved on the ’other’ datasets than the ’clean’ ones. To better understand this, we further check the state posteriors from acoustic models’ top 5 outputs at each frame. We accumulate these posteriors at each frame and average them within each dev set for each model, which is shown in Figure 2. This partially reflects the models’ distribution over the states and both models tend to produce sharper distribution on the devclean set than on the devother set. One way to interpret this is that when the models are more confident on a very limited number of state paths, there is likely less difference between fullsum and Viterbi decoding.
4 Conclusion
In this paper, we showed that by applying fullsum instead of Viterbi approximation in decoding, more accurate probabilities improve decision making. The proposed fullsum decoding was verified on different corpora, models and decision rules, and showed consistent improvements in almost all cases. We showed that the fullsum decoding is more sensitive to strong pruning, but has no efficiency loss with normal pruning. One major advantage is that by applying the proposed fullsum decoding, even for the stateoftheart system with very low WER, further improvements can be achieved with negligible extra cost.
One possible future direction is to investigate the effect of fullsum decoding in noisy conditions where models are in general less confident. Another direction is to investigate the joint effect on models trained with fullsumbased training criteria, such as CTC.
5 Acknowledgements
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 694537, project “SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains.
We thank Christoph Lüscher, Kazuki Irie, Markus Kitza and Wilfried Michel for providing the models.
References
 [1] Sepp Hochreiter and Jürgen Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [2] Martin Sundermeyer, Zoltán Tüske, Ralf Schlüter, and Hermann Ney, “Lattice Decoding and Rescoring with Longspan Neural Network Language Models,” INTERSPEECH, pp. 661–665, 2014.
 [3] Shankar Kumar, Michael Nirschl, Daniel N. HoltmannRice, Hank Liao, Ananda Theertha Suresh, and Felix X. Yu, “Lattice Rescoring Strategies for Long Short Term Memory Language Models in Speech Recognition,” ASRU, pp. 165–172, 2017.
 [4] Eugen Beck, Wei Zhou, Ralf Schlüter, and Hermann Ney, “LSTM Language Models for LVCSR in FirstPass Decoding and LatticeRescoring,” 2019, https://arxiv.org/abs/1907.01030.
 [5] Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “RWTH ASR Systems for LibriSpeech: Hybrid vs Attention,” INTERSPEECH, pp. 231–235, 2019.
 [6] Markus Kitza, Pavel Golik, Ralf Schlüter, and Hermann Ney, “Cumulative Adaptation for BLSTM Acoustic Models,” INTERSPEECH, 2019.
 [7] Herve A. Bourlard and Nelson Morgan, Connectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, Norwell, MA, USA, 1993.
 [8] Jorge Javier, Giménez Adrià, IranzoSánchez Javier, Civera Jorge, Sanchis Albert, and Juan Alfons, “Realtime Onepass Decoder for Speech Recognition Using LSTM Language Models,” INTERSPEECH, pp. 3820–3824, 2019.

[9]
Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen
Schmidhuber,
“Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,”
in ICML, 2006, vol. 148, pp. 369–376.  [10] Naoyuki Kanda, Xugang Lu, and Hisashi Kawai, “Maximum A Posteriori Based Decoding for CTC Acoustic Models,” in INTERSPEECH, 2016, pp. 1868–1872.
 [11] Wilfried Michel, Ralf Schlüter, and Hermann Ney, “Comparison of LatticeFree and LatticeBased Sequence Discriminative Training Criteria for LVCSR,” INTERSPEECH, pp. 1601–1605, 2019.
 [12] Jennifer Drexler and James R. Glass, “Subword Regularization and Beam Search Decoding for Endtoend Automatic Speech Recognition,” in ICASSP, 2019, pp. 6266–6270.
 [13] Matthew Gibson and Thomas Hain, “Hypothesis Spaces for Minimum Bayes Risk Training in Large Vocabulary Speech Recognition,” in INTERSPEECH, 2006.
 [14] Ralf Schlüter, Markus NussbaumThom, and Hermann Ney, “On the Relationship Between Bayes Risk and Word Error Rate in ASR,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, pp. 1103 – 1112, Aug. 2011.
 [15] R. Schlüter, M. NussbaumThom, and H. Ney, “Does the Cost Function Matter in Bayes Decision Rule?,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, pp. 292–301, Feb. 2012.
 [16] Björn Hoffmeister, Bayes Risk Decoding and its Application to System Combination, Ph.D. thesis, RWTH Aachen University, Computer Science Department, RWTH Aachen University, Aachen, Germany, 2011.
 [17] Hermann Ney and Stefan Ortmanns, “Dynamic Programming Search for Continuous Speech Recognition,” IEEE Signal Processing Magazine, vol. 16, no. 5, pp. 64–83, Sep. 1999.
 [18] S. Wiesler, A. Richard, P. Golik, R. Schlüter, and H. Ney, “RASR/NN: The RWTH Neural Network Toolkit for Speech Recognition,” in ICASSP, 2014, pp. 3281–3285.
 [19] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone Speech Corpus for Research and Development,” in ICASSP, 1992, vol. 1, pp. 517–520.
 [20] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” ICASSP, pp. 5206–5210, 2015.
 [21] Ralf Schlüter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney, “Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition,” ICASSP, pp. 649–652, 2007.