1 Introduction
The quality of the data used for language modeling (LM) directly determines the accuracy of an automatic speech recognition (ASR) system. Most commercial ASR systems use ngram based LMs trained on large amounts of text data. In spoken dialog systems like Amazon Alexa and Google Home, transcribed user utterances, collected from live traffic, form a large part of the training data. The transcribed data is also used for optimizing the LM parameters. Due to this, the data for training and tuning the LM parameters is always from the past user usage. When a new application intent is added to the system^{1}^{1}1An example of a new application intent is the ability to ask for cooking recipes, the past user data is no longer sufficient to model the user interactions expected with the release of the application intent. This mismatch often leads to a reduced recognition quality on recently launched applications.
In the absence of text data for training ngram models, a common approach is to use a probabilistic contextfree grammar (pCFG), similar to one described in [1]. Figure 1 shows an example grammar for a new application intent “cooking recipes”. The grammar is written to encompass the expected user interaction for the new application. The non terminals (DISH_NAME
in the example) are expanded by specifying a catalog of entities for each nonterminal. Further, weights can be specified for the phrases and entities in the grammar to assign a probability to each sentence generated by the pCFG. In this sense, the pCFG can be regarded as a LM which is specific to a particular application intent.
The LM of an ASR system can be adapted to the new application by combining the pCFG with an existing model. A common approach is to sample data from the grammar [3], [4], or transform outofdomain sentences into indomain sentences using a simulator [5] and combine this data with preexisting training data. However, certain grammars, e.g. those with large nonterminal, can encode exponentially many different paths, and hence, a large amount of data must be generated from these grammars to capture all possible variations. Another approach is to represent the final LM as a union of the grammar and ngram models, where each of them is represented as an FST [6], [7]. However, rapidly accelerating spoken dialogue systems introduce ten’s of new applications every month, and the final model can become quite large and lead to increased latency. In section 2
, we introduce a novel method for extracting ngram counts directly from the grammar, thereby eliminating the need for generating sampled data. These counts can be then used to train a ngram based model or as additional data for training neural network based models.
Language model adaptation has also been studied in recent studies. [8]
adapted a recurrent neural network LM to a single domain while
[9], [10]built a multidomain neural network model by using domain vectors to modify the model output. However, an important missing aspect of the aforementioned approaches is the ways of optimizing the LM training for multiple domains while maintaining the performance on past usage. In section
3, we introduce a constrained optimization criteria that minimizes the loss on a new application intents while not degrading the performance on existing applications supported by the spoken dialogue system. Our work can be extended to adapting neural language models by applying the same constrained loss during network training.2 Estimating ngram models from pCFG
A language model estimates the probability of a given sequence of words . Ngram LMs make the assumption that the probability of word depends only on previous words, so that the probability can be written as:
(1) 
For maximum likelihood estimation, it can be shown that where and is count of the ngram in the training data. Furthermore, for most commonly used smoothing techniques [11], ngram counts are sufficient to estimate the ngram LM. In this paper, we extract fractional ngram counts directly from a pCFG using the algorithm described in 2.1, and use smoothing techniques that can work with expected counts to estimate the ngram probabilities.
2.1 Expected ngram counts
A pCFG can be represented as a weighted finite state transducer (FST) where each arc represents either a word or a nonterminal, and the weights on the arcs are the probabilities associated with them; each nonterminal is also represented as an FST. When a sentence is sampled from the FST, the nonterminal arcs are replaced by the nonterminal FST to generate the final sentence.
Each sequence generated from the FST has an associated probability p(). Then, the expected count of an ngram within this sequence is same as . The expected count of an ngram for the entire FST can be calculated by generating all possible sequences and summing their expected counts:
(2) 
where is the count of in . Computing this sum is nontrivial, as it requires enumerating all the sequences in FST and calculating the normalized probability . However, we can use dynamic programming to efficiently compute the ngram counts as well as the normalized probability for each ngram.
Each ngram is represented as a sequence of arcs in an FST and the expected count of the ngram can be computed by the sum of the probability of all paths that include the corresponding sequence of arcs. If is the initial state of , is the final state of , then the sum of probabilities for the sequence of arcs is:
(3) 
where and is the unnormalized sum of the probabilities of the paths ending in state and starting in respectively, is the unnormalized probability of arc , and Z is the normalization factor ( sum of all paths in the FST). The values of , for every state in the FST and the normalization factor can be computed efficiently using the forwardbackward algorithm [12]. Then, the expected count of the ngram is same as .
To compute the counts of all ngrams in the FST, we traverse the FST in a forward topological order. For each state, we iterate over all the incoming ngrams with their corresponding accumulated probability . For each outgoing arc at the state, with word , we calculate the expected count using the following iterative equation:
(4) 
and propagate the ngram to the next state of the arc^{2}^{2}2
Iterating over all ngrams can be expensive for adjacent nonterminals or cyclic grammars; this can be mitigated by applying heuristics such as not propagating forward ngrams with counts lower than a given threshold
along with its accumulated probability .2.2 Estimating Language Models
Given the fractional counts extracted from a pCFG, smoothing algorithms that use fractional ngram counts [13], [14] to build the ngram model can be used. Using countofcount methods like Katz smoothing [15] and modKN smoothing [11]
requires extracting countofcounts from the pCFG, which can be done in a way similar to extracting fractional ngram counts. Similarly, the ngram counts can be used to train a feedforward neural network or interpolate with an existing neural network model
[16]. However, for simplicity, in this paper, we use Katz smoothing on scaled fractional counts to build the final ngram model.3 Optimizing interpolation weights
A preexisting LM can be adapted to the new application intent by linear interpolation with the LM from section 2, where the probability of a word sequence is calculated by using different interpolation weights, ’s, for each LM in the mixture:
(5) 
such that . In this approach, the interpolation weights are estimated by minimizing a loss, e.g. perplexity on representative data from the new application intent. When bootstrapping an existing LM with a grammar, however, just optimizing on the application’s data is not sufficient; we need to make sure that the final LM does not degrade on existing applications. Hence, we propose a constrained optimization problem:
(6)  
subject to 
where are interpolation weights for preexisting LM and application intent LM respectively,
is the loss function being minimized,
is representative data for the new application, and is development set based on past usage.3.1 Loss functions
The actual loss function to be minimized (or constrained) depends on the adaptation data available for a new application intent. In this paper, we propose to use three loss functions:
3.1.1 Negative squared minimization
When no data is available for a application intent, the only way to optimize the LM for the new intent is to assign maximum possible interpolation weight to the feature LM. Hence, the loss can be expressed as:
(7) 
This will maximize the application intent’s interpolation weight until we violate the constraint on past data.
3.1.2 Perplexity minimization
When sample text data, from the new application intent, is available for LM adaptation, one can minimize the perplexity of the interpolated model on the data
(8) 
where is the probability assigned by the interpolated model to the word and is the total number of words in the data.
3.1.3 Expected WER minimization
When there is transcribed audio available for an application intent, we can directly minimize the expected recognition error rate of the ASR system as explained in [17]. [18] showed that perplexity is often not correlated with WER, and hence it might be better to directly minimize the WER when possible. Expected WER loss is computed by summing over the recognition error for all hypotheses,
s, weighted by their posterior probability
:(9) 
In order to make the sum over all possible hypothesis tractable, we approximate the sum by restricting s to an nbest list of ASR hypothesis.
3.2 Constraint optimization
The constraint on past usage is implemented as a penalty term [19] in the final optimization function:
(10) 
where , is the loss of the preexisting LM, and is the penalty coefficient to control the tolerance of the constraint. can be set either statically to a high value or changed dynamically with every iteration of the optimization as suggested in [20]. In this paper, we use a static penalty value of 1000. The loss function for the constraint can be either perplexity or expected WER. However, we expect that adding the constraint on WER will lead to better interpolation weights for the new application intent.
3.3 Scaling to multiple applications
The proposed method can easily scale to adding multiple application intent grammars at the same time to an existing ngram LM. We can extend equation 6 with one loss per application:
where is the loss function chosen for and is the total number of applications. The choice of loss function and the constraint used is independent of each application intent which enables us to choose the loss function for each intent depending on the availability of data for that application.
Constraint  Loss function  Past Data  Stock Price  Flights Info  Recipes  
PPL  rel. WER  PPL  WER  PPL  WER  PPL  WER  
Baseline  32.4    41.6  6.80  85.6  6.45  90.9  7.68  
PPL  min L2  32.6  0%  41.6  6.82  80.3  6.29  90.9  7.65 
PPL  32.4  0%  37.6  6.67  65.7  5.82  76.3  5.96  
expWER  32.7  0%  37.3  6.64  62.4  5.75  73.8  5.78  
expWER  min L2  34.4  1%  35.2  6.54  65.3  5.74  83.2  6.49 
PPL  34.5  1%  35.1  6.52  64.3  5.7  81  6.26  
expWER  34.5  1%  35.1  6.54  61.9  5.63  79.2  6.14 
4 Experiments
The training data for the baseline language model is a combination of transcribed data from users of a personal assistant system as well as other inhouse data sources^{3}^{3}3Data crawled from reddit.com, amazon forums, and amazon customer reviews. We build one LM for each data source and interpolate them into a single LM. The interpolation weights are optimized on a heldout subset of transcribed data from users.
We evaluate our approach on grammars written for new application intents which have little or no coverage in the LM training data. We tested with three application intents  Getting stock prices, Getting Flight Information, and Asking for recipes.
For running ASR experiments, we use an inhouse ASR system^{4}^{4}4Trained on only a subset of data used for our production system based on [21, 22]. The acoustic model is a DNN trained on concatenated LFBE and ivectors, and the baseline LM is a 4gram models with a vocabulary of 200K trained with modified kneserney smoothing [11]. The LMs for the data extracted from application grammars were trained using Katz smoothing, and we limited the number of new vocabulary words (new relative to baseline) to 10K. Each intent was tested for perplexity (PPL) and word error rate (WER) on a held out test set targeted towards the intent containing about 500 to 700 utterances^{5}^{5}5The test set was collected inhouse by language experts and is not reflective of actual live usage of the application. We also report the PPL and the relative WER degradation on past usage test set, dubbed as Past Data.
4.1 ngram count extraction
In section 2.1, we proposed extracting ngram counts directly from the pCFG written for a new application intent. In figure 2, we compare the proposed method with a model trained on data sampled from the pCFGs on three application intents. The application LM is interpolated with the baseline model based on weights estimated using constrained optimization with expected word error as the loss and the constraint. We find that the difference in performance of the two techniques varies for different applications  the perplexity difference is small for Recipes and Stock Price and significantly better for Flights info. Similar results are observed on word error rate, although Stock Price word error rate degrades with larger number of samples.
We looked at the total number of trigrams (without replacing nonterminals) in the grammars for the applications intents as well as the total number of nonterminals in the three grammars, as shown in table 2. Flight Info and Stock Price application grammars have much larger number of trigrams in the grammar. However, Flight Info has a large number of nonterminals pairs (nonterminals appearing next to each other), which leads to an exponentially large number of ngrams that can be sampled, and hence we observe a large drop in perplexity and word error rate when using the proposed method.
Stock Price  Recipes  Flight Info  

# of trigrams  152k  49k  306k 
# of nonterminals  3  12  23 
# of nonterminal pair  3  23  203 
4.2 Comparing Loss function and constraint
Section 3 describes different methods for estimating interpolation weights, both for the loss to be minimized as well the constraint. Table 1 shows the comparison of different loss functions and constraining on perplexity of baseline LM on the past data. It also compares the same loss functions but with expected WER as the constraint in the optimization. While both constraints ensure that the WER on past data degrades only marginally, constraint of expected WER leads to larger improvements in WER for new features. The improvements vary across the different application intent, and is more than 10% relative for Flights Info and Recipes applications, which can be attributed to the initial model having a high perplexity on these applications.
We also find that L2 minimization works quite with expected WER constraint and the improvement in WER is comparable to other loss functions. This is quite useful as it shows that we can adapt to new application intents even when there is no adaptation data available. The expected WER constraint uses the acoustic scores of each word when calculating the posterior probability, thereby reducing the dependency on the LM scores. This can explain why expected WER is lessstricter constraint than perplexity and assigns larger interpolation weights to new application intent LMs.
4.3 Scalability
In the previous experiments, each application intent was optimized separately on its own loss function. However, our proposed method allows us to add as many new application intents into the optimization as we want. In figure 3, we show the word error rate change for the three application intents as more and more application intents are added into the optimization^{6}^{6}6The new applications were chosen randomly from a set of heldout application intents. Some examples are Shopping, Calendar events, Donate to charity.. We find even though the word error rate increases with very new application added, the increase is not significant even with 12 new applications.
5 Conclusion
We propose a new method for adapting the language model of a spoken dialogue system, specifically designed to scale to support multiple application intents. The proposed constrained optimization function ensures that the accuracy of the system does not degrade as new applications are added to the system. We also show that using word error rate as a constraint leads to better adaptation for the new applications and removes the need for any adaptation data. In the future, we want to extend the same framework for adaptation of neural network based LMs.
References
 [1] Daniel Jurafsky, Chuck Wooters, Jonathan Segal, Andreas Stolcke, Eric Fosler, Gary Tajchaman, and Nelson Morgan, “Using a stochastic contextfree grammar as a language model for speech recognition,” in Acoustics, Speech, and Signal Processing, 1995. ICASSP95., 1995 International Conference on. IEEE, 1995, vol. 1, pp. 189–192.
 [2] “http://www.openfst.org/twiki/bin/view/grm/thrax,” .
 [3] Lucian Galescu, Eric Ringger, and James Allen, “Rapid language model development for new task domains,” 1998.
 [4] Karl Weilhammer, Matthew N Stuttle, and Steve Young, “Bootstrapping language models for dialogue systems,” in Ninth International Conference on Spoken Language Processing, 2006.
 [5] Grace Chung, Stephanie Seneff, and Chao Wang, “Automatic induction of language model data for a spoken dialogue system,” in 6th SIGdial Workshop on Discourse and Dialogue, 2005.
 [6] Ariya Rastrow, Bjorn Hoffmeister, Sri Venkata Surya Siva Rama Krishna, Rohit Krishna Prasad, et al., “Speech recognition with combined grammar and statistical language models,” Sept. 20 2016, US Patent 9,449,598.
 [7] Mehryar Mohri, Fernando Pereira, and Michael Riley, “Weighted finitestate transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, 2002.
 [8] Junho Park, Xunying Liu, Mark JF Gales, and Phil C Woodland, “Improved neural network based language modelling and adaptation,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
 [9] Tanel Alumäe, “Multidomain neural network language model.,” 2013.
 [10] Ottokar Tilk and Tanel Alumäe, “Multidomain recurrent neural network language model for medical speech recognition.,” in Baltic HLT, 2014, pp. 149–152.
 [11] Stanley F Chen and Joshua Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–394, 1999.

[12]
Lawrence R Rabiner,
“A tutorial on hidden markov models and selected applications in speech recognition,”
Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.  [13] Hui Zhang and David Chiang, “Kneserney smoothing on expected counts,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, vol. 1, pp. 765–774.
 [14] Vitaly Kuznetsov, Hank Liao, Mehryar Mohri, Michael Riley, and Brian Roark, “Learning ngram language models from uncertain data.,” 2016.
 [15] Slava Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 400–401, 1987.
 [16] Graham Neubig and Chris Dyer, “Generalizing and hybridizing countbased and neural language models,” arXiv preprint arXiv:1606.00499, 2016.
 [17] Xunying Liu, Mark JF Gales, and Philip C Woodland, “Use of contexts in language model interpolation and adaptation,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
 [18] Rukmini Iyer, Mari Ostendorf, and Marie Meteer, “Analyzing and predicting language model improvements,” in Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on. IEEE, 1997, pp. 254–261.
 [19] Alice E Smith, David W Coit, Thomas Baeck, David Fogel, and Zbigniew Michalewicz, “Penalty functions,” .

[20]
Özgür Yeniay,
“Penalty function methods for constrained optimization with genetic algorithms,”
Mathematical and computational Applications, vol. 10, no. 1, pp. 45–56, 2005.  [21] Sree Hari Krishnan Parthasarathi, Bjorn Hoffmeister, Spyros Matsoukas, Arindam Mandal, Nikko Strom, and Sri Garimella, “fMLLR based featurespace speaker adaptation of DNN acoustic models,” in Proceedings Interspeech, 2015.
 [22] Sri Garimella, Arindam Mandal, Nikko Strom, Bjorn Hoffmeister, Spyros Matsoukas, and Sree Hari Krishnan Parthasarathi, “Robust ivector based adaptation of DNN acoustic model for speech recognition,” in Proceedings Interspeech, 2015.
Comments
There are no comments yet.