Voice assistant systems rely heavily on complex language models. These language models are used as a second-pass reranking step for hypotheses generated by a first-pass speech recognizer [1, 2, 3, 4]. While a significant fraction of queries in voice assistant systems relate to entities in the real world (such as names of places, products, arts, people, etc.), language models do not explicitly model them. The problem of recognizing entities becomes more critical when some entities are not adequately represented in the training data. For example, Figure 1 shows two possible outputs from a first-pass speech recognition system. The correct output is about playing a song by a specific singer. The singer and the song have not been seen in the training data together, but by incorporating the cross-entity relationship from a relevant knowledge-base, we can capture the correct output of the speech recognition system.
Unlike previous work [5, 6] that have considered modeling entities directly in a language model, this paper proposes a dynamic reranking approach without the need for any transcribed training data. Our model makes use of raw text and a knowledge-base consisting of entities in the world. In order to train a reranker, we create artificial n-best lists for each training sentence. This enables us to train a reranking model on the artificial n-best list. We use the contrastive estimation method 111 This method is different from the noise contrastive estimation method
This method is different from the noise contrastive estimation method. to maximize the likelihood of each sentence in the training data in contrast to artificial sentences in the n-best list. One main strength of our method is that we are not bound to local word-based features anymore, thereby facilitating the possibility of embedding global features such as phrase-based interactions between entities (such as the “played-by” relationship in Figure 1).
This paper considers using language modeling as a global reranking approach in which the reranker makes use of many features including the bidirectional LSTM-based representations of sentences, n-gram language model probability, cross-entity relationships and frequency of entities in the knowledge-base. In other words, we maximize the probability of the whole sentence given its artificial negative examples. By employing this approach, we are able to improve the word error rate of our in-house speech recognition system byabsolute difference. This reduction is in particular very interesting to us, because we only target one aspect of the language modeling problem.
In summary, the main contributions of this paper are as following:
Designing a reranker, based on global features, to incorporate entities in the language model.
Proposing an effective approach for generating artificial n-best lists for training a reranker. By applying this idea, we do not need to have transcribed training data.
Introducing features from a knowledge-base and showing their effectiveness in the performance of our speech recognition system.
A recurrent language model uses a recurrent function and derives the intermediate representation for every word in sentence . is a set of parameters in the recurrent model.
In Eq. 1, is the softmax function. and are the output layer and bias term, and is the vocabulary size.
Representing entities in language models has been considered in previous work. For example, entities are modeled by [5, 12] as additional information in an ngram language model. Ahn et al.  uses a fact-based model to incorporate information available in a knowledge base. In contrast to their word-based model, our model can capture global information beyond words. Ji et al.  uses a dynamic model to incorporate multiple entities while processing the data. Their model achieves a slight improvement in perplexity, but it is not clear if their model can improve the error rate on big datasets. The recent work by Biadsy et al.  shows the effectiveness of log-linear models with global features using transcribed data. We instead use the contrastive estimation method  to maximize the probability of a correct sentence given its implicit negative examples. This enables us to use global features in our model, as well as word-based features, without using any transcribed data.
This section describes our approach based on the contrastive estimation method . We list the features and describe the neural network architecture that we use in our model.
3.1 Data Assumption
We assume availability of the following sources for training the entity-aware language model:
Raw text: A large amount of raw text where is the th sentence in the dataset. Each sentence consists of words: . In this paper, we train our model on music queries. These queries are usually about asking a voice system to play or download some particular music.
Knowledge-base: A database such that each entry in the knowledge-base consists of fields . In this paper, we use a knowledge-base that consists of music information. We use three fields: artist name, song title, and the frequency of usage of the song in our in-house application.
3.2 Contrastive Estimation
Contrastive estimation  requires creation of implicit negative examples for each training sample. This is done by injecting artificial noise to the correct example. Therefore, for every sentence in the training data, we approximate its probability with respect to the negative examples and model parameter .
where is the scoring function of the sentence given the model parameter . The objective function for the training data is the following:
where is a constant coefficient for L2 regularization.
3.2.1 Creating Negative Examples
The definition of the negative example function depends on the task. Since we target outputs from a voice system, our observation shows that most of the real errors come from a confusion by the model between two phonetically similar words or phrases. We use a simple phonetic similarity function to randomly pick words in a training sentence and replace them with one of their phonetically similar words. We rerank the negative samples based on their n-gram language model probability and pick the five highest scoring ones. This is mainly because the real n-best lists usually consist of relatively fluent sentences while so many of the negative samples are not fluent sentences. By applying this reranking step, we avoid totally irrelevant negative samples. Figure2 shows a real example of the negative examples created by our method.
3.2.2 Comparison to Noise Contrastive Estimation (NCE)
Noise contrastive estimation 
is a popular method in language modeling. In this method, the probability of a word is maximized given its negative samples. The negative samples come from a probability distribution aside from the current model. Then the method is defined as a binary classification problem in which the label for the correct word is one and for the negative examples is zero. NCE has interesting properties such as being self-normalized. One challenge in NCE is that one should be able to define a well-formed probability distribution for negative examples. This is very straightforward for word-level language modeling; for example, we can define the noise distribution as the categorical distribution derived from the word counts in the training data. In our case, we are interested in changing more than one word in a sentence to create negative examples. That makes it hard for us to define a well-formed probability distribution for negative examples. On the other hand, although contrastive estimation is not as principled as NCE, we do not have a strict limitation on defining the negative examples using contrastive estimation.
For a sentence with words, the following features are used:
Recurrent representation: We use a bidirectional LSTM (BiLSTM) to compute the sentence-level representation for each sentence. In other words, we have two independent LSTMs, one for the forward pass that sweeps a sentence from left to right and the other for the backward pass that does a reverse sweep of the sentence. The forward and backward LSTMs give the following outputs:
is the word embedding vector for word.
We use the concatenation of the final representations as the recurrent representation of the sentence. The LSTM parameters are updated during training with backpropagation:
N-gram LM probability: This score is fixed during training:
The ngram language model feature is obtained with 10-way jack-knifing in order to avoid overfitting to the training data.
Phrase pair co-occurrence: For every pair of non-overlapping phrases in sentence , we count the number of entries in the knowledge-base that has the first subphrase as an artist and the second as a song name (or vice versa). Finally, we quantize this value into a bin of size (based on the maximum possible co-occurrence count in the knowledge-base) and embed that to a embedding dictionary . Thus the cross-entity phrase co-occurrence feature is a -dimensional vector.
Subsentence knowledge-base frequency: We compute the sum of frequency field of each phrase in a sentence for the artist and song fields.
We quantize these values into the integer range and represent them as embedding vectors :
The final features are and as two dimensional vectors.
Cross-entity and Intra-entity word-based mutual information: We observed that many phrasal entities have different forms (such as using abbreviations for first names). Therefore, it can be useful to incorporate word-level features for words in entities. This is done by enumerating the mutual information between words across the artist and song fields. We use the average of normalized pointwise mutual information  between words.
where the probabilities are calculated based on the frequency information in the knowledge-base. We scale the value to be in and then quantize it into 100 bins:
Finally, we embed this to an embedding parameter . is the embedding feature for representing the cross-entity word-based mutual information. Similarly, we use an embedding dictionary and for the feature representing intra-entity word-based mutual information.
3.4 Network Architecture
All the features, except the ngram probability, are concatenated as15] activation. The output from the hidden layer is multiplied by a vector :
Finally, we use a linear interpolation between and the n-gram feature . All parameters in this model, except the ngram probabilities, are tuned during backpropagation:
Figure 3 shows a graphical depiction of the network structure.
4.1 Data and Setting
Our pipeline uses a domain classifier to classify queries in a query stream. We select queries tagged as belonging to the music domain. We mapped words with frequency less than one in the training data to theunknown symbol. Our heldout data is a small set of transcribed queries ( sentences) from a mobile phone application and our blind test set is from a distant microphone with relatively poor quality speech signals ( sentences). Both the heldout and test data have five hypotheses per sentence. The knowledge-base consists of 13 million entries. For each entry, we select the title of the song, artist name, and the frequency of requests. The training data has million sentences consisting of 24 million words. The vocabulary size after converting infrequent words to the unknown symbol is .
4.2 Negative examples
We first train the KenLM ngram language model  with 10-way jackknifing on the training data. We sample 30 random negative examples for each sentence in the training data and use the probabilities from the language model to get the 5 best examples. Our observation shows that the quality of some of the negative examples are still not promising. Therefore, we just kept the top one million training examples where the average ngram probability of their negative examples are highest. We train the same ngram model on the training data to calculate probabilities on the heldout and test data.
4.3 LSTM language model and reranker
Since our reranker just uses a portion of the training data, we might lose some performance from having a smaller training data. Therefore, the final score of each sentence in decoding is summed with the score from an LSTM model trained on the whole training data. To train the standard language model with LSTM, we use noise contrastive estimation  with fixed samples per each minibatch of sentences to train our model. The bias term is initialized as where is the vocabulary size. We use L2-regularization with coefficient
. Stochastic gradient descent with momentum and learning rate of with momentum and decay is used to train the model parameters. We apply dropout  with probability in training. Word embedding vectors are initialized randomly with dimension and the LSTM dimension is .
We use a similar LSTM model in both directions inside the reranker but with a smaller dimension for the LSTM representation (). Early stopping is applied based on word error rate improvement on the heldout data. We use dimension of for , and and for . We use the Dynet library  for implementing all of our models.
Table 1 shows the experimental results on the heldout and test data. As shown in the table, the reranker, with only one-fifth of the real training data, perform better than both the LSTM and ngram language models. When the reranker is merged with the LSTM language model, the performance improves further. We believe that is due to the bigger size of the training data for the LSTM model. The final ensemble result achieves an absolute improvement compared to the LSTM language model on the test data. Another observation is that LSTM LM is more effective than ngram LM for reranking on the heldout set but not on the test set (rows 2 and 3 in Table 1). A possible explanation of this result is the difference in quality of the n-best hypotheses between heldout and test sets. Recall that the speech for heldout set is collected from mobile phones whereas the test set data is collected from a distant device.
|First-pass (no reranker)||11.52||12.52|
|Reranker + LSTM||9.82||11.12|
In this paper, we have shown an effective method to encode real-world entities into a language model. We designed a simple but effective approach to create artificial n-best lists, thus obviating the need for annotated data. Our experiments show an improvement on the heldout and test datasets. One interesting direction to pursue is to incorporate a small amount of transcribed data and use our approach on a combined set of transcribed dataset and artificial training data. There are certain challenges facing this approach when dealing with a mixture of real n-best lists and artificial n-best lists such as deciding about the proportion of real n-best lists compared to the artificial ones. Future work should consider studying this problem.
We thank the anonymous reviewers for their valuable feedback. This research was conducted while the first author was an intern in the language modeling group at Microsoft in Sunnyvale, California. We would like to thank the researchers in the group for helpful discussions and assistance on different aspects of the problem.
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic
Journal of machine learning research, vol. 3, no. Feb, pp. 1137–1155, 2003.
-  M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
-  G. Melis, C. Dyer, and P. Blunsom, “On the state of the art of evaluation in neural language models,” arXiv preprint arXiv:1707.05589, 2017.
-  F. Biadsy, M. Ghodsi, and D. Caseiro, “Effectively building tera scale maxent language models incorporating non-linguistic signals,” Interpspeech 2017, 2017.
-  M. Levit, A. Stolcke, S. Chang, and S. Parthasarathy, “Token-level interpolation for class-based language models,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5426–5430.
-  S. Ahn, H. Choi, T. Pärnamaa, and Y. Bengio, “A neural knowledge language model,” arXiv preprint arXiv:1608.00318, 2016.
-  N. A. Smith and J. Eisner, “Contrastive estimation: Training log-linear models on unlabeled data,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005, pp. 354–362.
-  M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 297–304. [Online]. Available: http://proceedings.mlr.press/v9/gutmann10a.html
-  T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5528–5531.
-  T. Mikolov and G. Zweig, “Context dependent recurrent neural network language model.” SLT, vol. 12, pp. 234–239, 2012.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  M. Levit, A. Stolcke, R. Subba, S. Parthasarathy, S. Chang, S. Xie, T. Anastasakos, and B. Dumoulin, “Personalization of word-phrase-entity language models,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
Y. Ji, C. Tan, S. Martschat, Y. Choi, and N. A. Smith, “Dynamic entity
representations in neural language models,” in
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017, pp. 1831–1840. [Online]. Available: http://www.aclweb.org/anthology/D17-1195
-  G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” Proceedings of GSCL, pp. 31–40, 2009.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” inProceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
-  K. Heafield, “KenLM: faster and smaller language model queries,” in Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, July 2011, pp. 187–197. [Online]. Available: https://kheafield.com/papers/avenue/kenlm.pdf
-  G. E. Hinton, A Practical Guide to Training Restricted Boltzmann Machines. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 599–619. [Online]. Available: https://doi.org/10.1007/978-3-642-35289-8_32
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Ballesteros, D. Chiang, D. Clothiaux, T. Cohn et al., “Dynet: The dynamic neural network toolkit,” arXiv preprint arXiv:1701.03980, 2017.