N-gram Language Modeling using Recurrent Neural Network Estimation

03/31/2017
by   Ciprian Chelba, et al.
0

We investigate the effective memory depth of RNN models by using them for n-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the n-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM n-gram matches the LSTM LM performance for n=9 and slightly outperforms it for n=13. When allowing dependencies across sentence boundaries, the LSTM 13-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM n-gram smoothing also has the desirable property of improving with increasing n-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low n-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale: while LSTM smoothing for short n-gram contexts does not provide significant advantages over classic N-gram models, it becomes effective with long contexts (n > 5); depending on the task and amount of data it can match fully recurrent LSTM models at about n=13. This may have implications when modeling short-format text, e.g. voice search/query LMs. Building LSTM n-gram LMs may be appealing for some practical situations: the state in a n-gram LM can be succinctly represented with (n-1)*4 bytes storing the identity of the words in the context and batches of n-gram contexts can be processed in parallel. On the downside, the n-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2014

Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation

We present a novel family of language model (LM) estimation techniques n...
research
06/23/2017

Comparison of Modified Kneser-Ney and Witten-Bell Smoothing Techniques in Statistical Language Model of Bahasa Indonesia

Smoothing is one technique to overcome data sparsity in statistical lang...
research
03/07/2017

Data Noising as Smoothing in Neural Network Language Models

Data noising is an effective technique for regularizing neural network m...
research
11/11/2019

Long-span language modeling for speech recognition

We explore neural language modeling for speech recognition where the con...
research
11/05/2015

Multinomial Loss on Held-out Data for the Sparse Non-negative Matrix Language Model

We describe Sparse Non-negative Matrix (SNM) language model estimation u...
research
05/28/2020

Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search

In spoken Keyword Search, the query may contain out-of-vocabulary (OOV) ...
research
12/25/2019

N-gram Statistical Stemmer for Bangla Corpus

Stemming is a process that can be utilized to trim inflected words to st...

Please sign up or login with your details

Forgot password? Click here to reset