Variational Smoothing in Recurrent Neural Network Language Models

01/27/2019 ∙ by Lingpeng Kong, et al. ∙ Google 0

We present a new theoretical perspective of data noising in recurrent neural network language models (Xie et al., 2017). We show that each variant of data noising is an instance of Bayesian recurrent neural networks with a particular variational distribution (i.e., a mixture of Gaussians whose weights depend on statistics derived from the corpus such as the unigram distribution). We use this insight to propose a more principled method to apply at prediction time and propose natural extensions to data noising under the variational framework. In particular, we propose variational smoothing with tied input and output embedding matrices and an element-wise variational smoothing method. We empirically verify our analysis on two benchmark language modeling datasets and demonstrate performance improvements over existing data noising methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks (RNNs) are state of the art models in various language processing tasks. However, their performance heavily depends on proper regularization at training time (Melis et al., 2018b; Merity et al., 2018)

. The two predominant approaches to regularize RNNs are dropout (randomly zeroing out neurons; Srivastava et al., 2014) and

regularization (applying penalty to model parameters; Hoerl & Kennard, 1970). Recently, Xie et al. (2017) proposed data noising to regularize language models. Their method is formulated as a data augmentation method that randomly replaces words with other words drawn from a proposal distribution. For example, we can use the unigram distribution which models the number of word occurrences in the training corpus, or a more sophisticated proposal distribution that takes into account the number of bigram types in the corpus. Data noising has been shown to improve perplexity on small and large corpora, and that this improvement is complementary to other regularization techniques and translates to improvements on downstream models such as machine translation.

Xie et al. (2017) derived connections between data noising and smoothing in classical -gram language models, which we review in §2. In smoothing (Chen & Goodman, 1996)

, since empirical counts for unseen sequences are zero, we smooth our estimates by a weighted average of higher order and lower order

-gram models. There are various ways to choose the weights and the lower order models leading to different smoothing techniques, with Kneser-Ney smoothing widely considered to be the most effective. Xie et al. (2017) showed that the pseudocounts of a noised data correspond to a mixture of different -gram models.

In this paper, we provide a new theoretical foundation for data noising and show that it can be understood as a form of Bayesian recurrent neural network with a particular variational distribution (§4.1 and §4.2). Our derivation relates data noising to dropout and variational dropout (Gal & Ghahramani, 2016b), and naturally leads to a data dependent regularization coefficient. We use this insight to arrive at a more principled way to do prediction with data noising—i.e., by taking the mean of the variational distribution, as opposed to the mode—and propose several extensions under the variational framework in §4.3. Specifically, we show how to use variational smoothing with tied input and output embeddings and propose element-wise smooothing. In §5, we validate our analysis in language modeling experiments on the Penn Treebank (Marcus et al., 1994) and Wikitext-2 (Merity et al., 2017) datasets.

2 Recurrent Neural Network Language Models

We consider a language modeling problem where the goal is to predict the next word given previously seen context words . Let be parameters of a recurrent neural network, and . Following previous work in language modeling (Melis et al., 2018b; Merity et al., 2018), we use LSTM (Hochreiter & Schmidhuber, 1997) as our RNN function, although other variants such as GRU (Cho et al., 2014) can be used as well.

Given a training corpus , the likelihood we would like to maximize is:

Directly optimizing the (log) likelihood above often leads to overfitting. We typically augment the objective function with a regularizer (e.g., regularizer where ) or use dropout by randomly zeroing out neurons (Srivastava et al., 2014).

Data noising as smoothing.

Xie et al. (2017) proposed a method to regularize recurrent neural network language models by noising the data. For each input word in (sometimes also the corresponding output word), we replace it with another word sampled from a proposal distribution

with probability

. They introduced various methods on how to choose and .

For example, if for all and is the unigram distribution of words in the corpus, corresponds to a mixture of

-gram models with fixed weights (i.e., linear interpolation smoothing).

Another option is to set , where denotes the number of distinct continuations preceded by word (i.e., the number of bigram types that has as the first word), and denotes the number of times appears in the corpus. For this choice of , when is the unigram distribution, the expectation corresponds to absolute discounting; whereas if and we replace both the input and output words, it corresponds to bigram Kneser-Ney smoothing. We summarize their proposed methods in Table 1.

At prediction (test) time, Xie et al. (2017) do not apply any noising and directly predict . They showed that a combination of smoothing and dropout achieves the best result on their language modeling and machine translation experiments.

Name Noised
Blank noising
Linear interpolation unigram
Absolute discounting unigram
Table 1: Variants of data noising techniques proposed in Xie et al. (2017) for context word and target word . Blank noising replaces an input word with a blank word, denoted by “_”.

3 Bayesian Recurrent Neural Networks

In Bayesian RNNs, we define a prior over our parameters and consider the following maximization problem:

A common prior is the standard normal distribution


For recurrent neural networks, the posterior over given the data is intractable. We approximate the posterior with a variational distribution , and minimize the KL divergence between the variational distribution and the true posterior by:

The integral is often approximated with Monte Carlo integration with one sample :111In the followings, we use to denote a sample from a distribution .


At test time, for a new sequence , we can either set to be the mean of or sample and average the results: , where is the number of samples and .

4 Variational Smoothing

We now provide theoretical justifications for data smoothing under the variational framework. In a recurrent neural network language model, there are three types of parameters: an input (word) embedding matrix , an LSTM parameter matrix , and an output embedding matrix

that produces logits for the final softmax function. We have


4.1 Linear interpolation smoothing.

We first focus on the simplest data noising method—linear interpolation smoothing—and show how to extend it to other data noising methods subsequently.

Word embedding matrix .

Denote the word vector corresponding to word

in the input embedding matrix by . We obtain a similar effect to linear interpolation smoothing by using the following mixture of Gaussians variational distribution for :


where is the unigram probability of word and is small. In other words, with probability , we replace the embedding with another embedding sampled from a normal distribution centered at .

Note that noising the input word is equivalent to choosing a different word embedding vector to be used in a standard recurrent neural network. Under the variational framework, we sample a different word embedding matrix for every sequence at training time, since the integral is approximated with Monte Carlo by sampling from (Eq. 1).

The above formulation is related to word embedding dropout (Dai & Le, 2015; Iyyer et al., 2015; Kumar et al., 2016), although in word embedding dropout .


For , we use Proposition 1 in Gal & Ghahramani (2016a) and approximate the divergence between a mixture of Gaussians and as:

where is the vocabulary size and is the mixture proportion for word : for and otherwise. In practice, the term directly translates to an regularizer on each word embedding vector, but the regularization coefficient is data dependent. More specifically, the regularization coefficient for word vector taking into account contributions from for is:


In other words, the variational formulation of data smoothing results in a regularization coefficient that is a function of corpus statistics (the unigram distribution).

Other parameters.

For other parameters and , we can use either simple variational distributions such as and , which become standard regularizers on these parameters; or incorporate dropout by setting , where is the dropout probability and is the -th row of (Gal & Ghahramani, 2016b).


In the original noising formulation (Xie et al., 2017), given a sequence {a, b, a, c}, it is possible to get a noised sequence {d, b, a, c}, since the decision to noise at every timestep is independent of others. In the variational framework, we use Monte Carlo integration for each sequence to compute . We can either sample one embedding matrix per sequence similar to Gal & Ghahramani (2016b), or sample one embedding matrix per timestep (Melis et al., 2018a). While the first method is computationally more efficient (we use it in our experiments), if we decide to noise the first to , we will have a noised sequence where every is replaced by : {d, b, d, c}. At training time, we go through sequence

multiple times (once per epoch) to get different noised sequences.


For predictions, Xie et al. (2017) do not noise any input. In the variational framework, this corresponds to taking the mode of the variational distribution, since is almost always greater than . They also reported that they did not observe any improvements by sampling. Using the mode is rather uncommon in Bayesian RNNs. A standard approach is to take the mean, so we can set:

Extension to absolute discounting.

It is straightforward to extend linear interpolation smoothing to absolute discounting by setting in Eq. 4.1 and Eq. 3 above.

4.2 Kneser-Ney Smoothing

We now consider the variational analog of Kneser-Ney data noising. Instead of smoothing towards lower order -grams, Kneser-Ney uses models that take into account contextual diversity. For example, for bigram Kneser-Ney smoothing, we replace the unigram distribution with , where denotes the number of distinct bigrams that end with word . As a result, even if bigrams such as “San Francisco” and “Los Angeles” appear frequently in the corpus; “Francisco” and “Angeles” will not have high probabilities in since they often follow “San” and “Los”.

Recall that similar to absolute discounting, Kneser-Ney also uses . However, if we replace an input word , Xie et al. (2017) proposed to also replace the corresponding output word . The intuition behind this is that since the probability of replacing an input word is proportional to the number of distinct bigrams that start with , when we replace the input word, we also need to replace the output word (e.g., if we replace “San” by a word sampled from , we should also replace “Francisco”).

Word embedding matrix .

To get (bigram) Kneser-Ney smoothing, we use the following mixture of Gaussians variational distribution for :


Output embedding matrix .

For the output embedding matrix, we also use the same variational distribution as the word embedding matrix:



Following similar derivations in the previous subsection, it is straightforward to show that the approximated term introduces the following regularization term to the overall objective:

and similarly for the output embedding matrix .


Recall that we use Monte Carlo integration for each sequence to compute at training time. Since we need to noise the output word when the input word is replaced, we sample at (for ) and decide whether we will keep the original (input and output) words or replace them. If we decide to replace, we need to sample two new words from and respectively. For other words that are not the target at the current timestep (i.e., ), we can either assume that or alternatively sample (which can be expensive since we need to sample additional times per timestep).


For predictions, Xie et al. (2017) also take the mode of the variational distribution (assuming is almost always greater than ). We use the mean instead, similar to what we do in variational linear interpolation smoothing.

4.3 Extensions

Our derivations above provide insights into new methods under the variational smoothing framework. We describe three variants in the followings:

Tying input and output embeddings.

Inan et al. (2017) and Press & Wolf (2017) showed that tying the input and output embedding matrices (i.e., using the same embedding matrix for both) improves language modeling performance and reduces the number of parameters significantly. We take inspirations from this work and sample both and from the same base matrix. As a result, we have fewer parameters to train due to this sharing mechanism, but we still have different samples of input and output embedding matrices per sequence.222Alternatively, we could sample one matrix for both the input and output embeddings. We found that this approach is slightly worse in our preliminary experiments. Similar to previous results in language modeling (Inan et al., 2017; Melis et al., 2018b), our experiments demonstrate that tying improves the performance considerably.

Combining smoothing and dropout.

We can combine variational smoothing and variational dropout by modifying the variational distribution to incorporate a standard Gaussian component:

where is the dropout probability. Note that in this formulation, we either drop an entire word by setting its embedding vector to zero (i.e., similar to word embedding dropout and blank noising) or choose an embedding vector from the set of words in the vocabulary.

However, it is more common to apply dropout to each dimension of the input embedding and output embedding matrix. In this formulation, we have:


At training time, while we sample multiple times (once per word embedding dimension for each word), we only sample once per word to ensure that when the element is not noised, we still use the same base embedding vector. We use this variant of smoothing and dropout throughout our experiments for our models.

Element-wise smoothing.

The variational formulation above allows us to derive a variant of data smoothing that samples each element of the embedding vector independently (and similarly for ). Consider the variational distribution in Eq. 6. At training time, if we sample both and multiple times (once per word embedding dimension for each word) we arrive at a new element-wise smoothing method. The main difference between this model and the previous model is that each dimension in the input (and output) embedding vector is sampled independently. As a result, the vector that is used is a combination of elements from various word vectors. Notice that the mean under this new scheme is still the same as sampling per vector, so we do not need to change anything at test time. One major drawback about this model is that it is computationally expensive since we need to sample each element of each embedding vector.

5 Experiments

5.1 Setup

We evaluate our approaches on two standard language modeling datasets: Penn Treebank (PTB) and Wikitext-2. We use a two-layer LSTM as our base language model. We perform non-episodic training with batch size 64 using RMSprop

(Hinton, 2012) as our optimization method. We tune the RMSprop learning rate and

regularization hyperparameter

for all models on a development set by a grid search on and respectively, and use perplexity on the development set to choose the best model. We also tune from . We use recurrent dropout (Semeniuta et al., 2016) for and set it to 0.2, and apply (element-wise) input and output embedding dropouts for and and set it to 0.5 when and 0.7 when based on preliminary experiments. We tie the input and output embedding matrices in all our experiments (i.e., ), except for the vanilla LSTM model, where we report results for both tied and untied.333Our preliminary experiments are consistent with previous work (Inan et al., 2017; Melis et al., 2018b; Merity et al., 2018) that show tying the input and output embedding matrices results in better models with fewer numbers of parameters.

5.2 Models

We compare the following methods in our experiments:

  • Baseline: a vanilla LSTM language model. We evaluate two variants: with tied input and output embeddings and without.

  • Data noising (DN): an LSTM language model trained with data noising using linear interpolation smoothing or bigram Kneser-Ney smoothing (Xie et al., 2017).

  • Variational smoothing (VS): an LSTM language model with variational smoothing using linear interpolation or Kneser-Ney. For both models, we use the mean of the variational distribution at test time.444We discuss results using sampling at test time in §5.4.

  • Variational element-wise smoothing: for the smaller PTB dataset, we evaluate an LSTM language model that uses elementwise Kneser-Ney variational smoothing and dropout. We also use the mean at test time.

Model LSTM # of PTB Wikitext-2
hidden size params. Dev Test Dev Test
Vanilla LSTM (Xie et al., 2017) 512 - 84.3 80.4 - -
Vanilla LSTM (Xie et al., 2017) 1500 - 81.6 77.5 - -
DN: Kneser-Ney (Xie et al., 2017) 512 - 79.9 76.9 - -
DN: Kneser-Ney (Xie et al., 2017) 1500 - 76.2 73.4 - -
Var. dropout (Gal & Ghahramani, 2016b) 1500 - - 73.4 - -
Vanilla LSTM: untied 512 14M/38M 89.6 84.5 106.3 100.8
Vanilla LSTM: tied 512 9M/21M 80.0 74.0 90.6 86.6
DN: linear interpolation 79.4 73.3 88.9 84.6
DN: Kneser-Ney 75.0 70.7 86.1 82.1
VS: linear interpolation 512 9M/21M 76.3 71.2 84.0 79.6
VS: Kneser-Ney 74.5 70.6 84.9 80.9
VS: element-wise 70.5 66.8 - -
Vanilla LSTM: untied 1024 37M/85M 90.3 85.5 97.6 91.9
Vanilla LSTM: tied 1024 27M/50M 75.9 70.2 85.2 81.0
DN: linear interpolation 75.5 70.2 84.3 80.1
DN: Kneser-Ney 71.4 67.3 81.9 78.3
VS: linear interpolation 1024 27M/50M 71.7 67.8 80.5 76.6
VS: Kneser-Ney 70.8 66.9 80.9 76.7
VS: element-wise 68.6 64.5 - -
Table 2: Perplexity on PTB and Wikitext-2 datasets. DN and VS denote data noising and variational smoothing. The two numbers (*M/*M) in the # of params. column denote the number of parameters for PTB and Wikitext-2 respectively.

5.3 Results

Our results are summarized in Table 2. Consistent with previous work on tying the input and output embedding matrices in language models (Inan et al., 2017; Melis et al., 2018b; Merity et al., 2018), we see a large reduction in perplexity (lower is better) when doing so. While our numbers are generally better, the results are also consistent with Xie et al. (2017) that show linear interpolation data noising is slightly better than vanilla LSTM with dropout, and that Kneser-Ney data noising outperforms these methods for both the medium (512) and large (1024) models.

Variational smoothing improves over data noising in all cases, both for linear interpolation and Kneser-Ney. Recall that the main differences between variational smoothing and data noising are: (1) using the mean at test time, (2) having a data dependent regularization coefficient that comes from the term,555The data dependent coefficient penalizes vectors that are sampled more often higher. and (3) how each method interacts with the input and output embedding tying mechanism.666 In the variational framework, if we sample one matrix for the input and output embeddings, it effectively noises the output words even for linear interpolation. If we sample two matrices from the same base matrix, these matrices can be different at training time even if the parameters are tied. As described in §4.3, we use the latter in our experiments. Our results suggest that the choice of the proposal distribution to sample from is less important for variational smoothing. In our experiments, Kneser-Ney outperforms linear interpolation on PTB but linear interpolation is slightly better on Wikitext-2.

Element-wise variational smoothing performs the best for both small and large LSTM models on PTB. We note that this improvement comes at a cost, since this method is computationally expensive.777Element-wise dropout can be implemented efficiently by sampling a mask (zero or one with some probability) and multiply the entire embedding matrix with this mask. In element-wise smoothing, we need to sample an index for each dimension and reconstruct the embedding matrix for each timestep. It took about one day to train the smaller model as opposed to a couple hours without element-wise smoothing. As a result, we were unable to train this on a much bigger dataset such as Wikitext-2. Nonetheless, our results show that applying smoothing in the embedding (latent) space results in better models.

5.4 Discussions

Sampling at test time.

In order to better understand another prediction method for these models, we perform experiments where we sample at test time for both data noising and variational smoothing (instead of taking the mode or the mean). We use twenty samples and average the log likelihood. Our results agree with Xie et al. (2017) that mentioned sampling does not provide additional benefits. In our experiments, the perplexities of 512- and 1024-dimensional DN-Kneser-Ney models increase to 96.5 and 85.1 (from 75.0 and 71.4) on the PTB validation set. For VS-Kneser-Ney models, the perplexities increase to 89.8 and 78.7 (from 74.5 and 70.8). Both our results and Xie et al. (2017) suggest that introducing data dependent noise at test time is detrimental for recurrent neural network language models.

Sensitivity to .

We evaluate the sensitivity of variational smoothing to hyperparameter . Figure 1 shows perplexities on the PTB validation set for a variant of our models. The model generally performs well within the range of 0.1 and 0.3, but it becomes progressively worse as we increase since there is too much noise.

Other applications.

While we focus on language modeling in this paper, the proposed technique is applicable to other language processing tasks. For example, Xie et al. (2017) showed that data smoothing improves machine translation, so our techniques can be used in that setup as well. It is also interesting to consider how variational smoothing interacts with and/or can be applied to memory augmented language models (Tran et al., 2016; Grave et al., 2017; Yogatama et al., 2018) and state-of-the-art language models (Yang et al., 2017). We leave these for future work.


Figure 1: Validation set perplexities on PTB for VS: Kneser-Ney (512 dimensions).

6 Conclusion

We showed that data noising in recurrent neural network language models can be understood as a Bayesian recurrent neural network with a variational distribution that consists of a mixture of Gaussians whose mixture weights are a function of the proposal distribution used in data noising (e.g., the unigram distribution, the Kneser-Ney continuation distribution). We proposed using the mean of the variational distribution at prediction time as a better alternative to using the mode. We combined it with variational dropout, presented two extensions (i.e., variational smoothing with tied input and output embedding matrices and element-wise smoothing), and demonstrated language modeling improvements on Penn Treebank and Wikitext-2.


  • Chen & Goodman (1996) Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proc. of ACL, 1996.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. of EMNLP, 2014.
  • Dai & Le (2015) Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. In Proc. of NIPS, 2015.
  • Gal & Ghahramani (2016a) Yarin Gal and Zoubin Ghahramani.

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning.

    In Proc. of ICML, 2016a.
  • Gal & Ghahramani (2016b) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Proc. of NIPS, 2016b.
  • Grave et al. (2017) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In Proc. of ICLR, 2017.
  • Hinton (2012) Geoffrey Hinton.

    Neural networks for machine learning, 2012.

    Lecture 6.5.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Hoerl & Kennard (1970) Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  • Inan et al. (2017) Hakan Inan, Khashayar Khosravi, and Richard Socher.

    Tying word vectors and word classifiers: A loss framework for language modeling.

    In Proc. of ICLR, 2017.
  • Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume III. Deep unordered composition rivals syntactic methods for text classification. In Proc. of ACL, 2015.
  • Kumar et al. (2016) Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher.

    Ask me anything: Dynamic memory networks for natural language processing.

    In Proc. of ICML, 2016.
  • Marcus et al. (1994) Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Proc. of the Workshop on Human Language Technology, 1994.
  • Melis et al. (2018a) Gabor Melis, Charles Blundell, Tomas Kocisky, Karl Moritz Hermann, Chris Dyer, and Phil Blunsom. Pushing the bounds of dropout. arXiv preprint, 2018a.
  • Melis et al. (2018b) Gabor Melis, Chris Dyer, and Phil Blusom. On the state of the art of evaluation in neural language models. In Proc. of ICLR, 2018b.
  • Merity et al. (2017) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In Proc. of ICLR, 2017.
  • Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. In Proc. of ICLR, 2018.
  • Press & Wolf (2017) Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163. Association for Computational Linguistics, 2017. URL
  • Semeniuta et al. (2016) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. In Proc. of COLING, 2016.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • Tran et al. (2016) Ke Tran, Arianna Bisazza, and Christof Monz. Recurrent memory networks for language modeling. In Proc. of NAACL-HLT, 2016.
  • Xie et al. (2017) Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Levy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In Proc. of ICLR, 2017.
  • Yang et al. (2017) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
  • Yogatama et al. (2018) Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. Memory architectures in recurrent neural network language models. In Proc. of ICLR, 2018.