1 Introduction
Recurrent neural networks (RNNs) are state of the art models in various language processing tasks. However, their performance heavily depends on proper regularization at training time (Melis et al., 2018b; Merity et al., 2018)
. The two predominant approaches to regularize RNNs are dropout (randomly zeroing out neurons; Srivastava et al., 2014) and
regularization (applying penalty to model parameters; Hoerl & Kennard, 1970). Recently, Xie et al. (2017) proposed data noising to regularize language models. Their method is formulated as a data augmentation method that randomly replaces words with other words drawn from a proposal distribution. For example, we can use the unigram distribution which models the number of word occurrences in the training corpus, or a more sophisticated proposal distribution that takes into account the number of bigram types in the corpus. Data noising has been shown to improve perplexity on small and large corpora, and that this improvement is complementary to other regularization techniques and translates to improvements on downstream models such as machine translation.Xie et al. (2017) derived connections between data noising and smoothing in classical gram language models, which we review in §2. In smoothing (Chen & Goodman, 1996)
, since empirical counts for unseen sequences are zero, we smooth our estimates by a weighted average of higher order and lower order
gram models. There are various ways to choose the weights and the lower order models leading to different smoothing techniques, with KneserNey smoothing widely considered to be the most effective. Xie et al. (2017) showed that the pseudocounts of a noised data correspond to a mixture of different gram models.In this paper, we provide a new theoretical foundation for data noising and show that it can be understood as a form of Bayesian recurrent neural network with a particular variational distribution (§4.1 and §4.2). Our derivation relates data noising to dropout and variational dropout (Gal & Ghahramani, 2016b), and naturally leads to a data dependent regularization coefficient. We use this insight to arrive at a more principled way to do prediction with data noising—i.e., by taking the mean of the variational distribution, as opposed to the mode—and propose several extensions under the variational framework in §4.3. Specifically, we show how to use variational smoothing with tied input and output embeddings and propose elementwise smooothing. In §5, we validate our analysis in language modeling experiments on the Penn Treebank (Marcus et al., 1994) and Wikitext2 (Merity et al., 2017) datasets.
2 Recurrent Neural Network Language Models
We consider a language modeling problem where the goal is to predict the next word given previously seen context words . Let be parameters of a recurrent neural network, and . Following previous work in language modeling (Melis et al., 2018b; Merity et al., 2018), we use LSTM (Hochreiter & Schmidhuber, 1997) as our RNN function, although other variants such as GRU (Cho et al., 2014) can be used as well.
Given a training corpus , the likelihood we would like to maximize is:
Directly optimizing the (log) likelihood above often leads to overfitting. We typically augment the objective function with a regularizer (e.g., regularizer where ) or use dropout by randomly zeroing out neurons (Srivastava et al., 2014).
Data noising as smoothing.
Xie et al. (2017) proposed a method to regularize recurrent neural network language models by noising the data. For each input word in (sometimes also the corresponding output word), we replace it with another word sampled from a proposal distribution
with probability
. They introduced various methods on how to choose and .For example, if for all and is the unigram distribution of words in the corpus, corresponds to a mixture of
gram models with fixed weights (i.e., linear interpolation smoothing).
Another option is to set , where denotes the number of distinct continuations preceded by word (i.e., the number of bigram types that has as the first word), and denotes the number of times appears in the corpus. For this choice of , when is the unigram distribution, the expectation corresponds to absolute discounting; whereas if and we replace both the input and output words, it corresponds to bigram KneserNey smoothing. We summarize their proposed methods in Table 1.
At prediction (test) time, Xie et al. (2017) do not apply any noising and directly predict . They showed that a combination of smoothing and dropout achieves the best result on their language modeling and machine translation experiments.
Name  Noised  

Blank noising  
Linear interpolation  unigram  
Absolute discounting  unigram  
KneserNey 
3 Bayesian Recurrent Neural Networks
In Bayesian RNNs, we define a prior over our parameters and consider the following maximization problem:
A common prior is the standard normal distribution
.For recurrent neural networks, the posterior over given the data is intractable. We approximate the posterior with a variational distribution , and minimize the KL divergence between the variational distribution and the true posterior by:
The integral is often approximated with Monte Carlo integration with one sample :^{1}^{1}1In the followings, we use to denote a sample from a distribution .
(1) 
At test time, for a new sequence , we can either set to be the mean of or sample and average the results: , where is the number of samples and .
4 Variational Smoothing
We now provide theoretical justifications for data smoothing under the variational framework. In a recurrent neural network language model, there are three types of parameters: an input (word) embedding matrix , an LSTM parameter matrix , and an output embedding matrix
that produces logits for the final softmax function. We have
.4.1 Linear interpolation smoothing.
We first focus on the simplest data noising method—linear interpolation smoothing—and show how to extend it to other data noising methods subsequently.
Word embedding matrix .
Denote the word vector corresponding to word
in the input embedding matrix by . We obtain a similar effect to linear interpolation smoothing by using the following mixture of Gaussians variational distribution for :(2) 
where is the unigram probability of word and is small. In other words, with probability , we replace the embedding with another embedding sampled from a normal distribution centered at .
Note that noising the input word is equivalent to choosing a different word embedding vector to be used in a standard recurrent neural network. Under the variational framework, we sample a different word embedding matrix for every sequence at training time, since the integral is approximated with Monte Carlo by sampling from (Eq. 1).
term.
For , we use Proposition 1 in Gal & Ghahramani (2016a) and approximate the divergence between a mixture of Gaussians and as:
where is the vocabulary size and is the mixture proportion for word : for and otherwise. In practice, the term directly translates to an regularizer on each word embedding vector, but the regularization coefficient is data dependent. More specifically, the regularization coefficient for word vector taking into account contributions from for is:
(3) 
In other words, the variational formulation of data smoothing results in a regularization coefficient that is a function of corpus statistics (the unigram distribution).
Other parameters.
For other parameters and , we can use either simple variational distributions such as and , which become standard regularizers on these parameters; or incorporate dropout by setting , where is the dropout probability and is the th row of (Gal & Ghahramani, 2016b).
Training.
In the original noising formulation (Xie et al., 2017), given a sequence {a, b, a, c}, it is possible to get a noised sequence {d, b, a, c}, since the decision to noise at every timestep is independent of others. In the variational framework, we use Monte Carlo integration for each sequence to compute . We can either sample one embedding matrix per sequence similar to Gal & Ghahramani (2016b), or sample one embedding matrix per timestep (Melis et al., 2018a). While the first method is computationally more efficient (we use it in our experiments), if we decide to noise the first to , we will have a noised sequence where every is replaced by : {d, b, d, c}. At training time, we go through sequence
multiple times (once per epoch) to get different noised sequences.
Predictions.
For predictions, Xie et al. (2017) do not noise any input. In the variational framework, this corresponds to taking the mode of the variational distribution, since is almost always greater than . They also reported that they did not observe any improvements by sampling. Using the mode is rather uncommon in Bayesian RNNs. A standard approach is to take the mean, so we can set:
Extension to absolute discounting.
4.2 KneserNey Smoothing
We now consider the variational analog of KneserNey data noising. Instead of smoothing towards lower order grams, KneserNey uses models that take into account contextual diversity. For example, for bigram KneserNey smoothing, we replace the unigram distribution with , where denotes the number of distinct bigrams that end with word . As a result, even if bigrams such as “San Francisco” and “Los Angeles” appear frequently in the corpus; “Francisco” and “Angeles” will not have high probabilities in since they often follow “San” and “Los”.
Recall that similar to absolute discounting, KneserNey also uses . However, if we replace an input word , Xie et al. (2017) proposed to also replace the corresponding output word . The intuition behind this is that since the probability of replacing an input word is proportional to the number of distinct bigrams that start with , when we replace the input word, we also need to replace the output word (e.g., if we replace “San” by a word sampled from , we should also replace “Francisco”).
Word embedding matrix .
To get (bigram) KneserNey smoothing, we use the following mixture of Gaussians variational distribution for :
(4) 
Output embedding matrix .
For the output embedding matrix, we also use the same variational distribution as the word embedding matrix:
(5) 
term.
Following similar derivations in the previous subsection, it is straightforward to show that the approximated term introduces the following regularization term to the overall objective:
and similarly for the output embedding matrix .
Training.
Recall that we use Monte Carlo integration for each sequence to compute at training time. Since we need to noise the output word when the input word is replaced, we sample at (for ) and decide whether we will keep the original (input and output) words or replace them. If we decide to replace, we need to sample two new words from and respectively. For other words that are not the target at the current timestep (i.e., ), we can either assume that or alternatively sample (which can be expensive since we need to sample additional times per timestep).
Predictions
For predictions, Xie et al. (2017) also take the mode of the variational distribution (assuming is almost always greater than ). We use the mean instead, similar to what we do in variational linear interpolation smoothing.
4.3 Extensions
Our derivations above provide insights into new methods under the variational smoothing framework. We describe three variants in the followings:
Tying input and output embeddings.
Inan et al. (2017) and Press & Wolf (2017) showed that tying the input and output embedding matrices (i.e., using the same embedding matrix for both) improves language modeling performance and reduces the number of parameters significantly. We take inspirations from this work and sample both and from the same base matrix. As a result, we have fewer parameters to train due to this sharing mechanism, but we still have different samples of input and output embedding matrices per sequence.^{2}^{2}2Alternatively, we could sample one matrix for both the input and output embeddings. We found that this approach is slightly worse in our preliminary experiments. Similar to previous results in language modeling (Inan et al., 2017; Melis et al., 2018b), our experiments demonstrate that tying improves the performance considerably.
Combining smoothing and dropout.
We can combine variational smoothing and variational dropout by modifying the variational distribution to incorporate a standard Gaussian component:
where is the dropout probability. Note that in this formulation, we either drop an entire word by setting its embedding vector to zero (i.e., similar to word embedding dropout and blank noising) or choose an embedding vector from the set of words in the vocabulary.
However, it is more common to apply dropout to each dimension of the input embedding and output embedding matrix. In this formulation, we have:
(6) 
At training time, while we sample multiple times (once per word embedding dimension for each word), we only sample once per word to ensure that when the element is not noised, we still use the same base embedding vector. We use this variant of smoothing and dropout throughout our experiments for our models.
Elementwise smoothing.
The variational formulation above allows us to derive a variant of data smoothing that samples each element of the embedding vector independently (and similarly for ). Consider the variational distribution in Eq. 6. At training time, if we sample both and multiple times (once per word embedding dimension for each word) we arrive at a new elementwise smoothing method. The main difference between this model and the previous model is that each dimension in the input (and output) embedding vector is sampled independently. As a result, the vector that is used is a combination of elements from various word vectors. Notice that the mean under this new scheme is still the same as sampling per vector, so we do not need to change anything at test time. One major drawback about this model is that it is computationally expensive since we need to sample each element of each embedding vector.
5 Experiments
5.1 Setup
We evaluate our approaches on two standard language modeling datasets: Penn Treebank (PTB) and Wikitext2. We use a twolayer LSTM as our base language model. We perform nonepisodic training with batch size 64 using RMSprop
(Hinton, 2012) as our optimization method. We tune the RMSprop learning rate andregularization hyperparameter
for all models on a development set by a grid search on and respectively, and use perplexity on the development set to choose the best model. We also tune from . We use recurrent dropout (Semeniuta et al., 2016) for and set it to 0.2, and apply (elementwise) input and output embedding dropouts for and and set it to 0.5 when and 0.7 when based on preliminary experiments. We tie the input and output embedding matrices in all our experiments (i.e., ), except for the vanilla LSTM model, where we report results for both tied and untied.^{3}^{3}3Our preliminary experiments are consistent with previous work (Inan et al., 2017; Melis et al., 2018b; Merity et al., 2018) that show tying the input and output embedding matrices results in better models with fewer numbers of parameters.5.2 Models
We compare the following methods in our experiments:

Baseline: a vanilla LSTM language model. We evaluate two variants: with tied input and output embeddings and without.

Data noising (DN): an LSTM language model trained with data noising using linear interpolation smoothing or bigram KneserNey smoothing (Xie et al., 2017).

Variational smoothing (VS): an LSTM language model with variational smoothing using linear interpolation or KneserNey. For both models, we use the mean of the variational distribution at test time.^{4}^{4}4We discuss results using sampling at test time in §5.4.

Variational elementwise smoothing: for the smaller PTB dataset, we evaluate an LSTM language model that uses elementwise KneserNey variational smoothing and dropout. We also use the mean at test time.
Model  LSTM  # of  PTB  Wikitext2  
hidden size  params.  Dev  Test  Dev  Test  
Vanilla LSTM (Xie et al., 2017)  512    84.3  80.4     
Vanilla LSTM (Xie et al., 2017)  1500    81.6  77.5     
DN: KneserNey (Xie et al., 2017)  512    79.9  76.9     
DN: KneserNey (Xie et al., 2017)  1500    76.2  73.4     
Var. dropout (Gal & Ghahramani, 2016b)  1500      73.4     
Vanilla LSTM: untied  512  14M/38M  89.6  84.5  106.3  100.8 
Vanilla LSTM: tied  512  9M/21M  80.0  74.0  90.6  86.6 
DN: linear interpolation  79.4  73.3  88.9  84.6  
DN: KneserNey  75.0  70.7  86.1  82.1  
VS: linear interpolation  512  9M/21M  76.3  71.2  84.0  79.6 
VS: KneserNey  74.5  70.6  84.9  80.9  
VS: elementwise  70.5  66.8      
Vanilla LSTM: untied  1024  37M/85M  90.3  85.5  97.6  91.9 
Vanilla LSTM: tied  1024  27M/50M  75.9  70.2  85.2  81.0 
DN: linear interpolation  75.5  70.2  84.3  80.1  
DN: KneserNey  71.4  67.3  81.9  78.3  
VS: linear interpolation  1024  27M/50M  71.7  67.8  80.5  76.6 
VS: KneserNey  70.8  66.9  80.9  76.7  
VS: elementwise  68.6  64.5     
5.3 Results
Our results are summarized in Table 2. Consistent with previous work on tying the input and output embedding matrices in language models (Inan et al., 2017; Melis et al., 2018b; Merity et al., 2018), we see a large reduction in perplexity (lower is better) when doing so. While our numbers are generally better, the results are also consistent with Xie et al. (2017) that show linear interpolation data noising is slightly better than vanilla LSTM with dropout, and that KneserNey data noising outperforms these methods for both the medium (512) and large (1024) models.
Variational smoothing improves over data noising in all cases, both for linear interpolation and KneserNey. Recall that the main differences between variational smoothing and data noising are: (1) using the mean at test time, (2) having a data dependent regularization coefficient that comes from the term,^{5}^{5}5The data dependent coefficient penalizes vectors that are sampled more often higher. and (3) how each method interacts with the input and output embedding tying mechanism.^{6}^{6}6 In the variational framework, if we sample one matrix for the input and output embeddings, it effectively noises the output words even for linear interpolation. If we sample two matrices from the same base matrix, these matrices can be different at training time even if the parameters are tied. As described in §4.3, we use the latter in our experiments. Our results suggest that the choice of the proposal distribution to sample from is less important for variational smoothing. In our experiments, KneserNey outperforms linear interpolation on PTB but linear interpolation is slightly better on Wikitext2.
Elementwise variational smoothing performs the best for both small and large LSTM models on PTB. We note that this improvement comes at a cost, since this method is computationally expensive.^{7}^{7}7Elementwise dropout can be implemented efficiently by sampling a mask (zero or one with some probability) and multiply the entire embedding matrix with this mask. In elementwise smoothing, we need to sample an index for each dimension and reconstruct the embedding matrix for each timestep. It took about one day to train the smaller model as opposed to a couple hours without elementwise smoothing. As a result, we were unable to train this on a much bigger dataset such as Wikitext2. Nonetheless, our results show that applying smoothing in the embedding (latent) space results in better models.
5.4 Discussions
Sampling at test time.
In order to better understand another prediction method for these models, we perform experiments where we sample at test time for both data noising and variational smoothing (instead of taking the mode or the mean). We use twenty samples and average the log likelihood. Our results agree with Xie et al. (2017) that mentioned sampling does not provide additional benefits. In our experiments, the perplexities of 512 and 1024dimensional DNKneserNey models increase to 96.5 and 85.1 (from 75.0 and 71.4) on the PTB validation set. For VSKneserNey models, the perplexities increase to 89.8 and 78.7 (from 74.5 and 70.8). Both our results and Xie et al. (2017) suggest that introducing data dependent noise at test time is detrimental for recurrent neural network language models.
Sensitivity to .
We evaluate the sensitivity of variational smoothing to hyperparameter . Figure 1 shows perplexities on the PTB validation set for a variant of our models. The model generally performs well within the range of 0.1 and 0.3, but it becomes progressively worse as we increase since there is too much noise.
Other applications.
While we focus on language modeling in this paper, the proposed technique is applicable to other language processing tasks. For example, Xie et al. (2017) showed that data smoothing improves machine translation, so our techniques can be used in that setup as well. It is also interesting to consider how variational smoothing interacts with and/or can be applied to memory augmented language models (Tran et al., 2016; Grave et al., 2017; Yogatama et al., 2018) and stateoftheart language models (Yang et al., 2017). We leave these for future work.
6 Conclusion
We showed that data noising in recurrent neural network language models can be understood as a Bayesian recurrent neural network with a variational distribution that consists of a mixture of Gaussians whose mixture weights are a function of the proposal distribution used in data noising (e.g., the unigram distribution, the KneserNey continuation distribution). We proposed using the mean of the variational distribution at prediction time as a better alternative to using the mode. We combined it with variational dropout, presented two extensions (i.e., variational smoothing with tied input and output embedding matrices and elementwise smoothing), and demonstrated language modeling improvements on Penn Treebank and Wikitext2.
References
 Chen & Goodman (1996) Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proc. of ACL, 1996.
 Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. of EMNLP, 2014.
 Dai & Le (2015) Andrew M. Dai and Quoc V. Le. Semisupervised sequence learning. In Proc. of NIPS, 2015.

Gal & Ghahramani (2016a)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In Proc. of ICML, 2016a.  Gal & Ghahramani (2016b) Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Proc. of NIPS, 2016b.
 Grave et al. (2017) Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In Proc. of ICLR, 2017.

Hinton (2012)
Geoffrey Hinton.
Neural networks for machine learning, 2012.
Lecture 6.5.  Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Hoerl & Kennard (1970) Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

Inan et al. (2017)
Hakan Inan, Khashayar Khosravi, and Richard Socher.
Tying word vectors and word classifiers: A loss framework for language modeling.
In Proc. of ICLR, 2017.  Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan BoydGraber, and Hal Daume III. Deep unordered composition rivals syntactic methods for text classification. In Proc. of ACL, 2015.

Kumar et al. (2016)
Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian
Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher.
Ask me anything: Dynamic memory networks for natural language processing.
In Proc. of ICML, 2016.  Marcus et al. (1994) Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Proc. of the Workshop on Human Language Technology, 1994.
 Melis et al. (2018a) Gabor Melis, Charles Blundell, Tomas Kocisky, Karl Moritz Hermann, Chris Dyer, and Phil Blunsom. Pushing the bounds of dropout. arXiv preprint, 2018a.
 Melis et al. (2018b) Gabor Melis, Chris Dyer, and Phil Blusom. On the state of the art of evaluation in neural language models. In Proc. of ICLR, 2018b.
 Merity et al. (2017) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In Proc. of ICLR, 2017.
 Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. In Proc. of ICLR, 2018.
 Press & Wolf (2017) Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/E172025.
 Semeniuta et al. (2016) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. In Proc. of COLING, 2016.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 Tran et al. (2016) Ke Tran, Arianna Bisazza, and Christof Monz. Recurrent memory networks for language modeling. In Proc. of NAACLHLT, 2016.
 Xie et al. (2017) Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Levy, Aiming Nie, Dan Jurafsky, and Andrew Y. Ng. Data noising as smoothing in neural network language models. In Proc. of ICLR, 2017.
 Yang et al. (2017) Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
 Yogatama et al. (2018) Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. Memory architectures in recurrent neural network language models. In Proc. of ICLR, 2018.