Noisin: Unbiased Regularization for Recurrent Neural Networks

05/03/2018 ∙ by Adji B. Dieng, et al. ∙ 0

Recurrent neural networks (RNNs) are powerful models of sequential data. They have been successfully used in domains such as text and speech. However, RNNs are susceptible to overfitting; regularization is important. In this paper we develop Noisin, a new method for regularizing RNNs. Noisin injects random noise into the hidden states of the RNN and then maximizes the corresponding marginal likelihood of the data. We show how Noisin applies to any RNN and we study many different types of noise. Noisin is unbiased--it preserves the underlying RNN on average. We characterize how Noisin regularizes its RNN both theoretically and empirically. On language modeling benchmarks, Noisin improves over dropout by as much as 12.2 also compared the state-of-the-art language model of Yang et al. 2017, both with and without Noisin. On the Penn Treebank, the method with Noisin more quickly reaches state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

RNN are powerful models of sequential data (Robinson & Fallside, 1987; Werbos, 1988; Williams, 1989; Elman, 1990; Pearlmutter, 1995). RNNs have achieved state-of-the-art results on many tasks, including language modeling (Mikolov & Zweig, 2012; Yang et al., 2017)

, text generation

(Graves, 2013), image generation (Gregor et al., 2015), speech recognition (Graves et al., 2013; Chiu et al., 2017), and machine translation (Sutskever et al., 2014; Wu et al., 2016).

The main idea behind an RNN is to posit a sequence of recursively defined hidden states, and then to model each observation conditional on its state. The key element of an RNN is its transition function. The transition function determines how each hidden state is a function of the previous observation and previous hidden state; it defines the underlying recursion. There are many flavors of RNNs—examples include the ERNN (Elman, 1990), the LSTM (Hochreiter & Schmidhuber, 1997), and the GRU (Cho et al., 2014). Each flavor amounts to a different way of designing and parameterizing the transition function.

We fit an RNN by maximizing the likelihood of the observations with respect to its parameters, those of the transition function and of the observation likelihood. But RNNs are very flexible and they overfit; regularization is crucial. Researchers have explored many approaches to regularizing RNNs, such as Tikhonov regularization (Bishop, 1995), dropout and its variants (Srivastava et al., 2014; Zaremba et al., 2014; Gal & Ghahramani, 2016; Wan et al., 2013), and zoneout (Krueger et al., 2016). (See the related work section below for more discussion.)

In this paper, we develop noisin, an effective new way to regularize an RNN. The idea is to inject random noise into its transition function and then to fit its parameters to maximize the corresponding marginal likelihood of the observations. We can easily apply noisin to any flavor of RNN and we can use many types of noise.

Figure 1

demonstrates how an RNN can overfit and how noisin can help. The plot involves a language modeling task where the RNN models a sequence of words. The horizontal axis is epochs of training; the vertical axis is perplexity, which is an assessment of model fitness (lower numbers are better). The figure shows how the model fits to both the training set and the validation set. As training proceeds, the vanilla RNN improves its fitness to the training set but performance on the validation set degrades—it overfits. The performance of the RNN with noisin continues to improve in both the training set and the validation set.

noisin regularizes the RNN by smoothing its loss, averaging over local neighborhoods of the transition function. Further, noisin requires that the noise-injected transition function be unbiased. This means that, on average, it preserves the transition function of the original RNN.

With this requirement, we show that noisin provides explicit regularization, i.e., it is equivalent to fitting the usual RNN loss plus a penalty function of its parameters. We can characterize the penalty as a function of the variance of the noise. Intuitively, it penalizes the components of the model that are sensitive to noise; this induces robustness to how future data may be different from the observations.

We examine noisin with the LSTM and the LSTM with dropout, which we call the dropout-LSTM, and we explore several types of distributions. We study performance with two benchmark datasets on a language modeling task. noisin improves over the LSTM by as much as on the Penn Treebank dataset and on the Wikitext-2 dataset; it improves over the dropout-LSTM by as much as on the Penn Treebank and on Wikitext-2.

Figure 1: Training and validation perplexity for the deterministic RNN and the RNN regularized with noisin. The settings were the same for both. We used additive Gaussian noise on an ERNN with sigmoid activations. We used one layer of hidden units. The RNN overfits after only five epochs, and its training loss still decreases. This is not the case for the RNN regularized with noisin.

Related work.  Many techniques have been developed to address overfitting in RNNs. The most traditional regularization technique is weight decay ( and ). However, Pascanu et al. (2013) showed that such simple regularizers prevent the RNN from learning long-range dependencies.

Another technique for regularizing RNNs is to normalize the hidden states or the observations (Ioffe & Szegedy, 2015; Ba et al., 2016; Cooijmans et al., 2016). Though powerful, this class of approaches can be expensive.

Other types of regularization, including what we study in this paper, involve auxiliary noise variables. The most successful noise-based regularizer for neural networks is dropout (Srivastava et al., 2014; Wager et al., 2013; Noh et al., 2017). Dropout has been adapted to RNNs by only pruning their input and output matrices (Zaremba et al., 2014) or by putting judiciously chosen priors on all the weights and applying variational methods (Gal & Ghahramani, 2016). Still other noise-based regularization prunes the network by dropping updates to the hidden units of the RNN  (Krueger et al., 2016; Semeniuta et al., 2016). More recently Merity et al. (2017) extended these techniques.

Involving noise variables in RNNs has been used in contexts other than regularization. For example Jim et al. (1996) analyze the impact of noise on convergence and long-term dependencies. Other work introduces auxiliary latent variables that enable RNNs to capture the high variability of complex sequential data such as music, audio, and text (Bayer & Osendorfer, 2014; Chung et al., 2015; Fraccaro et al., 2016; Goyal et al., 2017).

2 Recurrent Neural Networks

Likelihood
Bernoulli (Binary data)
Gaussian (Real-Valued data)
Poisson (Count data)
Categorical (Categorical data) logsumexp
Table 1: Expression for the log normalizer and its Hessian for different likelihoods. Here is the observation variance in the Gaussian case and in the categorical case.

Consider a sequence of observations,

. An RNN factorizes its joint distribution according to the chain rule of probability,

(1)

To capture dependencies, the RNN expresses each conditional probability as a function of a low-dimensional recurrent hidden state,

The likelihood can be of any form. We focus on the exponential family

(2)

where is the base measure, is the natural parameter—a linear function of the hidden state —and is the log-normalizer. The matrix is called the prediction or output matrix of the RNN.

The hidden state at time is a parametric function of the previous hidden state and the previous observation ; the parameters are shared across all time steps. The function is the transition function of the RNN, it defines a recurrence relation for the hidden states and renders a function of all the past observations ; these properties match the chain rule decomposition in Eq. 1.

The particular form of determines the RNN. Researchers have designed many flavors, including the LSTM and the GRU (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). In this paper we will study the LSTM. However, the methods we develop can be applied to all types of RNNs.

LSTM.  We now describe the LSTM, a variant of RNN that we study in Section 5. The LSTM is built from the simpler ERNN (Elman, 1990). In an ERNN, the transition function is

where we dropped an intercept term to avoid cluttered notation. Here, is called the recurrent weight matrix and is called the embedding matrix or input matrix. The function is called an activation or squashing function, which stabilizes the transition dynamics by bounding the hidden state. Typical choices for the squashing function include the sigmoid and the hyperbolic tangent.

The LSTM was designed to avoid optimization issues, such as vanishing (or exploding) gradients. Its transition function composes four ERNNs, three with sigmoid activations and one with a activation:

(3)
(4)
(5)
(6)
(7)

The idea is that the memory cell captures long-term dependencies (Hochreiter & Schmidhuber, 1997).

However, LSTMs have a high model complexity and, consequently, they easily memorize data. Regularization is crucial. In the next section, we develop a new regularization method for RNN called noisin.

3 Noise-Injected RNNs

noisin is built from noise-injected RNNs. These are RNNs whose hidden states are computed using auxiliary noise variables. There are several advantages to injecting noise into the hidden states of RNN. For example it prevents the dimensions of the hidden states from co-adapting and forces individual units to capture useful features.

We define noise-injected RNNs as any RNN following the generative process

(8)
(9)
(10)

where the likelihood is an exponential family as in Eq. 2. The noise variables are drawn from a distribution with mean and scale . For example, can be a zero-mean Gaussian with variance . We will study many types of noise distributions.

The noisy hidden state is a parametric function of the previous observation , the previous noisy hidden state , and the noise . Therefore conditional on the noise , the transition function defines a recurrence relation on .

The function determines the noise-injected RNN. In this paper, we propose functions that meet the criterion described below.

Unbiased noise injection.  Injecting noise at each time step limits the amount of information carried by hidden states. In limiting their capacity, noise injection is some form of regularization. In Section 4 we show that noise injection under exponential family likelihoods corresponds to explicit regularization under some unbiasedness condition.

We define two flavors of unbiasedness: strong unbiasedness and weak unbiasedness. Let denote the unrolled recurrence at time

; it is a random variable via the noise

. Under the strong unbiasedness condition, the transition function must satisfy the relationship

(11)

where is the hidden state of the underlying RNN. This is satisfied by injecting the noise at the last layer of the RNN. Weak unbiasedness imposes a looser constraint. Under weak unbiasedness, must satisfy

(12)

where is the transition function of the underlying RNN. What weak unbiasedness means is that the noise should be injected in such a way that driving the noise to zero leads to the original RNN. Two possible choices for that meet this condition are the following

(13)
(14)

In Eq. 13 the noise has mean zero whereas in Eq. 14 it has mean one. These choices of correspond to additive noise and multiplicative noise respectively. Note can be any RNN including the RNN with dropout or the stochastic RNNs (Bayer & Osendorfer, 2014; Chung et al., 2015; Fraccaro et al., 2016; Goyal et al., 2017). For example to implement unbiased noise injection with multiplicative noise for the LSTM the only change from the original LSTM is to replace Eq. 7 with

Such noise-injected hidden states can be stacked to build a multi-layered noise-injected LSTM that meet the weak unbiasedness condition.

Dropout.  We now consider dropout from the perspective of unbiasedness. Consider the LSTM as described in Section 2

. Applying dropout to it corresponds to injecting Bernoulli-distributed noise as follows

This general form of dropout encapsulates existing dropout variants. For example when the noise variables are set to one we recover the variant of dropout in Zaremba et al. (2014).

Because of the nonlinearities dropout does not meet the unbiasedness desideratum Eq. 12 where is the hidden state of the LSTM as described in Section 2. Here at each time step , denotes the set of noise variables and .

Dropout is therefore biased and does not preserve the underlying RNN. However, dropout has been widely successfully used in practice and has many nice properties. For example it regularizes by acting like an ensemble method (Goodfellow et al., 2016). We study the dropout-LSTM in Section 5 as a variant of RNN that can benefit from the method noisin proposed in this paper.

  Input: Data , initial hidden state , noise distribution , and learning rate .
  Output: learned parameters and .
  Initialize parameters and
  for iteration  do
     for time step  do
        Sample noise
        Compute state
     end for
     Compute loss as in Eq. 17
     Update
     Update
     Change learning rate according to some schedule.
  end for
Algorithm 1 noisin with multiplicative noise.

Unbiased noise-injection with noisin.

  Deterministic RNNs are learned using truncated backpropagation through time with the maximum likelihood objective—the log likelihood of the data. Backpropagation through time builds gradients by

unrolling the RNN into a feed-forward neural network and applies backpropagation (Rumelhart et al., 1988)

. The RNN is then optimized using gradient descent or stochastic gradient descent 

(Robbins & Monro, 1951).

noisin applies the same procedure to the expected log-likelihood under the injected noise,

(15)

In more detail this is

(16)

Notice this objective is a Jensen bound on the marginal log-likelihood of the data,

The expectations in the objective of Eq. 16 are intractable due to the nonlinearities in the model and the form of the noise distribution. We approximate the objective using Monte Carlo;

When using one sample (), the training procedure is just as easy as for the underlying RNN. The loss in this case, under the exponential family likelihood, becomes

(17)

where is a constant that does not depend on the parameters. Algorithm summarizes the procedure for multiplicative noise. The only change from traditional RNN training is when updating the hidden state in lines and .

. Standard Noise Scaled Noise 0 Bernoulli Gamma Gumbel Laplace Logistic Beta Chi-Square

Table 2: Expression for the noise distributions and their scaled version used in this paper. Here

is the noise spread. It determines the amount of regularization. For example it is the standard deviation for Gaussian noise and the scale parameter for Gamma noise. The constant

is the Euler-Mascheroni constant

Controling the noise level.  noisin is amenable to any RNN and any noise distribution. As with all regularization techniques, noisin comes with a free parameter that determines the amount of regularization: the spread of the noise.

Certain noise distributions have bounded variance; for example the Bernoulli and the Beta distributions. This limits the amount of regularization one can afford. To circumvent this bounded variance issue, we rescale the noise to have unbounded variance. Table 

2 shows the expression of the variance of the original noise and its scaled version for several distributions. It is the scaled noise that is used in noisin.

4 Unbiased Regularization for RNNs

In Section 3, we introduced the concept of unbiasedness in the context of RNNs as a desideratum for noise injection to preserve the underlying RNN. In this section we prove unbiasedness leads to an explicit regularizer that forces the hidden states to be robust to noise.

4.1 Unbiased noise injection is explicit regularization

A valid regularizer is one that adds a nonnegative term to the risk. This section shows that unbiased noise injection with exponential family likelihoods leads to valid regularizers.

Consider the loss in Eq. 17 for an exponential family likelihood. The exponential family provides a general notation for the types of data encountered in practice: binary, count, real-valued, and categorical. Table 1 shows the expression of for these types of data. The log normalizer has many useful properties. For example it is convex and infinitely differentiable.

Assume without loss of generality that we observe one sequence . Consider the empirical risk function for the noise-injected RNN. It is defined as

With little algebra we can decompose this risk into the sum of two terms

(18)

where is the empirical risk for the underlying RNN and is

Because the second term in Eq. 18 is not always guaranteed to be non-negative, noise-injection is not explicit regularization in general. However, under the strong unbiasedness condition, this term corresponds to a valid regularization term and simplifies to

where the matrix is the prediction matrix of the underlying RNN rescaled by the square root of —the Hessian of the log-normalizer of the likelihood. This Hessian is also the Fisher information matrix of the RNN. We provide a detailed proof in Section 7.

noisin requires that we minimize the objective of the underlying RNN while also minimizing . Minimizing induces robustness—it is equivalent to penalizing hidden units that are too sensitive to noise.

4.2 Connections

In this section, we intuit that noisin has ties to ensemble methods and empirical Bayes.

The ensemble method perspective.  noisin can be interpreted as an ensemble method. The objective in Eq. 16 corresponds to averaging the predictions of infinitely many RNNs at each time step in the sequence. This is known as an ensemble method and has a regularization effect (Poggio et al., 2002). However ensemble methods are costly as they require training all the sub-models in the ensemble. With noisin, at each time step in the sequence, one of the infinitely many RNNs is trained and because of parameter sharing, the RNN being trained at the next time step will use better settings of the weights. This makes training the whole model efficient. (See Algorithm .)

The empirical Bayes perspective.  Consider a noise-injected RNN. We write its joint distribution as

Here denotes the likelihood and is the prior over the noisy hidden states; it is parameterized by the weights

. From the perspective of Bayesian inference this is an unknown prior. When we optimize the objective in Eq. 

16, we are learning the weights . This is equivalent to learning the prior over the noisy hidden states and is known as empirical Bayes (Robbins, 1964)

. It consists in getting point estimates of prior parameters in a hierarchical model and using those point estimates to define the prior.

5 Empirical Study

Medium Large
Method Dev Test Dev Test
None
Gaussian
Logistic
Laplace
Gamma
Bernoulli 75.7 71.4 72.8 68.3
Gumbel
Beta
Chi
Medium Large
Method Dev Test Dev Test
Dropout (D)
D + Gaussian 70.0 66.1
D + Logistic
D + Laplace
D + Gamma
D + Bernoulli 70.0 66.1
D + Gumbel
D + Beta 73.0 69.2
D + Chi
Table 3: noisin improves the performance of the LSTM and the dropout-LSTM by as much as on the Penn Treebank dataset. This table shows word-level perplexity scores on the medium and large settings for both the validation (or dev) and the test set.
Medium Large
Method Dev Test Dev Test
None
Gaussian
Logistic
Laplace
Gamma
Bernoulli
Gumbel
Beta 91.1 87.2 86.9 82.9
Chi
Medium Large
Method Dev Test Dev Test
Dropout (D)
D + Gaussian
D + Logistic
D + Laplace 85.6 82.1
D + Gamma
D + Bernoulli 80.8 76.8
D + Gumbel
D + Beta
D + Chi
Table 4: noisin improves the performance of the LSTM and the dropout-LSTM by as much as on the Wikitext-2 dataset. This table shows word-level perplexity scores on the medium and large settings for both the validation (or dev) and the test set. D is short for dropout. refers to noisin applied to the dropout-LSTM with the specified distribution.
Model # Parameters Dev Test
(Zaremba et al., 2014) - LSTM M
(Gal & Ghahramani, 2016) - Variational LSTM (MC) M
(Merity et al., 2016) - Pointer Sentinel-LSTM M
(Grave et al., 2016) - LSTM + continuous cache pointer
(Inan et al., 2016) - Tied Variational LSTM + augmented loss M
(Zilly et al., 2016)- Variational RHN M
(Melis et al., 2017) - 2-layer skip connection LSTM M
(Merity et al., 2017) - AWD-LSTM + continuous cache pointer M
(Krause et al., 2017) - AWD-LSTM + dynamic evaluation M
(Yang et al., 2017) - AWD-LSTM-MoS + dynamic evaluation M 48.3 47.7
(This paper) - AWD-LSTM-MoS + noisin + dynamic evaluation M 48.4 47.6
Table 5: When applied to the model in (Yang et al., 2017), noisin achieves the same state-of-the-art perplexity on the Penn Treebank after only epochs (vs epochs).

Multiplicative gamma-distributed noise with shape

and scale .

We presented noisin, a method that relies on unbiased noise injection to regularize any RNN. noisin is simple and can be integrated with any existing RNN-based model. In this section, we focus on applying noisin to the LSTM and the dropout-LSTM. We use language modeling as a testbed. Regularization is crucial in language modeling because the input and prediction matrices scale linearly with the size of the vocabulary. This results in networks with very high capacity.

We used noisin under two noise regimes: additive noise and multiplicative noise. We found that additive noise uniformly performs worse than multiplicative noise for the LSTM. We therefore report results only on multiplicative noise.

We used noisin with several noise distributions: Gaussian, Logistic, Laplace, Gamma, Bernoulli, Gumbel, Beta, and -Square. We found that overall the only property that matters with these distributions is the variance. The variance determines the amount of regularization for noisin. It is the parameter in Algorithm . We outlined in Section 4 how to set the noise level for a given distribution so as to benefit from unbounded variance.

We also found that these distributions, when used with noisin on the LSTM perform better than the dropout LSTM on the Penn Treebank.

Another interesting finding is that noisin when applied to the dropout-LSTM performs better than the original dropout-LSTM.

Next we describe the two benchmark datasets used: Penn Treebank and Wikitext-2. We then provide details on the experimental settings for reproducibility. We finally present the results in Table 5 and Table 5.

Penn Treebank.  The Penn Treebank portion of the Wall Street Journal (Marcus et al., 1993) is a long standing benchmark dataset for language modeling. We use the standard split, where sections to ( tokens) are used for training, sections to ( tokens) for validation, and sections to ( tokens) for testing (Mikolov et al., 2010). We use a vocabulary of size that includes the special token unk for rare words and the end of sentence indicator eos.

Wikitext-2.  The Wikitext-2 dataset (Merity et al., 2016) has been recently introduced as an alternative to the Penn Treebank dataset. It is sourced from Wikipedia articles and is approximately twice the size of the Penn Treebank dataset. We use a vocabulary size of and no further preprocessing steps.

Experimental settings.  To assess the capabilities of noisin as a regularizer on its own, we used the basic settings for RNN training (Zaremba et al., 2014). We did not use weight decay or pointers (Merity et al., 2016).

We considered two settings in our experiments: a medium-sized network and a large network. The medium-sized network has layers with hidden units each. This results in a model complexity of million parameters. The large network has layers with hidden units each. This leads to a model complexity of million parameters.

For each setting, we set the dimension of the word embeddings to match the number of hidden units in each layer. Following initialization guidelines in the literature, we initialize all embedding weights uniformly in the interval . All other weights were initialized uniformly between where is the number of hidden units in a layer. All the biases were initialized to . We fixed the seed to for reproducibility.

We train the models using truncated backpropagation through time with average stochastic gradient descent (Polyak & Juditsky, 1992) for a maximum of epochs. The LSTM was unrolled for steps. We used a batch size of for both datasets. To avoid the problem of exploding gradients we clip the gradients to a maximum norm of . We used an initial learning rate of for all experiments. This is divided by a factor of if the perplexity on the validation set deteriorates.

For the dropout-LSTM, the values used for dropout on the input, recurrent, and output layers were respectively.

The models were implemented in PyTorch. The source code is available upon request.

Results on the Penn Treebank.  The results on the Penn Treebank are illustrated in Table 5. The best results for the non-regularized LSTM correspond to a small network. This is because larger networks overfit and require regularization. In general noisin improves any given RNN including dropout-LSTM. For example noisin with multiplicative Bernoulli noise performs better than dropout RNN for both medium and large settings. noisin improves the performance of the dropout-LSTM by as much as on this dataset.

Results on the Wikitext-2 dataset.  Results on the Wikitext-2 dataset are presented in Table 5. We observe the same trend as for the Penn Treebank dataset: noisin improves the underlying LSTM and dropout-LSTM. For the dropout-LSTM, it improves its generalization capabilities by as much as on this dataset.

6 Discussion

We proposed noisin, a simple method for regularizing RNNs. noisin injects noise into the hidden states such that the underlying RNN is preserved. noisin maximizes a lower bound of the log marginal likelihood of the data—the expected log-likelihood under the injected noise. We showed that noisin is an explicit regularizer that imposes a robustness constraint on the hidden units of the RNN. On a language modeling benchmark noisin improves the generalization capabilities of both the LSTM and the dropout-LSTM.

7 Detailed Derivations

We derive in full detail the risk of noisin and show that it can be written as the sum of the risk of the original RNN and a regularization term.

Assume without loss of generality that we observe one sequence . The risk of a noise-injected RNN is

Expand this in more detail and write in lieu of to avoid cluttering of notation. Then

The risk for the underlying RNN——is similar when we replace with ,

Therefore we can express the risk of noisin as a function of the risk of the underlying RNN,

Under the strong unbiasedness condition,

Using the convexity property of the log-normalizer of exponential families and Jensen’s inequality,

Using the strong unbiasedness condition a second time we conclude Therefore

is a valid regularizer. A second-order Taylor expansion of around and the strong unbiasedness condition yield

where the matrix is the original prediction matrix rescaled by the square root of the Hessian of the log-normalizer, the inverse Fisher information matrix of the underlying RNN. This regularization term forces the hidden units to be robust to noise. Under weak unbiasedness, the proof holds under the assumption that the true data generating distribution is an RNN.

Acknowledgements

We thank Francisco Ruiz for presenting our paper at ICML, 2018. We thank the Princeton Institute for Computational Science and Engineering (PICSciE), the Office of Information Technology’s High Performance Computing Center and Visualization Laboratory at Princeton University for the computational resources. This work was supported by ONR N00014-15-1-2209, ONR 133691-5102004, NIH 5100481-5500001084, NSF CCF-1740833, the Alfred P. Sloan Foundation, the John Simon Guggenheim Foundation, Facebook, Amazon, and IBM.

References