1 Introduction
RNN are powerful models of sequential data (Robinson & Fallside, 1987; Werbos, 1988; Williams, 1989; Elman, 1990; Pearlmutter, 1995). RNNs have achieved stateoftheart results on many tasks, including language modeling (Mikolov & Zweig, 2012; Yang et al., 2017)
(Graves, 2013), image generation (Gregor et al., 2015), speech recognition (Graves et al., 2013; Chiu et al., 2017), and machine translation (Sutskever et al., 2014; Wu et al., 2016).The main idea behind an RNN is to posit a sequence of recursively defined hidden states, and then to model each observation conditional on its state. The key element of an RNN is its transition function. The transition function determines how each hidden state is a function of the previous observation and previous hidden state; it defines the underlying recursion. There are many flavors of RNNs—examples include the ERNN (Elman, 1990), the LSTM (Hochreiter & Schmidhuber, 1997), and the GRU (Cho et al., 2014). Each flavor amounts to a different way of designing and parameterizing the transition function.
We fit an RNN by maximizing the likelihood of the observations with respect to its parameters, those of the transition function and of the observation likelihood. But RNNs are very flexible and they overfit; regularization is crucial. Researchers have explored many approaches to regularizing RNNs, such as Tikhonov regularization (Bishop, 1995), dropout and its variants (Srivastava et al., 2014; Zaremba et al., 2014; Gal & Ghahramani, 2016; Wan et al., 2013), and zoneout (Krueger et al., 2016). (See the related work section below for more discussion.)
In this paper, we develop noisin, an effective new way to regularize an RNN. The idea is to inject random noise into its transition function and then to fit its parameters to maximize the corresponding marginal likelihood of the observations. We can easily apply noisin to any flavor of RNN and we can use many types of noise.
Figure 1
demonstrates how an RNN can overfit and how noisin can help. The plot involves a language modeling task where the RNN models a sequence of words. The horizontal axis is epochs of training; the vertical axis is perplexity, which is an assessment of model fitness (lower numbers are better). The figure shows how the model fits to both the training set and the validation set. As training proceeds, the vanilla RNN improves its fitness to the training set but performance on the validation set degrades—it overfits. The performance of the RNN with noisin continues to improve in both the training set and the validation set.
noisin regularizes the RNN by smoothing its loss, averaging over local neighborhoods of the transition function. Further, noisin requires that the noiseinjected transition function be unbiased. This means that, on average, it preserves the transition function of the original RNN.
With this requirement, we show that noisin provides explicit regularization, i.e., it is equivalent to fitting the usual RNN loss plus a penalty function of its parameters. We can characterize the penalty as a function of the variance of the noise. Intuitively, it penalizes the components of the model that are sensitive to noise; this induces robustness to how future data may be different from the observations.
We examine noisin with the LSTM and the LSTM with dropout, which we call the dropoutLSTM, and we explore several types of distributions. We study performance with two benchmark datasets on a language modeling task. noisin improves over the LSTM by as much as on the Penn Treebank dataset and on the Wikitext2 dataset; it improves over the dropoutLSTM by as much as on the Penn Treebank and on Wikitext2.
Related work. Many techniques have been developed to address overfitting in RNNs. The most traditional regularization technique is weight decay ( and ). However, Pascanu et al. (2013) showed that such simple regularizers prevent the RNN from learning longrange dependencies.
Another technique for regularizing RNNs is to normalize the hidden states or the observations (Ioffe & Szegedy, 2015; Ba et al., 2016; Cooijmans et al., 2016). Though powerful, this class of approaches can be expensive.
Other types of regularization, including what we study in this paper, involve auxiliary noise variables. The most successful noisebased regularizer for neural networks is dropout (Srivastava et al., 2014; Wager et al., 2013; Noh et al., 2017). Dropout has been adapted to RNNs by only pruning their input and output matrices (Zaremba et al., 2014) or by putting judiciously chosen priors on all the weights and applying variational methods (Gal & Ghahramani, 2016). Still other noisebased regularization prunes the network by dropping updates to the hidden units of the RNN (Krueger et al., 2016; Semeniuta et al., 2016). More recently Merity et al. (2017) extended these techniques.
Involving noise variables in RNNs has been used in contexts other than regularization. For example Jim et al. (1996) analyze the impact of noise on convergence and longterm dependencies. Other work introduces auxiliary latent variables that enable RNNs to capture the high variability of complex sequential data such as music, audio, and text (Bayer & Osendorfer, 2014; Chung et al., 2015; Fraccaro et al., 2016; Goyal et al., 2017).
2 Recurrent Neural Networks
Likelihood  

Bernoulli (Binary data)  
Gaussian (RealValued data)  
Poisson (Count data)  
Categorical (Categorical data)  logsumexp 
Consider a sequence of observations,
. An RNN factorizes its joint distribution according to the chain rule of probability,
(1) 
To capture dependencies, the RNN expresses each conditional probability as a function of a lowdimensional recurrent hidden state,
The likelihood can be of any form. We focus on the exponential family
(2) 
where is the base measure, is the natural parameter—a linear function of the hidden state —and is the lognormalizer. The matrix is called the prediction or output matrix of the RNN.
The hidden state at time is a parametric function of the previous hidden state and the previous observation ; the parameters are shared across all time steps. The function is the transition function of the RNN, it defines a recurrence relation for the hidden states and renders a function of all the past observations ; these properties match the chain rule decomposition in Eq. 1.
The particular form of determines the RNN. Researchers have designed many flavors, including the LSTM and the GRU (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). In this paper we will study the LSTM. However, the methods we develop can be applied to all types of RNNs.
LSTM. We now describe the LSTM, a variant of RNN that we study in Section 5. The LSTM is built from the simpler ERNN (Elman, 1990). In an ERNN, the transition function is
where we dropped an intercept term to avoid cluttered notation. Here, is called the recurrent weight matrix and is called the embedding matrix or input matrix. The function is called an activation or squashing function, which stabilizes the transition dynamics by bounding the hidden state. Typical choices for the squashing function include the sigmoid and the hyperbolic tangent.
The LSTM was designed to avoid optimization issues, such as vanishing (or exploding) gradients. Its transition function composes four ERNNs, three with sigmoid activations and one with a activation:
(3)  
(4)  
(5)  
(6)  
(7) 
The idea is that the memory cell captures longterm dependencies (Hochreiter & Schmidhuber, 1997).
However, LSTMs have a high model complexity and, consequently, they easily memorize data. Regularization is crucial. In the next section, we develop a new regularization method for RNN called noisin.
3 NoiseInjected RNNs
noisin is built from noiseinjected RNNs. These are RNNs whose hidden states are computed using auxiliary noise variables. There are several advantages to injecting noise into the hidden states of RNN. For example it prevents the dimensions of the hidden states from coadapting and forces individual units to capture useful features.
We define noiseinjected RNNs as any RNN following the generative process
(8)  
(9)  
(10) 
where the likelihood is an exponential family as in Eq. 2. The noise variables are drawn from a distribution with mean and scale . For example, can be a zeromean Gaussian with variance . We will study many types of noise distributions.
The noisy hidden state is a parametric function of the previous observation , the previous noisy hidden state , and the noise . Therefore conditional on the noise , the transition function defines a recurrence relation on .
The function determines the noiseinjected RNN. In this paper, we propose functions that meet the criterion described below.
Unbiased noise injection. Injecting noise at each time step limits the amount of information carried by hidden states. In limiting their capacity, noise injection is some form of regularization. In Section 4 we show that noise injection under exponential family likelihoods corresponds to explicit regularization under some unbiasedness condition.
We define two flavors of unbiasedness: strong unbiasedness and weak unbiasedness. Let denote the unrolled recurrence at time
; it is a random variable via the noise
. Under the strong unbiasedness condition, the transition function must satisfy the relationship(11) 
where is the hidden state of the underlying RNN. This is satisfied by injecting the noise at the last layer of the RNN. Weak unbiasedness imposes a looser constraint. Under weak unbiasedness, must satisfy
(12) 
where is the transition function of the underlying RNN. What weak unbiasedness means is that the noise should be injected in such a way that driving the noise to zero leads to the original RNN. Two possible choices for that meet this condition are the following
(13)  
(14) 
In Eq. 13 the noise has mean zero whereas in Eq. 14 it has mean one. These choices of correspond to additive noise and multiplicative noise respectively. Note can be any RNN including the RNN with dropout or the stochastic RNNs (Bayer & Osendorfer, 2014; Chung et al., 2015; Fraccaro et al., 2016; Goyal et al., 2017). For example to implement unbiased noise injection with multiplicative noise for the LSTM the only change from the original LSTM is to replace Eq. 7 with
Such noiseinjected hidden states can be stacked to build a multilayered noiseinjected LSTM that meet the weak unbiasedness condition.
Dropout. We now consider dropout from the perspective of unbiasedness. Consider the LSTM as described in Section 2
. Applying dropout to it corresponds to injecting Bernoullidistributed noise as follows
This general form of dropout encapsulates existing dropout variants. For example when the noise variables are set to one we recover the variant of dropout in Zaremba et al. (2014).
Because of the nonlinearities dropout does not meet the unbiasedness desideratum Eq. 12 where is the hidden state of the LSTM as described in Section 2. Here at each time step , denotes the set of noise variables and .
Dropout is therefore biased and does not preserve the underlying RNN. However, dropout has been widely successfully used in practice and has many nice properties. For example it regularizes by acting like an ensemble method (Goodfellow et al., 2016). We study the dropoutLSTM in Section 5 as a variant of RNN that can benefit from the method noisin proposed in this paper.
Unbiased noiseinjection with noisin.
Deterministic RNNs are learned using truncated backpropagation through time with the maximum likelihood objective—the log likelihood of the data. Backpropagation through time builds gradients by
unrolling the RNN into a feedforward neural network and applies backpropagation (Rumelhart et al., 1988). The RNN is then optimized using gradient descent or stochastic gradient descent
(Robbins & Monro, 1951).noisin applies the same procedure to the expected loglikelihood under the injected noise,
(15) 
In more detail this is
(16) 
Notice this objective is a Jensen bound on the marginal loglikelihood of the data,
The expectations in the objective of Eq. 16 are intractable due to the nonlinearities in the model and the form of the noise distribution. We approximate the objective using Monte Carlo;
When using one sample (), the training procedure is just as easy as for the underlying RNN. The loss in this case, under the exponential family likelihood, becomes
(17) 
where is a constant that does not depend on the parameters. Algorithm summarizes the procedure for multiplicative noise. The only change from traditional RNN training is when updating the hidden state in lines and .
Controling the noise level. noisin is amenable to any RNN and any noise distribution. As with all regularization techniques, noisin comes with a free parameter that determines the amount of regularization: the spread of the noise.
Certain noise distributions have bounded variance; for example the Bernoulli and the Beta distributions. This limits the amount of regularization one can afford. To circumvent this bounded variance issue, we rescale the noise to have unbounded variance. Table
2 shows the expression of the variance of the original noise and its scaled version for several distributions. It is the scaled noise that is used in noisin.4 Unbiased Regularization for RNNs
In Section 3, we introduced the concept of unbiasedness in the context of RNNs as a desideratum for noise injection to preserve the underlying RNN. In this section we prove unbiasedness leads to an explicit regularizer that forces the hidden states to be robust to noise.
4.1 Unbiased noise injection is explicit regularization
A valid regularizer is one that adds a nonnegative term to the risk. This section shows that unbiased noise injection with exponential family likelihoods leads to valid regularizers.
Consider the loss in Eq. 17 for an exponential family likelihood. The exponential family provides a general notation for the types of data encountered in practice: binary, count, realvalued, and categorical. Table 1 shows the expression of for these types of data. The log normalizer has many useful properties. For example it is convex and infinitely differentiable.
Assume without loss of generality that we observe one sequence . Consider the empirical risk function for the noiseinjected RNN. It is defined as
With little algebra we can decompose this risk into the sum of two terms
(18) 
where is the empirical risk for the underlying RNN and is
Because the second term in Eq. 18 is not always guaranteed to be nonnegative, noiseinjection is not explicit regularization in general. However, under the strong unbiasedness condition, this term corresponds to a valid regularization term and simplifies to
where the matrix is the prediction matrix of the underlying RNN rescaled by the square root of —the Hessian of the lognormalizer of the likelihood. This Hessian is also the Fisher information matrix of the RNN. We provide a detailed proof in Section 7.
noisin requires that we minimize the objective of the underlying RNN while also minimizing . Minimizing induces robustness—it is equivalent to penalizing hidden units that are too sensitive to noise.
4.2 Connections
In this section, we intuit that noisin has ties to ensemble methods and empirical Bayes.
The ensemble method perspective. noisin can be interpreted as an ensemble method. The objective in Eq. 16 corresponds to averaging the predictions of infinitely many RNNs at each time step in the sequence. This is known as an ensemble method and has a regularization effect (Poggio et al., 2002). However ensemble methods are costly as they require training all the submodels in the ensemble. With noisin, at each time step in the sequence, one of the infinitely many RNNs is trained and because of parameter sharing, the RNN being trained at the next time step will use better settings of the weights. This makes training the whole model efficient. (See Algorithm .)
The empirical Bayes perspective. Consider a noiseinjected RNN. We write its joint distribution as
Here denotes the likelihood and is the prior over the noisy hidden states; it is parameterized by the weights
. From the perspective of Bayesian inference this is an unknown prior. When we optimize the objective in Eq.
16, we are learning the weights . This is equivalent to learning the prior over the noisy hidden states and is known as empirical Bayes (Robbins, 1964). It consists in getting point estimates of prior parameters in a hierarchical model and using those point estimates to define the prior.
5 Empirical Study
Medium  Large  

Method  Dev  Test  Dev  Test  
None  
Gaussian  
Logistic  
Laplace  
Gamma  
Bernoulli  75.7  71.4  72.8  68.3  
Gumbel  
Beta  
Chi 
Medium  Large  

Method  Dev  Test  Dev  Test  
Dropout (D)  
D + Gaussian  70.0  66.1  
D + Logistic  
D + Laplace  
D + Gamma  
D + Bernoulli  70.0  66.1  
D + Gumbel  
D + Beta  73.0  69.2  
D + Chi 
Medium  Large  

Method  Dev  Test  Dev  Test  
None  
Gaussian  
Logistic  
Laplace  
Gamma  
Bernoulli  
Gumbel  
Beta  91.1  87.2  86.9  82.9  
Chi 
Medium  Large  

Method  Dev  Test  Dev  Test  
Dropout (D)  
D + Gaussian  
D + Logistic  
D + Laplace  85.6  82.1  
D + Gamma  
D + Bernoulli  80.8  76.8  
D + Gumbel  
D + Beta  
D + Chi 
Model  # Parameters  Dev  Test 
(Zaremba et al., 2014)  LSTM  M  
(Gal & Ghahramani, 2016)  Variational LSTM (MC)  M  
(Merity et al., 2016)  Pointer SentinelLSTM  M  
(Grave et al., 2016)  LSTM + continuous cache pointer  
(Inan et al., 2016)  Tied Variational LSTM + augmented loss  M  
(Zilly et al., 2016) Variational RHN  M  
(Melis et al., 2017)  2layer skip connection LSTM  M  
(Merity et al., 2017)  AWDLSTM + continuous cache pointer  M  
(Krause et al., 2017)  AWDLSTM + dynamic evaluation  M  
(Yang et al., 2017)  AWDLSTMMoS + dynamic evaluation  M  48.3  47.7 
(This paper)  AWDLSTMMoS + noisin + dynamic evaluation  M  48.4  47.6 
Multiplicative gammadistributed noise with shape
and scale .We presented noisin, a method that relies on unbiased noise injection to regularize any RNN. noisin is simple and can be integrated with any existing RNNbased model. In this section, we focus on applying noisin to the LSTM and the dropoutLSTM. We use language modeling as a testbed. Regularization is crucial in language modeling because the input and prediction matrices scale linearly with the size of the vocabulary. This results in networks with very high capacity.
We used noisin under two noise regimes: additive noise and multiplicative noise. We found that additive noise uniformly performs worse than multiplicative noise for the LSTM. We therefore report results only on multiplicative noise.
We used noisin with several noise distributions: Gaussian, Logistic, Laplace, Gamma, Bernoulli, Gumbel, Beta, and Square. We found that overall the only property that matters with these distributions is the variance. The variance determines the amount of regularization for noisin. It is the parameter in Algorithm . We outlined in Section 4 how to set the noise level for a given distribution so as to benefit from unbounded variance.
We also found that these distributions, when used with noisin on the LSTM perform better than the dropout LSTM on the Penn Treebank.
Another interesting finding is that noisin when applied to the dropoutLSTM performs better than the original dropoutLSTM.
Next we describe the two benchmark datasets used: Penn Treebank and Wikitext2. We then provide details on the experimental settings for reproducibility. We finally present the results in Table 5 and Table 5.
Penn Treebank. The Penn Treebank portion of the Wall Street Journal (Marcus et al., 1993) is a long standing benchmark dataset for language modeling. We use the standard split, where sections to ( tokens) are used for training, sections to ( tokens) for validation, and sections to ( tokens) for testing (Mikolov et al., 2010). We use a vocabulary of size that includes the special token unk for rare words and the end of sentence indicator eos.
Wikitext2. The Wikitext2 dataset (Merity et al., 2016) has been recently introduced as an alternative to the Penn Treebank dataset. It is sourced from Wikipedia articles and is approximately twice the size of the Penn Treebank dataset. We use a vocabulary size of and no further preprocessing steps.
Experimental settings. To assess the capabilities of noisin as a regularizer on its own, we used the basic settings for RNN training (Zaremba et al., 2014). We did not use weight decay or pointers (Merity et al., 2016).
We considered two settings in our experiments: a mediumsized network and a large network. The mediumsized network has layers with hidden units each. This results in a model complexity of million parameters. The large network has layers with hidden units each. This leads to a model complexity of million parameters.
For each setting, we set the dimension of the word embeddings to match the number of hidden units in each layer. Following initialization guidelines in the literature, we initialize all embedding weights uniformly in the interval . All other weights were initialized uniformly between where is the number of hidden units in a layer. All the biases were initialized to . We fixed the seed to for reproducibility.
We train the models using truncated backpropagation through time with average stochastic gradient descent (Polyak & Juditsky, 1992) for a maximum of epochs. The LSTM was unrolled for steps. We used a batch size of for both datasets. To avoid the problem of exploding gradients we clip the gradients to a maximum norm of . We used an initial learning rate of for all experiments. This is divided by a factor of if the perplexity on the validation set deteriorates.
For the dropoutLSTM, the values used for dropout on the input, recurrent, and output layers were respectively.
The models were implemented in PyTorch. The source code is available upon request.
Results on the Penn Treebank. The results on the Penn Treebank are illustrated in Table 5. The best results for the nonregularized LSTM correspond to a small network. This is because larger networks overfit and require regularization. In general noisin improves any given RNN including dropoutLSTM. For example noisin with multiplicative Bernoulli noise performs better than dropout RNN for both medium and large settings. noisin improves the performance of the dropoutLSTM by as much as on this dataset.
Results on the Wikitext2 dataset. Results on the Wikitext2 dataset are presented in Table 5. We observe the same trend as for the Penn Treebank dataset: noisin improves the underlying LSTM and dropoutLSTM. For the dropoutLSTM, it improves its generalization capabilities by as much as on this dataset.
6 Discussion
We proposed noisin, a simple method for regularizing RNNs. noisin injects noise into the hidden states such that the underlying RNN is preserved. noisin maximizes a lower bound of the log marginal likelihood of the data—the expected loglikelihood under the injected noise. We showed that noisin is an explicit regularizer that imposes a robustness constraint on the hidden units of the RNN. On a language modeling benchmark noisin improves the generalization capabilities of both the LSTM and the dropoutLSTM.
7 Detailed Derivations
We derive in full detail the risk of noisin and show that it can be written as the sum of the risk of the original RNN and a regularization term.
Assume without loss of generality that we observe one sequence . The risk of a noiseinjected RNN is
Expand this in more detail and write in lieu of to avoid cluttering of notation. Then
The risk for the underlying RNN——is similar when we replace with ,
Therefore we can express the risk of noisin as a function of the risk of the underlying RNN,
Under the strong unbiasedness condition,
Using the convexity property of the lognormalizer of exponential families and Jensen’s inequality,
Using the strong unbiasedness condition a second time we conclude Therefore
is a valid regularizer. A secondorder Taylor expansion of around and the strong unbiasedness condition yield
where the matrix is the original prediction matrix rescaled by the square root of the Hessian of the lognormalizer, the inverse Fisher information matrix of the underlying RNN. This regularization term forces the hidden units to be robust to noise. Under weak unbiasedness, the proof holds under the assumption that the true data generating distribution is an RNN.
Acknowledgements
We thank Francisco Ruiz for presenting our paper at ICML, 2018. We thank the Princeton Institute for Computational Science and Engineering (PICSciE), the Office of Information Technology’s High Performance Computing Center and Visualization Laboratory at Princeton University for the computational resources. This work was supported by ONR N000141512209, ONR 1336915102004, NIH 51004815500001084, NSF CCF1740833, the Alfred P. Sloan Foundation, the John Simon Guggenheim Foundation, Facebook, Amazon, and IBM.
References
 Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bayer & Osendorfer (2014) Bayer, J. and Osendorfer, C. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 Bishop (1995) Bishop, C. M. Training with noise is equivalent to tikhonov regularization. Neural Computation, 7(1):108–116, 1995.
 Chiu et al. (2017) Chiu, C.C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, K., et al. Stateoftheart speech recognition with sequencetosequence models. arXiv preprint arXiv:1712.01769, 2017.
 Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, pp. 2980–2988, 2015.
 Cooijmans et al. (2016) Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., and Courville, A. Recurrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.
 Elman (1990) Elman, J. L. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
 Fraccaro et al. (2016) Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pp. 2199–2207, 2016.
 Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1019–1027, 2016.
 Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
 Goyal et al. (2017) Goyal, A., Sordoni, A., Côté, M.A., Ke, N., and Bengio, Y. Zforcing: Training stochastic recurrent networks. In Advances in Neural Information Processing Systems, pp. 6716–6726, 2017.
 Grave et al. (2016) Grave, E., Joulin, A., and Usunier, N. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426, 2016.
 Graves (2013) Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Graves et al. (2013) Graves, A., Mohamed, A.r., and Hinton, G. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, 2013.
 Gregor et al. (2015) Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Inan et al. (2016) Inan, H., Khosravi, K., and Socher, R. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

Ioffe & Szegedy (2015)
Ioffe, S. and Szegedy, C.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pp. 448–456, 2015.  Jim et al. (1996) Jim, K.C., Giles, C. L., and Horne, B. G. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks, 7(6):1424–1438, 1996.
 Krause et al. (2017) Krause, B., Kahembwe, E., Murray, I., and Renals, S. Dynamic evaluation of neural sequence models. arXiv preprint arXiv:1709.07432, 2017.
 Krueger et al. (2016) Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., Goyal, A., Bengio, Y., Larochelle, H., Courville, A., et al. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2016.
 Marcus et al. (1993) Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330, 1993.
 Melis et al. (2017) Melis, G., Dyer, C., and Blunsom, P. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.
 Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 Merity et al. (2017) Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 Mikolov & Zweig (2012) Mikolov, T. and Zweig, G. Context dependent recurrent neural network language model. SLT, 12:234–239, 2012.
 Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
 Noh et al. (2017) Noh, H., You, T., Mun, J., and Han, B. Regularizing deep neural networks by noise: Its interpretation and optimization. In Advances in Neural Information Processing Systems, pp. 5113–5122, 2017.
 Pascanu et al. (2013) Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. International Conference on Machine Learning, 28:1310–1318, 2013.
 Pearlmutter (1995) Pearlmutter, B. A. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5):1212–1228, 1995.
 Poggio et al. (2002) Poggio, T., Rifkin, R., Mukherjee, S., and Rakhlin, A. Bagging regularizes. Technical report, Massachusetts Institute of Technology, 2002.
 Polyak & Juditsky (1992) Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 Robbins (1964) Robbins, H. The empirical bayes approach to statistical decision problems. The Annals of Mathematical Statistics, 35(1):1–20, 1964.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407, 1951.
 Robinson & Fallside (1987) Robinson, A. and Fallside, F. The utility driven dynamic error propagation network. University of Cambridge Department of Engineering, 1987.
 Rumelhart et al. (1988) Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. Learning representations by backpropagating errors. Cognitive Modeling, 5(3):1, 1988.
 Semeniuta et al. (2016) Semeniuta, S., Severyn, A., and Barth, E. Recurrent dropout without memory loss. arXiv preprint arXiv:1603.05118, 2016.
 Srivastava et al. (2014) Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
 Wager et al. (2013) Wager, S., Wang, S., and Liang, P. S. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems, pp. 351–359, 2013.
 Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 1058–1066, 2013.
 Werbos (1988) Werbos, P. J. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988.
 Williams (1989) Williams, R. J. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical report, Technical Report Technical Report NUCCS8927, Boston: Northeastern University, College of Computer Science, 1989.
 Wu et al. (2016) Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Yang et al. (2017) Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A highrank RNN language model. arXiv preprint arXiv:1711.03953, 2017.
 Zaremba et al. (2014) Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zilly et al. (2016) Zilly, J. G., Srivastava, R. K., Koutník, J., and Schmidhuber, J. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.