Regularizing Recurrent Networks - On Injected Noise and Norm-based Methods

10/21/2014 ∙ by Saahil Ognawala, et al. ∙ 0

Advancements in parallel processing have lead to a surge in multilayer perceptrons' (MLP) applications and deep learning in the past decades. Recurrent Neural Networks (RNNs) give additional representational power to feedforward MLPs by providing a way to treat sequential data. However, RNNs are hard to train using conventional error backpropagation methods because of the difficulty in relating inputs over many time-steps. Regularization approaches from MLP sphere, like dropout and noisy weight training, have been insufficiently applied and tested on simple RNNs. Moreover, solutions have been proposed to improve convergence in RNNs but not enough to improve the long term dependency remembering capabilities thereof. In this study, we aim to empirically evaluate the remembering and generalization ability of RNNs on polyphonic musical datasets. The models are trained with injected noise, random dropout, norm-based regularizers and their respective performances compared to well-initialized plain RNNs and advanced regularization methods like fast-dropout. We conclude with evidence that training with noise does not improve performance as conjectured by a few works in RNN optimization before ours.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Networks are variations of multilayer perceptrons based function approximators, which are used to predict on time-series data. Such data may be text information in various languages, a musical sequence, a video, or a trend analysis in the financial domain. As training for MLP goes, the most popular techniques are all based on some form of backpropagation of weight gradients (rumelhart1988learning). To train an RNN, the backpropagation of gradients is performed in time, on a time-unfolded representation of the network.

When such a time series network is trained by traditional backpropagation on error gradients, it suffers from one of two peculiar analytical problems—exploding gradients or vanishing gradients. When the error gradients are backpropagated through what is essentially a set of identical weight vectors, the gradients may grow smaller (vanishing gradients) or larger (exploding gradients) exponentially fast, until they become insignificant for training purpose or lead to instability. Conceptually, the problem of vanishing gradients exists in any deep neural network that relies on propagating its error downwards to train the weights. This issue is particularly harmful in case of RNN because it damages the capability of a network to learn properties of the problem that are

long-term dependent. In simple terms, this means that due to its inherent nature of being time-series, a recurrent network needs to store not only the state representation of the input at time, , but also of those seen at . This problem, in presence of vanishing gradients, becomes intractable for exceeding a few dozens.

Due to the unstable behaviour of RNNs in dynamic space, they were not touched upon extensively until some sophisticated second-order optimization methods were introduced for feedforward neural networks (martens2010deep)

, that were extended to RNNs. Also groundbreaking have been the advances in form of structural solutions like Long Short-Term Memory (LSTM)

(hochreiter1997long) that established state-of-the-art results on text prediction tasks, pathological tasks and such.

Till date, there have been no empirical studies on claims as the ones made in pascanu2012difficulty that regularization of recurrent weights by means of restricting the growth of will fail to prevent vanishing gradients. There have also not been evaluations on the standard regularization-for-overfitting techniques in MLP training applied to RNN for remembering long term dependencies. In this study, we aim to evaluate the effect of norm-based regularization methods, artificial noise injection and dropout in weights before propagating derivatives on the ability of the network to remember long term dependencies as well as convergence.

2 Related Work


present an experimental study that discusses the latest optimization trends in RNNs, including gradient clipping, second order optimization methods like Hessian-free, leaky integration units (LSTMs are also discussed as a part of this), momentum tricks in simple gradient descent (SGD), powerful output probability models based on deterministic variations of Restricted Boltzmann Machines and using sparse gradients as a regularization trick. The evaluations presented in the paper above are on the same music datasets that we use in our study, in addition to the Penn Treebank Corpus of text data.


describe deep recurrent networks that consist of denoising autoencoders

(vincent2008extracting) at each time-step, to extract rich features out of audio signals by learning time-series representations from deliberately noise-ed input. The noise itself is not modelled by the autoencoder, which is the key idea behind learning a denoised input representation.

RNNs are typically described as a set of three transition functions, viz. input-to-hidden, hidden-to-hidden and hidden-to-output. pascanu2013construct delve into the matter of “depth” in RNNs by describing and evaluating the workings of an RNN when one or more of these three transitions are made deeper than a single layer.

The study by hochreiter1997long is a solution to the long term dependency problem in RNNs. In this, the authors propose a structural variation of a conventional RNN where, by adding additional short-term memory units that fire randomly, the long time-delay remembering capability of an RNN increases significantly. graves2013generating extended the study of LSTMs by applying the idea to generate complex sequences of words in a text corpus, and handwriting patterns learned from real-valued positional information in calligraphy. zaremba2014recurrent improved generalization in LSTMs by applying Dropout (hinton2012improving) only to the non-recurrent connections.

murray1994enhanced present an analysis of noisy MLP training models, where the cost function is appended with a noise term to improve trajectory of the training curve, generalization of the network and increase fault tolerance from data. The results were shown to be particularly useful in the field of VLSI network design.

The study of jim1996analysis attempts to extend the noisy gradient descent model from feedforward networks to RNNs. The authors focus on convergence of RNNs, rather than the long term dependency problem. The noisy update model is applied to automata solving problems, which typically do not have pathologically long sequences that need to be remembered at arbitrary time delays.

In an analysis by schaefer2008learning, the authors claim that the widely discussed problem of long-term dependency identification in RNNs does not really exist. This claim is validated by working a pathological sequence task through an RNN, and demonstrating its performance on increasing time delays between the relevant input and output values. However, this study does not present results on standard audio, video or text corpus data that are used in other pertinent publications in RNN.

3 Formulation of RNNs

RNNs are semantically applicable to tasks that are based on temporal consistency. Other than universal function approximators, a way of looking at MLPs is as orthogonal representation of the input features. RNNs exploit this representation technique by duplicating hidden layers of MLP in time-steps and fully connecting the consecutive hidden layers in time. Therefore, we get an unfolded representation of RNNs in time as shown in Fig. 1.

Figure 1: RNN unfolded in time. (Adapted from sutskever2013training with permission.)

We, hence, define an RNN as


At time-step , is the input, is the activation of the hidden layer, , and is the output of the network. The complete parameter set of the model is given by the input-to-hidden weights, , hidden-to-hidden weights, , hidden-to-output weights, , hidden layer bias, , and output layer bias, . and

are the non-linear activation functions at the output and hidden layers, respectively.

4 Exploding Gradients and Effect on Long-term Dependencies

bengio1994learning and pascanu2012difficulty explain the dynamics of the weight training using backpropagation through time in RNNs.

Consider the error function, , applied on the outputs of RNN. Calculating error gradient


Where, is the concatenated matrix of , , , and .

It is clear from Eq. (4

) that derivative of loss function at every time-step,

, is affected by the the activations at time-steps .

Furthermore, consider the term on the right hand side of Eq. (4)


The multiplication of the real valued derivatives at time-steps successively for all indices in Eq. (4) may lead to the norm of the product growing very large or vanishing to zero, exponentially fast in time. This is harmful as far as storing long term time dependencies goes, because by the time the error gradient at would have been propagated to , the norm explosion or vanishing may have made the training regime unsuitable for any meaningful updates.

This compounding of the error gradient can happen in one of two opposite directions, both depending on the largest eigenvalue (spectral-radius),

, of the recurrent weight matrix. If the spectral radius is much less than 1, the gradient might vanish over time (if using a sigmoid-like non-linearity). On the other hand, if the spectral radius is bigger than 1, the gradient might explode over time.

5 Demonstration with a Simple Regime

Let us demonstrate the delicate nature of training a recurrent weight matrix, using an over-simplified architecture (a more expansive explanation, also from a dynamical systems perspective, can be found in pascanu2012understanding).

In Eq. 2, assume that there is no new input coming at every time-step, so that the second term with becomes unnecessary. Furthermore, assume that is a single dimension variable, which means that and have dimensions [1, 1] and [1] respectively.

Our objective, then, is to start with a zero value and reach a given target value, , in a set number of time-steps. Fig. 2 shows the training graph over 10000 different initialization sets of and . On the third axis, represents the squared loss of the model.

Figure 2: Training regime of a simple RNN. : weight, : bias, : squared loss

The steep wall perpendicular to the parameter space represents an explosion in gradients of the loss function. When the largest eigenvalue in the parameter matrix explodes, the curvature of the error surface compounds too, which is what the wall illustrates.

The thing most noteworthy is that when the search routine is at a point on the top surface of the error curve, it makes its next step in a direction perpendicular to the face of the wall. Depending on the learning rate, it might then fall to ground beyond the valley where the error reaches its minimum. This is not such a big problem, because the search must come back to the valley region, given itself to explore the ground region. Note, however, that is only until the search direction collides onto the wall again, at which point a small change in the norm of the update would take the search back to the top of the hill to repeat the entire search process.

The key, then, is to have a method that would smoothen the minima valley and decrease the slope of the steep wall so as to allow optimization to move in a less arbitrary fashion given a sufficiently small learning rate. A more acceptable routine may look like the one shown in Fig. 3.

Figure 3: More desirable training regime of a simple RNN.

6 Existing Solutions

6.1 Initialization and Momentum Tricks

Momentum (polyak1964some) with SGD method has the added advantage of preserving the directions of consistent change over multiple updates. The persistent change in directions can be thought of as the dominant velocity in which the update moves during the optimization process. sutskever2013importance describe Gradient descent with momentum as


Where, is the weight matrix after updates, is the update value, is the momentum, is the step rate of learning and is the partial derivative of the error function w.r.t. the parameter .

nesterov1983method introduced Nesterov Accelerated Gradient (NAG) method for effective velocity preservation in optimization process. In the manner of classical momentum, NAG can be formalized as


The small, but key, difference between classical momentum method and NAG is that in the latter, first a partial update to the parameters is done using the last update value, and then the gradient calculation is done for the next update.

The second trick presented by sutskever2013importance is related to the random initialization of the hidden-to-hidden and input-to-hidden weight matrices. The sparse-ifying technique presented here is inspired by martens2010deep, where all but 15 (or some

) connection weights are set to zero, and the rest are sampled from a Gaussian distribution. The reasoning behind this weight setting has been that a sparse connection matrix would help to diversify the incoming connection from a lower layer.

As a second initialization step, the spectral-radius is kept close to 1, so as to decrease the possibility of the gradients exploding or vanishing over a long time delay, when using sigmoid transfer function.

6.2 Echo-State Networks

It has been argued by jaeger2004harnessing that a random draw from a pre-determined distribution can be used to set the input-to-hidden and hidden-to-hidden connection weights, instead of learning them iteratively. This method, however, is not applied to the hidden-to-output layer connections, which are trained using closed form solutions that involve calculating the pseudo-inverse of a Hessian matrix.

A completely random draw without controlling the distribution parameters might be harmful for setting such weights, though. For instance, if the spectral radius of the hidden-to-hidden weight matrix is much higher or lower than 1, there is a clear possibility that the long term dependency effects are either intractable or vanish, respectively, over time. Hence, we follow the general rule that the spectral radius of the hidden-to-hidden weight matrix is restricted to be close to 1 (1.1, 0.9 etc.) and the input-to-hidden weights are drawn with a small standard deviation of about 0.001.

6.3 Hessian Free Optimization

martens2010deep propose a second order Hessian-Free (HF) optimization method, inspired by Newton’s method, to train deep neural networks with random initializations. HF method obviates the need for pre-training in deep models, which was previously thought to be the most promising way of starting the optimization process, due to the presence of deep pathologies (hinton2006fast; hinton2006reducing).

With respect to the objective function, , HF concerns itself with optimizing a simpler sub-objective of by finding local approximations to it. This is done as follows—for a parameter update from to , it optimizes a sub-objective function


The term, , represents a quadratic approximation to . Normally, is chosen to be the Taylor-series expansion of to second-order terms. This is the same expansion term that is used for Newton’s optimization methods with the key difference that there are no additional assumptions like a low-rank matrix. This would, typically, make the optimization harder since it would involve an inversion of a large matrix. What differentiates HF from other second order optimization methods is that it is made possible to partially optimize by conjugate gradient method, instead of gradient descent.

The term is a regularization function that penalizes the solution as it moves farther away from (this modification to the HF method of martens2010deep was proposed by sutskever2013training).

6.4 Long Short-Term Memory (LSTM)

Figure 4: LSTM cell schematic (graves2013generating)

While not particularly a solution to the exploding/vanishing gradients problem, LSTMs

(hochreiter1997long) have been systematically proven (graves2013generating) to have state-of-the-art performance on sequence generation and long-range time series prediction tasks. LSTM alleviates the temporal dependency preservation problem of plain RNNs by structurally modifying the naive neural nodes of the RNN model to produce a more complex LSTM memory cell.

LSTM cell consists of the following novel links, as in Fig. 4, in addition to the conventional hidden units

  • Input gate to control the in-flow of an input vector into the hidden state. Takes a value from .

  • Output gate to control the out-flow of a hidden state activation to the next layer of LSTM-RNN. Takes a value from .

  • Forget gate to control the value retention of a memory cell. This link uses the input vector and hidden activation value to determine whether the activation is fed back to the unit for retention over longer time sequences. Takes a value from .

The original LSTM by hochreiter1997long

uses SGD for training, but it suffers from the exploding gradient problem. In order to solve that, the solution of

graves2013generating uses gradient clipping technique to limit the norm of the gradients and hence stop them from growing too large with time. Even so, the structural complexity of LSTM memory units makes it difficult to implement and harder to train on most systems that do not allow calculation of arbitrary gradients.

6.5 Fast-Dropout RNNs

wang2013fast suggest an approximation for dropout (hinton2012improving)

in deep neural networks. The suggestion is to treat every neuron as a random variable, whose incoming connections are randomly set to zero, with a probability of

. It would be safe to assume that the nature of such a random variable would tend to be Gaussian over sufficiently large number (approximately 10, or more) of incoming connections. The resulting models had orders of magnitude better training times than a naive dropout approach, and the test results matched, and were sometimes better than those of plain MLPs.

bayer2013fast verified the validity of the fast-dropout approach on RNNs. This was done by concatenating the input-to-hidden and hidden-to-hidden weights into a single array, and applying the same approximation to the incoming connections as in wang2013fast. Fast-dropout applied to RNNs, works as a regularizer, because the Gaussian approximation of the dropout term leads to a local derivative of the random variable representation of the node, that acts as an additive regularization term.

The results of Fast-dropout, when applied with the initialization tricks of Sec. 6.1 on standard music datasets, produces state-of-the-art results.

7 Norm-based Regularizers

The first method of regularization in RNN that we evaluate is Tikhonov regularization (bishop1991improving) on input-to-hidden, recurrent and hidden-to-output weight matrices. It has been claimed in previous RNN related works (pascanu2012difficulty)

that L1 and L2 penalties on the weight matrix, when added to the cost function of the estimator, may work against improving the long-term dependency remembrance of the network and only partially alleviate the exploding gradients problem.

Using the same example for demonstration as in Sec. 5, we illustrate the effect of L1 and L2 regularizers on the training regime of a time-series network.

8 Stochastic Noise Injection

Noise injection is used as a regularization method in feedforward-only neural networks (bremermann1991brain, flower1993summed, jabri1992weight) to improve generalization. The motive behind adding stochastic noise of different natures to the synaptic weights is to improve fault tolerance in the input and gracefully handle unseen data during prediction.

Adding noise to the weights during optimization works as a regularizer by, essentially, converting the state-space search into a search in a more coarse region of the weight space than what would have been without the additional noise. This property of noisy training has been exploited for training the recurrent weights in RNNs too. By adjusting the weight space to a grainier region, not only are we promised faster convergence but also a cure for the exploding gradients problem. A detailed analysis of Gaussian noise injection in recurrent weight matrix and its behaviour as a regularizer is given in appendix A.

In RNNs, the work of jim1996analysis demonstrate application of stochastic noise to the recurrent layers, much the same way as feedforward MLPs. In the following subsections, we use the additive and multiplicative noise addition model by jim1996analysis to evaluate the performance of a recurrent network in terms of preserving long term dependencies in musical chord sequences. Our analysis of the noisy recurrent weight training model is followed by noisy input-to-hidden weight model.

8.1 Noise in Recurrent Weights

The first type of noise injection we analyze is in the recurrent weight matrix. In all the analyzed noisy training methods, we restrict ourselves to non-cumulative noise models. In non-cumulative noise methods, the intensity of noise injected at each time-step, , is independent of the amount of noise injected at . As we saw earlier, backpropagation-through-time in RNNs trains essentially the same set of weights in time-space and, hence, we postulate that cumulatively increasing the noise intensity in time space might decrease the convergence performance of the network.

Other than the cumulative nature of the recurrent weight noise, there are two main considerations for deciding the nature of noise that must be injected at each recurrent layer

  1. Should the same noise vector be inserted at every time-step in the unrolled representation of the network (per-sequence noise) or a different noise vector be sampled for every time-step (per-time-step noise)?

  2. Should the noise be a multiplicative factor of the state of weight vector (multiplicative noise) or simply an additive noise vector sampled from a given distribution (additive noise)?

8.1.1 Additive Noise

Additive noise in recurrent weights at time-step, , is given by


is the modified version of after adding the noise term. The noise vector,

is chosen from a standard normal distribution

In the per-time-step recurrent noise model, we sample a new noise vector, for every time-step in the unrolled-representation for every iteration of weight update in the optimization process. In the per-sequence recurrent noise model, we sample a new noise vector, for every iteration in the optimization process and add the same noise to each time-step in the network.

8.1.2 Multiplicative Noise

Multiplicative noise in recurrent weights, analogously, is given by


The nature of is the same as before.

As with additive noise, multiplicative noise is also evaluated on the two variants of per-time-step noise and per-sequence noise models.

In both, additive and multiplicative noise models, the perturbation of the weight matrix is done only during the optimization period, and not during forward propagation. During weight training, the original values of the weight matrices are preserved even as noise is added for the gradient calculation for backpropagation-through-time.

8.2 Noise in Feedforward layers

As with noise in the recurrent weight matrix, we would like to close the loop on experimentation by applying the noisy weights training on the feedforward connections too.

During training of feedforward connections with backpropagation of gradients, we use the following weight formulae for noisy weights


We only work with per-time-step noise model for feedforward layers.

9 Dropout as a Regularizer

Random dropout in MLP connections is used as a generalization technique (hinton2012improving), that works by preventing co-adaptation of multiple features in the training set. A variation of dropout in the activation units is DropConnect (wan2013regularization), where random elements from the weight matrix are dropped instead.

We use the DropConnect model on the recurrent weight matrix to try to improve the long-term dependency preserving tendency of our network. As with stochastic noise reduction, dropout in recurrent weights can be applied in two different ways

  1. A possibly unique set of weights are dropped out at every time-step (per-time-step dropout).

  2. Same set of weights are dropped out at every time-step (per-sequence dropout).

After searching over the range 0–1, we find the best dropout rate suitable for the recurrent connections.

10 Experiments

10.1 Datasets

For evaluating the proposed regularization techniques, we use musical datasets. These are notes based representation of score sheets from four sources—JSB Chorales (harmonized chorales of J.S. Bach), (classical music from different sources), Nottingham (folk tunes) and MuseData (classical music).

The dimensionality at each time-step for all four datasets is 88. After dividing the original dataset into training, validation and testing sets (approximately 60%–20%–20% respectively), we split the training and validation samples into chunks of 100 time-steps each. We choose this number because in our experience, for a dataset such as music scores, a length of 100 is long enough to make remembering long term dependencies a necessity while at the same time not making it unreasonably difficult for a network to do so. For samples that are smaller than 100 steps long, we pad them with zeros at the front.

We do no such splitting or prefixing for the test dataset, and use the original sized data chunks for prediction.

10.2 Model Description

Our setup for all four polyphonic music datasets consists of one hidden layer of neurons at each time-step of the RNN. The number of hidden units in the layer is enumerated in the appendix B. The hidden units use the hyperbolic-tangent (tanh) non-linearity and the output nodes use sigmoid. The model parameters are tasked with describing the random variable, , such that

Where denotes the state of note at time-step which, if present, is and otherwise.

The loss function which is optimized by this RNN is a mean cross-entropy (CE) loss over all time-steps

denotes the note index, denotes the time-step and denotes the training sample index.

10.3 Results

On the four datasets, we report the average CE errors in Tab. 1. The results for RNN with norm-based regularizer (RNN-NBR), per-time-step noise (RNN-N), per-sequence noise (RNN-NS), multiplicative noise per-time-step (RNN-MN), multiplicative noise per-sequence (RNN-MNS), dropout per-time-step (RNN-DO), dropout per-sequence (RNN-DOS) and feedforward noise (RNN-FF) are given compared to plain RNNs (with initialization in correct regime) and fast dropout RNN (RNN-FD). Advanced training methods like fast dropout and RNN-NADE (boulanger2012modeling) perform measurably better on this data.

We see that injecting stochastic noise or randomly dropping out weights in recurrent layers during training does not necessarily improve the performance of the RNN training or generalization to the test set. In fact, for most datasets, simply tuning the initialization parameters viz. standard deviation of the weight parameter sampling, sparsification of the weight matrix and spectral radius of the recurrent weight vectors, provides better test performance on the musical datasets, than using the noise injection techniques.

JSBC Not. P-midi Muse
Plain-RNN 8.58 3.43 7.58 6.99
RNN-FD 8.01 3.09 7.39 6.75
RNN-NBR 8.83 3.70 7.78 8.62
RNN-N 8.92 3.56 7.66 8.40
RNN-NS 8.96 3.58 7.74 8.40
RNN-MN 8.64 3.51 7.71 8.13
RNN-MNS 8.64 3.50 7.70 8.12
RNN-DO 8.48 3.49 7.65 7.98
RNN-DOS 8.55 3.57 7.67 8.00
RNN-FF 8.67 3.54 7.69 8.10
Table 1: Test set results on polyphonic musical datasets

As postulated by bayer2013fast

, we observe too that the largest eigenvalue, when training with stochastic noise of dropout in recurrent weights, gets stuck at a lower spectral radius after a fixed number of epochs over multiple tries. There is less incentive for weight matrices with lower spectral radii to change their values by a bigger amount, due to the lack of error information that can be stored over longer time delays. This can be seen in Fig. 

5 and Fig. 6. However, this is not the case with norm-based regularizers where the spectral radius continues to grow, albeit very slowly (Fig. 7).

Figure 5: Training with multiplicative noise per-sequence
Figure 6: Training with dropout
Figure 7: Training with L2 regularizer

Tab. 2 in appendix B gives the range of values from which we generate for norm-based regularizer. Fig. 8 shows the average logarithmic test errors over different for both, L1 and L2, regularizers.

Figure 8: Regularizer vs. mean test-error for JSB Chorales

Fig. 9 shows the average test errors over different (standard deviation) of additive stochastic noise. The general trend indicates that the network performance decreases as increases.

Figure 9: Additive noise vs. mean test-error for JSB Chorales

Fig. 10 shows the average test errors over different (probability that an incoming recurrent weight is set to zero) values, for uniform dropout per-sequence. The general trend indicates that the network performance improves as is increased.

Figure 10: Dropout probability (per-sequence) vs. mean test-error for JSB Chorales

11 Conclusion

Through an exhaustive set of experiments with noisy weight updates, random dropout and norm-based regularization approach we have shown that conjectures about the inefficacy of MLP specific regularizers on RNNs are verifiable. pascanu2012difficulty

conjectured that a norm-based penalty on the loss function may reduce the training regime of an RNN to a single point attractor, since the length of the eigenvectors of the weight matrix never exceeded by more than a limited amount. A matrix of weights with such low spectral radius would not suffer from exploding or vanishing gradients at the cost of storing long term dependency effects. We can see this from the demonstration of a simple RNN (Fig. 

3). In fact, the analytic presentation of the noisy weight training method shows that noise in weights can also be explained as a loss regularization term.

As the results of stochastic noise, L1 and L2 regularizers on RNNs have not been sufficiently tackled by past works in the field, we believe that we have closed a much needed empirical gap by showing that second order optimization methods, structural solutions or more sophisticated methods of training are indeed imperative to deal with the issues of vanishing gradients and long term dependency in recurrent networks.


Appendix A Analysis of Noisy Weights

In this section we attempt to show that adding stochastic noise to the weight matrix is equivalent to adding a regularization term to the loss function of the RNN.

Let us define the pre-synaptic activation of the incoming connections to one hidden unit as as . Then, upon adding multiplicative noise to the weight vector, we have

The noise, is drawn from a zero mean Gaussian (). Additionally, considering and as constants, we have –


This shows that the expected value of is the same as the expected value of pre–synaptic signal that is not perturbed by noise.

For computing the variance of

, we know that,



The second and third terms on the right hand side of Eq. A are zero since the variance in question is that of a constant input, .

For the first term of Eq. A,


Where is the standard deviation of the Gaussian noise matrix, .

Putting this back into Eq. A, we get


The forms of and imply that,


This means that if the multiplicative noise is assumed to have been sampled from a Gaussian distribution, it is equivalent to assume that the pre-synaptic activations are sampled from a Gaussian.

This equivalence to a sampling form brings us to the sampling form of pre-synaptic activation explained by bayer2013fast, instead of smooth Gaussian approximation.

In place of , let us use , which we define as –

Where, .

Using the above incarnation of to it’s sampling form, , we may define an effective loss function as follows –


We will analyse the right hand side of Eq. 21 one at a time.

Consider the first term –


Using the expectation value from Eq. 16.

For a pre-synaptic activation, , Eq. 22 is similar to the usual backpropagation term w.r.t a loss function, . Therefore, we may simply use the following form of the gradient term –


Consider now the second term of Eq. 21


This is the same as the post-synaptic gradient term, scaled by the standard deviation of the noise, , and independent of the actual weight values.

Hence, we can write Eq. 21 as –


Where the second term on the right hand side is the regularization term due to multiplicative noise addition to the synaptic weights.

Similar analysis can be done for dropout in recurrent weight matrix, where the Gaussian distribution of the noise vector can be replaced by a Bernoulli distribution approximation when choosing


Appendix B Hyper Parameters for RNN Models

For each of the eight RNN models for which the results are listed in Tab. 1 we generate 50 experiments with model hyper parameters chosen from the ranges given in Tab. 2.

The best configurations for all datasets are listed in Tab. 3, Tab. 4, Tab. 5 and Tab. 6.

Initilization parameters for {1e-3, 1, 1e-4}
for {1e-1, 1e-2, 1e-3}
Sparsify {15, 25, 50}
limit {0.9, 1.0, 1.1}
Regularizer Regularizer {L1, L2}
Regularizer [10e-2, 10e-4]
Dropout [0.0, 1.0]
Additive and multiplicative noise for [0.01, 0.1]
Optimizer (rmsprop) parameters Momentum {0.9, 0.95, 0.99}
Step rate {1e-2, 1e-3, 1e-4}
Batch size {27, 81}
Table 2: RNN hyper parameter ranges
Initilization for 0.0001 0.001 0.001 0.0001 0.0001 0.001 0.001 0.001
for 0.1 0.1 0.001 0.01 0.001 0.01 0.001 0.1
Sparsify 15 50 50 50 25 25 50 15
limit 1.1 0.9 1.0 0.9 0.9 1.0 1.0 0.9
Regularizer Regularizer L2
Dropout 0.92 0.56 -
Noise for 0.01 0.04 0.06 0.01 0.09
Optimizer Momentum 0.9 0.99 0.90 0.95 0.90 0.95 0.95 0.90
Step rate 0.001 0.0001 0.001 0.0001 0.0001 0.0001 0.0001 0.0001
Batch size 81 27 27 81 27 81 81 81
Hidden layer # hidden 200 200 200 200 200 200 200 200
Table 3: Best configurations for JSB Chorales
Initilization for 0.0001 0.0001 0.001 0.0001 0.0001 0.001 0.001 0.0001
for 0.1 0.1 0.01 0.001 0.001 0.01 0.1 0.001
Sparsify 15 25 25 15 25 15 25 15
limit 0.9 1.1 1.0 1.0 1.0 0.9 1.1 1.1
Regularizer Regularizer L2
Dropout 0.36 0.78
Noise for 0.01 0.02 0.02 0.06 0.05
Optimizer Momentum 0.95 0.95 0.95 0.95 0.95 0.90 0.90 0.95
Step rate 0.0001 0.0001 0.0001 0.0001 0.0001 0.001 0.001 0.0001
Batch size 81 27 27 27 27 81 81 27
Hidden layer # hidden 200 200 200 200 200 200 200 200
Table 4: Best configurations for Nottingham
Initialization for 0.0001 0.001 0.001 0.0001 0.0001 0.001 0.0001 0.0001
for 0.001 0.1 0.001 0.001 0.1 0.001 0.01 0.1
Sparsify 15 25 15 15 15 15 25 50
limit 0.90 1.0 1.0 1.0 0.90 1.0 0.90 0.90
Regularizer Regularizer L2
Dropout 0.69 0.51
Noise for 0.05 0.04 0.04 0.02 0.08
Optimizer Momentum 0.95 0.95 0.99 0.90 0.95 0.90 0.95 0.90
Step rate 0.0001 0.0001 0.0001 0.001 0.0001 0.0001 0.0001 0.0001
Batch size 27 27 81 81 81 27 81 81
Hidden layer # hidden 100 100 100 100 100 100 100 100
Table 5: Best configurations for
Initialization for 0.001 0.0001 0.0001 0.001 0.001 0.0001 0.0001 0.0001
for 0.01 0.01 0.1 0.1 0.1 0.001 0.1 0.001
Sparsify 25 50 15 50 50 50 25 15
limit 1.0 0.9 1.1 1.0 1.0 1.1 1.0 0.9
Regularizer Regularizer L1
Dropout 0.93 0.80
Noise for 0.02 0.02 0.04 0.09 0.01
Optimizer Momentum 0.90 0.99 0.95 0.95 0.90 0.90 0.90 0.95
Step rate 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
Batch size 81 27 27 81 81 27 81 81
Hidden layer # hidden 600 600 600 600 600 600 600 600
Table 6: Best configurations for MuseData