Recurrent Dropout without Memory Loss

03/16/2016 ∙ by Stanislau Semeniuta, et al. ∙ Google Universität Lübeck 0

This paper presents a novel approach to recurrent neural network (RNN) regularization. Differently from the widely adopted dropout method, which is applied to forward connections of feed-forward architectures or RNNs, we propose to drop neurons directly in recurrent connections in a way that does not cause loss of long-term memory. Our approach is as easy to implement and apply as the regular feed-forward dropout and we demonstrate its effectiveness for Long Short-Term Memory network, the most popular type of RNN cells. Our experiments on NLP benchmarks show consistent improvements even when combined with conventional feed-forward dropout.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Tensorflow RNN for spanish sentiment analysis

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Networks, LSTMs in particular, have recently become a popular tool among NLP researchers for their superior ability to model and learn from sequential data. These models have shown state-of-the-art results on various public benchmarks ranging from sentence classification [Wang et al.2015, Irsoy and Cardie2014, Liu et al.2015] and various tagging problems [Dyer et al.2015] to language modelling [Kim et al.2015, Zhang et al.2015]

, text generation 

[Zhang and Lapata2014] and sequence-to-sequence prediction tasks [Sutskever et al.2014].

Having shown excellent ability to capture and learn complex linguistic phenomena, RNN architectures are prone to overfitting. Among the most widely used techniques to avoid overfitting in neural networks is the dropout regularization [Hinton et al.2012]

. Since its introduction it has become, together with the L2 weight decay, the standard method for neural network regularization. While showing significant improvements when used in feed-forward architectures, e.g., Convolutional Neural Networks 

[Krizhevsky et al.2012], the application of dropout in RNNs has been somewhat limited. Indeed, so far dropout in RNNs has been applied in the same fashion as in feed-forward architectures: it is typically injected in input-to-hidden and hidden-to-output connections, i.e., along the input axis, but not between the recurrent connections (time axis). Given that RNNs are mainly used to model sequential data with the goal of capturing short- and long-term interactions, it seems natural to also regularize the recurrent weights. This observation has led us and other researchers [Moon et al.2015, Gal2015] to the idea of applying dropout to the recurrent connections in RNNs.

In this paper we propose a novel recurrent dropout technique and demonstrate how our method is superiour to other recurrent dropout methods recently proposed in [Moon et al.2015, Gal2015]. Additionally, we answer the following questions which helps to understand how to best apply recurrent dropout: (i) how to apply the dropout in recurrent connections of the LSTM architecture in a way that prevents possible corruption of the long-term memory; (ii) what is the relationship between our recurrent dropout and the widely adopted dropout in input-to-hidden and hidden-to-output connections; (iii) how the dropout mask in RNNs should be sampled: once per step or once per sequence. The latter question of sampling the mask appears to be crucial in some cases to make the recurrent dropout work and, to the best of our knowledge, has received very little attention in the literature. Our work is the first one to provide empirical evaluation of the differences between these two sampling approaches.

Regarding empirical evaluation, we first highlight the problem of information loss in memory cells of LSTMs when applying recurrent dropout. We demonstrate that previous approaches of dropping hidden statevectors cause loss of memory while our proposed method to use dropout mask in hidden state update

vectors does not suffer from this problem. We experiment on three widely adopted NLP tasks: word- and character-level Language Modeling and Named Entity Recognition. The results demonstrate that our

recurrent dropout helps to achieve better regularization and yields improvements across all the tasks, even when combined with the conventional feed-forward dropout. Furthermore, we compare our dropout scheme with the recently proposed alternative recurrent dropout methods and show that our technique is superior in almost all cases.

2 Related Work

Neural Network models often suffer from overfitting, especially when the number of network parameters is large and the amount of training data is small. This has led to a lot of research directed towards improving their generalization ability. Below we primarily discuss some of the methods aimed at improving regularization of RNNs.

DBLP:journals/corr/PhamKL13 and DBLP:journals/corr/ZarembaSV14 have shown that LSTMs can be effectively regularized by using dropout in forward connections. While this already allows for effective regularization of recurrent networks, it is intuitive that introducing dropout also in the hidden state may force it to create more robust representations. Indeed, moon2015rnndrop have extended the idea of dropping neurons in forward direction and proposed to drop cell states as well showing good results on a Speech Recognition task. Bluche2015b carry out a study to find where dropout is most effective, e.g. input-to-hidden or hidden-to-output connections. The authors conclude that it is more beneficial to use it once in the correct spot, rather than to put it everywhere. DBLP:journals/corr/BengioVJS15 have proposed an algorithm called scheduled sampling to improve performance of recurrent networks on sequence-to-sequence labeling tasks. A disadvantage of this work is that the scheduled sampling is specifically tailored to this kind of tasks, what makes it impossible to use in, for example, sequence-to-label tasks. gal2015dropout uses insights from variational Bayesian inference to propose a variant of LSTM with dropout that achieves consistent improvements over a baseline architecture without dropout.

The main contribution of this paper is a new recurrent dropout technique, which is most useful in gated recurrent architectures such as LSTMs and GRUs. We demonstrate that applying dropout to arbitrary vectors in LSTM cells may lead to loss of memory thus hindering the ability of the network to encode long-term information. In other words, our technique allows for adding a strong regularizer on the model weights responsible for learning short and long-term dependencies without affecting the ability to capture long-term relationships, which are especially important to model when dealing with natural language. Finally, we compare our method with alternative recurrent dropout methods recently introduced in [Moon et al.2015, Gal2015] and demonstrate that our method allows to achieve better results.

3 Recurrent Dropout

In this section we first show how the idea of feed-forward dropout [Hinton et al.2012] can be applied to recurrent connections in vanilla RNNs. We then introduce our recurrent dropout method specifically tailored for gated architectures such as LSTMs and GRUs. We draw parallels and contrast our approach with alternative recurrent dropout techniques recently proposed in [Moon et al.2015, Gal2015] showing that our method is favorable when considering potential memory loss issues in long short-term architectures.

3.1 Dropout in vanilla RNNs

Vanilla RNNs process the input sequences as follows:


where is the input at time step ; and are hidden vectors that encode the current and previous states of the network; is parameter matrix that models input-to-hidden and hidden-to-hidden (recurrent) connections; is a vector of bias terms, and

is the activation function.

As RNNs model sequential data by a fully-connected layer, dropout can be applied by simply dropping the previous hidden state of a network. Specifically, we modify Equation 1 in the following way:


where is the dropout function defined as follows:


where is the dropout rate and

is a vector, sampled from the Bernoulli distribution with success probability


3.2 Dropout in LSTM networks

(a) Moon et al., 2015
(b) Gal, 2015
(c) Ours
Figure 1: Illustration of the three types of dropout in recurrent connections of LSTM networks. Dashed arrows refer to dropped connections. Input connections are omitted for clarity.

Long Short-Term Memory networks [Hochreiter and Schmidhuber1997] have introduced the concept of gated inputs in RNNs, which effectively allow the network to preserve its memory over a larger number of time steps during both forward and backward passes, thus alleviating the problem of vanishing gradients [Bengio et al.1994]. Formally, it is expressed with the following equations:


where are input, output and forget gates at step ; is the vector of cell updates and is the updated cell vector used to update the hidden state ;

is the sigmoid function and

is the element-wise multiplication.

gal2015dropout proposes to drop the previous hidden state when computing values of gates and updates of the current step, where he samples the dropout mask once for every sequence:


moon2015rnndrop propose to apply dropout directly to the cell values and use per-sequence sampling as well:


In contrast to dropout techniques proposed by gal2015dropout and moon2015rnndrop, we propose to apply dropout to the cell update vector as follows:


Different from methods of [Moon et al.2015, Gal2015], our approach does not require sampling of the dropout masks once for every training sequence. On the contrary, as we will show in Section 4, networks trained with a dropout mask sampled per-step achieve results that are at least as good and often better than per-sequence sampling. Figure 1 shows differences between approaches to dropout.

The approach of [Gal2015] differs from ours in the overall strategy – they consider network’s hidden state as input to subnetworks that compute gate values and cell updates and the purpose of dropout is to regularize these subnetworks. Our approach considers the architecture as a whole with the hidden state as its key part and regularize the whole network. The approach of [Moon et al.2015] on the other hand is seemingly similar to ours. In Section 3.2 we argue that our method is a more principled way to drop recurrent connections in gated architectures.

It should be noted that while being different, the three discussed dropout schemes are not mutually exclusive. It is in general possible to combine our approach and the other two. We expect the merge of our scheme and that of [Gal2015] to hold the biggest potential. The relations between recurrent dropout schemes are however out of scope of this paper and we rather focus on studying the relationships of different dropout approaches with the conventional forward dropout.

Gated Recurrent Unit (GRU) networks are a recently introduced variant of a recurrent network with hidden state protected by gates [Cho et al.2014]. Different from LSTMs, GRU networks use only two gates and to update the cell’s hidden state :


Similarly to the LSTMs, we propoose to apply dropout to the hidden state updates vector :


We found that an intuitive idea to drop previous hidden states directly, as proposed in moon2015rnndrop, produces mixed results. We have observed that it helps the network to generalize better when not coupled with the forward dropout, but is usually no longer beneficial when used together with a regular forward dropout.

The problem is caused by the scaling of neuron activations during inference. Consider the hidden state update rule in the test phase of an LSTM network. For clarity, we assume every gate to be equal to 1:


where are update vectors computed by Eq. 4 and is the probability to not drop a neuron. As was, in turn, computed using the same rule, we can rewrite this equation as:


Recursively expanding for every timestep results in the following equation:


Pushing inside parenthesis, Eq. 16 can be written as:


Since is a value between zero and one, sum components that are far away in the past are multiplied by a very low value and are effectively removed from the summation. Thus, even though the network is able to learn long-term dependencies, it is not capable of exploiting them during test phase. Note that our assumption of all gates being equal to 1 helps the network to preserve hidden state, since in a real network gate values lie within (0, 1) interval. In practice trained networks tend to saturate gate values [Karpathy et al.2015] what makes gates to behave as binary switches. The fact that moon2015rnndrop have achieved an improvement can be explained by the experimentation domain. DBLP:journals/corr/LeJH15 have proposed a simple yet effective way to initialize vanilla RNNs and reported that they have achieved a good result in the Speech Recognition domain while having an effect similar to the one caused by Eq. 17. One can reduce the influence of this effect by selecting a low dropout rate. This solution however is partial, since it only increases the number of steps required to completely forget past history and does not remove the problem completely.

One important note is that the dropout function from Eq. 3 can be implemented as:


In this case the above argument holds as well, but instead of observing exponentially decreasing hidden states during testing, we will observe exponentially increasing values of hidden states during training.

Our approach addresses the problem discussed previously by dropping the update vectors . Since we drop only candidates, we do not scale the hidden state directly. This allows for solving the scaling issue, as Eq. 17 becomes:


Moreover, since we only drop differences that are added to the network’s hidden state at each time-step, this dropout scheme allows us to use per-step mask sampling while still being able to learn long-term dependencies. Thus, our approach allows to freely apply dropout in the recurrent connections of a gated network without hindering its ability to process long-term relationships.

We note that the discussed problem does not affect vanilla RNNs because they overwrite their hidden state at every timestep. Lastly, the approach of gal2015dropout is not affected by the issue as well.

4 Experiments

First, we empirically demonstrate the issues linked to memory loss when using various dropout techniques in recurrent nets (see Sec. 3.2). For this purpose we experiment with training LSTM networks on one of the synthetic tasks from [Hochreiter and Schmidhuber1997], specifically the Temporal Order task. We then validate the effectiveness of our recurrent dropout when applied to vanilla RNNs, LSTMs and GRUs on three diverse public benchmarks: Language Modelling, Named Entity Recognition, and Twitter Sentiment classification.

4.1 Synthetic Task


In this task the input sequences are generated as follows: all but two elements in a sequence are drawn randomly from {C, D} and the remaining two symbols from {A, B}. Symbols from {A, B} can appear at any position in the sequence. The task is to classify a sequence into one of four classes ({AA, AB, BA, BB}) based on the order of the symbols. We generate data so that every sequence is split into three parts with the same size and emit one meaningful symbol in first and second parts of a sequence. The prediction is taken after the full sequence has been processed. We use two modes in our experiments:

Short with sequences of length 15 and Medium with sequences of length 30.

Setup. We use LSTM with one layer that contains 256 hidden units and recurrent dropout

with 0.5 strength. Network is trained by SGD with a learning rate of 0.1 for 5k epochs. The networks are trained on 200 mini-batches with 32 sequences and tested on 10k sequences.

Sampling moon2015rnndrop gal2015dropout; Ours
short sequences medium sequences short sequences medium sequences
Train Test Train Test Train Test Train Test
per-step 100% 100% 25% 25% 100% 100% 100% 100%
per-sequence 100% 25% 100% <25% 100% 100% 100% 100%
Table 1: Accuracies on the Temporal Order task.
Dropout rate Sampling moon2015rnndrop gal2015dropout Ours
Valid Test Valid Test Valid Test
0.0 130.0 125.2 130.0 125.2 130.0 125.2
0.25 per-step 113.0 108.7 119.8 114.2 106.1 100.0
0.5 per-step 124.0 116.5 118.3 112.5 102.8 98.0
0.25 per-sequence 121.0 113.0 120.5 114.0 106.3 100.7
0.5 per-sequence 137.7 126.2 125.2 117.9 103.2 96.8
0.0 94.1 89.5 94.1 89.5 94.1 89.5
0.25 per-step 113.5 105.8 92.9 88.4 91.6 87.0
0.5 per-step 140.6 130.1 98.6 92.5 100.6 95.5
0.25 per-sequence 105.7 99.9 94.5 89.7 92.4 87.6
0.5 per-sequence 125.4 117.4 98.4 92.5 107.8 101.8
Table 2: Perplexity scores of the LSTM network on word level Language Modeling task (lower is better). Upper and lower parts of the table report results without and with forward dropout respectively. Networks with forward dropout use 0.2 and 0.5 dropout rates in input and output connections respectively. Values in bold show best results for each of the recurrent dropout schemes with and without forward dropout.

Results. Table 1 reports the results on the Temporal Order task when recurrent dropout is applied using our method and methods from [Moon et al.2015] and  [Gal2015]. Using dropout from [Moon et al.2015] with per-sequence sampling, networks are able to discover the long-term dependency, but fail to use it on the test set due to the scaling issue. Interestingly, in Medium case results on the test set are worse than random. Networks trained with per-step sampling exhibit different behaviour: in Short case they are capable of capturing the temporal dependency and generalizing to the test set, but require 10-20 times more iterations to do so. In Medium case these networks do not fit into the allocated number of iterations. This suggests that applying dropout to hidden states as suggested in [Moon et al.2015] corrupts memory cells hindering the long-term memory capacity of LSTMs.

In contrast, using our recurrent dropout methods, networks are able to solve the problem in all cases. We have also ran the same experiments for longer sequences, but found that the results are equivalent to the Medium case. We also note that the approach of [Gal2015] does not seem to exhibit the memory loss problem.

4.2 Word Level Language Modeling

Data. Following MikolovPTB we use the Penn Treebank Corpus to train our Language Modeling (LM) models. The dataset contains approximately 1 million words and comes with pre-defined training, validation and test splits, and a vocabulary of 10k words.


In our LM experiments we use recurrent networks with a single layer with 256 cells. Network parameters were initialized uniformly in [-0.05, 0.05]. For training, we use plain SGD with batch size 32 with the maximum norm gradient clipping

[Pascanu et al.2013]

. Learning rate, clipping threshold and number of Backpropagation Through Time (BPTT) steps were set to 1, 10 and 35 respectively. For the learning rate decay we use the following strategy: if the validation error does not decrease after each epoch, we divide the learning rate by 1.5. The aforementioned choices were largely guided by the work of DBLP:journals/corr/MikolovJCMR14. To ease reproducibility of our results on the LM and synthetic tasks, we have released the source code of our experiments


Results. Table 2 reports the results for LSTM networks. We also present results when the dropout is applied directly to hidden states as in [Moon et al.2015] and results of networks trained with the dropout scheme of [Gal2015]. We make the following observations: (i) our approach shows better results than the alternatives; (ii) per-step mask sampling is better when dropping hidden state directly; (iii) on this task our method using per-step sampling seems to yield results similar to per-sequence sampling; (iv) in this case forward dropout yields better results than any of the three recurrent dropouts; and finally (v) both our approach and that of [Gal2015] are effective when combined with the forward dropout, though ours is more effective.

We make the following observations: (i) dropping hidden state updates yields better results than dropping hidden states; (ii) per-step mask sampling is better when dropping hidden state directly; (iii) contrary to our expectations, when we apply dropout to hidden state updates per-step sampling seems to yield results similar to per-sequence sampling; (iv) applying dropout to hidden state updates rather than hidden states in some cases leads to a perplexity decrease by more than 30 points; and finally (v) our approach is effective even when combined with the forward dropout – for LSTMs we are able to bring down perplexity on the validation set from 130 to 91.6.

Figure 2: Learning curves of LSTM networks when training without and with 0.25 per-step recurrent dropout. Solid and dashed lines show training and validation errors respectively. Best viewed in color.

To demonstrate the effect of our approach on the learning process, we also present learning curves of LSTM networks trained with and without recurrent dropout (Fig. 2). Models trained using our recurrent dropout scheme have slower convergence than models without dropout and usually have larger training error and lower validation errors. This behaviour is consistent with what is expected from a regularizer and is similar to the effect of the feed-forward dropout applied to non-recurrent networks [Hinton et al.2012].

4.3 Character Level Language Modeling

Dropout rate Sampling moon2015rnndrop gal2015dropout Ours
Valid Test Valid Test Valid Test
0.0 1.460 1.457 1.460 1.457 1.460 1.457
0.25 per-step 1.435 1.394 1.345 1.308 1.338 1.301
0.5 per-step 1.610 1.561 1.387 1.348 1.355 1.316
0.25 per-sequence 1.433 1.390 1.341 1.304 1.356 1.319
0.5 per-sequence 1.691 1.647 1.408 1.369 1.496 1.450
0.0 1.362 1.326 1.362 1.326 1.362 1.326
0.25 per-step 1.471 1.428 1.381 1.344 1.358 1.321
0.5 per-step 1.668 1.622 1.463 1.425 1.422 1.380
0.25 per-sequence 1.455 1.413 1.387 1.348 1.403 1.363
0.5 per-sequence 1.681 1.637 1.477 1.435 1.567 1.522
Table 3: Bit-per-character scores of the LSTM network on character level Language Modelling task (lower is better). Upper and lower parts of the table report results without and with forward dropout respectively. Networks with forward dropout use 0.2 and 0.5 dropout rates in input and output connections respectively. Values in bold show best results for each of the recurrent dropout schemes with and without forward dropout.

Data. We train our networks on the dataset described in the previous section. It contains approximately 6 million characters, and a vocabulary of 50 characters. We use the provided partitions train, validation and test partitions.

Setup. We use networks with 1024 units to solve the character level LM task. The characters are embedded into 256 dimensional space before being processed by the LSTM. All parameters of the networks are initialized uniformly in [-0.01, 0.01]. We train our networks on non-overlapping sequences of 100 characters. The networks are trained with the Adam [DBLP:journals/corr/KingmaB14] algorithm with initial learning rate of 0.001 for 50 epochs. We decrease the learning rate by 0.97 after every epoch starting from epoch 10. To avoid exploding gradints, we use MaxNorm gradient clipping with threshold set to 10.

Results. Results of our experiments are given in Table 3. Note that on this task regularizing only the recurrent connections is more beneficial than only the forward ones. In particular, LSTM networks trained with our approach and the approach of [Gal2015] yield a lower bit-per-character (bpc) score than those trained with forward dropout onlyWe attribute it to pronounced long term dependencies. In addition, our approach is the only one that improves over baseline LSTM with forward dropout. The overall best result is achieved by a network trained with our dropout with 0.25 dropout rate and per-step sampling, closely followed by network with gal2015dropout dropout.

4.4 Named Entity Recognition

Data. To assess our recurrent Named Entity Recognition (NER) taggers when using recurrent dropout we use a public benchmark from CONLL 2003 [Tjong Kim Sang and De Meulder2003]. The dataset contains approximately 300k words split into train, validation and test partitions. Each word is labeled with either a named entity class it belongs to, such as Location or Person, or as being not named. The majority of words are labeled as not named entities. The vocabulary size is about 22k words.

Setup. Previous state-of-the-art NER systems have shown the importance of using word context features around entities. Hence, we slightly modify the architecture of our recurrent networks to consume the context around the target word by simply concatenating their embeddings. The size of the context window is fixed to 5 words (the word to be labeled, two words before and two words after). The recurrent layer size is 1024 units. The network inputs include only word embeddings (initialized with pretrained word2vec embeddings [Mikolov et al.2013]

and kept static) and capitalization features. For training we use the RMSProp algorithm 

[Dauphin et al.2015] with fixed at 0.9 and a learning rate of 0.01 and multiply the learning rate by 0.99 after every epoch. We also combine our recurrent dropout (with per-sequence mask sampling) with the conventional forward dropout with the rate 0.2 in input and 0.5 in output connections. Lastly, we found that using nonlinearity resulted in higher performance than .

To speed up the training we use a length expansion approach described in [Ng et al.2015], where training is performed in two stages: (i) we first sample short 5-words input sequences with their contexts and train for 25 epochs; (ii) we fine tune the network on input 15-words sequences for 10 epochs. We found that further fine tuning on longer sequences yielded negligible improvements. Such strategy allows us to significantly speed up the training when compared to training from scratch on full-length input sentences. We use full sentences for testing.

Dropout rate RNN LSTM GRU
5 word long sequences
0.0 85.60 85.32 86.00
0.25 86.48 86.42 86.70
15 word long sequences
0.0 86.12 86.12 86.44
0.25 86.95 86.88 87.10
Table 4: F1 scores (higher is better) on NER task.

Results. F1 scores of our taggers are reported in Table 4 when trained on short 5-word and longer 15-word input sequences. We note that the gap between networks trained with and without our dropout scheme is larger for networks trained on shorter sequences. It suggests that dropout in recurrent connections might have an impact on how well a network generalizes to sequences that are longer than the ones used during training. The gain from using recurrent dropout is larger for the LSTM network. We have experimented with higher recurrent dropout rates, but found that it led to excessive regularization.

4.5 Twitter Sentiment Analysis

Dropout rate Twitter13 LiveJournal14 Twitter15 Twitter14 SMS13 Sarcasm14
0 67.54 71.20 59.35 68.90 64.51 53.58
0.25 66.35 67.70 59.91 67.76 64.46 48.74
0 67.97 69.82 57.84 67.95 61.47 53.49
0.25 69.11 71.39 61.35 68.08 65.45 53.80
0 67.09 70.80 59.07 67.02 67.11 51.01
0.25 69.04 72.10 60.34 69.65 65.73 54.77
Table 5: F1 scores (higher is better) on Sentiment Evaluation task

Data. We use Twitter sentiment corpus from SemEval-2015 Task 10 (subtask B) [Rosenthal et al.2015]. It contains 15k labeled tweets split into training and validation partitions. The total number of words is approximately 330k and the vocabulary size is 22k. The task consists of classifying a tweet into three classes: positive, neutral, and negative. Performance of a classifier is measured by the average of F1 scores of positive and negative classes. We evaluate our models on a number of datasets that were used for benchmarking during the last years.

Setup. We use recurrent networks in the standard sequence labeling manner - we input words to a network one by one and take the label at the last step. Similarly to [Severyn and Moschitti2015], we use 1 million of weakly labeled tweets to pre-train our networks. We use networks composed of 500 neurons in all cases. Our models are trained with the RMSProp algorithm with a learning rate of 0.001. We use our recurrent dropout regularization with per-step mask sampling. All the other settings are equivalent to the ones used in the NER task.

Results. The results of these experiments are presented in Table 5. Note that in this case our algorithm decreases the performance of the vanilla RNNs while this is not the case for LSTM and GRU networks. This is due to the nature of the problem: differently from LM and NER tasks, a network needs to aggregate information over a long sequence. Vanilla RNNs notoriously have difficulties with this and our dropout scheme impairs their ability to remember even further. The best result over most of the datasets is achieved by the GRU network with recurrent dropout. The only exception is the Twitter2015 dataset, where the LSTM network shows better results.

5 Conclusions

This paper presents a novel recurrent dropout method specifically tailored to the gated recurrent neural networks. Our approach is easy to implement and is even more effective when combined with conventional forward dropout. We have shown that for LSTMs and GRUs applying dropout to arbitrary cell vectors results in suboptimal performance. We discuss in detail the cause of this effect and propose a simple solution to overcome it. The effectiveness of our approach is verified on three different public NLP benchmarks.

Our findings along with our empirical results allow us to answer the questions posed in Section 1: i) while is straight-forward to use dropout in vanilla RNNs due to their strong similarity with the feed-forward architectures, its application to LSTM networks is not so straightforward. We demonstrate that recurrent dropout is most effective when applied to hidden state update vectors in LSTMs rather than to hidden states; (ii) we observe an improvement in the network’s performance when our recurrent dropout is coupled with the standard forward dropout, though the extent of this improvement depends on the values of dropout rates; (iii) contrary to our expectations, networks trained with per-step and per-sequence mask sampling produce similar results when using our recurrent dropout method, both being better than the dropout scheme proposed by moon2015rnndrop.

While our experimental results show that applying recurrent dropout method leads to significant improvements across various NLP benchmarks (especially when combined with conventional forward dropout), its benefits for other tasks, e.g., sequence-to-sequence prediction, or other domains, e.g., Speech Recognition, remain unexplored. We leave it as our future work.


This project has received funding from the European Union’s Framework Programme for Research and Innovation HORIZON 2020 (2014-2020) under the Marie Skłodowska-Curie Agreement No. 641805. Stanislau Semeniuta thanks the support from Pattern Recognition Company GmbH. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.