tensorflow-rnn-shakespeare
Code from the "Tensorflow and deep learning - without a PhD, Part 2" session on Recurrent Neural Networks.
view repo
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, image caption generation, and machine translation.
READ FULL TEXT VIEW PDF
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN)
archit...
read it
Recurrent neural networks (RNNs) with Long Short-Term memory cells curre...
read it
Recurrent neural networks (RNNs), such as long short-term memory network...
read it
Model compression is significant for the wide adoption of Recurrent Neur...
read it
Recurrent neural networks are a powerful tool for modeling sequential da...
read it
In this paper, we investigate the use of prediction-adaptation-correctio...
read it
Unitary Evolution Recurrent Neural Networks (uRNNs) have three attractiv...
read it
Code from the "Tensorflow and deep learning - without a PhD, Part 2" session on Recurrent Neural Networks.
A couple of scripts to illustrate how to do CNNs and RNNs in PyTorch
This repository provides scripts to train an LSTM and then extract states from it in Tensorflow.
Solving Question-Answering Problem Using Deep Learning
None
The Recurrent Neural Network (RNN) is neural sequence model that achieves state of the art performance on important tasks that include language modeling Mikolov (2012), speech recognition Graves et al. (2013), and machine translation Kalchbrenner & Blunsom (2013). It is known that successful applications of neural networks require good regularization. Unfortunately, dropout Srivastava (2013), the most powerful regularization method for feedforward neural networks, does not work well with RNNs. As a result, practical applications of RNNs often use models that are too small because large RNNs tend to overfit. Existing regularization methods give relatively small improvements for RNNs Graves (2013). In this work, we show that dropout, when correctly used, greatly reduces overfitting in LSTMs, and evaluate it on three different problems.
The code for this work can be found in https://github.com/wojzaremba/lstm.
Dropout Srivastava (2013)
is a recently introduced regularization method that has been very successful with feed-forward neural networks. While much work has extended dropout in various ways
Wang & Manning (2013); Wan et al. (2013), there has been relatively little research in applying it to RNNs. The only paper on this topic is by Bayer et al. (2013), who focuses on “marginalized dropout” Wang & Manning (2013), a noiseless deterministic approximation to standard dropout. Bayer et al. (2013) claim that conventional dropout does not work well with RNNs because the recurrence amplifies noise, which in turn hurts learning. In this work, we show that this problem can be fixed by applying dropout to a certain subset of the RNNs’ connections. As a result, RNNs can now also benefit from dropout.Independently of our work, Pham et al. (2013) developed the very same RNN regularization method and applied it to handwriting recognition. We rediscovered this method and demonstrated strong empirical results over a wide range of problems. Other work that applied dropout to LSTMs is Pachitariu & Sahani (2013).
There have been a number of architectural variants of the RNN that perform better on problems with long term dependencies Hochreiter & Schmidhuber (1997); Graves et al. (2009); Cho et al. (2014); Jaeger et al. (2007); Koutník et al. (2014); Sundermeyer et al. (2012). In this work, we show how to correctly apply dropout to LSTMs, the most commonly-used RNN variant; this way of applying dropout is likely to work well with other RNN architectures as well.
In this paper, we consider the following tasks: language modeling, speech recognition, and machine translation. Language modeling is the first task where RNNs have achieved substantial success Mikolov et al. (2010, 2011); Pascanu et al. (2013). RNNs have also been successfully used for speech recognition Robinson et al. (1996); Graves et al. (2013) and have recently been applied to machine translation, where they are used for language modeling, re-ranking, or phrase modeling Devlin et al. (2014); Kalchbrenner & Blunsom (2013); Cho et al. (2014); Chow et al. (1987); Mikolov et al. (2013).
In this section we describe the deep LSTM (Section 3.1). Next, we show how to regularize them (Section 3.2), and explain why our regularization scheme works.
We let subscripts denote timesteps and superscripts denote layers. All our states are -dimensional. Let be a hidden state in layer in timestep . Moreover, let be an affine transform ( for some and ). Let be element-wise multiplication and let
be an input word vector at timestep
. We use the activations to predict , since is the number of layers in our deep LSTM.The RNN dynamics can be described using deterministic transitions from previous to current hidden states. The deterministic state transition is a function
For classical RNNs, this function is given by
The LSTM has complicated dynamics that allow it to easily “memorize” information for an extended number of timesteps. The “long term” memory is stored in a vector of memory cells
. Although many LSTM architectures that differ in their connectivity structure and activation functions, all LSTM architectures have explicit memory cells for storing information for long periods of time. The LSTM can decide to overwrite the memory cell, retrieve it, or keep it for the next time step. The LSTM architecture used in our experiments is given by the following equations
Graves et al. (2013):In these equations, and are applied element-wise. Figure 1 illustrates the LSTM equations.
The main contribution of this paper is a recipe for applying dropout to LSTMs in a way that successfully reduces overfitting. The main idea is to apply the dropout operator only to the non-recurrent connections (Figure 2). The following equation describes it more precisely, where is the dropout operator that sets a random subset of its argument to zero:
Our method works as follows. The dropout operator corrupts the information carried by the units, forcing them to perform their intermediate computations more robustly. At the same time, we do not want to erase all the information from the units. It is especially important that the units remember events that occurred many timesteps in the past. Figure 3 shows how information could flow from an event that occurred at timestep to the prediction in timestep in our implementation of dropout. We can see that the information is corrupted by the dropout operator exactly times, and this number is independent of the number of timesteps traversed by the information. Standard dropout perturbs the recurrent connections, which makes it difficult for the LSTM to learn to store information for long periods of time. By not using dropout on the recurrent connections, the LSTM can benefit from dropout regularization without sacrificing its valuable memorization ability.
We present results in three domains: language modeling (Section 4.1), speech recognition (Section 4.2), machine translation (Section 4.3), and image caption generation (Section 4.4).
We conducted word-level prediction experiments on the Penn Tree Bank (PTB) dataset Marcus et al. (1993), which consists of k training words, k validation words, and k test words. It has k words in its vocabulary. We downloaded it from Tomas Mikolov’s webpage^{3}^{3}3http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz. We trained regularized LSTMs of two sizes; these are denoted the medium LSTM and large LSTM. Both LSTMs have two layers and are unrolled for steps. We initialize the hidden states to zero. We then use the final hidden states of the current minibatch as the initial hidden state of the subsequent minibatch (successive minibatches sequentially traverse the training set). The size of each minibatch is 20.
Model | Validation set | Test set |
A single model | ||
Pascanu et al. (2013) | 107.5 | |
Cheng et al. | 100.0 | |
non-regularized LSTM | 120.7 | 114.5 |
Medium regularized LSTM | 86.2 | 82.7 |
Large regularized LSTM | 82.2 | 78.4 |
Model averaging | ||
Mikolov (2012) | 83.5 | |
Cheng et al. | 80.6 | |
2 non-regularized LSTMs | 100.4 | 96.1 |
5 non-regularized LSTMs | 87.9 | 84.1 |
10 non-regularized LSTMs | 83.5 | 80.0 |
2 medium regularized LSTMs | 80.6 | 77.0 |
5 medium regularized LSTMs | 76.7 | 73.3 |
10 medium regularized LSTMs | 75.2 | 72.0 |
2 large regularized LSTMs | 76.9 | 73.6 |
10 large regularized LSTMs | 72.8 | 69.5 |
38 large regularized LSTMs | 71.9 | 68.7 |
Model averaging with dynamic RNNs and n-gram models |
||
Mikolov & Zweig (2012) | 72.9 |
The medium LSTM has units per layer and its parameters are initialized uniformly in . As described earlier, we apply dropout on the non-recurrent connections. We train the LSTM for epochs with a learning rate of , and after epochs we decrease it by a factor of after each epoch. We clip the norm of the gradients (normalized by minibatch size) at . Training this network takes about half a day on an NVIDIA K20 GPU.
The large LSTM has units per layer and its parameters are initialized uniformly in . We apply dropout on the non-recurrent connections. We train the model for epochs with a learning rate of ; after epochs we start to reduce the learning rate by a factor of after each epoch. We clip the norm of the gradients (normalized by minibatch size) at Mikolov et al. (2010). Training this network takes an entire day on an NVIDIA K20 GPU.
For comparison, we trained a non-regularized network. We optimized its parameters to get the best validation performance. The lack of regularization effectively constrains size of the network, forcing us to use small network because larger networks overfit. Our best performing non-regularized LSTM has two hidden layers with units per layer, and its weights are initialized uniformly in . We train it for epochs with a learning rate of and then we decrease the learning rate by a factor of after each epoch, for a total of training epochs. The size of each minibatch is , and we unroll the network for steps. Training this network takes 2-3 hours on an NVIDIA K20 GPU.
Deep Neural Networks have been used for acoustic modeling for over half a century (see Bourlard & Morgan (1993) for a good review). Acoustic modeling is a key component in mapping acoustic signals to sequences of words, as it models where is the phonetic state at time and is the acoustic observation. Recent work has shown that LSTMs can achieve excellent performance on acoustic modeling Sak et al. (2014), yet relatively small LSTMs (in terms of the number of their parameters) can easily overfit the training set. A useful metric for measuring the performance of acoustic models is frame accuracy, which is measured at each for all timesteps . Generally, this metric correlates with the actual metric of interest, the Word Error Rate (WER). Since computing the WER involves using a language model and tuning the decoding parameters for every change in the acoustic model, we decided to focus on frame accuracy in these experiments. Table 2 shows that dropout improves the frame accuracy of the LSTM. Not surprisingly, the training frame accuracy drops due to the noise added during training, but as is often the case with dropout, this yields models that generalize better to unseen data. Note that the test set is easier than the training set, as its accuracy is higher. We report the performance of an LSTM on an internal Google Icelandic Speech dataset, which is relatively small (93k utterances), so overfitting is a great concern.
Model | Training set | Validation set |
---|---|---|
Non-regularized LSTM | 71.6 | 68.9 |
Regularized LSTM | 69.4 | 70.5 |
We formulate a machine translation problem as a language modelling task, where an LSTM is trained to assign high probability to a correct translation of a source sentence. Thus, the LSTM is trained on concatenations of source sentences and their translations
Sutskever et al. (2014) (see also Cho et al. (2014)). We compute a translation by approximating the most probable sequence of words using a simple beam search with a beam of size 12. We ran an LSTM on the WMT’14 English to French dataset, on the “selected” subset from Schwenk (2014) which has 340M French words and 304M English words. Our LSTM has 4 hidden layers, and both its layers and word embeddings have 1000 units. Its English vocabulary has 160,000 words and its French vocabulary has 80,000 words. The optimal dropout probability was 0.2. Table 3 shows the performance of an LSTM trained with and without dropout. While our LSTM does not beat the phrase-based LIUM SMT system Schwenk et al. (2011), our results show that dropout improves the translation performance of the LSTM.Model | Test perplexity | Test BLEU score |
---|---|---|
Non-regularized LSTM | 5.8 | 25.9 |
Regularized LSTM | 5.0 | 29.03 |
LIUM system | 33.30 |
We applied the dropout variant to the image caption generation model of Vinyals et al. (2014). The image caption generation is similar to the sequence-to-sequence model of Sutskever et al. (2014)
, but where the input image is mapped onto a vector with a highly-accurate pre-trained convolutional neural network
(Szegedy et al., 2014), which is converted into a caption with a single-layer LSTM (see Vinyals et al. (2014) for the details on the architecture). We test our dropout scheme on LSTM as the convolutional neural network is not trained on the image caption dataset because it is not large (MSCOCO (Lin et al., 2014)).Our results are summarized in the following Table 4. In brief, dropout helps relative to not using dropout, but using an ensemble eliminates the gains attained by dropout. Thus, in this setting, the main effect of dropout is to produce a single model that is as good as an ensemble, which is a reasonable improvement given the simplicity of the technique.
Model | Test perplexity | Test BLEU score |
---|---|---|
Non-regularized model | 8.47 | 23.5 |
Regularized model | 7.99 | 24.3 |
10 non-regularized models | 7.5 | 24.4 |
We presented a simple way of applying dropout to LSTMs that results in large performance increases on several problems in different domains. Our work makes dropout useful for RNNs, and our results suggest that our implementation of dropout could improve performance on a wide variety of applications.
We wish to acknowledge Tomas Mikolov for useful comments on the first version of the paper.
Optimization and applications of echo state networks with leaky-integrator neurons.
Neural Networks, 20(3):335–352, 2007.Proceedings of the 30th International Conference on Machine Learning (ICML-13)
, pp. 1058–1066, 2013.
Comments
There are no comments yet.