Implementation of abstractive summarization using LSTM with Residual Recurrent Attention in the encoder-decoder architecture with local attention.
In this paper, we propose a recurrent neural network (RNN) with residual attention (RRA) to learn long-range dependencies from sequential data. We propose to add residual connections across timesteps to RNN, which explicitly enhances the interaction between current state and hidden states that are several timesteps apart. This also allows training errors to be directly back-propagated through residual connections and effectively alleviates gradient vanishing problem. We further reformulate an attention mechanism over residual connections. An attention gate is defined to summarize the individual contribution from multiple previous hidden states in computing the current state. We evaluate RRA on three tasks: the adding problem, pixel-by-pixel MNIST classification and sentiment analysis on the IMDB dataset. Our experiments demonstrate that RRA yields better performance, faster convergence and more stable training compared to a standard LSTM network. Furthermore, RRA shows highly competitive performance to the state-of-the-art methods.READ FULL TEXT VIEW PDF
Recurrent neural network (RNN), as a powerful contextual dependency mode...
Recurrent neural networks (RNNs) have gained a great deal of attention i...
To solve the problem of inaccurate recognition of types of communication...
Circuits of biological neurons, such as in the functional parts of the b...
Recurrent Neural Network (RNN) has been successfully applied in many seq...
Long short-term memory (LSTM) networks and their variants are capable of...
Recurrent Neural Networks (RNN) are a type of statistical model designed...
Implementation of abstractive summarization using LSTM with Residual Recurrent Attention in the encoder-decoder architecture with local attention.
Deep neural networks (DNN) have shown significant improvements in several application domains including image recognition [Krizhevsky, Sutskever, and Hinton2012]Mikolov et al.2013] and speech recognition [Hinton et al.2012]. Recurrent neural networks (RNNs), a particular type of DNN, have powerful capability in processing complicated sequential data. By using recurrent connections, the previous context information can be captured and used to predict the next hidden state output. However, training RNN remains a difficult task due to gradient vanishing and exploding problems [Pascanu, Mikolov, and Bengio2013], especially when the RNN needs to learn very long dependencies from sequential inputs. The main issue is that training an RNN using back-propagation through time (BPTT) [Williams and Hinton1986] entails multiplying gradients a large number of times (specifically, once for each time step) with the weights matrix . If
contains small values (namely, if the largest eigenvalue ofis less than 1), then gradient contributions from “far away” states become zero and have no influence on future states, this is the gradient vanishing problem. On the other hand, if the weights in the matrix are large, the gradient signal grows without bound, and learning diverges, this is the gradient exploding problem
. To alleviate the effects of gradient vanishing, many methods have been proposed. Long Short-Term Memory (LSTM)[Hochreiter and Schmidhuber1997] can be seen as the most successful one among those techniques. The introduced memory cell in LSTM has its own input, forget and output gates to control whether to store the context information or remove it from memory. This allows LSTM networks to capture the long-range relational dependencies from input sequences as compared to a regular RNN.
The gradient vanishing problem is not limited to recurrent neural network and can also appear in feedforward neural network, particularly, in training very deep networks. If we treat an RNN in its unfolded form, a shallow RNN with multiple timesteps is equivalent to a very deep network. Residual learning [He et al.2016]
provides a novel learning scheme for ultra-deep convolutional neural network (CNN) (e.g. more than 1000 layers) by introducing residual connections across layers. Theseshortcut connections
connect far-away layers to ensure training error signal can be back-propagated from higher layer to lower layer directly and alleviate gradient vanishing problem. Inspired by the success of residual learning in CNN on computer vision tasks, this work reformulates residual learning into recurrent network for learning ultra-long range dependencies across timesteps in sequence learning.
Different to residual learning [He et al.2016] where an identity shortcut connection is used to add the input and the outputs from stacked layers (i.e. +, is residual function), in the context of sequence learning, we reformulate the recurrent residual connection to have attention over multiple precessing steps. It results in a residual function with attention across timesteps: + where is a recurrent model and is the attention weights. At each timestep , in computing the current state , this reformulation ensures recurrent units have the ability to look back as far as + past timesteps and control the relative contribution of each hidden state to the current state .
Even though attention mechanism has been widely studied in machine translation [Bahdanau, Cho, and Bengio2014], image captioning [Xu et al.2015], object detection [Ba, Mnih, and Kavukcuoglu2015] and generative models [Mnih et al.2014, Gregor et al.2015]
. Basically, this sort of attention models are either layer-based or network-based. They are only allowed to receive attended information from a previous layer or a separate network. By casting attention mechanism to recurrent residual connection, the recurrent unit provides a more natural way to sequence learning. Because it explicitly looks back at multiple preceding steps and automatically decides how much previous information should be “seen” by weighting them. For a specific sequential pattern (e.g. English or German sentence), the semantic dependencies between words that are far apart (e.g. and , 1) can be stronger than that between two adjacent words (e.g. and ). Figure 1 gives an example which intuitively supports our assumption. The word “drawing” is explicitly involved in predicting the word “her”, it is obvious that word “girl” would also make significant semantic contribution. Essentially, the sentence is saying: “The girl is beautiful”, however, regular RNNs suffer difficulties in capturing the meaning. Thus, it is reasonable to explicitly consider the information that are several steps apart in learning the semantic meaning from sequential data. In this work, we address this problem by casting attention mechanism to residual connection over timesteps in recurrent network.
The benefits of recurrent residual attention (RRA) are two fold: (1) RRA enhances the interactions between hidden states that are several steps apart, that is, RRA allows training error can be back-propagated across multiple timesteps. (2) The attention over residual connection gives a more natural way in which past hidden states can selectively “attend” to future states in sequence learning.
Our main contributions are summarized as follows:
We propose a novel learning scheme for sequential data, it reformulates residual learning with attention in recurrent network. The code will be made publicly available soon.
A new gate—attention gate is defined in LSTM RNN to control the individual contribution of context information from multiple previous hidden states.
Our proposed RRA shows promising performance as compared to a standard LSTM network on three benchmark tasks: the adding problem, pixel-by-pixel MNIST and sentiment analysis. RRA also outperforms or matches the state-of-the-art methods.
The rest of this paper is structured as follows, section 2Related Work gives the related work. In section 3Models, we elaborate the reformation of residual learning with attention in recurrent manner. We describe our experiments and discussions in section 4Experiments and conclude this work in section 5Conclusion.
Recurrent Neural Network (RNN) RNN is a powerful network architecture for processing sequential data. It has been widely used in natural language processing [Socher et al.2011], speech recognition [Graves, Mohamed, and Hinton2013] and handwriting recognition [Graves et al.2009]
in recent years. In RNN, it allows cyclical connection and reuse the weights across different instances of neurons, each of them associated with different time steps. This idea can explicitly support network to learn the entire history of previous states and map them to current states. With this property, RNN is able to map an arbitrary length sequence to a fixed length vector. But RNN is known for its difficult training due to gradient vanishing problem.
The vanishing problem was originally found in [Hochreiter and Schmidhuber1997], then LSTM (Long short-term memory) was proposed to prevent gradient from vanishing during training. Therefore, compare to traditional RNN, LSTM has the ability to learn the long-term dependencies between inputs and outputs. Recently, LSTM has became very popular in the field of machine translation [Cho et al.2014], speech recognition [Graves, Mohamed, and Hinton2013] and sequence learning [Sutskever, Vinyals, and Le2014]
recently. Another special type of RNN is Gated Recurrent Unit (GRU)[Cho et al.2014]
. It simplifies LSTM by removing memory cell and provides a different way to prevent vanishing gradient problem. Our work falls into this category and aims to alleviate gradient vanishing in learning ultra-long dependencies.
Residual Learning Previous work [Simonyan and Zisserman2015, Szegedy et al.2015] have proven that network depth is of crucial importance of neural network architectures, but it is more challenging to train deeper networks. Residual learning [He et al.2016] paves a way for training such networks. The residual mapping between layers enables networks can be substantially deep (e.g. with hundreds of layers) and leads more efficient optimization, most importantly, yields better performance. The short-cut skip connections were considered across multiple layers to force a direct information flow in both forward and backward passes. By doing this, feedforward signals as well as feedback errors can be passed easily. Adding residual connection across layers has shown its powerful capability in computer vision [He et al.2016, Szegedy et al.2017]. Inspired by this, our work incorporates residual connection across multiple precessing steps to learn long and complex dependencies from sequential data.
Attention Mechanism Attention in neural networks [Bahdanau, Cho, and Bengio2014] is designed to assign weights to different inputs instead of threat all input sequences equally as original neural networks do. It can be seen as an additional network that is now widely incorporated into different neural networks leading to a new variety of models [Xu et al.2015, Ba, Mnih, and Kavukcuoglu2015, Mnih et al.2014, Gregor et al.2015]. Formally, an attention model takes arguments e.g. ,…,, and a context information . It returns a weighted output which summaries based on how is related to context . The weights are corresponds to the relevances between each and and sum to 1, e.g. the weights in Figure 2 (c). This determines the relative contributions of each to final output. But the current state-of-the-art attention methods are either layer or network based, and not well studied in recurrent manner. This work reformulates an attention over residual connection in recurrent network.
This section describes our proposed approach to learn recurrent residual attention from sequential data. We firstly introduce existing way for sequence learning with recurrent network and explain our intuition of extending recurrent network to learn more complex dependencies. Then we describe how to reformulate residual connection into RNN, and followed by casting attention mechanism to recurrent residual connection. Here, we use LSTM as base recurrent network to elaborate our approach, but it can be easily generalized to plain RNN or GRU.
where and are input sequence and target sequence respectively. The input sequence length may differ from target sequence length . is the hidden state from a model for a given hidden state and a new input . The is recurrent model that can be a standard RNN or its variants. The equation (3) can be viewed as a general form of recurrent learning algorithm which is able to capture the semantic dependencies across timesteps. For example the hidden state is explicitly used for outputting while the past hidden state before are only implicitly involved.
This challenges existing RNNs in a task that needs model to explicitly capture the long-range semantic dependencies between the states that are several timesteps apart, as the task we described in Figure 1. Adding a shortcut connection to skip one or multiple timesteps and enforcing a direct information across timesteps is a way to explicitly use previous hidden states in (,…,) in computing future states. This entails recurrent residual learning.
The overview of reformulating recurrent network to have residual connection is illustrated in Figure 2 (b), in which a shortcut connection is designed to impose a fluent information flow across timesteps. With residual connection in recurrent network, at a given timestep , the hidden state can be computed as:
where is a RNN model with weights , it receives and as regular RNN. Here we keep to receive so as to form a residual skip connection across timesteps. approximates the residual function with weights . can be an identity function such that = where is the hidden state at - time step. With this formulation, when computing a hidden state , besides and , can be explicitly considered. If approximating 0, equation (4) returns back to plain RNN.
By making to weight multiple previous hidden states, i.e. ,…,, can lead to recurrent residual learning with attention over timesteps:
where is the attention weight matrix that controls the relative contribution of the past hidden states and =.
Figure 2 (d) gives our design of reformulating attention on residual connections in recurrent network. The recurrent residual attention is considered at each timestep, this can be viewed as a sliding attention window with size of over timesteps. To make the past states selectively “attend” in future state, we enforce the residual attention effects memory cell directly, a new gate—attention gate is defined to LSTM cell, making LSTM has residual attention. Then the equation (5) is reformulated as
where =. Figure 3 demonstrates the internal gates of RRA cell, where the attention gate controls the relative contributions of the past states. Basically, the hidden state of each gate within RRA can be computed as:
where , and are input, forget and output gate respectively. is memory cell,
is the sigmoid function. Equations(7) - (8) are from original LSTM, in equation (9) is the defined attention gate which summarizes relative contributions in the range from to . The hidden state is used in original way and attended at each step so that to form a residual (shortcut) connection across timesteps. The attention weights is normalized by =111while softmax is more often used here, we found this is more straightforward and faster in BPTT without losing performance.. In equation (10), follow residual network [He et al.2016], element-wise addition is used to form the residual function of attention which directly effect memory cell for outputting .
By defining an attention gate in RNNs, only additional differentiable parameters over residual connection are introduced. The optimization can be realized by using standard back-propagation through time (BPTT)[Williams and Hinton1986] as regular RNNs.
In this section, we explore the performance of proposed RRA in multiple tasks including the adding problem, pixel-by-pixel MNIST image classification and sentiment analysis on the IMDB dataset.
Our implementation was based on Theano222http://www.deeplearning.net/software/theano/
. We conducted all our experiments on a single Titan Xp with 12G memory. The weights for input-to-hidden layer and hidden-to-output layer were initialized by drawing the uniform distribution(: number of units). The RNN internal weights were orthogonally initialized [Saxe, McClelland, and Ganguli2014]. The attention weights were randomly initialized. By default, the attention window size =10, which means the past hidden states from to are considered at every timestep. Initial learning rate was set to 0.0001 and 0.5 dropout rate was used after recurrent layer. Gradients were clipped to 1 to prevent exploding gradients. All models were configured to have only one recurrent layer and trained with given number of iterations without early stopping. All experimental settings for LSTM and RRA are same.
This task was originally defined in [Hochreiter and Schmidhuber1997] for testing the ability of RNN to capture the long dependencies in a sequential data. The task is asked to add two numbers and that randomly selected from a sequence. For a given sequence with length , each element of this sequence is a pair consisting of two components , the first one is an actual number that uniformly sampled at , the second one is an indicator decides whether to add (if =) or just ingore (if =). There are only two numbers ( and ) in each sequence are marked as 1 for addition: the first number is placed to the first 10% of sequence, i.e. , the second number is from the last 50% in the sequence, i.e.
. This leads to a sequence has long-range dependency where only two significant but remote inputs. A naive strategy is always to predict the target output as 1 regardless of the input sequences[Le, Jaitly, and Hinton2015, Arjovsky, Shah, and Bengio2016], it gives an expected mean squared error (MSE) of 0.167 which is used as baseline to beat.
We used 128 hidden units for both LSTM and RRA, the batch size was set to 50, the models were optimized with ADADELTA [Zeiler2012]. We generated 100,000 training examples and 10,000 test examples. Figure 4 presents the performance of LSTM and RRA on test dataset as we varied sequence length S. As we can see, for S=, LSTM is able to consistently beat baseline around 4,400 iterations while RRA approximately beats baseline at 2,200 iterations. As we increased S to 500, the task gets harder because the dependency between target output and the two relevant sequence inputs becomes more remote, this requires model is able to capture longer dependencies. In the first 40,000 iterations, both LSTM and RRA struggled to minimize MSE, RRA started to beat baseline after 43,000 iterations, this is significantly faster than LSTM that started to beat baseline after around 92,000 iterations.
Although this task against the advantage of RRA since there are only two significant numbers in each sequence, RRA demonstrates good performance in learning long-range dependencies.
This task is asked to classify MNIST digits[LeCun et al.1998] as suggested by [Le, Jaitly, and Hinton2015]
. Each 28-by-28 image in MNIST is treated as sequential data and fed to recurrent network. This leads to pixel sequences with length of 784. Two versions of pixel-by-pixel MNIST were considered: (1) normal MNIST that the pixel sequence is read in order from left to right, top to bottom. (2) The pixel sequence is randomly permuted. We configured both networks to have 256 hidden units, optimizer is replaced with RMSprop which provides more steady improvement on this task for both networks. The training batch size is 50, LSTM is used as baseline to beat as plain RNN has been proved poor performance on such tasks in[Le, Jaitly, and Hinton2015, Arjovsky, Shah, and Bengio2016].
Figure 5 reports the test accuracy against iterations. On normal pixel-by-pixel MNIST (Figure 5(left)), similar to previous work [Arjovsky, Shah, and Bengio2016], both LSTM and RRA show good performance. RRA achieves 98.58% that beats LSTM of 97.66%. Besides, it shows that RRA is able to yield faster convergence, more stable improvement as compared to the standard LSTM.
The task was configured to be more challenging when we randomly permuted the order of pixels in image. By applying same permutation to each image, the dependencies across pixels become longer than original pixel order. This requires models to learn and remember more complicated dependencies across different timesteps. As shown in Figure 5(right), RRA shows superior capability in capturing such long and complicated dependencies. It achieves 95.84% against 91.2% for LSTM, but again, faster convergence.
We further compared RRA with recent proposed methods: IRNN [Le, Jaitly, and Hinton2015], URNN [Arjovsky, Shah, and Bengio2016] and RWA [Ostmeyer and Cowell2017] in Table 1. RRA achieves the state-of-the-art performance on both normal and permuted pixel-by-pixel MNIST. It should be noted that both URNN and RWA are not able to beat LSTM on normal MNIST in their configurations. Nevertheless, RRA achieves sightly better performance on normal MNIST and outperforms LSTM on permuted MNIST in a certain margin.
|Models||Normal MNIST||Premuted MNIST|
|Models||Reported Error Rate|
|BoW (bnc)[Maas et al.2011]||12.20%|
|BoW(b t) [Maas et al.2011]||11.77%|
|LDA [Maas et al.2011]||32.58%|
|LSA [Maas et al.2011]||17.04 %|
|Full+BoW [Maas et al.2011]||11.67%|
|Full+unlabelled+BoW [Maas et al.2011]||11.11%|
|WRRBM [Dahl, Adams, and Larochelle2012]||12.58%|
|WRRBM+BoW(bnc) [Dahl, Adams, and Larochelle2012]||10.77%|
|MNB-uni [Wang and Manning2012]||16.45%|
|MNB-bi [Wang and Manning2012]||13.41%|
|SVM-uni [Wang and Manning2012]||13.05%|
|SVM-bi [Wang and Manning2012]||10.84%|
|NBSVM-uni [Wang and Manning2012]||11.71%|
|NBSVM-bi [Wang and Manning2012]||8.78%|
|seq2-bown-CNN[Johnson and Zhang2015]||14.70%|
|Paragraph Vector [Le and Mikolov2014]||7.42%|
|LSTM with tuning and dropout [Dai and Le2015]||13.50%|
|LSTM initialized with word2vec embeddings [Dai and Le2015]||10.00%|
|LM-LSTM [Dai and Le2015]||7.64%|
|SA-LSTM [Dai and Le2015]||7.24%|
|SA-LSTM with liner gain [Dai and Le2015]||9.17%|
|SA-LSTM with joint training [Dai and Le2015]||14.70%|
|TS-ATT[Yuan, Hu, and Huang2016]||13.75%|
|SS-ATT[Yuan, Hu, and Huang2016]||13.26%|
|Bidirectional RRA (K=5)||9.05%|
To evaluate the performance of RRA on sentiment analysis, we conducted experiments on IMDB review dataset [Maas et al.2011]333http://ai.stanford.edu/~amaas/data/sentiment/. This dataset consists of 100,000 movie reviews from IMDB. The dataset is split into 75% for training and 25% for testing. There are only 25,000 reviews in training reviews are labeled, and the rest of 50,000 are unlabeled, all testing reviews are labeled. In this task,we used the labeled 25,000 training reviews and 25,000 test for binary sentiment classification (positive or negative), thus randomly guessing yields 50% accuracy. Different to some previous approaches, e.g. Bag-of-Words (BOW) and Latent Dirichlet Allocation (LDA)[Blei, Ng, and Jordan2003] etc., the review sentences are treated as sequential data. This task is particularly challenging because the average review length is 281 and the longest review can reach 2,956 words. This requires our model has strong ability to capture the long-range semantic dependencies among words.
In our experiments, we limited the word vocabulary size to 10,000, all other words were mapped to “unk” token. We used 128 units for embeddings and 128 units for both LSTM and RRA with ADADELTA[Zeiler2012] optimizer, batch size was set to 16. We tested RRA with different attention window size =, = and =. Figure 6 presents the test error against iterations for original LSTM and RRA with different
. Each model was trained around 15 epochs without early stopping. We can see that RRA fits the dataset quite well since 4,000 iterations, considerably faster than LSTM. With varied attention window size, we found that the test error is not very sensitive to different , RRA obtains sightly better results when =5. We conjecture that for a certain pattern of sequence (e.g. English sequence in this task), the semantic contributions from previous hidden states are sufficient to compute the current state.
In order to compare RRA with recent methods, we add more recently reported baselines. Table 2 shows the performance comparison. It proves that RRA can effectively learn good representations from input word sequence for sentiment classification as compared to previous non-sequential representations, e.g. BoW, LDA and LSA with SVM classifiers. RRA is also highly competitive to recent approaches LM-LSTM and SA-LSTM (which used 1024 units for memory cells, 512 embedding units with 50,000 unlabeled reviews for per-training). It should be noted that our models were solely based on proposed RRA with only 128 hidden units, without using additional unlabeled data for pre-training as well as word2vec embeddings. With bidirectional RRA, the performance of our model is sufficiently close to the state-of-the-art.
We also visualized the attention weights in the case of =5 and =10 respectively in Figure 7. The evolution of normalized weights of attention units suggests that attention gate learns to control the relative contributions from previous hidden states from to . They are explicitly considered in predicting , this is contrast with standard RNN/LSTM and other variants where history information indirectly considered via .
RRA alleviates gradient vanishing In BPTT, gradient vanishing when gradient = is close 0. Because it sums each gradient contribution from every timestep, the dependency across timesteps cannot be captured if the gradient contribution is 0. RRA explicitly enforces short-cut connection across timestep and directly passes error signal through to . The attention over residual connection enables to control the relative contribution across multiple timesteps to alleviate gradient become to 0, particularly in learning dependencies from long and complex sequence. Our experiments in Figure 4, 5 and 6 have demonstrated the stability of RRA in learning long and complex sequence.
Relation to related work There are some RNN variants have been recently proposed to address gradient vanishing problem in recurrent networks. IRNN [Le, Jaitly, and Hinton2015] is an RNN that is composed of ReLUs and initialized with an identity weight matrix, URNN [Arjovsky, Shah, and Bengio2016] uses a unitary hidden-to-hidden matrix by generalizing the orthogonal matrices to the complex domain. Differently, this work focuses on explicitly use multiple previous hidden states via residual connection with attention. Higher order RNN (HORNNs)[Soltani and Jiang2016] is proposed for language modeling which is similar to our work but the key differences are existed: (1) RRA uses as regular RNN so that to form a residual connection with attention while HORNN directly considers to . (2) RRA introduces much less parameters, e.g., when each unit is required to consider the past 3 states, RRA only introduces 2 additional parameters while HORNN introduces 0.3 millions more weights compared to a plain RNN. Recurrent Weighted Average (RWA)[Ostmeyer and Cowell2017] also explores attention in RNN. But the difference is that RWA performs a weighted average over to when computing each . RRA is more flexible by considering + past states with residual attention.
Limitation of RRA Although RRA shows its ability in capturing long-range dependencies across timesteps with faster convergence, more stable training compared to a standard LSTM on multiple tasks, it also has limitation: training speed is sightly slower than standard LSTMs, e.g., on permuted MNIST, LSTM took average 394s for one epoch while RRA(=5) took 760s and RRA(=10) took 773.6s. We conjecture that additional time is spent to compute the derivative of residual attention, and pass the error signal from current states to the states that are several step far apart directly. However, it should be noted that, all our experiments did not use early stopping, when it is applied to RRA and LSTM, RRA can finish the training and stop much earlier than LSTM.
In this paper we introduced RRA to learn long-term dependencies from sequential data. The residual shortcut connection can effectively pass error signal across timesteps that are several apart away so that to prevent gradient vanishing problem. The defined attention mechanism over timesteps provides a more natural way to summarize the individual contribution of the past hidden states in predicting future hidden states. We compared RRA to a standard implementation of LSTM. RRA shows superior performance, more stable training and fast convergence on the adding problem, pixel-by-pixel MNIST classification and sentiment analysis. Although without using additional mechanism, e.g. word2vec embedding, pre-training with unlabeled data, RRA demonstrates competitive performance as compared to recent methods. Future work will extend RRA on different sequence learning scenarios including machine translation, speech recognition etc..
We thank Mathias Niepert and Brandon Malone for their discussions and suggestions on this work.
Training restricted boltzmann machines on word observations.ICML.