Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems

12/29/2015 ∙ by Colin Raffel, et al. ∙ Columbia University 0

We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that the resulting model can solve the synthetic "addition" and "multiplication" long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Models for Sequential Data

Many problems in machine learning are best formulated using sequential data and appropriate models for these tasks must be able to capture temporal dependencies in sequences, potentially of arbitrary length. One such class of models are recurrent neural networks (RNNs), which can be considered a learnable function

whose output at time depends on input and the model’s previous state

. Training of RNNs with backpropagation through time

(Werbos, 1990)

is hindered by the vanishing and exploding gradient problem

(Pascanu et al., 2012; Hochreiter & Schmidhuber, 1997; Bengio et al., 1994), and as a result RNNs are in practice typically only applied in tasks where sequential dependencies span at most hundreds of time steps. Very long sequences can also make training computationally inefficient due to the fact that RNNs must be evaluated sequentially and cannot be fully parallelized.

1.1 Attention

A recently proposed method for easier modeling of long-term dependencies is “attention”. Attention mechanisms allow for a more direct dependence between the state of the model at different points in time. Following the definition from (Bahdanau et al., 2014), given a model which produces a hidden state

at each time step, attention-based models compute a “context” vector

as the weighted mean of the state sequence by

where is the total number of time steps in the input sequence and is a weight computed at each time step for each state . These context vectors are then used to compute a new state sequence , where depends on , and the model’s output at . The weightings are then computed by

where is a learned function which can be thought of as computing a scalar importance value for given the value of and the previous state . This formulation allows the new state sequence to have more direct access to the entire state sequence . Attention-based RNNs have proven effective in a variety of sequence transduction tasks, including machine translation (Bahdanau et al., 2014), image captioning (Xu et al., 2015), and speech recognition (Chan et al., 2015; Bahdanau et al., 2015)

. Attention can be seen as analogous to the “soft addressing” mechanisms of the recently proposed Neural Turing Machine

(Graves et al., 2014) and End-To-End Memory Network (Sukhbaatar et al., 2015) models.

1.2 Feed-Forward Attention

A straightforward simplification to the attention mechanism described above which would allow it to be used to produce a single vector from an entire sequence could be formulated as follows:

(1)

As before, is a learnable function, but it now only depends on . In this formulation, attention can be seen as producing a fixed-length embedding of the input sequence by computing an adaptive weighted average of the state sequence . A schematic of this form of attention is shown in Figure 1. Sønderby et al. (2015) compared the effectiveness of a standard recurrent network to a recurrent network augmented with this simplified version of attention on the task of protein sequence analysis.

Figure 1: Schematic of our proposed “feed-forward” attention mechanism (cf. (Cho, 2015) Figure 1). Vectors in the hidden state sequence are fed into the learnable function

to produce a probability vector

. The vector is computed as a weighted average of , with weighting given by .

A consequence of using an attention mechanism is the ability to integrate information over time. It follows that by using this simplified form of attention, a model could handle variable-length sequences even if the calculation of was feed-forward, i.e. . Using a feed-forward

could also result in large efficiency gains as the computation could be completely parallelized. We investigate the capabilities of this “feed-forward attention” model in Section

2.

We note here that feed-forward models without attention can be used for sequential data when the sequence length is fixed, but when varies across sequences, some form of temporal integration is necessary. An obvious straightforward choice, which can be seen as an extreme oversimplification of attention, would be to compute as the unweighted average of the state sequence , i.e.

(2)

This form of integration has been used to collapse the temporal dimension of audio (Dieleman, 2014) and text document (Lei et al., 2015) sequences. We will also explore the effectiveness of this approach.

2 Toy Long-Term Memory Problems

A common way to measure the long-term memory capabilities of a given model is to test it on the synthetic problems originally proposed by Hochreiter & Schmidhuber (1997). In this paper, we will focus on the “addition” and “multiplication” problems; due to space constraints, we refer the reader to (Hochreiter & Schmidhuber, 1997) or (Sutskever et al., 2013) for their specification. As proposed by Hochreiter & Schmidhuber (1997), we define accuracy as the proportion of sequences for which the absolute error between predicted value and the target value was less than .04. Applying our feed-forward model to these tasks is somewhat disingenuous because they are commutative and therefore may be easier to solve with a model which ignores temporal order. However, as we further argue in Section 2.4, we believe these tasks provide a useful demonstration of our model’s ability to refer to arbitrary locations in the input sequence when computing its output.

2.1 Model Details

For all experiments, we used the following model: First, the state was computed from the input at each time step by where and is the “leaky rectifier” nonlinearity, as proposed by Maas et al. (2013). We found that this nonlinearity improved early convergence so we used it in all of our models. We tested models where the context vector was then computed either as in Equation (1), with where , or simply as the unweighted mean of as in Equation (2). We then computed an intermediate vector where from which the output was computed as where , . For all experiments, we set .

We used the squared error of the output against the target value for each sequence as an objective. Parameters were optimized using “adam”, a recently proposed stochastic optimization technique (Kingma & Ba, 2014)

, with the optimization hyperparameters

and set to the values suggested by Kingma & Ba (2014)

(.9 and .999 respectively). All weight matrices were initialized with entries drawn from a Gaussian distribution with a mean of zero and, for a matrix

, a standard deviation of

. All bias vectors were initialized with zeros. We trained on mini-batches of 100 sequences and computed the accuracy on a held-out test set of 1000 sequences every epoch, defined as 1000 parameter updates. We stopped training when either 100% accuracy was attained on the test set, or after 100 epochs. All networks were implemented using Lasagne

(Dieleman et al., 2015)

, which is built on top of Theano

(Bastien et al., 2012; Bergstra et al., 2010).

Task Addition Multiplication
50 100 500 1000 5000 10000 50 100 500 1000 5000 10000
Attention 1 1 1 1 2 3 1 2 4 2 15 6
Unweighted 1 1 1 2 8 17 2 2 8 33 gray89.8% gray80.8%
Table 1: Number of epochs required to achieve perfect accuracy, or accuracy after 100 epochs (greyed-out values), for the experiment described in Section 2.2.

2.2 Fixed-Length Experiment

Traditionally, the sequence lengths tested in each task vary uniformly between for different values of . As increases, the model must be able to handle longer-term dependencies. The largest value of attained using RNNs with different training, regularization, and model structures has varied from a few hundred (Martens & Sutskever, 2011; Sutskever et al., 2013; Le et al., 2015; Krueger & Memisevic, 2015; Arjovsky et al., 2015) to a few thousand (Hochreiter & Schmidhuber, 1997; Jaeger, 2012). We therefore tested our proposed feed-forward attention models for . The required number of epochs or accuracy after 100 epochs for each task, sequence length, and temporal integration method (adaptively weighted attention or unweighted mean) is shown in Table 1. For fair comparison, we report the best result achieved using any learning rate in . From these results, it’s clear that the feed-forward attention model can quickly solve these long-term memory problems for all sequence lengths we tested. Our model is also efficient: Processing one epoch of 100,000 sequences with took 254 seconds using an NVIDIA GTX 980 Ti GPU, while processing the same data with a single-layer vanilla RNN with a hidden dimensionality of 100 (resulting in a comparable number of parameters) took 917 seconds on the same hardware. In addition, there is a clear benefit to using the attention mechanism of Equation (1) instead of a simple unweighted average over time, which only incurs a marginal increase in the number of parameters (10,602 vs. 10,501, or less than 1%).

2.3 Variable-length Experiment

Because the range of sequence lengths is small compared to the range of values we evaluated, we further tested whether it was possible to train a single model which could cope with sequences with highly varying lengths. To our knowledge, such a variant of these tasks has not been studied before. We trained models of the same architecture used in the previous experiment on minibatches of sequences whose lengths were chosen uniformly at random between 50 and 10000 time steps. Using the attention mechanism of Equation (1), on held-out test sets of 1000 sequences, our model achieved 99.9% accuracy on the addition task and 99.4% on the multiplication task after training for 100 epochs. This suggests that a single feed-forward network with attention can simultaneously handle both short and very long sequences, with a marginal decrease in accuracy. Using an unweighted average over time, we were only able to achieve accuracies of 77.4% and 55.5% on the variable-length addition and multiplication tasks, respectively.

2.4 Discussion

A clear limitation of our proposed model is that it will fail on any task where temporal order matters because computing an average over time discards order information. For example, on the two-symbol temporal order task (Hochreiter & Schmidhuber, 1997)

where a sequence must be classified in terms of whether two symbols

and appear in the order ; ; ; or , our model can differentiate between the and cases perfectly but cannot differentiate between the and cases at all. Nevertheless, we submit that for some real-world tasks involving sequential data, temporal order is substantially less important than being able to handle very long sequences. For example, in Joachims’ seminal paper on text document categorization (Joachims, 1998), he posits that “word stems work well as representation units and that their ordering in a document is of minor importance for many tasks”. In fact, the current state-of-the-art system for document classification still uses order-agnostic sequence integration (Lei et al., 2015). We have also shown in parallel work that our proposed feed-forward attention model can be used effectively for pruning large-scale (sub)sequence retrieval searches, even when the sequences are very long and high-dimensional (Raffel & Ellis, 2016).

Our experiments explicitly demonstrate that including an attention mechanism can allow a model to refer to specific points in a sequence when computing its output. They also provide an alternate argument for the claim made by Bahdanau et al. (2014) that attention helps models handle very long and widely variable-length sequences. We are optimistic that our proposed feed-forward model will prove beneficial in additional real-world problems requiring order-agnostic temporal integration of long sequences. Further investigation is warranted; to facilitate future work, all of the code used in our experiments is available online.111https://github.com/craffel/ff-attention/tree/master/toy_problems

3 Acknowledgements

We thank Sander Dieleman, Bart van Merriënboer, Søren Kaae Sønderby, Brian McFee, and our anonymous reviewers for discussion and feedback.

References