Many large-scale applications, in particular among information retrieval tasks, require efficient algorithms to compress documents and query them. For example, at test time, systems may have to process millions of queries simultaneously and in real-time. The content-based attention mechanism (Bahdanau et al., 2015)
is a recently introduced architecture that allows the system to focus on particular parts of the document depending on the query. It has proven to be very beneficial in many applications of deep learning but its expensive computations often prevent it from being used in large-scale applications. In this work we introduce a family of linear attention mechanisms that overcome these limitations and still offer to some extent the benefits of the traditional attention mechanism.
Notations: Let represent a document sequence of tokens and let us consider queries on this document. Let
represent one of these queries, which is encoded into a column vector representation(for example the last state of a recurrent neural network). The document is processed with a recurrent neural network, which, at each timestep , computes a hidden state of size . Let be the matrix composed of all the hidden states of the document stacked vertically, i.e. whose row .
2 Classic softmax attention mechanism
2.1 Definition and complexity
In this work, we consider the following form of softmax attention mechanism111Note that this form is found in memory networks (Sukhbaatar et al., 2015) but other forms are common (all with similar complexities and memory requirements), in particular the one introduced by Bahdanau et al. (2015). We present this particular form because it is the most similar to the cheap mechanism that we introduce in the next section., which computes a representation of the document conditioned on the question :
where represents the inner products of with all the hidden states of the document
. The softmax then converts these inner products into probabilities that are used to compute a weighted sum of the hidden states stacked in.
This mechanism involves matrix multiplications which result in an overall complexity for a single query lookup. If, instead of considering a single query, we would like to process queries, the complexity would be . If or are very large, this complexity is prohibitive and restricts the scale of the potential applications.
Furthermore, the classic softmax attention mechanism does not allow to store a fixed-size representation of the document . Instead, all of the hidden states of the network have to be stored, resulting in a variable-size representation that requires memory space. This can also prohibitive when is large.
2.2 Applications of the softmax attention mechanisms and limitations
In this section, we describe a few use cases of the softmax attention mechanism and how its computational cost may limit the scale of its applications.
In machine translation (Bahdanau et al., 2015), the document would be the source sentence that has to be translated and which is composed of words. The translated sentence is generated iteratively and at each new timestep, an attention lookup is performed. The number of words of the translated sequence is , which corresponds to the number of required attention lookups. Thus, for each new generated word, a new attention lookup is performed. This may significantly slow down the translation of long sentences (large and large ) and prevent real-time translations.
In question answering (Hermann et al., 2015), the document is usually a text document of words. The query is a question about the document and there might be questions per document. In practice, is undefined. The cost of current softmax attention mechanisms may prevent real-time question answering from many users.
In information retrieval tasks (such as a search engine), the document may represent a long sequence (such as a webpage). A query could be a single question about a fact implicitly contained in one of these documents . The classic softmax attention mechanism would require scanning all the words of every document all over again for each new searched query.
In network architectures with external memory (Graves et al., 2014; Sukhbaatar et al., 2015), represents the memory to be queried. Current attention mechanism may limit the size of the memory and the number of queries. It seems particularly important to develop more efficient memory mechanisms. One such possibly would be a memory architecture whose memory size does not scale linearly with the number of facts to be stored. Another one would be a linear size memory but a sublinear query algorithm.
More generally, the softmax attention mechanism is prohibitive in large-scale applications which have long sequences (), an extremely high amount of queries (possibly to be processed in real-time) and strong memory constraints. There is thus a potential interest for developing cheaper attention mechanisms that would satisfy the following properties:
At test time, a computational complexity independent from the document size , by opposition to the complexity of current attention mechanisms. Such a cheap attention would have very little overhead compared to a recurrent model with no attention (in terms of the sequence size ).
At test time, a fixed-size representation of the document, by opposition to the memory representations of current attention mechanisms.
At training time, if there are queries per document, an algorithm which does not scale in but only in .
The linear attention mechanism that we introduce in the next section satisfies these requirements, allowing to potentially tackle problems at a much larger scale. As expected, our early experiments show that these computational gains come at the price of slightly worse accuracy than the softmax attention mechanism, yet definitively better than no attention.
3 Cheap linear attention mechanism
3.1 Definition and complexity
In this section, we introduce the simplest version of the linear attention mechanism; more sophisticated additions are described in the next section. The linear attention mechanism results from the removal of the softmax, leading to the following linear attention mechanism:
where is a square matrix of dimension . represents a non-centered covariance matrix of the hidden states, it is computed in complexity. Most importantly, it depends only on the document (not on the query ). This implies that if is computed once, any attention lookup will only cost , i.e. with a complexity independent from , the length of the document sequence. For queries, the resulting attention complexity would be , i.e. a speedup compared to the classic softmax attention mechanism (). Furthermore, each document can be summarized into the matrix , i.e a fixed-size representation of size instead of the matrix of hidden states required by the softmax attention. Note that if there is no memory improvement, in which case it is more suitable to store rather than the singular matrix of rank . Notice that can be seen as the non-centered covariance matrix of the hidden states.
3.2 Computation of C
The matrix is equal to . Computing it that way still requires to store all the hidden states and then perform a huge matrix multiplication. To avoid this memory footprint at test time, we can notice that
which suggests an iterative way to compute it:
and . This iterative process avoids storing all the hidden states and the matrix can eventually be computed using only memory space.
Although the complexity of computing is still linear in the size of the sequence , this computation has to be done only a single time per document, which contrasts with the classic attention mechanism, for which we have to scan all over again the document for each new query .
3.3 Backpropagation through
Using the iterative procedure to compute does not require to store all the intermediate
during backpropagation. The attention lookup processcan be written as
where . Naive automatic differentiation tools may save all the states of the matrix in the forward pass, which is unnecessary given that the corresponding gradient of the loss with respect to can be written as:
which shows that it is unnecessary to store the intermediate states .
3.4 Summary of the computational advantages
Table 1 summarizes the computational and memory benefits of using linear attention mechanisms compared to the original softmax attention. The forward encoding pass is slightly more expensive for the linear attention mechanism because it has to perform an outer product at each timestep to update the matrix .
|Softmax attention||Linear attention|
|a) Query complexity|
|b) Document compression|
|c) Encoding complexity|
4 Gated linear attention mechanisms
We can generalize the cheap linear attention described previously by incorporating non-linear functions to update :
where , and are (non-linear) functions of and . Their intended functions are described as follows:
The quantity is useful because it measures to some extent how much of is already contained in . Suppose that already contains and only other orthogonal vectors to , then , which gives information on the presence or not of in the matrix .
and control to what extent the network remembers about the previous .
lets the network precisely update certain regions of the matrix . could be the element-wise product of and a sigmoid whose input is .
Backpropagation requires to know the intermediate values of at each timestep. Instead of storing them in the forward pass, which would be prohibitive memory-wise, we can incrementally re-compute each starting from the final matrix and invert the successive transformations. If we memorize in the forward pass the values of , , and , we can use them to compute from :
In the experiments below, we use a particular instance of the general model above, which we call gated linear attention. It is defined by and , where is the element-wise product. In other words, the network has now the capacity to control the information it adds to the matrix . The full mechanism can be written as:
The cheap linear attention mechanism is designed for large-scale applications with a very large number of queries per document in real-time. Research datasets are not really suitable to highlight their computational efficiency in practice. Therefore, we focus on comparing their accuracy results.
We evaluated the basic and gated versions of the linear attention mechanism on a question answering task. We used the CNN dataset released by Hermann et al. (2015), which is composed of Cloze style questions on documents with words on average. There are about questions per document. We did not aim to reach state of the art results but simply to compare the different versions of attention. As such, we used a simple architecture, which only requires a few hours to train. We fixed the architecture for all our experiments and the models only differ by their attention part. More precisely, the common architecture is composed of a single-layer GRU network to encode the query and a separate single-layer GRU network to encode the document333Note that for their baseline model without attention, Hermann et al. (2015) concatenated the question and the document. Despite improving a lot the performance, this approach does not allow to compute a representation of the document independent of the query (it requires to know the question in advance). Therefore we encoded the query and the document with two independent networks.. We used ADAM to train our networks. For the two GRU networks, we chose a small hidden size and word embeddings of size 100.
At test time, an optimized implementation should yield a speedup of for each attention lookup444These are the complexity gains for the attention lookups only, we do not consider the forward pass necessary for both softmax and linear attentions.. However, at this stage, we are more interested in the accuracy results comparison rather than the speed. The speedup would better be illustrated in applications with a (very) large number of queries per document and relatively long documents, but such public datasets are still rare.
Our early experiments on question-answering suggest that linear mechanisms and their gated extensions significantly improve models with no attention. As expected, the accuracy results of softmax attention are better but the gap can be reduced when adding non-linear gates to the basic linear mechanism. We believe that more sophisticated extensions could further improve the results.
In terms of memory, the linear attention mechanisms can be seen as a trade-off between no-attention models and classic softmax models. They compress the document sequence intorepresentations, which can store more information than the -length vector of the last hidden state of a classic recurrent network, but obviously less than the stored hidden states of a softmax attention mechanism. This is probably more suitable for tasks with relatively long sequences and an extremely high number of lookups. Nevertheless, for extremely long sequences, we believe that fixed-size representations may not capture enough information and further research should focus on sublinear (maybe or adaptative, depending on how much information is contained in the sequence) representations.
This representation can not only store more information than a
-length vector but it also acts as skip connections from the past hidden states to the output. As a result, we observed that it can capture longer term dependencies and the training optimization is easier because it is less prone to the vanishing gradient problem.
A potential extension of this cheap mechanism is to interleave the updates of and to create a new flavor of recurrent unit, which uses second order information about the past hidden states ( can be seen as a non-centered covariance matrix). The recurrent unit would take as input not only the previous hidden state and the current input but also the product which evaluates to some extent how much of is already stored in .
We introduced a new family of attention mechanisms, called linear attention mechanisms, which, with little computational overhead, yield better and easier to optimize models compared to standard recurrent networks with no attention. Their constant attention lookup complexity and their memory requirements make them very appealing alternatives to build large-scale information retrieval systems, for which the computational costs of traditional softmax attention mechanisms are prohibitive. More precisely, we believe that the linear attention mechanisms would be suitable on large-scale tasks with some of these three properties:
long sequences, long enough so that a recurrent network with no attention is unable to capture long-enough dependencies.
many attention lookups, such that traditional softmax attention mechanisms would be too slow. This is particularly important for real-time systems which have to process extremely large loads of queries simultaneously (for example millions of queries per hour).
a requirement to store documents into fixed-size representations.
- Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In ICLR’2015, arXiv:1409.0473, 2015.
- Graves et al. (2014) Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Hermann et al. (2015) Hermann, Karl Moritz, Kočiský, Tomáš, Grefenstette, Edward, Espeholt, Lasse, Kay, Will, Suleyman, Mustafa, and Blunsom, Phil. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS), 2015. URL http://arxiv.org/abs/1506.03340.
- Sukhbaatar et al. (2015) Sukhbaatar, Sainbayar, szlam, arthur, Weston, Jason, and Fergus, Rob. End-to-end memory networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2440–2448. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf.