Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

02/04/2019 ∙ by Andrea Galassi, et al. ∙ University of Bologna 0

Attention is an increasingly popular mechanism used in a wide range of neural architectures. Because of the fast-paced advances in this domain, a systematic overview of attention is still missing. In this article, we define a unified model for attention architectures for natural language processing, with a focus on architectures designed to work with vector representation of the textual data. We discuss the dimensions along which proposals differ, the possible uses of attention, and chart the major research activities and open challenges in the area.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many problems that involve the processing of natural language, the elements composing the source text are characterized by having each a different relevance to the task at hand. For instance, in aspect-based sentiment analysis, cue words such as “good” or “bad” could be relevant to some aspects under consideration, but not to others. In machine translation, some words in the source text could be irrelevant in the translation of the next word. In a visual question-answering task, background pixels could be irrelevant in answering a question regarding an object in the foreground, but relevant to questions regarding the scenery.

Arguably, effective solutions to such problems should factor in a notion of relevance, so as to focus the computational resources on a restricted set of important elements. One possible approach would be to tailor solutions to the specific genre at hand, in order to better exploit known regularities of the input, by feature engineering. For example, in the argumentative analysis of persuasive essays, one could decide to assign a particular emphasis to the final sentence. However, such an approach is not always viable, especially if the input is long or very information-rich, like in text summarization, where the output is the condensed version of a possibly lengthy text sequence. Another approach of increasing popularity amounts to machine-learning the relevance of input elements. In that way, neural architectures could automatically weigh the relevance of any region of the input, and take such a weight into account while performing the main task. The commonest solution to this problem is a mechanism known as


Figure 1: Example of attention visualization for an aspect-based sentiment analysis task, from [Figure 6]attention-rationales. Words are highlighted according to attention scores. Phrases in bold are the words considered relevant for the task, or human rationales.

Attention was first introduced in natural language processing (NLP) for machine translation tasks by DBLP:journals/corr/BahdanauCB14, but the idea of glimpses

had already been proposed in computer vision by larochelle2010learning, following the observation that biological retinas fixate on relevant parts of the optic array, while resolution falls off rapidly with eccentricity. The term

visual attention became especially popular after mnih2014recurrent proposed an architecture that could adaptively select and then process a sequence of regions or locations at high resolution, and use a progressively lower resolution for further pixels, thereby significantly outperforming the state of the art in several image classification tasks as well as in dynamic visual control problems such as object tracking.

Besides offering a performance gain, the attention mechanism can also be used as a tool for interpreting the behaviour of neural architectures, which are notoriously difficult to understand. Indeed, neural networks are sub-symbolic architectures, therefore the knowledge they gather is stored in numeric elements that do not provide any means of interpretation by themselves. It then becomes hard if not impossible to pinpoint the reasons behind the wrong output of a neural architecture. Interestingly, attention could provide a key to interpret and explain, at least to some extent, neural network behaviour 

[explainability]. The weights computed by attention could point us to relevant information discarded by the neural network or to irrelevant elements of the input source that have been factored in and could explain a surprising output of the neural network. By inspecting the network’s attention, for instance by visually highlighting attention weights, one could attempt to investigate and understand the outcome of neural networks. Hence, weight visualization is now common practice, and a number of specific tools have been devised for this type of analysis [visual-interrogation, interactive-visualization]. Figure 1 shows an example of attention visualization in the context of aspect-based sentiment analysis.

Addressed Task Related Works
Document Classification yang2016hierarchical,rationales-supervised
Summarization abstrctive-summarization,word-attention,ijcai2018-577
Language Modelling NIPS2015_5846,frustrating
Question Answering NIPS2015_5945,NIPS2015_5846,D16-1053,Goldilocks,consensus-attention-reader,Attention-sum-reader,iterative-alternating,wang2016inner,dos-santos-attentive-pooling,P17-1055-attention-over-attention,gated-attention-comprehension,Reasonet,DBLP:conf/aaai/KunduN18,DBLP:conf/aaai/ZhuWQL18,hermitian
Question Answering over Knowledge Base cross-attention

Machine Translation global-local,supervised-translation-mi,coverage-embeddings,interactive-attention,coverage,multi-head-disagreement,DBLP:journals/corr/BahdanauCB14,supervised-translation-liu,NIPS2017_7181,D17-1151,transparent-attention,modeling-localness,DBLP:conf/aaai/ChenWUSZ18,word-attention,P18-1167,zhao2018attention-via-attention,temperature-control,history-attention,sparse-constrained

Translation Quality Estimation


Pun Recognition WECA

Multimodal Tasks MARN
Image Captioning xu2015show-soft-hard
Visual Question Answering co-attention,stackedattentionimage,DBLP:conf/ijcai/SongZGS18
Task-oriented Language Grounding DBLP:conf/aaai/ChaplotSPRS18

Information Extraction
Coreference Resolution P18-2017,P18-2063
Named Entity Recognition DBLP:conf/aaai/0001FLH18,constrained-softmax
Optical Character Recognition Correction P18-1220

Social Application
Abusive content detection DeeperAttention

Entity Disambiguation DBLP:conf/aaai/NieCWLP18
Natural Language Inference decomposable,cafe,sparsemax,N16-1170,multi-dimensional,reinforced-self-att
Semantic Relatedness multi-dimensional
Semantic Role Labelling linguistically-informed-attention,DBLP:conf/aaai/TanWXCS18
Textual Entailment DBLP:journals/corr/RocktaschelGHKB15,hermitian,self-attentive-embedding
Word Sense Disambiguation glosses

Constituency Parsing NIPS2015_5635,P18-1249
Dependency Parsing C18-1067,biaffine-attention,constrained-softmax

Sentiment Analysis D16-1021,HEAT,attention-aspect-sentiment,attention-rationales,effective-attention-sentiment,sentiment-auxiliary,interactive-attention-network,multi-grained,AAAI1816541,contrastive-contextual-co-attention,dual-attention,multi-dimensional,self-attentive-embedding,DBLP:conf/aaai/LiWZY18,what-to-learn
Agreement/Disagreement Identification hybrid
Argumentation Mining D18-1402
Emoji prediction emoji
Emotion Cause Analysis co-attention-emotion-cause
Emotion Classification dual-attention

Table 1: Non-exhaustive list of works that exploit attention, grouped by the task(s) addressed.

For all these reasons, attention has become a frequent element of neural architectures for NLP [DBLP:journals/jair/GattK18, trends-NLP]. Table 1 presents a non-exhaustive list of neural architectures where the introduction of an attention mechanism has brought about a significant gain, grouped by the NLP tasks they address. The spectrum of tasks involved is remarkably broad. Besides NLP and computer vision [xu2015show-soft-hard, DRAW, self-a-gan], attentive models have been successfully adopted in many other different fields, such as speech recognition [speech-recognition, listen-attend-spell, sperber2018], recommendation [DBLP:conf/aaai/WangHCHL018, ijcai2018-546], time-series analysis [8476227, DBLP:conf/aaai/SongRTS18], and mathematical problems [pointer-networks].

In NLP, after an initial exploration by a number of seminal papers [DBLP:journals/corr/BahdanauCB14, NIPS2015_5846, NIPS2017_7181], a fast-paced development of new attention models and attentive architectures ensued, resulting in a highly diversified architectural landscape. Because of, and adding to, the overall complexity, it is not unheard of different groups of authors who have been independently following similar intuitions leading to the development of almost identical attention models. For instance, the concepts of inner attention [wang2016inner] and word attention [word-attention] are arguably one and the same. Unsurprisingly, the same terms have been introduced by different authors to define different concepts, thus further adding to the ambiguity in the literature. For example, the term context vector is used with different meanings by DBLP:journals/corr/BahdanauCB14, class-aware, and yang2016hierarchical.

In this article, we offer a systematic overview of attention models developed for NLP. To this end, we provide a general model of attention for NLP tasks, and use it to chart the major research activities in this area. We restrict our analysis to attentive architectures designed to work with vectorial representation of data, as it typically is the case in NLP. Readers interested in attention models for tasks where data has a graph representation can refer to the survey by graph-attention-survey, which specifically addresses attention models in graphs, framing them according to the characteristics of the graph and of the general task.

What this survey does not offer is a comprehensive account of all the neural architectures for NLP that use an attention mechanism, not only because of the sheer volume of new articles featuring architectures that increasingly rely on such a mechanism, but also because our purpose is to produce a synthesis and a critical outlook rather than a flat listing of research activities. For the same reason, we do not offer a quantitative evaluation of different types of attention mechanisms, since such mechanisms in general are embedded in larger neural network architectures devised to address specific tasks, and it would be pointless in many cases to make comparisons using different standards. Focused comparative studies have been carried out, for instance, in the domain of machine translation, by D17-1151 and P18-1167.

This survey is structured as follows. In Section 2 we define a general model of attention and we describe its components. We use the machine-translation architecture by DBLP:journals/corr/BahdanauCB14 as an illustration and an instance of the general model. In Section 3 we propose a taxonomy for different attention models, divided by compatibility function, input representation, distribution function, and multiplicity. Section 4 discusses how attention can be combined with knowledge about the task or the data. In Section 5 we discuss open challenges, current trends and future directions. Section 6 concludes.

2 The Attention Function

The attention mechanism is a part of a neural architecture that enables to dynamically highlight relevant features of the input data, which in NLP is typically a sequence of textual elements. It can be applied directly on the raw input, or on its higher-level representation. The core idea behind attention is to compute a weight distribution on the input sequence, assigning higher values to more relevant elements.

To illustrate, we briefly describe a classic attention architecture, called RNNsearch [DBLP:journals/corr/BahdanauCB14]. We chose RNNsearch because of its historical significance, and also because of its simplicity compared to other architectures that we will describe further on.

2.1 An Example for Machine Translation and Alignment

RNNsearch uses attention in a machine translation task. The objective is to compute an output sequence that is a translation of an input sequence . The architecture consists of an encoder followed by a decoder, as Figure 2 illustrates.

Figure 2: Architecture of RNNsearch [DBLP:journals/corr/BahdanauCB14] (left) and its attention model (right).

The encoder is a Bidirectional Recurrent Neural Network (BiRNN)

[birnn] that computes an annotation term for every input term of (Eq. 1).


The decoder consists of two elements in cascade: the attention function and an RNN. At each time step , the attention function produces an embedding of the input sequence, called a context vector. The subsequent RNN, characterized by a hidden state

, computes from such an embedding a probability distribution over all possible output symbols, pointing to the most probable symbol

(Eq. 2).


The context vector is obtained as follows. At each time-step , the attention function takes as input the previous hidden state of the RNN and the annotations . Such inputs are processed through an alignment model (Eq. 3) to obtain a set of scalar values which score the matching between the inputs around position and the outputs around position . These scores are then normalized through a softmax function, so as to obtain a set of weights (Eq. 4).


Finally, the context vector is computed as a weighted sum of the annotations based on their weights (Eq. 5).


Quoting DBLP:journals/corr/BahdanauCB14, the use of attention “relieve[s] the encoder from the burden of having to encode all information in the source sentence into a fixed length-vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.”

2.2 A Unified Attention Model

Symbol Name Definition
Input sequence Sequence of textual elements constituting the raw input.
Keys Matrix of vectors () of size , whereupon attention weights are computed: .
Values Matrix of vectors () of size , whereupon attention is applied: . Each and its corresponding offer two, possibly different, interpretations of the same entity.
Query Vector of size , or sequence thereof, in which respect attention is computed: .
kaf qaf vaf Annotation functions Functions that encode the input sequence and query, producing , and respectively.
Energy scores Vector of size , whose scalar elements (energy “scores”, ) represent the relevance of the corresponding , according to the compatibility function: .
Attention weights Vector of size , whose scalar elements (attention “weights”, ) represent the relevance of the corresponding according to the attention model: .
Compatibility function Function that evaluates the relevance of with respect to , returning a vector of energy scores: .
Distribution function Function that computes the attention weights from the energy scores: .
Weighted values Matrix of vectors () of size , representing the application of to : .
Context vector Vector of size , offering a compact representation of : .

Table 2: Notation.

The characteristics of an attention model depend on the structure of the data whereupon they operate, and on the desired output structure. The unified model we propose is based on and extends the models proposed by frustrating and NIPS2017_7181. It comprises a core part shared by almost the totality of the models found in the surveyed literature, as well as some additional components that, although not universally present, are still found in the majority of literature models.

Figure 3: Core attention model.

Figure 4: General attention model.

Figure 3 illustrates the core attention model, which is part of the general model shown in Figure 4. Table 2 lists the key terms and symbols. The core of the attention mechanism maps a sequence of vectors , the keys, to a distribution of weights . encodes the data features whereupon attention is computed. For instance, may be word or character embeddings of a document, or the internal states of a recurrent architecture, as it happens with the annotation

in RNNsearch, or they could include multiple features or representations of the same object (e.g., both one-hot encoding and embedding of a word), or even – if the task calls for it – representations of entire documents.

In most of the cases, another input element , called query,111The concept of “query” in attention models should not be confused with that used in tasks such as question answering or information retrieval. In our model, the “query” is an element of a general architecture and is task-independent. is used as a reference when computing the attention distribution. The attention mechanism will give emphasis to the input elements considered to be relevant to the task according to the query, if a query is defined, or else those considered to be inherently relevant to the task, if no query is defined. In RNNsearch, for instance, is a single element, namely the RNN hidden state . In other architectures, ranges from embeddings of actual textual queries, to contextual information, to background knowledge, and so on, and as such it can take the form of a matrix rather than a vector. For example, in their Document Attentive Reader, iterative-alternating make use of two query vectors.

From the keys and query, a vector of energy scores is computed through a compatibility function (Eq. 6).


Function corresponds to RNNsearch’s “alignment model”, and to what other authors call “energy function” [zhao2018attention-via-attention]. Energy scores are then transformed into attention weights using what we call a distribution function, (Eq. 7).


Such weights are the outcome of the core attention mechanism. The commonest distribution function is the softmax function, as in RNNsearch, which normalizes all the scores to a probability distribution. Weights represent the relevance of each element to the given task, with respect to and .

The computation of these weights may already be sufficient for some tasks such as the classification task addressed by P17-1055-attention-over-attention. Nevertheless, in most cases the task requires the computation of new representation of the keys. In such cases, it is common to have another input element: a sequence of vectors , the values, representing the data whereupon the attention computed from and is to be applied. Each element of corresponds to one and only one element of , and the two can be seen as different representations of the same data. Indeed, many architectures, including RNNsearch, do not distinguish between and . The distinction between keys and values was introduced by frustrating, who use different representations of the input for computing the attention distribution and the contextual information. and are thus combined to obtain a new set of weighted representations of (Eq. 8), which are then merged together so as to produce a compact representation of usually called the context vector (Eq. 9).222Although most authors use this terminology, we shall remark that yang2016hierarchical,class-aware and other authors use the term context vector to refer to other elements of the attention architecture. The commonest way of obtaining from is by summation. However, alternatives have been proposed, including the use of gating functions [multi-dimensional]. Either way, will be mainly determined by values associated with higher attention weights.


What we described so far is a synthesis of the most frequent architectural choices made in the design of attentive architectures. Other options will be explored in Section 3.4.

2.3 The Uses of Attention

Attention enables to estimate the relevance of the input elements as well as to combine said elements into a compact representation – the context vector – that condenses the characteristics of the most relevant elements. Because the context vector is smaller than the original input, it requires fewer computational resources to be processed at later stages, yielding a computational gain.

For tasks such as document classification, where usually there is only

in input and no query, the attention mechanism can be seen as an instrument to encode the input into a compact form. The computation of such an embedding can be seen as a form of feature selection, and as such it can be applied to any set of features sharing the same representation. This applies to the case where features come from different domains 

[MARN] or from different levels of a neural architecture [transparent-attention], or where they simply represent different aspects of the input document [genre-aware-book].

When the generation of a text sequence is required, as in machine translation, attention enables to make use of a dynamic representation of the input sequence, whereby the whole input does not have to be encoded into a single vector. At each time step, the encoding is tailored according to the task, and in particular represents an embedding of the previous state of the decoder. More generally, the possibility to perform attention with respect to a query allows us to create representations of the input that depend on the task context, creating specialized embeddings.

Since attention can create create contextual representations of an element, it can also be used to build sequence-to-sequence annotators, without resorting RNNs or CNNs, as suggested by NIPS2017_7181, who rely on an attention mechanism to obtain a whole encoder/decoder architecture.

Attention can also be used as a tool for selecting specific words. This could be the case for example in dependency parsing, where linguistically-informed-attention rely on self-attention in order to predict dependencies, and in cloze question-answering tasks [Attention-sum-reader, P17-1055-attention-over-attention]. In the latter case, attention can be applied to a textual document or to a vocabulary to perform a classification among the words.

Finally, attention can come in handy when multiple interacting input sequences have to be considered in combination. In tasks such as question answering, where the input consists of two textual sequences—for instance, the question and the document, or the question and the possible answers—an encoding of such sequences can be obtained according to the mutual interactions between the elements of such sequences, rather than by applying a more rigid a-priori defined model. We will discuss this technique, which is known as co-attention, in Section 3.4.2.

3 A Taxonomy for Attention Models

Attention models can be described on the basis of the following orthogonal dimensions: the nature of inputs (Section 3.1), the compatibility function (Section 3.2), the distribution function (Section 3.3), and the number of distinguished inputs/outputs, which we refer to as “multiplicity” (Section 3.4). Moreover, attention modules can themselves be used inside larger attention models to obtain complex architectures, such as hierarchical input models (Section 3.1.1) or in some multiple input co-attention models (Section 3.4.2).

3.1 Input Representation

In NLP-related tasks, and usually are representations of parts of documents, such as sequences of characters, words, or sentences. These components are usually embedded into continuous vector representations and then processed through key/value annotation functions (called kaf and vaf in Figure 4

), so as to obtain a hidden representation resulting in


. A typical annotation function could be a recurrent neural layer such as a RNN, a Gated Recurrent Unit (GRU), a Long Short-Term Memory (LSTM), or a Convolutional Neural Network (CNN). In this way,

and represent an input element relative to its local context. If the layers in charge of annotation are trained together with the attention model, they can learn to encode information useful to the attention model.

Alternatively, / can be taken to represent each input element in isolation, rather than in context. For instance, they could be a one-hot encoding of words or characters, or a pre-trained word embedding. This results in an application of the attention mechanism directly to the raw inputs, which is a model known as inner attention [wang2016inner]. Such a model has proven to be effective by several authors, who have exploited it in different fashions [DeeperAttention, NIPS2017_7181, word-attention, DBLP:conf/aaai/LiWZY18]. The resulting architecture has a smaller number of layers and hyper-parameters, which reduces the computational resources needed for training.

We have assumed so far two different input sources: the input sequence, represented by and , and the query, represented by . However, some architectures, known as self or intra attentive models, compute attention only based on the input sequence. This usually happens in one of two cases: when the query term is missing, as in yang2016hierarchical,self-attentive-embedding, or when the keys and query represent the same data, as in [NIPS2017_7181, self-a-gan]. The former case results in simplified compatibility functions (see Section 3.2). The latter can be dealt with in multiple ways. One possibility is to consider the whole to be the query, and apply a technique we will describe in Section 3.4.2, known as co-attention. As a special case, one can also construct a query which represents the entire set , for example, by computing an average vector. Alternatively, one could consider the representation of part of as the query. For example, AAAI1816541 apply an attention mechanism to a selected subset of , and then use the resulting context vector as query in a further attention module. That could be considered as a case of hierarchical input architecture (see Section 3.1.1). Finally, one could apply attention multiple times, using each time a different as a query. In this case, the weights will represent the relevance of with respect to , yielding separate context embeddings, , one per key. Attention could thus be used as a sequence-to-sequence model, as an alternative to CNNs or RNNs (see Figure 5). This is especially interesting, since it could overcome a well-known shortcoming of RNNs: their limited ability of modeling long-range dependencies [rnn-difficult].

Figure 5: Example of use of attention in a sequence-to-sequence model.

can also represent a single element of the input sequence. This is the case, for example, in the work by transparent-attention, whose attention architecture operates on different encodings of the same element, obtained by subsequent application of RNN layers. The context embeddings obtained for all components individually can then be concatenated, producing a new representation of the document that encodes the most relevant representation of each component for the given task.

We have so far considered the input to be a sequence of characters, words, or sentences, which is the case most of the times. However, the input can also be other things, such as a juxtaposition of features or relevant aspects of the same textual element. For instance, MARN and ijcai2018-577 consider inputs composed of different sources, and for genre-aware-book and dynamic-meta-embeddings the input represents different aspects of the same document. In that case, embeddings of the input can be collated together and fed into an attention model as multiple keys, as long as the embeddings share the same representation. This allows to highlight the most relevant elements of the inputs and operate a feature selection, which can lead to a reduction of the dimensionality of the representation via the context embedding produced by the attention mechanism.

3.1.1 Hierarchical Input Architectures

If portions of input data can be meaningfully grouped together into higher level structures, hierarchical input attention models can be exploited to subsequently apply multiple attention modules at different levels of the composition, as illustrated in Figure 6.

Consider, for instance, data naturally associated with a two-level semantic structure, such as characters (the “micro”-elements) forming words (the “macro”-elements), or words forming sentences. Attention can be first applied to the representations of micro elements , so as to build aggregate representations of the macro-elements, such as context vectors. Attention could then be applied again to the sequence of macro element embeddings, in order to compute an embedding for the whole document . With this model, attention first highlights the most relevant micro-elements within each macro-element, and then the most relevant macro-elements in the document. For instance, yang2016hierarchical apply attention first at word level, for each sentence in turn, to compute sentence embeddings. Then, they apply attention again on the sentence embeddings to obtain a document representation. With reference to the model introduced in Section 2, embeddings are computed for each sentence in , and then all such embeddings are used together as keys to compute the document-level weights and eventually ’s context vector .

If representations for both micro-level and macro-level elements are available, one can compute attention on one level and then exploit the result as a key or query to compute attention on the other, yielding two different micro/macro representations of . In this way, attention enables to identify the most relevant elements for the task at both levels. The attention-via-attention model by zhao2018attention-via-attention defines a hierarchy with characters at the micro level and words at the macro level. Both characters and words act as keys. Attention is first computed on word embeddings , thus obtaining a document representation in the form of a context vector , which in turn acts as a query to guide the application of character-level attention to the keys (character embeddings) , yielding weights a context vector for .

AAAI1816541 identify a single “target” macro-object as a set of words, which do not necessarily have to form a sequence in , and then use such a macro object as keys, . The context vector produced by a first application of the attention mechanism on is then used as query in a second application of the attention mechanism over , with the keys being the document’s word embeddings .

Figure 6: Hierarchical input attention models defined by yang2016hierarchical (left), zhao2018attention-via-attention (center), and AAAI1816541 (right). The attention functions on different levels are applied sequentially, left-to-right.

3.2 Compatibility Functions

Name Equation Reference

similarity DBLP:journals/corr/GravesWD14

multiplicative or dot global-local

scaled multiplicative NIPS2017_7181

general or bilinear global-local

biased general iterative-alternating

activated general interactive-attention-network

concat global-local
additive DBLP:journals/corr/BahdanauCB14
deep DeeperAttention
location-based global-local

Table 3: Summary of compatibility functions found in literature. , , , …, and are learnable parameters.

The compatibility function is a crucial part of the attention architecture, because it defines how keys and queries are matched or combined. In this discussion of compatibility functions, we shall consider a data model where and are mono-dimensional vectors. For example, if represents a document, each may be the embedding of a sentence, a word or a character. In such a model, and may have the same structure, and thus the same size, although that is not always necessary. However, in some architectures can consist of a sequence of vectors or a matrix, a possibility we explore in Section 3.4.2.

Some common compatibility functions are listed in Table 3. Two main approaches can be identified. A first one is to match and compare and . For instance, the idea behind the similarity attention proposed by DBLP:journals/corr/GravesWD14 is that the most relevant keys are the most similar to the query. Accordingly, the authors present a model that relies on a similarity function (sim in Table 3

) to compute the energy scores. For example, they rely on cosine similarity, a choice that suits well the cases where the query and the keys share the same semantic representation. A similar idea is followed by the widely used

multiplicative or dot attention, where the dot product between and is computed. A variation of this model is scaled multiplicative attention, where a scaling factor is introduced to improve performance with large keys [NIPS2017_7181]. General attention, proposed by global-local, extends this concept in order to accommodate keys and querys with different representations. To that end, it introduces a learnable matrix parameter . In what could be called a biased general attention, iterative-alternating introduce a learnable bias, so as to consider some keys as relevant independently of the input. Activated general attention [interactive-attention-network]

employs a non-linear activation function. In Table 

3, act is a placeholder for a non-linear activation function such as hyperbolic tangent,

, rectifier linear unit, ReLU 

[relu], or scaled exponential linear unit, SELU [self-norm-selu].

A different approach amounts to combining rather than comparing and , using them together to compute a joint representation, which is then multiplied by an importance vector333Our terminology. As previously noted, is termed context vector by yang2016hierarchical and other authors. , which has to adhere to the same semantic of the new representation. Such a vector defines, in a way, relevance, and could be an additional query element, as offered by AAAI1816541, or a learnable parameter. In that case, we speculate that the analysis of a machine-learned importance vector could provide additional information on the model. One of the simplest models that follow this approach is the concat attention by global-local, where a joint representation is given by juxtaposing keys and query. Additive attention works similarly, except that the contribution of and can be computed separately. For example, DBLP:journals/corr/BahdanauCB14 pre-compute the contribution of in order to reduce the computational footprint. Moreover, additive attention in principle could accommodate queries of different size. In additive and concat attention the keys and the query are fed into a single neural layer. We speak instead of deep attention if multiple layers are employed [DeeperAttention]. Table 3 illustrates a deep attention function with levels of depth, .

Finally, in some models the attention distribution only depends on the query, disregarding the keys. In that case, it is called location-based attention. The energy associated with each key is thus computed as a function of the key’s position, independently of its content [global-local]. Conversely, as we mentioned in Section 2.2, attention can also be computed only based on , without any . In that case, we speak of self-attention. Table 3 does not explicitly list the compatibility functions for self-attention, which are but a special case of the more general functions.

3.3 Distribution functions

Attention distribution maps energy scores to attention weights. The choice of the distribution function depends on the properties the distribution is required to have—for instance, whether it is required to be a probability distribution, a set of probability scores, or a set of Boolean scores—on the need to enforce sparsity, and on the need to take into account the keys’ positions.

One possible distribution function is the logistic sigmoid, as proposed by domain-enablement. In this way, each weight is constrained between 0 and 1, thus ensuring that the values and their corresponding weighted counterparts share the same boundaries. The same range can be forced also on the context vector’s elements , by using a softmax function, as it is commonly done. In that case, we refer to soft attention.

With sigmoid or softmax alike, all the key/value elements have a relevance, small as it may be. Yet, it can be argued that, in some cases, some parts of the input are completely irrelevant, and if considered they would likely introduce noise rather than contribute with useful information. In such cases, attention distributions that ignore some of the keys altogether could be exploited, obtaining also a reduction of the computational footprint. That can be done through the sparsemax distribution [sparsemax], which truncates to zero the scores under a certain threshold by exploiting the geometric properties of the probability simplex.

Since in some tasks the relevant features are found in a neighborhood of a certain position, it could be helpful to focus the attention only on a specific portion of the input. If the position is known in advance, one can apply a positional mask, by adding or subtracting a given value from the energy scores before the application of the softmax [multi-dimensional]. Since the location may not be known in advance, the hard

attention model by xu2015show-soft-hard, considers the keys in a dynamically determined location. Such a solution is less expensive at inference time but it is not differentiable. For this reason, it requires more advanced training techniques, such as reinforcement learning or variance reduction.

Local attention [global-local]

extends this idea, while preserving differentiability. Guided by the intuition that in machine translation at each time step only a small segment of the input can be considered relevant, local attention takes into account only a small window of the keys at a time. The window has a fixed size and the attention can be better focused on a precise location by combining the softmax distribution with a Gaussian distribution. The mean of the Gaussian distribution is dynamic, while its variance can either be fixed, as done by global-local, or dynamic, as done by modeling-localness.

Selective attention [DRAW] follows the same idea: using a grid of Gaussian filters, only a patch of the keys is considered, with its position, size, and resolution depending by dynamic parameters.

reinforced-self-att combine soft and hard attention, by applying the former only on the elements filtered by the latter. More precisely, softmax is applied only among a subset of selected energy scores, while for the others the weight is set to zero. The subset is determined according to a set of random variables, with each variable corresponding to a key. The probability associated with each variable is determined through soft attention applied to the same set of keys. The proper “softness” of the distribution could depend not only on the task but also on the query. temperature-control define a model whose distribution is controlled by a learnable “temperature” parameter tuned using a self-adaptive algorithm. When softer attention is required, temperature increases, producing a smoother distribution of weights, while the opposite happens when harder attention is needed.

Another noteworthy possibility for modeling a local approach is to adopt a representation of the keys that highlights elements around a certain position. Some such possibilities are explored by D16-1021.

Finally, the concept of locality can also be defined according to semantic rules, rather than the temporal position. This possibility will be further discussed in Section 4.

3.4 Multiplicity

We shall now present variations of the general unified model where the attention mechanism is extended to accommodate multiple, possibly heterogeneous, inputs or outputs.

3.4.1 Multiple Outputs

Some applications suggest that the data could, and should, be interpreted in multiple ways. This can be the case when there is ambiguity in the data, stemming, for example, from words having multiple meanings, or when addressing a multi-task problem. For this reason, models have been defined that jointly compute not only one, but multiple attention distributions over the same data.

One possibility presented by self-attentive-embedding is to use additive attention (seen in Section 3.2) with an importance matrix, instead of a vector, , yielding an energy scores matrix where multiple scores are associated with each key. Such scores can be regarded as different models of relevance for the same values and can be used to create a context matrix . Such embeddings can be concatenated together, creating a richer and more expressive representation of the values. In multi-dimensional attention [multi-dimensional], where the importance matrix is a square matrix, attention can computed feature-wise. To that end, each weight is paired with the -th feature of the -th value , and a feature-wise product yields the new value .

Another possibility, explored by NIPS2017_7181, is multi-head attention, whereby multiple linear projections of all the inputs (, , ) are performed according to learnable parameters, and multiple attention functions are computed in parallel. The processed context vectors are then merged together into a single embedding. A suitable regularization term is sometimes imposed so as to guarantee sufficient dissimilarity between attention elements. multi-head-disagreement propose three possibilities: regularization on the subspaces (the linear projections of ), on the attented positions (the sets of weights), or on the outputs (the context vectors). Multi-head attention can be especially helpful when combined with non-soft attention distribution, since different heads can capture local and global context at the same time [modeling-localness].

Finally, label-wise attention [emoji] computes a separate attention distribution for each class. This may improve the performance as well as lead to a better interpretation of the data, because it could help isolate data points that better describe each class.

3.4.2 Multiple Inputs: Co-Attention

Some architectures consider the query to be a matrix , rather than a plain vector. In that case, it could be useful to find the most relevant query elements according to the task and the keys. A straightforward way of doing that would be to apply the attention mechanism to the query elements, thus treating as keys and each as query, yielding two independent representations for and . However, in that way we would miss the information contained in the interactions between elements of and . Alternatively, one could apply attention jointly on and , which become the “inputs” of a co-attention architecture [co-attention].

Co-attention models can be coarse-grained or fine-grained [multi-grained]. Coarse-grained models compute attention on each input, using an embedding of the other input as a query. Fine-grained models consider each element of an input with respect to each element of the other input. Furthermore, co-attention can be performed sequentially or in parallel. In parallel models, the procedures to compute attention on and on symmetric, thus the two inputs are treated identically.

Coarse-grained co-attention

Coarse-grained models use a compact representation of one input to compute attention on the other. In such models, the role of the inputs as keys and queries is no longer focal, thus a compact representation of may play as query in parts of the architecture and vice versa.

A sequential coarse-grained model proposed by co-attention is alternating co-attention, illustrated in Figure 7 (left), whereby attention is computed three times to obtain embeddings for and . First, self-attention is computed on . The resulting context vector is then used as a query to perform attention on . The result is another context vector , which is further used as a query as attention is again applied to . This produces a final context vector, . The architecture proposed by iterative-alternating can also be described using this model with a few adaptations. In particular, iterative-alternating omit the last step, and factor in an additional query element in the first two attention steps. An almost identical sequential architecture is used by DBLP:conf/aaai/0001FLH18, who use only in the first attention step.

A parallel coarse-grained model is illustrated in Figure 7 (right). In such a model, proposed by interactive-attention-network, an average () is initially computed on each input, and then used as query in the application of attention to generate the embedding of the other input.

Figure 7: Coarse-grained co-attention by co-attention (left) and interactive-attention-network (right).
Name Equations Reference

pooling dos-santos-attentive-pooling
perceptron co-attention

linear transformation co-attention-emotion-cause

attention over attention P17-1055-attention-over-attention

perceptron with DBLP:conf/aaai/NieCWLP18
nested attention

Table 4: Aggregation functions. In most cases, and are obtained by applying a distribution function, such as those seen in Section 3.3, to and , and are thus omitted from this table in the interest of brevity. As customary, act is a placeholder for a generic non-linear activation function, whereas dist indicates a distribution function such as softmax.
Fine-grained co-attention

In fine-grained co-attention models, the relevance (energy scores) associated with each key/query element pair is represented by the elements of a co-attention matrix computed by a co-compatibility function.

Co-compatibility functions can be straightforward adaptations of any of the compatibility functions listed in Table 3. Alternatively, new functions can be defined. For example, multi-grained define co-compatibility as a linear transformation of the concatenation between the elements and their product (Eq. 10). In decomposable decomposable attention, the inputs are fed into neural networks, whose outputs are then multiplied (Eq. 11). Delaying the product to after the processing by the neural networks reduces the number of inputs of such networks, yielding a reduction in the computational footprint. Another possibility proposed by hermitian is to exploit the Hermitian inner product. The elements of and are thus projected into a complex domain, then the Hermitian product between elements is computed, and finally only the real part of the result is kept. As the Hermitian product is noncommutative, will depend on the roles played by the inputs as keys and queries.


Because represent energy scores associated with pairs, computing the relevance of with respect to specific query elements, or, similarly, the relevance of with respect to specific key elements, requires extracting information from using what we call an aggregation function. The output of such a function is a pair / of weight vectors.

The commonest aggregation functions are listed in Table 4. A simple idea is adopted by dos-santos-attentive-pooling attention pooling parallel model, and it amounts to considering the highest score in each row or column of . By attention pooling, a key will be attributed a high attention weight if and only if it has a high co-attention score with respect to at least one query element

. Key attention scores are obtained through row-wise max-pooling, whereas query attention scores are obtained through column-wise max-pooling, as Figure 

8 (left) illustrates.

Another possibility is offered by co-attention, who use a multi-layer perceptron in order to learn the mappings from to and . In co-attention-emotion-cause architecture the computation is even simpler, since the final energy scores are a linear transformation of .

P17-1055-attention-over-attention instead apply the nested model depicted in Figure 8 (right). First, two matrices and are computed by separately applying a row-wise and a column-wise softmax on . The idea is that each row of represents the attention distribution over the document according to a specific query element–and it could already be used as such. Then a row-wise average over is computed so as to produce an attention distribution over query elements. Finally, a weighted sum of according to the relevance of query elements is computed through the dot product between and , obtaining the document’s attention distribution over the keys, . An alternative nested attention model is proposed by DBLP:conf/aaai/NieCWLP18, whereby and are fed to a multi-layer perceptron, like is done by [co-attention].

Further improvements can be obtained by combining the results of multiple co-attention models. multi-grained, for instance, compute coarse-grained and fine-grained attention in parallel, and combine the results into a single embedding.

Figure 8: Fine-grained co-attention models presented by dos-santos-attentive-pooling (left) and by P17-1055-attention-over-attention (right). Dashed lines show how max pooling/distribution functions are applied (column-wise or row-wise).

4 Combining Attention and Knowledge

According to lecun2015deep, a major open challenge in AI is combining connectionist (or sub-symbolic) models, such as deep networks, with approaches based on symbolic knowledge representation, in order to perform complex reasoning tasks. Throughout the last decade, filling the gap between these two families of AI methodologies has represented a major research avenue. Popular approaches include statistical relational learning [getoor2007introduction], neural-symbolic learning [garcez2012neural]

, and the application of various deep learning architectures 

[lippi2017reasoning] such as memory networks [NIPS2015_5846]

, neural Turing machines 

[DBLP:journals/corr/GravesWD14], and several others.

From this perspective, attention can be seen both as an attempt to improve the interpretability of neural networks, and as an opportunity to plug external knowledge into them. As a matter of fact, since the weights assigned by attention represent the relevance of the input with respect to the given task, in some contexts it could be possible to exploit this information to isolate the most significant features that allow the deep network to make its predictions. On the other hand, any background knowledge regarding the data, the domain, or the specific task, whenever available, could be exploited to generate information about the desired attention distribution, which could be encoded within the neural architecture.

In this section, we overview different techniques that can be used to inject this kind of knowledge in a neural network. We leave to Section 5 further discussions on the open challenges regarding the combination of knowledge and attention.

4.1 Supervised Attention

In most of the works we surveyed, the attention model is trained with the rest of the neural architecture to perform a specific task. Although trained alongside a supervised procedure, the attention model per se is trained in an unsupervised fashion444Meaning that there is no target distribution for the attention model. to select useful information for the rest of the architecture. Nevertheless, in some cases knowledge about the desired weight distribution could be available. Whether it is present in the data as a label, or it is obtained as additional information through external tools, it can be exploited to perform a supervised training of the attention model.

4.1.1 Preliminary training

One possibility is to use an external classifier. The weights learned by such a classifier are subsequently plugged into the attention model of a different architecture. We name this procedure as preliminary training. For example, rationales-supervised first train an attention model to represent the probability that a sentence contains relevant information. The relevance of a sentence is given by

rationales [N07-1033], which are snippets of text that support the corresponding document categorizations.

4.1.2 Auxiliary training

Another possibility is to train the attention model without preliminary training, but by treating attention learning as an auxiliary task that is performed jointly with the main task. This procedure has led to good results in many scenarios, including machine translation [supervised-translation-liu, supervised-translation-mi], visual question answering [AAAI1816485], and domain classification for natural language understanding [domain-enablement].

In some cases, this mechanism can be exploited also to have attention model specific features. For example, since the linguistic information is useful for semantic role labelling, attention can be trained in a multi-task setting to represent the syntactic structure of a sentence. Indeed, in LISA [linguistically-informed-attention], a multi-layer multi-headed architecture for semantic role labelling, one of the attention heads is trained to perform dependency parsing as an auxiliary task.

4.1.3 Transfer learning

Furthermore, it is possible to perform transfer learning across different domains 

[attention-rationales] or tasks [dual-attention]. By performing a preliminary training of an attentive architecture on a source domain to perform a source task, a mapping between the inputs and the distribution of weights will be learned. Then, when another attentive architecture is trained on the target domain to perform the target task, the pre-trained model can be exploited. Indeed, the desired distribution can be obtained through the first architecture. Attention learning can therefore be treated as an auxiliary task as in the previously mentioned cases. The difference is that the distribution of the pre-trained model is used as ground truth, instead of using data labels.

4.2 Attention tracking

When attention is applied multiple times on the same data, as in sequence-to-sequence models, a useful piece of information could be how much relevance has been given to the input along different model iterations. Indeed, one may need to keep track of the weights that the attention model assigns to each input. For example, in machine translation it is desirable to ensure that all the words of the original phrase are taken into account. One possibility to maintain this information is to use a suitable structure and provide it as an additional input to the attention model. coverage exploit a piece of symbolic information called coverage to keep track of the weight associated to the inputs. Every time attention is computed, such information is fed to the attention model as a query element, and it is updated according to the output of the attention itself. In coverage-embeddings work, the representation is enhanced by making use of a sub-symbolic representation for the coverage.

4.3 Modelling the distribution function according to background knowledge

Another component of the attention model where background knowledge can be exploited is the distribution function. For example, constraints can be applied on the computation of the new weights to enforce the boundaries on the weights assigned to the inputs. In constrained-softmax,sparse-constrained work, the coverage information is exploited by a constrained distribution function, regulating the amount of attention that the same word can receive over time.

Background knowledge could also be exploited also to define or to infer a distance between the elements in the domain. Such domain-specific distance could then be considered in any position-based distribution function, instead of the positional distance. An example of distance could be derived by the syntactical information. effective-attention-sentiment,DBLP:conf/aaai/ChenWUSZ18 use distribution functions that takes into account the distance between two words along the dependency graph of a sentence.

5 Challenges and Future Directions

In this section, we discuss open challenges and possible applications of attention in the analysis of neural networks, and as a support of the training process.

5.1 Attention for deep networks investigation

In the context of a multi-layer neural architecture it is fair to assume that the deepest levels will compute the most abstract features [le2013building, lecun2015deep]. Therefore, the application of attention to deep networks could enable the selection of higher-level features, thus providing hints to understand which complex features are relevant for a given task.

Following this line of inquiry in the computer vision domain, self-a-gan showed that the application of attention to middle-to-high level feature-sets leads to better performance in image generation. The visualization of the self-attention weights has revealed that higher weights are not attributed to proximate image regions, but rather to those regions whose color or texture is most similar to that of the query image point. Moreover, the spatial distribution does not follow a specific pattern, but instead it changes, modelling a region that corresponds to the object depicted in the picture. Identifying abstract features in an input text might be less immediate than doing that with an image, where the analysis process is greatly aided by visual intuition. Yet, it may be interesting to test the effects of the application of attention at different levels, and to assess whether its weights correspond to specific high-level features. For example, NIPS2017_7181 analyze the possible relation between attention weights and syntactic predictions.

modeling-localness seem to confirm that the deeper levels of neural architectures capture non-local aspects of the textual input. They studied the application of locality at different depth of an attentive deep architecture, and showed that its introduction is especially beneficial when it is applied to the layers that are closer to the inputs. Moreover, when the application of locality is based on a variable-size window, higher layers tend to have a broader window.

A popular way of investigating whether an architecture has learned high-level features amounts to using the same architecture to perform other tasks, as it happens with transfer learning. This setting has been adopted outside the context of attention, for example by D16-1159, who perform syntactic predictions by using the hidden representations learned with machine translation. In a similar way, attention weights could be used as input features in a different model, so as to assess whether they can select relevant information for a different learning task.

5.2 Attention for outlier detection and sample weighing

Another possible use of attention may be for outlier detection. In tasks such as classification, or the creation of a representative embedding of a specific class, attention could be applied over all the samples belonging to that task. In doing so, the samples associated with small weights could be regarded as outliers with respect to their class. The same principle could be potentially applied to each data point in a training set, independently of its class. The computation of a weight for each sample could be interpreted as assessing the relevance of that specific data point for a specific task. In principle, assigning such samples a low weight and excluding them from the learning could improve a model’s robustness to noisy input. Moreover, a dynamic computation of these weights during training would result in a dynamic selection of different training data in different training phases. Adaptive data selection strategies have proven to be useful for efficiently obtaining more effective models 


5.3 Attention analysis for model evaluation

The impact of attention is greatest when all the irrelevant elements are excluded from the input sequence, and the importance of the relevant elements is properly balanced. A seemingly uniform distribution of the attention weights could be interpreted as a sign that the attention model has been unable to identify the more useful elements. That in turn may be due to the data not contain useful information for the task at hand, or it may be ascribed to the poor ability of the model to discriminate information. Either way, the attention model would be unable to find relevant information in the specific input sequence, which may lead to errors. The analysis of the distribution of the attention weights may therefore be a tool for measuring an architecture’s confidence in performing a task on a given input. We speculate that an elevate entropy in the distribution or the presence of weights above a certain threshold may be correlated to a higher probability of success of the neural model. These may therefore be used as indicators, to assess the uncertainty of the architecture, as well as to improve its interpretability. Clearly, this information would be useful to the user, who could thus better understand the model and the data, but it may also be exploited by more complex systems.

In the context of an architecture that relies on multiple strategies to perform its task, such as a hybrid model that relies on both symbolic and sub-symbolic information, the uncertainty of the neural model can be used as parameter in the merging strategy. Other contexts in which this information may be relevant are multi-task learning and reinforcement learning. Examples of exploitation of the uncertainty of the model, although in contexts other than attention and NLP, can be found in works by poggi,Kendall, and Blundell.

5.4 Unsupervised learning with attention

To properly exploi unsupervised learning is widely recognized as one of the most important long-term challenges of AI 

[lecun2015deep]. As already mentioned in Section 4, attention is typically trained in a supervised architecture, although without a direct supervision on the attention weights. Nevertheless, a few works have recently attempted to exploit attention within purely unsupervised models. We believe this to be a promising research direction, as the learning process of humans is indeed largely unsupervised.

For example, in work by he2017unsupervised, attention is exploited in a model for aspect extraction in sentiment analysis, with the aim to remove words that are irrelevant for the sentiment, and to ensure more coherence of the predicted aspects. In work by zhang2018unsupervised, attention is used within autoencoders in a question-retrieval task. The main idea is to generate semantic representations of questions, and self-attention is exploited during the encoding and decoding phase, with the objective to reconstruct the input sequences, as in traditional autoencoders. Following a similar idea, zhang2017battrae exploit bidimensional attention-based recursive autoencoders for bilingual phrase embeddings.

6 Conclusion

Attention models have nowadays become widespread in NLP applications. By integrating attention in neural architectures, two positive effects are jointly obtained: a performance gain, and a means of investigating the network’s behaviour.

We have shown how attention can be applied to different input parts, different representations of the same data, or different features. The attention mechanism enables to obtain a compact representation of the data as well as to highlight relevant information. The selection is performed through a distribution function, which may take into account locality in different dimensions, such as space, time, or even semantics. Attention can also be modeled so as to compare the input data with a given element (a query) based on similarity or significance. But it can also learn the concept of relevant element by itself, thus creating a representation to which the important data should be similar.

We have also discussed the possible role of attention in addressing fundamental AI challenges. In particular, we have shown how attention can be a means of injecting knowledge into the neural model, so as to represent specific features, or to exploit knowledge acquired previously, as in transfer learning settings. We speculate that this could pave the way to new challenging research avenues, where attention could be exploited to enforce the combination of sub-symbolic models with symbolic knowledge representations, especially to perform reasoning tasks, or to address natural language understanding. In a similar vein, attention could be a key ingredient of unsupervised learning architectures, as recent works suggest, by guiding and focusing the training process where no supervision is given in advance.