Attention is a behavioral and cognitive process of focusing selectively on a discrete aspect of information, whether subjective or objective, while ignoring other perceptible information [colombini2014attentional]
, playing an essential role in human cognition and the survival of living beings in general. In animals of lower levels in the evolutionary scale, it provides perceptual resource allocation allowing these beings to respond correctly to the environment’s stimuli to escape predators and capture preys efficiently. In human beings, attention acts on practically all mental processes, from reactive responses to unexpected stimuli in the environment - guaranteeing our survival in the presence of danger - to complex mental processes, such as planning, reasoning, and emotions. Attention is necessary because, at any moment, the environment presents much more perceptual information than can be effectively processed, the memory contains more competing traits than can be remembered, and the choices, tasks, or motor responses available are much greater than can be dealt with[chun2011taxonomy].
At early sensorial processing stages, data is separated between sight, hearing, touch, smell, and taste. At this level, Attention selects and modulates processing within each of the five modalities and directly impacts processing in the relevant cortical regions. For example, attention to visual stimuli increases discrimination and activates the relevant topographic areas in the retinotopic visual cortex [tootell1998retinotopy], allowing observers to detect contrasting stimuli or make more precise discriminations. In hearing, attention allows listeners to detect weaker sounds or differences in extremely subtle tones but essential for recognizing emotions and feelings [woldorff1993modulation]. Similar effects of attention operate on the somatosensory cortex [johansen2000physiology], olfactory cortex [zelano2005attentional], and gustatory cortex [veldhuizen2007trying]. In addition to sensory perception, our cognitive control is intrinsically attentional. Our brain has severe cognitive limitations - the number of items that can be kept in working memory, the number of choices that can be selected, and the number of responses that can be generated at any time are limited. Hence, evolution has favored selective attention concepts as the brain has to prioritize.
Long before contemporary psychologists entered the discussion on Attention, William James [James1890JAMPOP] offered us a precise definition that has been, at least, partially corroborated more than a century later by neurophysiological studies. According to James, “Attention implies withdrawal from some things in order to deal effectively with others… Millions of items of the outward order are present to my senses which never properly enter into my experience. Why? Because they have no interest for me. My experience is what I agree to attend to. Only those items which I notice shape my mind — without selective interest, experience is an utter chaos.” Indeed, the first scientific studies of Attention have been reported by Herman Von Helmholtz (1821-1894) and William James (1890-1950) in the nineteenth century. They both conducted experiments to understand the role of Attention.
For the past decades, the concept of attention has permeated most aspects of research in perception and cognition, being considered as a property of multiple and different perceptual and cognitive operations [colombini2014attentional]. Thus, to the extent that these mechanisms are specialized and decentralized, attention reflects this organization. These mechanisms are in wide communication, and the executive control processes help set priorities for the system. Selection mechanisms operate throughout the brain and are involved in almost every stage, from sensory processing to decision making and awareness. Attention has become a broad term to define how the brain controls its information processing, and its effects can be measured through conscious introspection, electrophysiology, and brain imaging. Attention has been studied from different perspectives for a long time.
1.1 Pre-Deep Learning Models of Attention
Computational attention systems based on psychophysical models, supported by neurobiological evidence, have existed for at least three decades [FrintropSurvey]. Treisman’s Feature Integration Theory (FIT) [treisman1980feature], Wolfe’s Guides Search [wolfe1989guided], Triadic architecture [rensink2000dynamic], Broadbent’s Model [broadbent2013perception], Norman Attentional Model [norman1968toward] [kahneman1973attention], Closed-loop Attention Model [van2004clam], SeLective Attention Model [phaf1990slam], among several other models, introduced the theoretical basis of computational attention systems.
Initially, attention was mainly studied with visual experiments where a subject looks at a scene that changes in time [frintrop2010computational]
. In these models, the attentional system was restricted only to the selective attention component in visual search tasks, focusing on the extraction of multiple features through a sensor. Therefore, most of the attentional computational models occurred in computer vision to select important image regions. Koch and Ullman[koch1987shifts] introduced the area’s first visual attention architecture based on FIT [treisman1980feature]. The idea behind it is that several features are computed in parallel, and their conspicuities are collected on a salience map. Winner-Take-All (WTA) determines the most prominent region on the map, which is finally routed to the central representation. From then on, only the region of interest proceeds to more specific processing. Neuromorphic Vision Toolkit (NVT), derived from the Koch-Ullman [itti1998model] model, was the basis for developing research in computational visual attention for several years. Navalpakkam and Itti introduce a derivative of NVT which can deal with top-down cues [navalpakkam2006integrated]. The idea is to learn the target’s feature values from a training image in which a binary mask indicates the target. The attention system of Hamker [hamker2005emergence] [hamker2006modeling] calculates various features and contrast maps and turns them into perceptual maps. With target information influencing processing, they combine detection units to determine whether a region on the perceptual map is a candidate for eye movement. VOCUS [frintrop2006vocus] introduced a way to combine bottom-up and top-down attention, overcoming the limitations of the time. Several other models have emerged in the literature, each with peculiarities according to the task. Many computational attention systems focus on the computation of mainly three features: intensity, orientation, and color. These models employed neural networks or filter models that use classical linear filters to compute features.
Computational attention systems were used successfully before Deep Learning (DL) in object recognition [salah2002selective], image compression [ouerhani2004visual], image matching [walther2006interactions], image segmentation [ouerhani2004visual], object tracking [walther2004detection], active vision [clark1988modal], human-robot interaction [breazeal1999context], object manipulation in robotics [rotenstein2007towards], robotic navigation [clark1992attentive], and SLAM [frintrop2008attentional]. In mid-1997, Scheier and Egner [scheier1997visual] presented a mobile robot that uses attention for navigation. Still, in the 90s, Baluja and Pomerleau [baluja1997expectation] used an attention system to navigate an autonomous car, which followed relevant regions of a projection map. Walther [walther2006interactions] combined an attentional system with an object recognizer based on SIFT features and demonstrated that the attentional front-end enhanced the recognition results. Salah et al. [salah2002selective]ouerhani2004visual] proposed the focused image compression, which determines the number of bits to be allocated for encoding regions of an image according to their salience. High saliency regions have a high quality of reconstruction concerning the rest of the image.
1.2 Deep Learning Models of Attention: the beginning
By 2014, the DL community noticed attention as a fundamental concept for advancing deep neural networks. Currently, the state-of-the-art in the field uses neural attention models. As shown in figure 1, the number of published works grows each year significantly in the leading repositories. In neural networks, attention mechanisms dynamically manage the flow of information, the features, and the resources available, improving learning. These mechanisms filter out irrelevant stimuli for the task and help the network to deal with long-time dependencies simply. Many neural attentional models are simple, scalable, flexible, and with promising results in several application domains [draw] [vaswani_attention_2017] [weston_2014_memory]. Given the current research extent, interesting questions related to neural attention models arise in the literature: how these mechanisms help improve neural networks’ performance, which classes of problems benefit from this approach, and how these benefits arise.
To the best of our knowledge, most surveys available in the literature do not address all of these questions or are more specific to some domain. Wang et al. [wang2016survey] propose a review on recurrent networks and applications in computer vision, Hu [hu2019introductory], and Galassi et al. [galassi2020attention]
offer surveys on attention in natural language processing (NLP). Lee et al.[lee_attention_2018] present a review on attention in graph neural networks, and Chaudhari et al. [chaudhari2019attentive] presented a more general, yet short, review.
To assess the breadth of attention applications in deep neural networks, we present a systemic review of the field in this survey. Throughout our review, we critically analyzed 650 papers while addressing quantitatively 6,567.
As the main contributions of our work, we highlight:
A replicable research methodology. We provide, in the Appendix, the detailed process conducted to collect our data and we make available the scripts to collect the papers and create the graphs we use;
An in-depth overview of the field. We critically analyzed 650 papers and extracted different metrics from 6,567, employing various visualization techniques to highlight overall trends in the area;
We describe the main attentional mechanisms;
We present the main neural architectures that employ attention mechanisms, describing how they have contributed to the NN field;
We introduce how attentional modules or interfaces have been used in classic DL architectures extending the Neural Network Zoo diagrams;
Finally, we present a broad description of application domains, trends, and research opportunities.
This survey is structured as follows. In Section 2 we present the field overview reporting the main events from 2014 to the present. Section 3 contains a description of attention main mechanisms. In Section 4 we analyze how attentional modules are used in classic DL architectures. Section 5 explains the main classes of problems and applications of attention. Finally, in Section 6 we discuss limitations, open challenges, current trends, and future directions in the area, concluding our work in section 7 with directions for further improvements.
Historically, research in computational attention systems has existed since the 1980s. Only in mid-2014, the Neural Attentional Networks (NANs) emerged in Natural Language Processing (NLP), where attention provided significant advances, bringing promising results through scalable and straightforward networks. Attention allowed us to move towards the complex tasks of conversational machine comprehension, sentiment analysis, machine translation, question-answering, and transfer learning, previously challenging. Subsequently, NANs appeared in other fields equally important for artificial intelligence, such as computer vision, reinforcement learning, and robotics. There are currently numerous attentional architectures, but few of them have a significantly higher impact, as shown in Figure2. In this image, we depict the most relevant group of works organized according to citation levels and innovations where RNNSearch [bahdanau_neural_2014], Transformer [vaswani_attention_2017], Memory Networks [weston_2014_memory], “show, attend and tell” [xu_show_2015], and RAM [mnih_recurrent_2014] stand out as key developments.
The bottleneck problem
in the classic encoder-decoder framework worked as the initial motivation for attention research in Deep Learning. In this framework, the encoder encodes a source sentence into a fixed-length vector from which a decoder generates the translation. The main issue is that a neural network needs to compress all the necessary information from a source sentence into a fixed-length vector. Cho et al.[cho2014properties] showed that the performance of the classic encoder-decoder deteriorates rapidly as the size of the input sentence increases. To minimize this bottleneck, Bahdanau et al. [bahdanau_neural_2014] proposed RNNSearch, an extension to the encoder-decoder model that learns to align and translate together. RNNSearch generates a translated word at each time-step, looking for a set of positions in the source sentence with the most relevant words. The model predicts a target word based on the context vectors associated with those source positions and all previously generated target words. The main advantage is that RNNSearch does not encode an entire input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors, choosing a subset of these vectors adaptively while generating the translation. The attention mechanism allows extra information to be propagated through the network, eliminating the fixed-size context vector’s information bottleneck. This approach demonstrated that the attentive model outperforms classic encoder-decoder frameworks for long sentences for the first time.
RNNSearch was instrumental in introducing the first attention mechanism, soft attention (Section 3). This mechanism has the main characteristic of smoothly selecting the network’s most relevant elements. Based on RNNSearch, there have been numerous attempts to augment neural networks with new properties. Two research directions stand out as particularly interesting - attentional interfaces and end-to-end attention. Attentional interfaces treat attention as a module or set of elective modules, easily plugged into classic Deep Learning neural networks, just like RNNSearch. So far, this is the most explored research direction in the area, mainly for simplicity, general use, and the good results of generalization that the attentional interfaces bring. End-to-end attention is a younger research direction, where the attention block covers the entire neural network. High and low-level attentional layers act recursively or cascaded at all network abstraction levels to produce the desired output in these models. End-to-end attention models introduce a new class of neural networks in Deep Learning. End-to-end attention research makes sense since no isolated attention center exists in the human brain, and its mechanisms are used in different cognitive processes.
2.1 Attentional interfaces
RNNSearch is the basis for research on attentional interfaces. The attentional module of this architecture is widely used in several other applications. In voice recognition [chan2015listen], allowing one RNN to process the audio while another examines it focusing on the relevant parts as it generates a description. In-text analysis [vinyals2015grammar], it allows a model to look at the words as it generates an analysis tree. In conversational modeling [vinyals_neural_nodate], it allows the model to focus on the last parts of the conversation as it generates its response. There are also important extensions to deal with other information bottlenecks in addition to the classic encoder-decoder problem. BiDAF [seo_bidirectional_2016] proposes a multi-stage hierarchical process to question-answering. It uses the bidirectional attention flow to build a multi-stage hierarchical network with context paragraph representations at different granularity levels. The attention layer does not summarize the context paragraph in a fixed-length vector. Instead, attention is calculated for each step, and the vector assisted at each step, along with representations of previous layers, can flow to the subsequent modeling layer. This reduces the loss of information caused by the early summary. At each stage of time, attention is only a function of the query and the paragraph of the context in the current stage and does not depend directly on the previous stage’s attention. The hypothesis is that this simplification leads to a work division between the attention layer and the modeling layer, forcing the attention layer to focus on learning attention between the query and the context.
Yang et al. [yang2016hierarchical] proposed the Hierarchical Attention Network (HAN) to capture two essential insights about document structure. Documents have a hierarchical structure: words form sentences, sentences form a document. Humans, likewise, construct a document representation by first building representations of sentences and then aggregating them into a document representation. Different words and sentences in a document are differentially informative. Moreover, the importance of words and sentences is highly context-dependent, i.e., the same word or sentence may have different importance in different contexts. To include sensitivity to this fact, HAN consists of two levels of attention mechanisms - one at the word level and one at the sentence level - that let the model pay more or less attention to individual words and sentences when constructing the document’s representation. Xiong et al. [xiong2016dynamic]
created a coattentive encoder that captures the interactions between the question and the document with a dynamic pointing decoder that alternates between estimating the start and end of the answer span. To learn approximate solutions to computationally intractable problems,Ptr-Net [vinyals2015pointer] modifies the RNNSearch’s attentional mechanism to represent variable-length dictionaries. It uses the attention mechanism as a pointer.
See et. al. [see_get_2017] used a hybrid between classic sequence-to-sequence attentional models and a Ptr-Net [vinyals2015pointer]
to abstractive text summarization. The hybrid pointer-generator[see_get_2017] copies words from the source text via pointing, which aids accurate reproduction of information while retaining the ability to produce novel words through the generator. Finally, it uses a mechanism to keep track of what has been summarized, which discourages repetition. FusionNet [huang_fusionnet:_2018] presents a novel concept of "history-of-word" to characterize attention information from the lowest word-embedding level up to the highest semantic-level representation. This concept considers that data input is gradually transformed into a more abstract representation, forming each word’s history in human mental flow. FusionNet employs a fully-aware multi-level attention mechanism and an attention score-function that takes advantage of the history-of-word. Rocktäschel et al. [rocktaschel_reasoning_2015]
introduce two-away attention for recognizing textual entailment (RTE). The mechanism allows the model to attend over past output vectors, solving the LSTM’s cell state bottleneck. The LSTM with attention does not need to capture the premise’s whole semantics in the LSTM cell state. Instead, attention generates output vectors while reading the premise and accumulating a representation in the cell state that informs the second LSTM which of the premises’ output vectors to attend to determine the RTE class. Luong, et al.[luong_effective_2015], proposed global and local attention in machine translation. Global attention is similar to soft attention, while local is an improvement to make hard attention differentiable - the model first provides for a single position aligned to the current target word, and a window centered around the position is used to calculate a vector of context.
Attentional interfaces have also emerged in architectures for computer vision tasks. Initially, they are based on human saccadic movements and robustness to change. The human visual attention mechanism can explore local differences in an image while highlighting the relevant parts. One person focuses attention on parts of the image simultaneously, glimpsing to quickly scan the entire image to find the main areas during the recognition process. In this process, the different regions’ internal relationship guides the eyes’ movement to find the next area to focus. Ignoring the irrelevant parts makes it easier to learn in the presence of disorder. Another advantage of glimpse and visual attention is its robustness. Our eyes can see an object in a real-world scene but ignore irrelevant parts. Convolutional neural networks (CNNs) are extremely different. CNNs are rigid, and the number of parameters grows linearly with the size of the image. Also, for the network to capture long-distance dependencies between pixels, the architecture needs to have many layers, compromising the model’s convergence. Besides, the network treats all pixels in the same way. This process does not resemble the human visual system that contains visual attention mechanisms and a glimpse structure that provides unmatched performance in object recognition.
RAM [mnih_recurrent_2014] and STN are pioneering architectures with attentional interfaces based on human visual attention. RAM [mnih_recurrent_2014]
can extract information from an image or video by adaptively selecting a sequence of regions, glimpses, only processing the selected areas at high resolution. The model is a Recurrent Neural Network that processes different parts of the images (or video frames) at each instant of timet, building a dynamic internal representation of the scene via Reinforcement Learning training. The main model advantages are the reduced number of parameters and the architecture’s independence to the input image size, which does not occur in convolutional neural networks. This approach is generic. It can use static images, videos, or a perceptual module of an agent that interacts with the environment. [jaderberg_spatial_2015] is a module robust to spatial transformation changes. In STN, if the input is transformed, the model must generate the correct classification label, even if it is distorted in unusual ways. STN works as an attentional module attachable – with few modifications – to any neural network to actively spatially transform feature maps. STN learns transformation during the training process. Unlike pooling layers, where receptive fields are fixed and local, a Spatial Transformer is a dynamic mechanism that can spatially transform an image, or feature map, producing the appropriate transformation for each input sample. The transformation is performed across the map and may include changes in scale, cut, rotations, and non-rigid body deformations. This approach allows the network to select the most relevant image regions (attention) and transform them into a desired canonical position by simplifying recognition in the following layers.
Following the RAM approach, the Deep Recurrent Attentive Writer (DRAW) [draw]
represents a change to a more natural way of constructing the image in which parts of a scene are created independently of the others. This process is how human beings draw a scene by recreating a visual scene sequentially, refining all parts of the drawing for several iterations, and reevaluating their work after each modification. Although natural to humans, most approaches to automatic image generation aim to generate complete scenes at once. This means that all pixels are conditioned in a single latent distribution, making it challenging to scale large image approaches. DRAW belongs to the family of variational autoencoders. It has an encoder that compresses the images presented during training and a decoder that reconstructs the images. Unlike other generative models, DRAW iteratively constructs the scenes by accumulating modifications emitted by the decoder, each observed by the encoder. DRAW uses RAM attention mechanisms to attend to parts of the scene while ignoring others selectively. This mechanism’s main challenge is to learn where to look, which is usually addressed by reinforcement learning techniques. However, at DRAW, the attention mechanism is differentiable, making it possible to use backpropagation.
The first attention interfaces’ use in DL were limited to NLP and computer vision domains to solve isolated tasks. Currently, attentional interfaces are studied in multimodal learning. Sensory multimodality in neural networks is a historical problem widely discussed by the scientific community [ramachandram2017deep] [gao2020survey]. Multimodal data improves the robustness of perception through complementarity and redundancy. The human brain continually deals with multimodal data and integrates it into a coherent representation of the world. However, employing different sensors present a series of challenges computationally, such as incomplete or spurious data, different properties (i.e. dimensionality or range of values), and the need for data alignment association. The integration of multiple sensors depends on a reasoning structure over the data to build a common representation, which does not exist in classical neural networks. Attentional interfaces adapted for multimodal perception are an efficient alternative for reasoning about misaligned data from different sensory sources.
The first widespread use of attention for multimodality occurs with the attentional interface between a convolutional neural network and an LSTM in image captioning[xu_show_2015]. In this model, a CNN processes the image, extracting high-level features, whereas the LSTM consumes the features to produce descriptive words, one by one. The attention mechanism guides the LSTM to relevant image information for each word’s generation, equivalent to the human visual attention mechanism. The visualization of attention weights in multimodal tasks improved the understanding of how architecture works. This approach derived from countless other works with attentional interfaces that deal with video-text data [yao_describing_2015] [wu_hierarchical_2018] [fakoor2016memory], image-text data [tian2018diagnostic] [pu_adaptive_2018], monocular/RGB-D images [liu2017global] [zhang2018attention] [zhang2018adding], RADAR [zhang2018attention], remote sensing data [xiangrong_zhang;xin_wang;xu_tang;huiyu_zhou;chen_li_description_2019] [bei_fang;ying_li;haokui_zhang;jonathan_cheung-wai_chan_hyperspectral_2019] [qi_wang;shaoteng_liu;jocelyn_chanussot;xuelong_li_scene_2019] [xiaoguang_mei;erting_pan;yong_ma;xiaobing_dai;jun_huang;fan_fan;qinglei_du;hong_zheng;jiayi_ma_spectral-spatial_2019], audio-video [hori_attention-based_2017] [zhang2019deep], and diverse sensors [zadeh2018memory] [zadeh2018multi] [santoro2018relational], as shown in Figure 3.
Zhang et al. [zheng_zhang;lizi_liao;minlie_huang;xiaoyan_zhu;tat-seng_chua_neural_2019] used an adaptive attention mechanism to learn to emphasize different visual and textual sources for dialogue systems for fashion retail. An adaptive attention scheme automatically decided the evidence source for tracking dialogue states based on visual and textual context. Dual Attention Networks [nam_dual_2017] presented attention mechanisms to capture the fine-grained interplay between images and textual information. The mechanism allows visual and textual attention to guide each other during collaborative inference. HATT [wu_hierarchical_2018] presented a new attention-based hierarchical fusion to explore the complementary features of multimodal features progressively, fusing temporal, motion, audio, and semantic label features for video representation. The model consists of three attention layers. First, the low-level attention layer deals with temporal, motion, and audio features inside each modality and across modalities. Second, high-level attention selectively focuses on semantic label features. Finally, the sequential attention layer incorporates hidden information generated by encoded low-level attention and high-level attention. Hori et. al. [hori_attention-based_2017] extended simple attention multimodal fusion. Unlike the simple multimodal fusion method, the feature-level attention weights can change according to the decoder state and the context vectors, enabling the decoder network to pay attention to a different set of features or modalities when predicting each subsequent word in the description. Memory Fusion Network [zadeh2018memory] presented the Delta-memory Attention module for multi-view sequential learning. First, an LSTM system, one for each of the modalities, encodes the modality-specific dynamics and interactions. Delta-memory attention discovers both cross-modality and temporal interactions in different memory dimensions of LSTMs. Finally, Multi-view Gated Memory (unifying memory) stores the cross-modality interactions over time.
Huang et al. [noauthor_bi-directional_nodate] investigated the problem of matching image-text by exploiting the bi-directional attention with fine-granularity correlations between visual regions and textual words. Bi-directional attention connects the word to regions and objects to words for learning mage-text matching. Li et. al. [yehao_li;ting_yao;yingwei_pan;hongyang_chao;tao_mei_pointing_2019] introduced Long Short-Term Memory with Pointing (LSTM-P) inspired by humans pointing behavior [matthews2012origins], and Pointer Networks [vinyals2015pointer]. The pointing mechanism encapsulates dynamic contextual information (current input word and LSTM cell output) to deal with the image captioning scenario’s novel objects. Liu et. al. [noauthor_improving_nodate] proposed a cross-modal attention-guided erasing approach for referring expressions. Previous attention models focus on only the most dominant features of both modalities and neglect textual-visual correspondences between images and referring expressions. To tackle this issue, cross-modal attention discards the most dominant information from either textual or visual domains to generate difficult training samples and drive the model to discover complementary textual-visual correspondences. Abolghasemi et al. [pay_attention] demonstrated an approach for augmenting a deep visuomotor policy trained through demonstrations with Task Focused Visual Attention (TFA). Attention receives as input a manipulation task specified in natural language text, an image with the environment, and returns as output the area with an object that the robot needs to manipulate. TFA allows the policy to be significantly more robust from the baseline policy, i.e., no visual attention. Pu et al. [pu_adaptive_2018] adaptively select features from the multiple CNN layers for video captioning. Previous models often use the output from a specific layer of a CNN as video features. However, this attention model adaptively and sequentially focuses on different layers of CNN features.
2.3 Attention-augmented memory
Attentional interfaces also allow the neural network iteration with other cognitive elements (i.e., memories, working memory). Memory control and logic flow are essential for learning. However, they are elements that do not exist in classical architectures. The memory of classic RNNs, encoded by hidden states and weights, is usually minimal and is not sufficient to remember facts from the past accurately. Most Deep Learning models do not have a simple way to read and write data to an external memory component. The Neural Turing Machine (NTM) [graves_neural_2014] and Memory Networks (MemNN) [weston_2014_memory] - a new class of neural networks - introduced the possibility for a neural network dealing with addressable memory. NTM is a differentiable approach that can be trained with gradient descent algorithms, producing a practical learning program mechanism. NTM memory is a short-term storage space for information with its rules-based manipulation. Computationally, these rules are simple programs, where data are those programs’ arguments. Therefore, an NTM resembles a working memory designed to solve tasks that require rules, where variables are quickly linked to memory slots. NTMs use an attentive process to read and write elements to memory selectively. This attentional mechanism makes the network learn to use working memory instead of implementing a fixed set of symbolic data rules.
Memory Networks [weston_2014_memory] are a relatively new framework of models designed to alleviate the problem of learning long-term dependencies in sequential data by providing an explicit memory representation for each token in the sequence. Instead of forgetting the past, Memory Networks explicitly consider the input history, with a dedicated vector representation for each history element, effectively removing the chance to forget. The limit on memory size becomes a hyper-parameter to tune, rather than an intrinsic limitation of the model itself. This model was used in question-answering tasks where the long-term memory effectively acts as a (dynamic) knowledge base, and the output is a textual response. Large-scale question-answer tests were performed, and the reasoning power of memory networks that answer questions that require an in-depth analysis of verb intent was demonstrated. Mainly due to the success of MemNN, networks with external memory are a growing research direction in DL, with several branches under development as shown in figure 4.
End-to-end Memory Networks [sukhbaatar2015end] is the first version of MemNN applicable to realistic, trainable end-to-end scenarios, which requires low supervision during training. Aug Oh. et al. [oh2019video] extends Memory Networks to suit the task of semi-supervised segmentation of video objects. Frames with object masks are placed in memory, and a frame to be segmented acts as a query. The memory is updated with the new masks provided and faces challenges such as changes, occlusions, and accumulations of errors without online learning. The algorithm acts as an attentional space-time system calculating when and where to meet each query pixel to decide whether the pixel belongs to a foreground object or not. Kumar et al. [kumar_ask_2015] propose the first network with episodic memory - a type of memory extremely relevant to humans - to iterate over representations emitted by the input module updating its internal state through an attentional interface. In [lu2020video], an episodic memory with a key-value retrieval mechanism chooses which parts of the input to focus on thorough attention. The module then produces a summary representation of the memory, taking into account the query and the stored memory. Finally, the latest research has invested in Graph Memory Networks (GMN), which are memories in GNNs [wu2020comprehensive], to better handle unstructured data using key-value structured memories [miller2016key] [khasahmadi2020memory] [khasahmadi2020memory] [moon2019memory].
2.4 End-to-end attention models
In mid-2017, research aiming at end-to-end attention models appeared in the area. The Neural Transformer (NT) [vaswani_attention_2017] and Graph Attention Networks [velickovic_graph_2018] - purely attentional architectures - demonstrated to the scientific community that attention is a key element for the future development in Deep Learning. The Transformer’s goal is to use self-attention (Section 3) to minimize traditional recurrent neural networks’ difficulties. The Neural Transformer is the first neural architecture that uses only attentional modules and fully-connected neural networks to process sequential data successfully. It dispenses recurrences and convolutions, capturing the relationship between the sequence elements regardless of their distance. Attention allows the Transformer to be simple, parallelizable, and low training cost [vaswani_attention_2017]. Graph Attention Networks (GATs) are an end-to-end attention version of GNNs [wu2020comprehensive]
. They have stacks of attentional layers that help the model focus on the unstructured data’s most relevant parts to make decisions. The main purpose of attention is to avoid noisy parts of the graph by improving the signal-to-noise ratio (SNR) while also reducing the structure’s complexity. Furthermore, they provide a more interpretable structure for solving the problem. For example, when analyzing the Attention of a model under different components in a graph, it is possible to identify the main factors contributing to achieving a particular response condition.
There is a growing interest in NT and GATs, and some extensions have been proposed [wang2019heterogeneous] [wang2019graph] [abu2018watch] [li2019relation], with numerous Transformer-based architectures as shown figure 5. These architectures and all that use self-attention belong to a new category of neural networks, called Self-Attentive Neural Networks. They aim to explore self-attention in various tasks and improve the following drawbacks: 1) a Large number of parameters and training iterations to converge; 2) High memory cost per layer and quadratic growth of memory according to sequence length; 3) Auto-regressive model; 4) Low parallelization in the decoder layers. Specifically, Weighted Transformer [weighted_transformer] proposes modifications in the attention layers achieving a 40 % faster convergence. The multi-head attention modules are replaced by modules called branched attention that the model learns to match during the training process. The Star-transformer [qipeng_guo;xipeng_qiu;pengfei_liu;yunfan_shao;xiangyang_xue;zheng_zhang_star-transformer_2019] proposes a lightweight alternative to reduce the model’s complexity with a star-shaped topology. To reduce the cost of memory, Music Transformer [music_transformer], and Sparse Transformer [rewon_child;scott_gray;alec_radford;ilya_sutskever_generating_2019] introduces relative self-attention and factored self-attention, respectively. Lee et al. [lee2018set] also features an attention mechanism that reduces self-attention from quadratic to linear, allowing scaling for high inputs and data sets.
Some approaches adapt the Transformer to new applications and areas. In natural language processing, several new architectures have emerged, mainly in multimodal learning. Doubly Attentive Transformer [doubly_attentive_transformer] proposes a multimodal machine-translation method, incorporating visual information. It modifies the attentional decoder, allowing textual features from a pre-trained CNN encoder and visual features. The Multi-source Transformer [multi_source_transformer] explores four different strategies for combining input into the multi-head attention decoder layer for multimodal translation. Style Transformer [style_transformer], Hierarchical Transformer [hierarchical_transformer], HighWay Recurrent Transformer [highway_transformer], Lattice-Based Transformer [lattice_transformer], Transformer TTS Network [li2019neural], Phrase-Based Attention [phrase_attention] are some important architectures in style transfer, document summarization and machine translation. Transfer Learning in NLP is one of Transformer’s major contribution areas. BERT [devlin_bert:_2018]
, GPT-2[radford2019language], and GPT-3 [brown2020language] based NT architecture to solve the problem of Transfer Learning in NLP because current techniques restrict the power of pre-trained representations. In computer vision, the generation of images is one of the Transformer’s great news. Image Transformer [parmar2018image], SAGAN [zhang2018self], and Image GPT [chen2020generative] uses self-attention mechanism to attend the local neighborhoods. The size of the images that the model can process in practice significantly increases, despite maintaining significantly larger receptive fields per layer than the typical convolutional neural networks. Recently, at the beginning of 2021, OpenAi introduced the scientific community to DALL·E [unpublished2021dalle], the Newest language model based on Transformer and GPT-3, capable of generating images from texts extending the knowledge of GPT-3 for viewing with only 12 billions of parameters.
2.5 Attention today
Currently, hybrid models that employ the main key developments in attention’s use in Deep Learning (Figure 6) have aroused the scientific community’s interest. Mainly, hybrid models based on Transformer, GATs, and Memory Networks have emerged for multimodal learning and several other application domains. Hyperbolic Attention Networks (HAN) [gulcehre_hyperbolic_2018], Hyperbolic Graph Attention Networks (GHN) [zhang2019hyperbolic], Temporal Graph Networks (TGN) [rossi2020temporal] and Memory-based Graph Networks (MGN) [khasahmadi2020memory]
are some of the most promising developments. Hyperbolic networks are a new class of architecture that combine the benefits of self-attention, memory, graphs, and hyperbolic geometry in activating neural networks to reason with high capacity over embeddings produced by deep neural networks. Since 2019 these networks have stood out as a new research branch because they represent state-of-the-art generalization on neural machine translation, learning on graphs, and visual question answering tasks while keeping the neural representations compact. Since 2019, GATs have also received much attention due to their ability to learn complex relationships or interactions in a wide spectrum of problems ranging from biology, particle physics, social networks to recommendation systems. To improve the representation of nodes and expand the capacity of GATs to deal with data of a dynamic nature (i.e. evolving features or connectivity over time), architectures that combine memory modules and the temporal dimension, like MGNs and TGNs, were proposed.
At the end of 2020, two research branches still little explored in the literature were strengthened: 1) explicit combination of bottom-up and top-down stimuli in bidirectional recurrent neural networks and 2) adaptive computation time. Classic recurrent neural networks perform recurring iteration within a particular level of representation instead of using a top-down iteration, in which higher levels act at lower levels. However, Mittal et al.[mittal2020learning] revisited the bidirectional recurrent layers with attentional mechanisms to explicitly route the flow of bottom-up and top-down information, promoting selection iteration between the two levels of stimuli. The approach separates the hidden state into several modules so that upward iterations between bottom-up and top-down signals can be appropriately focused. The layer structure has concurrent modules so that each hierarchical layer can send information both in the bottom-up and top-down directions.
The adaptive computation time is an interesting little-explored topic in the literature that began to expand only in 2020 despite initial studies emerging in 2017. ACT applies to different neural networks (e.g. RNNs, CNNs, LSTMs, Transformers). The general idea is that complex data might require more computation to produce a final result, while some unimportant or straightforward data might require less. The attention mechanism dynamically decides how long to process network training data. The seminal approach by Graves et al. [graves_adaptive_2016] made minor modifications to an RNN, allowing the network to perform a variable number of state transitions and a variable number of outputs at each stage of the input. The resulting output is a weighted sum of the intermediate outputs, i.e., soft attention. A halting unit decides when the network should stop or continue. To limit computation time, attention adds a time penalty to the cost function by preventing the network from processing data for unnecessary amounts of time. This approach has recently been updated and expanded to other architectures. Spatially Adaptive Computation Time (SACT) [figurnov2017spatially] adapts ACT to adjust the per-position amount of computation to each spatial position of the block in convolutional layers, learning to focus computing on the regions of interest and to stop when the features maps are "good enough". Finally, Differentiable Adaptive Computation Time (DACT) [eyzaguirre2020differentiable] introduced the first differentiable end-to-end approach to computation time on recurring networks.
3 Attention Mechanisms
Deep attention mechanisms can be categorized into soft attention (global attention), hard attention (local attention), and self-attention (intra-attention).
Soft Attention. Soft attention assigns a weight of 0 to 1 for each input element. It decides how much attention should be focused on each element, considering the interdependence between the input of the deep neural network’s mechanism and target. It uses softmax functions in the attention layers to calculate weights so that the entire attentional model is deterministic and differentiable. Soft attention can act in the spatial and temporal context. The spatial context operates mainly to extract the features or the weighting of the most relevant features. For the temporal context, it works by adjusting the weights of all samples in sliding time windows, as samples at different times have different contributions. Despite being deterministic and differentiable, soft mechanisms have a high computational cost for large inputs. Figure 7 shows an intuitive example of a soft attention mechanism.
Hard Attention. Hard attention determines whether a part of the mechanism’s input should be considered or not, reflecting the interdependence between the input of the mechanism and the target of the deep neural network. The weight assigned to an input part is either 0 or 1. Hence, as input elements are either seen, the objective is non-differentiable. The process involves making a sequence of selections on which part to attend. In the temporal context, for example, the model attends to a part of the input to obtain information, decidinng where to attend in the next step based on the known information. A neural network can make a selection based on this information. However, as there is no ground truth to indicate the correct selection policy, the hard-attention type mechanisms are represented by stochastic processes. As the model is not differentiable, reinforcement learning techniques are necessary to train models with hard attention. Inference time and computational costs are reduced compared to soft mechanisms once the entire input is not being stored or processed. Figure 8 shows an intuitive example of a hard attention mechanism.
Self-Attention. Self-attention quantifies the interdependence between the input elements of the mechanism. This mechanism allows the inputs to interact with each other "self" and determine what they should pay more attention to. The self-attention layer’s main advantages compared to soft and hard mechanisms are parallel computing ability for a long input. This mechanism layer checks the attention with all the same input elements using simple and easily parallelizable matrix calculations. Figure 9 shows an intuitive example of a self-attention mechanism.
4 Attention-based Classic Deep Learning Architectures
This section introduces details about attentional interfaces in classic DL architectures. Specifically, we present the uses of attention in convolutional, recurrent networks and generative models.
4.1 Attention-based Convolutional Neural Networks (CNNs)
Attention emerges in CNNs to filter information and allocate resources to the neural network efficiently. There are numerous ways to use attention on CNNs, which makes it very difficult to summarize how this occurs and the impacts of each use. We divided the uses of attention into six distinct groups (Figure 10): 1) DCN attention pool – attention replaces the classic CNN pooling mechanism; 2) DCN attention input – the attentional modules are filter masks for the input data. This mask assigns low weights to regions irrelevant to neural network processing and high weights to relevant areas; 3) DCN attention layer – attention is between the convolutional layers; 4) DCN attention prediction – attentional mechanisms assist the model directly in the prediction process; 5) DCN residual attention – extracts information from the feature maps and presents a residual input connection to the next layer; 6) DCN attention out – attention captures important stimuli of feature maps for other architectures, or other instances of the same architecture. To maintain consistency with the Deep Neural Network’s area, we extend The Neural Network Zoo schematics 111https://www.asimovinstitute.org/neural-network-zoo/ to accommodate attention elements.
DCN attention input mainly uses attention to filter input data - a structure similar to the multi-glimpse mechanism and visual attention of human beings. Multi-glimpse refers to the ability to quickly scan the entire image and find the main areas relevant to the recognition process, while visual attention focuses on a critical area by extracting key features to understand the scene. When a person focuses on one part of the image, the different regions’ internal relationship is captured, guiding eye movement to find the next relevant area—ignoring the irrelevant parts easy learning in the presence of disorder. For this reason, human vision has an incomparable performance in object recognition. The main contribution of attention at the CNNs’ input is robustness. If our eyes see an object in a real-world scene, parts far from the object are ignored. Therefore, the distant background of the fixed object does not interfere in recognition. However, CNNs treat all parts of the image equally. The irrelevant regions confuse the classification and make it sensitive to visual disturbances, including background, changes in camera views, and lighting conditions. Attention in CNNs’ input contributes to increasing robustness in several ways: 1) It makes architectures more scalable, in which the number of parameters does not vary linearly with the size of the input image; 2) Eliminates distractors; 3) Minimizes the effects of changing camera lighting, scale, and views. 4) It allows the extension of models for more complex tasks, i.e., fine-grained classification or segmentation. 5) Simplifies CNN encoding. 6) Facilitates learning by including relevant priorities for architecture.
Zhao et al. [zhao_deep_2017]
used visual attention-based image processing to generate the focused image. Then, the focused image is input into CNN to be classified. According to the classification, the information entropy guides reinforcement learning agents to achieve a better image classification policy. Wang et al.[x._wang;_l._gao;_j._song;_h._shen_beyond_2017] used attention to create representations rich in motion information for action recognition. The attention extracts saliency maps using both motion and appearance information to calculate the objectness scores. For a video, attention process frame by frame to generate a saliency-aware map for each frame. The classic pipeline uses only CNN sequence features as input for LSTMs, failing to capture adjacent frames’ motion information. The saliency-aware maps capture only regions with relevant movements making CNN encoding simple and representative for the task. Liu et al. [ning_liu;yongchao_long;changqing_zou;qun_niu;li_pan;hefeng_wu_adcrowdnet:_2019] used attention as input of a CNN to provide important priors in counting crowded tasks. An attention map generator first provides two types of priors for the system: candidate crowd regions and crowd regions’ congestion degree. The priors guide subsequent CNNs to pay more attention to those regions with crowds and improving their capacity to be resistant to noise. Specifically, the congestion degree prior provides fine-grained density estimation for a system.
In classic CNNs, the size of the receptive fields is relatively small. Most of them extract features locally with convolutional operations, which fail to capture long-range dependencies between pixels throughout the image. However, larger receptive fields allow for better use of training inputs, and much more context information is available at the expense of instability or even convergence in training. Also, traditional CNNs treat channel features equally. This naive treatment lacks the flexibility to deal with low and high-frequency information. Some frequencies may contain more relevant information for a task than others, but equal treatment by the network makes it difficult to converge the models. To mitigate such problems, most literature approaches use attention between convolutional layers (i.e., DCN attention layer and DCN residual attention), as shown in figure 11. Between layers, attention acts mainly for feature recalibration, capturing long-term dependencies, internalizing, and correctly using past experiences.
The pioneering approach to adopting attention between convolutional layers is the Squeeze-and-Excitation Networks [hu_squeeze-and-excitation_2017] created in 2016 and winner of the ILSVRC in 2017. It is also the first architecture to model channel interdependencies to recalibrate filter responses in two steps, squeeze and excitation, i.e., SE blocks. To explore local dependencies, the squeeze module encodes spatial information into a channel descriptor. The output is a collection of local descriptors with expressive characteristics for the entire image. To make use of the information aggregated by the squeeze operation, excitation captures channel-wise dependencies by learning a non-linear and non-mutually exclusive relationship between channels, ensuring that multiple channels can be emphasized. In this sense, SE blocks intrinsically introduce attentional dynamics to boost feature discrimination between convolutional layers.
The inter-channel and intra-channel attention to capturing long-term dependencies and simultaneously taking advantage of high and low-level stimuli are widely explored in the literature. Zhang. et al. [zhang2019residual] proposed residual local and non-local attention blocks consisting of trunk and mask branches. Their attention mechanism helps to learn local and non-local information from the hierarchical features, further preserving low-level features while maintaining a representational quality of high-level features. The Cbam [sanghyun_woo;jongchan_park;joon-young_lee;in_so_kweon_cbam:_2018] infers attentional maps in two separate dimensions, channel and spatial, for adaptive feature refinement. The double attention block in [chen_^2-nets:_2018] aggregates and propagates global informational features considering the entire spatio-temporal context of images and videos, allowing subsequent convolution layers to access resources from across space efficiently. In the first stage, attention gathers features from all space into a compact set employing groupings. In the second stage, it selects and adaptively distributes the resources for each architectural location. Following similar exploration proposals, several attentional modules can be easily plugged into classic CNNs [fukui2019attention] [han20193dviewgraph] [ji2017distant] [yang2017neural].
Hackel et al. [hackel_inference_2018] explored attention to preserving sparsity in convolutional operations. Convolutions with kernels greater than generate fill-in, reducing feature maps’ sparse nature. Generally, the change in data sparsity has little influence on the network output, but memory consumption and execution time considerably increase when it occurs in many layers. To guarantee low memory consumption, attention acts as a filter, which has two different versions of selection: 1) it acts on the output of the convolution, preferring the largest
positive responses similar to a rectified linear unit; 2) it chooses thehighest absolute values, expressing a preference for responses of great magnitude. The parameter controls the level of sparse data and, consequently, computational resources during training and inference. Results point out that training with attentional control of data sparsity can reduce in more than the forward pass runtime in one layer.
To previous aggregate information and dynamically point to past experiences, SNAIL [mishra_simple_2017] - a pioneering class of meta-learner based attention architectures - has proposed combining temporal convolutions with soft attention. This approach demonstrates that attention acts as a complement to the disadvantages of convolution. Attention allows precise access in an infinitely large context, while convolutions provide high-bandwidth access at the expense of a finite context. By merging convolutional layers with attentional layers, SNAIL can have unrestricted access to the number of previous experiences effectively, as well as the model can learn a more efficient representation of features. As additional benefits, SNAIL architectures become simpler to train than classic RNNs.
The DCN attention out-group uses attention to share relevant feature maps with other architectures or even with instances of the current architecture. Usually, the main objective is to facilitate the fusion of features, multimodality, and external knowledge. In some cases, attention regularly works by turning classic CNNs into recurrent convolutional neural networks - a new trend in Deep Learning to deal with challenging images’ problems. RA-CNN [fu_look_2017] is a pioneering framework for recurrent convolutional networks. In their framework, attention proceeds along two dimensions, i.e., discriminative feature learning and sophisticated part localization. Given an input image, a classic CNN extracts feature maps, and the attention proposal network maps convolutional features to a feature vector that could be matched with the category entries. Then, attention estimates the focus region for the next CNN instance, i.e., the next finer scale. Once the focus region is located, the system cuts and enlarges the region to a finer scale with higher resolution to extract more refined features. Thus, each CNN in the stack generates a prediction so that the stack’s deepest layers generate more accurate predictions.
For merging features, Cheng. et. al. [xueying_chen;rong_zhang;pingkun_yan_feature_2019] presented Feature-fusion Encoder-Decoder Network (FED-net) to image segmentation. Their model uses attention to fuse features of different levels of an encoder. At each encoder level, the attention module merges features from its current level with features from later levels. After the merger, the decoder performs convolutional upsampling with the information from each attention level, which contributes by modulating the most relevant stimuli for segmentation. Tian et al. [tian2018learning] used feature pyramid-based attention to combine meaningful semantic features with semantically weak but visually strong features in a face detection task. Their goal is to learn more discriminative hierarchical features with enriched semantics and details at all levels to detect hard-to-detect faces, like tiny or partially occluded faces. Their attention mechanism can fuse different feature maps from top to bottom recursively by combining transposed convolutions and element-wise multiplication maximizing mutual information between the lower and upper-level representations.
framework presented an efficient strategy for transfer learning. Their attention system acts as a behavior regulator between the source model and the target model. The attention identifies the source model’s completely transferable channels, preserving their responses and identifying the non-transferable channels to dynamically modulate their signals, increasing the target model’s generalization capacity. Specifically, the attentional system characterizes the distance between the source/target model through the feature maps’ outputs and incorporates that distance to regularize the loss function. Optimization normally affects the weights of the neural network and assigns generalization capacity to the target model. Regularization modulated by attention on high and low semantic stimuli manages to take important steps in the semantic problem to plug in external knowledge.
The DCN attention prediction group uses attention directly in the prediction process. Various attentional systems capture features from different convolutional layers as input and generate a prediction as an output. Voting between different predictors generates the final prediction. Reusing activations of CNNs feature maps to find the most informative parts of the image at different depths makes prediction tasks more discriminative. Each attentional system learns to relate stimuli and part-based fine-grained features, which, although correlated, are not explored together in classical approaches. Zheng et al. [zheng2017learning] proposed a multi-attention mechanism to group channels, creating part classification sub-networks. The mechanism takes as input feature maps from convolutional layers and generates multiple single clusters spatially-correlated subtle patterns as a compact representation. The sub-network classifies an image by each individual part. The attention mechanism proposed in [rodriguez_painless_2018] uses a similar approach. However, instead of grouping features into clusters, the attentional system has the most relevant feature map regions selected by the attention heads. The output heads generate a hypothesis given the attended information, and the confidence gates generate a confidence score for each attention head.
Finally, the DCN attention pool group replaces classic pooling strategies with attention-based pooling. The objective is to create a non-linear encoding to select only stimuli relevant to the task, given that classical strategies select only the most contrasting stimuli. To modulate the resulting stimuli, attentional pooling layers generally capture different relationships between feature maps or between different layers. For example, Wang et al. [linlin_wang_zhu_cao_gerard_de_melo_zhiyuan_liu:_relation_nodate] created an attentional mechanism that captures pertinent relationships between convoluted context windows and the relation class embedding through a correlation matrix learned during training. The correlation matrix modulates the convolved windows, and finally, the mechanism selects only the most salient stimuli. A similar approach is also followed in [yin2016abcnn] for modeling sentence pairs.
4.2 Attention-based Recurrent Neural Networks (RNNs)
Attention in RNNs is mainly responsible for capturing long-distance dependencies. Currently, there are not many ways to use attention on RNNs. RNNSearch’s mechanism for encoder-decoder frameworks inspires most approaches [bahdanau_neural_2014]. We divided the uses of attention into three distinct groups (Figure 12): 1) Recurrent attention input – the first stage of attention to select elementary input stimulus, i.e., elementary features, 2) recurrent memory attention – the first stage of attention to historical weight components, 3) Recurrent hidden attention – the second stage of attention to select categorical information to the decode stage.
The recurrent attention input group main uses are item-wise hard, local-wise hard, item-wise soft, and local-wise soft selection. Item-wise hard selects discretely relevant input data for further processing, whereas location-wise hard discretely focuses only on the most relevant features for the task. Item-wise soft assigns a continuous weight to each input data given a sequence of items as input, and location-wise soft assigns a continuous weight between input features. Location-wise soft estimates high weights for features more correlated with the global context of the task. Hard selection for input elements are applied more frequently in computer vision approaches [mnih_recurrent_2014] [marcus_edel;joscha_lausch_capacity_2016]. On the other hand, soft mechanisms are often applied in other fields, mainly in Natural Language Processing. The soft selection normally weighs relevant parts of the series or input features, and the attention layer is a feed-forward network differentiable and with a low computational cost. Soft approaches are interesting to filter noise from time series and to dynamically learn the correlation between input features and output [qin2017dual] [geoman_2018] [du2017rpan]. Besides, this approach is useful for addressing graph-to-sequence learning problems that learn a mapping between graph-structured inputs to sequence outputs, which current Seq2Seq and Tree2Seq may be inadequate to handle [xu2018graph2seq].
Hard mechanisms take inspiration from how humans perform visual sequence recognition tasks, such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual entity, and adding the knowledge to our internal representation. A deep recurrent neural network, at each step, processes a multi-resolution crop of the input image, called a glimpse. The network uses information from the glimpse to update its internal representation and outputs the next glimpse location. The Glimpse network captures salient information about the input image at a specific position and region size. The internal state is formed by the hidden units of the recurrent neural network, which is updated over time by the core network. At each step, the location network estimates the next focus localization, and action networks depend on the task (e.g., for the classification task, the action network’s outputs are a prediction for the class label.). Hard attention is not entirely differentiable and therefore uses reinforcement learning.
RAM [mnih_recurrent_2014] was the first architecture to use a recurrent network implementing hard selection for image classification tasks. While this model has learned successful strategies in various image data sets, it only uses several static glimpse sizes. CRAM [marcus_edel;joscha_lausch_capacity_2016] uses an additional sub-network to dynamically change the glimpse size, with the assumption to increase the performance, and in Jimmy et al. [ba_multiple_2014] explore modifications in RAM for real-world image tasks and multiple objects classification. CRAM is a similar RAM model except for two key differences: Firstly, a dynamically updated attention mechanism restrains the input region observed by the glimpse network and the next output region prediction from the emission network – a network that incorporates the location and capacity information as well as past information. In a more straightforward way, the sub-network decides at each time-step what the focus region’s capacity should be. Secondly, the capacity sub-network outputs are successively added to the emission network’s input that will ultimately generate the information for the next focus region—allowing the emission network to combine the information from the location and the capacity networks.
Nearly all important works in the field belong to the recurrent hidden attention group, as shown in Figure 13. In this category, the attention mechanism selects elements that are in the RNN’s hidden layers for inter-alignment, contextual embedding, multiple-input processing, memory management, and capturing long-term dependencies, a typical problem with recurrent neural networks. Inter-alignment involves the encoder-decoder framework, and the attention module between these two networks is the most common approach. This mechanism builds a context vector dynamically from all previous decoder hidden states and the current encoder hidden state. Attention in inter-alignment helps minimize the bottleneck problem, with RNNSearch [bahdanau_neural_2014] for machine translation tasks as its first representative. Further, several other architectures implemented the same approach in other tasks [cheng2016long] [yang2016hierarchical] [seo_bidirectional_2016]. For example, Zichao Yang et al. [yang2016hierarchical] extended the soft selection to the hierarchical attention structure, which allows the calculation of soft attention at the word level and the sentence level in the GRU networks encoder for document classification.
To create contextual embeddings and to manipulate multimodal inputs, co-attention is highly effective for text matching applications. Co-attention enables the learning of pairwise attention, i.e., learning to attend based on computing word-level affinity scores between two documents. Such a mechanism is designed for architectures comprised of queries and context, such as questions and answers and emotions analysis. Co-attention models can be fine-grained or coarse-grained. Fine-grained models consider each element of input concerning each element of the other input. Coarse-grained models calculate attention for each input, using an embedding of the other input as a query. Although efficient, co-attention suffers from information loss from the target and the context due to the anticipated summary. Attention flow emerges as an alternative to summary problems. Unlike co-attention, attention flow links and merges context and query information at each stage of time, allowing embeddings from previous layers to flow to subsequent modeling layers. The attention flow layer is not used to summarize the query and the context in vectors of unique features, reducing information loss. Attention is calculated in two directions, from the context to the query and from the query to the context. The output is the query-aware representations of context words. Attention flow allows a hierarchical process of multiple stages to represent the context at different granularity levels without an anticipated summary.
Hard attention mechanisms do not often occur on recurrent hidden attention networks. However, Nan Rosemary et al. [ke_sparse_2018] demonstrate that hard selection to retrieve past hidden states based on the current state mimics an effect similar to the brain’s ability. Humans use a very sparse subset of past experiences and can access them directly and establish relevance with the present, unlike classic RNNs and self-attentive networks. Hard attention is an efficient mechanism for RNNs to recover sparse memories. It determines which memories will be selected on the forward pass, and therefore which will receive gradient updates. At time , RNN receives a vector of hidden states , a vector of cell states , and an input , and computes new cell states and a provisional hidden state vector that also serves as a provisional output. First, the provisional hidden state vector is concatenated to each memory vector in the memory . MLP maps each vector to an attention weight , representing memory relevance in current moment . With attention weights sparse attention computes a hard decision. The attention mechanism is differentiable but implements a hard selection to forget memories with no prominence over others. This is quite different from typical approaches as the mechanism does not allow the gradient to flow directly to a previous step in the training process. Instead, it propagates to some local timesteps as a type of local credit given to a memory.
Finally, recurrent memory attention groups implement attention within the memory cell. As far as our research goes, there are not many architectures in this category. Pengfei et al. [zhang2018adding] proposed an approach that modulates the input adaptively within the memory cell by assigning different levels of importance to each element/dimension of the input shown in Figure 14a. Dilruk et al. [perera2020lstm] proposed mechanisms of attention within memory cell to improve the past encoding history in the cell’s state vector since all parts of the data history are not equally relevant to the current prediction. As shown in Figure 14b, the mechanism uses additional gates to update LSTM’s current cell.
4.3 Attention-based Generative Models
Attention emerges in generative models essentially to augmented memory. Currently, there are not many ways to use attention on generative models. Since GANs are not a neural network architecture but a framework, we do not discuss the use of attention in GANs but autoencoders. We divided the uses of attention into three distinct groups (Figure 15): 1) Autoencoder input attention – attention provides spatial masks corresponding to all the parts for a given input, while a component autoencoder (e.g., AE, VAE, SAE) independently models each of the parts indicated by the masks. 2) Autoencoder memory attention – attention module acts as a layer between the encoder-decoder to augmented memory., 3) Autoencoder attention encoder-decoder – a fully attentive architecture acts on the encoder, decoder, or both.
MONet [burgess2019monet] is one of the few architectures to implement attention at the VAE input. A VAE is a neural network with an encoder parameterized by and a decoder parameterized by . The encoder parameterizes a distribution over the component latent , conditioned on both the input data x and an attention mask . The mask indicates which regions of the input the VAE should focus on representing via its latent posterior distribution, . During training, the VAE’s decoder likelihood term in the loss is weighted according to the mask, such that it is unconstrained outside of the masked regions. In [li2016learning], the authors use soft attention with learned memory contents to augment models to have more parameters in the autoencoder. In [bartunov2016fast], Generative Matching Networks use attention to access the exemplar memory, with the address weights computed based on a learned similarity function between an observation at the address and a function of the latent state of the generative model. In [rezende2016one], external memory and attention work as a way of implementing one-shot generalization by treating the exemplars conditioned on as memory entries accessed through a soft attention mechanism at each step of the incremental generative process similar to DRAW [draw]. Although most approaches use soft attention to address the memory, in [bornschein2017variational] the authors use a stochastic, hard attention approach, which allows using variational inference about it in a context of few-shot learning.
In [escolano2018self], self-attentive networks increase the autoencoder ability to generalize. The advantage of using this model instead of other alternatives, such as recurrent or convolutional encoders, is that this model is based only on self-attention and traditional attention over the whole representation created by the encoder. This approach allows us to easily employ the different components of the networks (encoder and decoder) as modules that, during inference, can be used with other parts of the network without the need for previous step information.
In a few years, neural attention networks have been used in numerous domains due to versatility, interpretability, and significance of results. These networks have been explored mainly in computer vision, natural language processing, and multi-modal tasks, as shown in figure 16. In some applications, these models transformed the area entirely (i.e., question-answering, machine translation, document representations/embeddings, graph embeddings), mainly due to significant performance impacts on the task in question. In others, they helped learn better representations and deal with temporal dependencies over long distances. This section explores a list of application domains and subareas, mainly discussing each domain’s main models and how it benefits from attention. We also present the most representative instances within each area and list them with reference approaches in a wide range of applications.
5.1 Natural Language Processing (NLP)
In the NLP domain, attention plays a vital role in many sub-areas, as shown in figure 16. There are several state-of-the-art approaches, mainly in language modeling, machine translation, natural language inference, question answering, sentiment analysis, semantic analysis, speech recognition, and text summarization. Table 1 groups works developed in each of these areas. Several applications have been facing an increasing expansion, with few representative works, such as emotion recognition, speech classification, sequence prediction, semantic matching, and grammatical correction, as shown in Table 2.
|Natural Language Inference||
For machine translation (MT), question answering (QA), and automatic speech recognition (ASR), attention works mainly in the alignment input and output sequences capturing long-range dependencies. For example, in ASR tasks, attention aligns acoustic frames extracting information from anchor words to recognize the main speaker while ignoring background noise and interfering speech. Hence, only information on the desired speech is used for the decoder as it provides a straightforward way to align each output symbol with different input frames with selective noise decoding. In MT, automatic alignment translates long sentences more efficiently. It is a powerful tool for multilingual machine translation (NMT), efficiently capturing subjects, verbs, and nouns in sentences of different languages that differ significantly in their syntactic structure and semantics.
In QA, alignment usually occurs between a query and the content, looking for key terms to answer the question. The classic QA approaches do not support very long sequences and fail to correctly model the meaning of context-dependent words. Different words can have different meanings, which increases the difficulty of extracting the essential semantic logical flow of each sentence in different paragraphs of context. These models are unable to address uncertain situations that require additional information to answer a particular question. In contrast, attention networks allow rich dialogues through addressing mechanisms for explicit memories or alignment structures in the query-context and context-query directions.
Attention also contributes to summarize or classify texts/documents. It mainly helps build more effective embeddings that generally consider contextual, semantic, and hierarchical information between words, phrases, and paragraphs. Specifically, in summarization tasks, attention minimizes critical problems involving: 1) modeling of keywords; 2) summary of abstract sentences; 3) capture of the sentence’s hierarchical structure; 4) repetitions of inconsistent phrases; and 5) generation of short sentences preserving their meaning.
Figure 17 illustrates two models working in NLP tasks: RNNSearch [bahdanau_neural_2014], in machine translation, and End-to-End Memory Networks [sukhbaatar2015end] in question answering. In RNNSearch, the attention guided by the decoder’s previous state dynamically searches for important source words for the next time step. It consists of an encoder followed by a decoder. The encoder is a bidirectional RNN (BiRNN) [schuster1997bidirectional]
that consists of forward and backward RNN’s. The forward RNN reads the input sequence in order and calculates the forward hidden state sequence. The backward RNN reads the sequence in the reverse order, resulting in the backward hidden states sequence. The decoder has an RNN and an attention system that calculates a probability distribution for all possible output symbols from a context vector.
In End-to-End Memory Networks, attention looks for the memory elements most related to query using an alignment function that dispenses the RNNs’ complex structure. It consists of a memory and a stack of identical attentional systems. Each layer takes as input set to store in the memory. The input set is converted in memory vectors and , in the simplest case using the embedding matrix to generate each , and the matrix to generate each . In the first, layer the query is also embedded, via embedding matrix to obtain an internal state . From the second layer, the internal state is the sum of the layer output and the internal state . Finally, the last layer generates .
The Neural Transformer [vaswani_attention_2017], illustrated in figure 18
, is the basis model for state-of-the-art results in NLP. The architecture consists of an arbitrary amount of stacked encoders and decoders. Each encoder has linear layers, an attention system, feed-forward neural networks, and normalization layers. The attention system has several parallel heads. Each head hasattentional subsystems that perform the same task but have different contextual inputs. The encoder receives a word embedding matrix , as input. As the architecture does not use recurrences, the input tokens’ position information is not explicit, but it is necessary. To represent the spatial position information, the Transformer adds a positional encoding to each embedding vector. Positional encoding is fixed and uses sinusoidal functions.
The input goes through linear layers and generates, for each word, a query vector (), a key vector (), and a value vector ().The attentional system receives all , , and
arrays as input and uses several parallel attention heads. The motivation for using a multi-head structure is to explore multiple subspaces since each head gets a different projection of the data. Each head learns a different aspect of attention to the input, calculating different attentional distributions. Having multiple heads on the Transformer is similar to having multiple feature extraction filters on CNNs. The head outputs an attentional mask that relates all queries to a certain key. In a simplified way, the operation performed by a head is a matrix multiplication between a matrix of queries and keys.
Finally, the data is added to the residual output from the previous layer and normalized, representing the encoder output. This data is input to the next encoder. The last encoder’s data are transformed into the attention matrices and . They are input to all decoder layers. This data help the decoder to focus on the appropriate locations in the input sequence. The decoder has two layers of attention, Feed-Foward layers and normalization layers. The attentional layers are the masked multi-head attention and the decoder multi-head attention.
The masked multi-head attention is very similar to the encoder multi-head attention, with the difference that the attention matrices , , and are created only with the previous data words, masking future positions with values before the softmax step. The decoder multi-head attention is equal to the encoder multi-head attention, except it creates the Q matrix from the data of the previous layer and uses the and matrices of the encoder output. The and
matrices are the memory structure of the network, storing context information of the input sequence, and given the previous words in the output decoder, the relevant information is selected in memory for the prediction of the next word. Finally, a linear layer followed by a softmax function projects the decoder vector by the last decoder into a probability vector in which each position defines the probability of the output word being a given vocabulary word. At each time step, the position with the highest probability value is chosen, and the word associated with it is the output.
|Text Classification||[liu2019bidirectional] [qipeng_guo;xipeng_qiu;pengfei_liu;yunfan_shao;xiangyang_xue;zheng_zhang_star-transformer_2019]|
|Speech Classification||[norouzian2019exploring] [li2019multi]|
|Document Classification||[choi2019aila] [yang2016hierarchical]|
|Transfer Learning||[devlin2018bert] [alt2019improving]|
|Text-to-Speech||[yasuda2019investigation] [zhang2019joint] [li2019neural]|
|Reading Comprehension||[tao_shen_tianyi_zhou_guodong_long_jing_jiang_chengqi_zhang:_bi-directional_nodate] [wei_wang_chen_wu_ming_yan:_multi-granularity_nodate] [yiming_cui_zhipeng_chen_si_wei_shijin_wang_ting_liu_guoping_hu:_attention-over-attention_nodate] [s._liu;_s._zhang;_x._zhang;_h._wang_r-trans:_2019] [yiming_cui;ting_liu;zhipeng_chen;shijin_wang;guoping_hu_consensus_2018]|
|Natural Language Understanding||[kim2018efficient]|
|Natural Language Transduction||[grefenstette2015learning]|
|Natural Language Generation||[xu2018graph2seq]|
|Entity Resolution||[das2016chains] [ganea2017deep]|
|Embedding||[lin2017structured] [schick2019attentive] [zhu2018self]|
|Dependency Parsing||[dozat2016deep] [strubell2018linguistically]|
|Conversation Model||[zhou2018commonsense] [zhang2019sequence]|
|Automatic Question Tagging||[sun2018automatic]|
5.2 Computer Vision (CV)
Visual attention has become popular in many CV tasks. Action recognition, counting crowds, image classification, image generation, object detection, person recognition, segmentation, saliency detection, text recognition, and tracking targets are the most explored sub-areas, as shown in Table 3. Applications in other sub-areas still have few representative works, such as clustering, compression, deblurring, depth estimation, image restoration, among others, as shown in Table 4.
Visual attention in image classification tasks was first addressed by Graves et al. [mnih_recurrent_2014]. In this domain, there are sequential approaches inspired by human saccadic movements [mnih_recurrent_2014] and feedforward-augmented structures CNNs (Section 4). The general goal is usually to amplify fine-grained recognition, improve classification in the presence of occlusions, sudden variations in points of view, lighting, and rotation. Some approaches aim to learn to look at the most relevant parts of the input image, while others try to discern between discriminating regions through feature recalibration and ensemble predictors via attention. To fine-grained recognition, important advances have been achieved through recurrent convolutional networks in the classification of bird subspecies [fu_look_2017] and architectures trained via RL to classify vehicle subtypes [zhao_deep_2017].
|Saliency Detection||[j._kuen;_z._wang;_g._wang_recurrent_nodate] [nian_liu;junwei_han;ming-hsuan_yang_picanet:_2018] [nian_liu;junwei_han;ming-hsuan_yang_picanet:_2018] [marcella_cornia;lorenzo_baraldi;giuseppe_serra;rita_cucchiara_predicting_2018] [xiaowei_hu;chi-wing_fu;lei_zhu;pheng-ann_heng_sac-net:_2019]|
|Text Recognition||[he_end--end_2018] [cheng_focusing_2017] [canjie_luo;lianwen_jin;zenghui_sun_moran:_2019] [hongtao_xie;shancheng_fang;zheng-jun_zha;yating_yang;yan_li;yongdong_zhang_convolutional_2019] [hui_li;peng_wang;chunhua_shen;guyu_zhang_show_2019] [noauthor_focusing_nodate]|
Visual attention also provides significant benefits for action recognition tasks by capturing spatio-temporal relationships. The biggest challenge’s classical approaches are capturing discriminative features of movement in the sequences of images or videos. The attention allows the network to focus the processing only on the relevant joints or on the movement features easily. Generally, the main approaches use the following strategies: 1) saliency maps: spatiotemporal attention models learn where to look in video directly human fixation data. These models express the probability of saliency for each pixel. Deep 3D CNNs extract features only high saliency regions to represent spatial and short time relations at clip level, and LSTMs expand the temporal domain from few frames to seconds [bazzani2016recurrent]; 2) self-attention: modeling context-dependencies. The person being classified is the Query (Q), and the clip around the person is the memory, represented by keys (K) and values (V) vectors. The network process the query and memory to generate an updated query vector. Intuitively self-attention adds context to other people and objects in the clip to assist in subsequent classification [girdhar2019video]; 3) recurrent attention mechanisms: captures relevant positions of joints or movement features and, through a recurring structure, refines the attentional focus at each time step [liu2017global] [du2017rpan]; and 4) temporal attention: captures relevant spatial-temporal locations [li2018videolstm] [li2018videolstm] [song_end--end_2016] [xin2016recurrent] [zang2018attention] [pei2017temporal].
Liu et al. [liu2017global] model is a recurrent attention approach to capturing the person’s relevant positions. This model presented a pioneering approach using two layers of LSTMs and the context memory cell that recurrently interact with each other, as shown in figure 19a. First, a layer of LSTMs generates an encoding of a skeleton sequence, initializing the context memory cell. The memory representation is input to the second layer of LSTMs and helps the network selectively focus on each frame’s informational articulations. Finally, attentional representation feeds back the context memory cell to refine the focus’s orientation again by paying attention more reliably. Similarly, Du et al. [du2017rpan] proposed RPAN - a recurrent attention approach between sequentially modeling by LSTMs and convolutional features extractors. First, CNNs extract features from the current frame, and the attentional mechanism guided by the LSTM’s previous hidden state estimates a series of features related to human articulations related to the semantics of movements of interest. Then, these highly discriminative features feed LSTM time sequences.
In image generation, there were also notable benefits. DRAW [draw] introduced visual attention with an innovative approach - image patches are generated sequentially and gradually refined, in which to generate the entire image in a single pass (figure 19
b). Subsequently, attentional mechanisms emerged in generative adversarial networks (GANs) to minimize the challenges in modeling images with structural constraints. Naturally, GANs efficiently synthesize elements differentiated by texture (i.e., oceans, sky, natural landscapes) but suffer to generate geometric patterns (i.e., faces, animals, people, fine details). The central problem is the convolutions that fail to model dependencies between distant regions. Besides, the statistical and computational efficiency of the model suffers from the stacking of many layers. The attentional mechanisms, especially self-attention, offered a computationally inexpensive alternative to model long-range dependencies easily. Self-attention as a complement to convolution contributes significantly to the advancement of the area with approaches capable of generating fine details[zhang2018self], high-resolution images, and with intricate geometric patterns [chen2020generative].
In expression recognition, attention optimizes the entire segmentation process by scanning input as a whole sequence, choosing the most relevant region to describe a segmented symbol or implicit space operator [zhang2017gru]. In information retriever, attention helps obtain appropriate semantic resources using individual class semantic resources to progressively orient visual aids to generate an attention map to ponder the importance of different local regions. [ji2018stacked].
In medical image analysis
, attention helps implicitly learn to suppress irrelevant areas in an input image while highlighting useful resources for a specific task. This allows us to eliminate the need to use explicit external tissue/organ localization modules using convolutional neural networks (CNNs). Besides, it allows generating both images and maps of attention in unsupervised learning useful for data annotation. For this, there are the ATA-GANS[kastaniotis_attention-aware_2018] and attention gate [schlemper_attention_2018]
modules that work with unsupervised and supervised learning, respectively.
|Depth Estimation||[xu2018structured] [liu2019end]|
|Image Restoration||[zhang2019residual] [suganuma2019attention] [qian2018attentive]|
|Image-to-Image Translation||[mejjati2018unsupervised] [tang2019attention]|
|Information Retriever||[ji2018stacked] [jin2018deep] [yang2019deep]|
|Medical Image Analysis||[kastaniotis_attention-aware_2018] [schlemper_attention_2018] [huiyan_jiang;tianyu_shi;zhiqi_bai;liangliang_huang_ahcnet:_2019]|
|Multiple Instance Learning||[ilse2018attention]|
|Transfer Learning||[zagoruyko_paying_2016] [xiao-yu_zhang;haichao_shi;changsheng_li;kai_zheng;xiaobin_zhu;lixin_duan_learning_2019] [xingjian_li;haoyi_xiong;hanchao_wang;yuxuan_rao;liping_liu;jun_huan_delta:_2019]|
|Video Classification||[bielski2018pay] [long2018attention]|
|Facial Detection||[tian2018learning] [yundong_zhang;xiang_xu;xiaotao_liu_robust_2019] [shengtao_xiao;jiashi_feng;junliang_xing;hanjiang_lai;shuicheng_yan;ashraf_a._kassim_robust_2016]|
|Person Detection||[zhang2019cross] [zhang2018occluded]|
|Text Detection||[he_end--end_2018] [wojna_attention-based_2017] [he_single_2017] [bhunia_script_2019]|
|Facial Expression Recognition||[siyue_xie;haifeng_hu;yongbo_wu_deep_2019] [shervin_minaee;amirali_abdolrashidi_deep-emotion:_2019] [yong_li;jiabei_zeng;shiguang_shan;xilin_chen_occlusion_2019]|
For person recognition, attention has become essential in in-person re-identification (re-id) [han_attribute-aware_2019] [meng_zheng;srikrishna_karanam;ziyan_wu;richard_j._radke_re-identification_2019] [li_harmonious_2018]. Re-id aims to search for people seen from a surveillance camera implanted in different locations. In classical approaches, the bounding boxes of detected people were not optimized for re-identification suffering from misalignment problems, background disorder, occlusion, and absent body parts. Misalignment is one of the biggest challenges, as people are often captured in various poses, and the system needs to compare different images. In this sense, neural attention models started to lead the developments mainly with multiple attentional mechanisms of alignment between different bounding boxes.
There are still less popular applications, but for which attention plays an essential role. Self-attention models iterations between the input set for clustering tasks [lee2018set]. Attention refines and merges multi-scale feature maps in-depth estimation and edge detection [xu2017learning] [xu2018structured]. In video classification, attention helps capture global and local resources generating a comprehensive representation [xie2019semantic]. It also measures each time interval’s relevance in a sequence [pei2017temporal], promoting a more intuitive interpretation of the impact of content on the video’s popularity, providing the regions that contribute the most to the prediction [bielski2018pay]. In face detection, attention dynamically selects the main reference points of the face [shengtao_xiao;jiashi_feng;junliang_xing;hanjiang_lai;shuicheng_yan;ashraf_a._kassim_robust_2016]. It improves deblurring in each convolutional layer in deblurring, preserving fine details [park2019down]. Finally, in emotion recognition, it captures complex relationships between audio and video data by obtaining regions where both signals relate to emotion [zhang2019deep].
5.3 Multimodal Tasks (CV/NLP)
Attention has been used extensively in multimodal learning, mainly for mapping complex relationships between different sensory modalities. In this domain, the importance of attention is quite intuitive, given that communication and human sensory processing are completely multimodal. The first approaches emerged from 2015 inspired by an attentive encoder-decoder framework entitled “Show, attend and tell: Neural image caption generation with visual attention” by Xu et al. [xu_show_2015]. In this framework, depicted in figure 20a at each time step , attention generates a vector with a dynamic context of visual features based on the words previously generated - a principle very similar to that presented in RNNSearch [bahdanau_neural_2014]. Later, more elaborate methods using visual and textual sources were developed mainly in image captioning, video captioning, and visual question answering, as shown in the Table 5.
|Emotion Recognition||[tan2019multimodal] [zadeh2018memory] [zadeh2018multi]|
|Visual Question Answering||
For image captioning, Yan et al. [yang_review_2016] extended the seminal framework by Xu et al. [xu_show_2015] with review attention, a sequence of modules that capture global information in various stages of reviewing hidden states and generate more compact, abstract, and global context vectors. Zhu et al. [zhu_image_2018] presented a triple attention model which enhances object information at the text generation stage. Two attention mechanisms capture semantic visual information in input, and a mechanism in the prediction stage integrates word and image information better. Lu et al. [lu_knowing_2017] presented an adaptive attention encoder-decoder framework that decides when to trust visual signals and when to trust only the language model. Specifically, their mechanism has two complementary elements: the visual sentinel vector decides when to look at the image, and the sentinel gate decides how much new information the decoder wants from the image. Recently, Pan et al. [pan2020x] created attentional mechanisms based on bilinear pooling capable of capturing high order interactions between multi-modal features, unlike the classic mechanisms that capture only first-order feature interactions.
Similarly, in visual question-answering tasks, methods seek to align salient textual features with visual features via feedforward or recurrent soft attention methods [noauthor_bottom-up_nodate] [osman_dual_2018] [lu_hierarchical_2016] [yang_stacked_2016]. More recent approaches aim to generate complex inter-modal representations. In this line, Kim et al. [kim2020hypergraph] proposed Hypergraph Attention Networks (HANs), a solution to minimize the disparity between different levels of abstraction from different sensory sources. So far, HAN is the first approach to define a common semantic space with symbolic graphs of each modality and extract an inter-modal representation based on co-attention maps in the constructed semantic space, as shown in figure 20b. Liang et al. [liang_focal_2018] used attention to capture hierarchical relationships between sequences of image-text pairs not directly related. The objective is to answer questions and justify what results in the system were based on answers.
For video captioning most approaches generally align textual features and spatio-temporal representations of visual features via simple soft attention mechanisms [cho_describing_2015] [yu_video_2015] [yao_describing_2015] [hori_attention-based_2017]. For example, Pu et al. [pu_adaptive_2018] design soft attention to adaptively emphasize different CNN layers while also imposing attention within local spatiotemporal regions of the feature maps at particular layers. These mechanisms define the importance of regions and layers to produce a word based on word-history information. Recently, self-attention mechanisms have also been used to capture more complex and explicit relationships between different modalities. Zhu et al. [zhu2020actbert] introduced ActBERT, a transformer-based approach trained via self-supervised learning to encode complex relations between global actions and local, regional objects and linguistic descriptions. Zhou et al. [zhou2018end] proposed a multimodal transformer via supervised learning, which employs a masking network to restrict its attention to the proposed event over the encoding feature.
Other applications also benefit from the attention. In the emotion recognition domain, the main approaches use memory fusion structures inspired by the human brain’s communication understanding mechanisms. Biologically, different regions process and understand different modalities connected via neural links to integrate multimodal information over time. Similarly, in existing approaches, an attentional component models view-specific dynamics within each modality via recurrent neural networks, and a second component simultaneously finds multiple cross-view dynamics in each recurrence timestep by storing them in hybrid memories. Memory updates occur based on all the sequential data seen. Finally, to generate the output, the predictor integrates the two levels of information: view-specific and multiple cross-view memory information [zadeh2018memory] [zadeh2018multi].
There are still few multimodal methods for classification. Whang et al. [wang2018tienet] presented a pioneering framework for classifying and describing image regions simultaneously from textual and visual sources. Their framework detects, classifies, and generates explanatory reports regarding abnormalities observed in chest X-ray images through multi-level attentional modules end-to-end in LSTMs and CNNs. In LSTMs, attention combines all hidden states and generates a dynamic context vector, then a spatial mechanism guided by a textual mechanism highlights the regions of the image with more meaningful information. Intuitively, the salient features of the image are extracted based on high-relevance textual regions.
5.4 Recommender Systems (RS)
Attention has also been used in recommender systems for behavioral modeling of users. Capturing user interests is a challenging problem for neural networks, as some iterations are transient, some clicks are unintentional, and interests can change quickly in the same session. Classical approaches (i.e., Markov Chains and RNNs) have limited performance predicting the user’s next actions, present different performances in sparse and dense datasets, and long-term memory problems. In this sense, attention has been used mainly to assign weights to a user’s interacted items capturing long and short-term interests more effectively than traditional ones. Self-attention and memory approaches have been explored to improve the area’s development. STAMP[liu2018stamp] model, based on attention and memory, manages users’ general interests in long-term memories and current interests in short-term memories resulting in behavioral representations that are more coherent. The Collaborative Filtering [chen2017attentive] framework, and SASRec [kang_deep_2018] explored self-attention in capturing long-term semantics for finding the most relevant items in user’s history.
5.5 Reinforcement Learning (RL)
Attention has been gradually introduced in reinforcement learning to deal with unstructured environments in which rewards and actions depend on past states and where it is challenging guaranteeing the Markov property. Specifically, the goals are to increase the agent’s generalizability and minimize long-term memory problems. Currently, the main attentional reinforcement learning approaches are computer vision, graph reasoning, natural language processing, and virtual navigation, as shown in Table 6.
|Natural Language Processing||[santoro2018relational]|
|Navigation||[mishra_simple_2017] [parisotto2017neural] [zambaldi2018relational] [santoro2018relational] [baker2019emergent]|
To increase the ability to generalize in partially observable environments, some approaches use attention in the policy network. Mishra et al. [mishra_simple_2017] used attention to easily capture long-term temporal dependencies in convolutions in an agent’s visual navigation task in random mazes. At each time step , the model receives as input the current observation and previous sequences of observations, rewards, and actions so that attention allows the policy to maintain a long memory of past episodes. Other approaches implement attention directly to the representation of the state. State representation is a classic and critical problem in RL, given that state space is one of the major bottlenecks for speed, efficiency, and generalization of training techniques. In this sense, the importance of attention on this topic is quite intuitive.
However, there are still few approaches exploring the representation of states. The neural map [parisotto2017neural] maintains an internal memory in the agent controlled via attention mechanisms. While the agent navigates the environment, an attentional mechanism alters the internal memory, dynamically constructing a history summary. At the same time, another generates a representation , based on the contextual information of the memory and the state’s current observation. Then, the policy network receives as input and generates the distribution of shares. Some more recent approaches affect the representation of the current observation of the state via self attention in iterative reasoning between entities in the scene [zambaldi2018relational] [baker2019emergent], or between the current observation and memory units [santoro2018relational] to guide model-free policies.
The most discussed topic is the use of the policy network to guide the attentional focus of the agent’s glimpses sensors on the environment so that the representation of the state refers to only a small portion of the entire operating environment. This approach emerged initially by Graves et al. [mnih_recurrent_2014] using policy gradient methods (i.e., REINFORCE algorithm) in the hybrid training of recurrent networks in image classification tasks. Their model consists of a glimpse sensor that captures only a portion of the input image, a core network that maintains a summary of the history of patches seen by the agent, an action network that estimates the class of the image seen, and a location network trained via RL which estimates the focus of the glimpse on the next time step, as shown in figure 21a. This structure considers the network as the agent, the image as the environment, and the reward is the number of correct network ratings in an episode. Stollenga et al. [stollenga_deep_2014] proposed a similar approach, however directly focused on CNNs, as shown in figure 21b. The structure allows each layer to influence all the others through attentional bottom-up and top-down connections that modulate convolutional filters’ activity. After supervised training, the attentional connections’ weights implement a control policy via RL and SNES [schaul2011high]. The policy learns to suppress or enhance features at various levels by improving the classification of difficult cases not captured by the initial supervised training. Subsequently, variants similar to these approaches appeared in multiple image classification [ba_multiple_2014] [marcus_edel;joscha_lausch_capacity_2016] [zhao_deep_2017], action recognition [yeung_end--end_2016], and face hallucination [cao_attention-aware_2017].
In robotics, there are still few applications with neural attentional models. A small portion of current work is focused on control, visual odometry, navigation, and human-robot interaction, as shown in Table 7.
|Visual odometry||[xue2020deep] [johnston2020self] [kuo2020dynamic] [damirchi2020exploring] [gao2020attentional] [li2021transformer]|
|Navigation||[sophie] [social_attention] [crowd] [scene_memory]|
Navigation and visual odometry are the most explored domains, although still with very few published works. For classic DL approaches, navigating tasks in real or complex environments are still very challenging. These approaches have limited performance in dynamic and unstructured environments and over long horizon tasks. In real environments, the robot must deal with dynamic and unexpected changes in humans and other obstacles around it. Also, decision-making depends on the information received in the past and the ability to infer the future state of the environment. Some seminal approaches in the literature have demonstrated the potential of attention to minimizing these problems without compromising the techniques’ computational cost. Sadeghian et al. [sophie] proposed Sophie: an interpretable framework based on GANs for robotic agents in environments with human crowds. Their framework via attention extracts two levels of information: a physical extractor learns spatial and physical constraints generating a context vector that focuses on viable paths for each agent. In contrast, a social extractor learns the interactions between agents and their influence on each agent’s future path. Finally, LSTMs based on GAN generate realistic samples capturing the nature of future paths, and the attentional mechanisms allow the framework to predict physically and socially feasible paths for agents, achieving cutting-edge performances on several different trajectories.
Vemula et al. [social_attention] proposed a trajectory prediction model that captures each person’s relative importance when navigating in the crowd, regardless of their proximity via spatio-temporal graphs. Chen et al. [crowd] proposed the crowd-aware robot navigation with attention-based deep reinforcement learning. Specifically, a self-attention mechanism models interactions between human-robot and human-human pairs, improving the robot’s inference capacity of future environment states. It also captures how human-human interactions can indirectly affect decision-making, as shown in figure 22 in a). Fang et. al. [scene_memory] proposed the novel memory-based policy (i.e., scene memory transformer - SMT) for embodied agents in long-horizon tasks. The SMT policy consists of two modules: 1) scene memory which stores all past observations in an embedded form, and 2) an attention-based policy network that uses the updated scene memory to compute a distribution over actions. The SMT model is based on an encoder-decoder Transformer and showed strong performance as the agent moves in a large environment, and the number of observations grows rapidly.
In visual odometry (VO), the classic learning-based methods consider the VO task a problem of pure tracking through the recovery of camera poses from fragments of the image, leading to the accumulation of errors. Such approaches often disregard crucial global information to alleviate accumulated errors. However, it is challenging to preserve this information in end-to-end systems effectively. Attention represents an alternative that is still little explored in this area to alleviate such disadvantages. Xue et al. [xue2020deep] proposed an adaptive memory approach to avoid the network’s catastrophic forgetfulness. Their framework consists mainly of a memory, a remembering, and a refining module, as shown in figure 22b). First, it remembers to select the main hidden states based on camera movement while preserving selected hidden states in the memory slot to build a global map. The memory stores the global information of the entire sequence, allowing refinements on previous results. Finally, the refining module estimates each view’s absolute pose, allowing previously refined outputs to pass through recurrent units, thus improving the next estimate.
Another common problem in VO classical approaches is selecting the features to derive ego-motion between consecutive frames. In scenes, there are dynamic objects and non-textured surfaces that generate inconsistencies in the estimation of movement. Recently, self-attention mechanisms have been successfully employed in dynamic reweighting of features, and in the semantic selection of image regions to extract more refined egomotion [kuo2020dynamic] [damirchi2020exploring] [gao2020attentional]. Additionally, self-attentive neural networks have been used to replace traditional recurrent networks that consume training time and are inaccurate in the temporal integration of long sequences [li2021transformer].
In human-robot interaction, Zang et al. [translating_navigation] proposed a framework that interprets navigation instructions in natural language and finds a mapping of commands in an executable navigation plan. The attentional mechanisms correlate navigation instructions very efficiently with the commands to be executed by the robot in only one trainable end-to-end model, unlike the classic approaches that use decoupled training and external interference during the system’s operation. In control, existing applications mainly use manipulator robots in visual-motor tasks. Duan et al. [one_shot] used attention to improve the model’s generalization capacity in imitation learning approaches with a complex manipulator arm. The objective is to build a one-shot learning system capable of successfully performing instances of tasks not seen in the training. Thus, it employs soft attention mechanisms to process a long sequence of (states, actions) demonstration pairs. Finally, Abolghasemi et al. [pay_attention] proposed a deep visual engine policy through task-focused visual attention to make the policy more robust and to prevent the robot from releasing manipulated objects even under physical attacks.
A long-standing criticism of neural network models is their lack of interpretability [li2016understanding]. Academia and industry have a great interest in the development of interpretable models mainly for the following aspects: 1) critical decisions: when critical decisions need to be made (e.i., medical analysis, stock market, autonomous cars), it is essential to provide explanations to increase the confidence of the specialist human results; 2) failure analysis: an interpretable model can retrospectively inspect where bad decisions were made and understand how to improve the system; 3) verification: there is no evidence of the models’ robustness and convergence even with small errors in the test set. It is difficult to explain the influence of spurious correlations on performance and why the models are sometimes excellent in some test cases and flawed in others; and 4) model improvements: interpretability can guide improvements in the model’s structure if the results are not acceptable.
Attention as an interpretability tool is still an open discussion. For some researchers, it allows to inspect the models’ internal dynamics – the hypothesis is that the attentional weights’ magnitude is correlated with the data’s relevance for predicting the output. Li et al. [li2016understanding] proposed a general methodology to analyze the effect of erasing particular representations of neural networks’ input. When analyzing the effects of erasure, they found that attentional focuses are essential to understand networks’ internal functioning. In [serrano2019attention] the results showed that higher attentional weights generally contribute with more significant impact to the model’s decision, but multiple weights generally do not fully identify the most relevant representations for the final decision. In this investigation, the researchers concluded that attention is an ideal tool to identify which elements are responsible for the output but do not yet fully explain the model’s decisions.
Some studies have also shown that attention encodes linguistic notions relevant to understanding NLP models [vig2019analyzing] [tenney2019bert] [clark2019does]. However, Jain et al. [attention_is_not_explanation] showed that although attention improves NLP results, its ability to provide transparency or significant explanations for the model’s predictions is questionable. Specifically, the researchers investigated the relationship between attentional weights and model results by answering the following questions: (i) to what extent do weights of attention correlate with metrics of the importance of features, specifically those resulting from gradient? Moreover, (ii) do different attentional maps produce different predictions? The results showed that the correlation between intuitive metrics about the features’ importance (e.i., gradient approaches, erasure of features) and attentional weights is low in recurrent encoders. Besides, the selection of features other than the attentional distribution did not significantly impact the output as attentional weights exchanged at random also induced minimal output changes. The researchers also concluded that such results depend significantly on the type of architecture, given that feedforward encoders obtained more coherent relationships between attentional weights and output than other models.
Vashishth et al. [vashishth2019attention] systematically investigated explanations for the researchers’ distinct views through experiments on NLP tasks with single sequence models, pair sequence models, and self-attentive neural networks. The experiments showed that attentional weights in single sequences tasks work like gates and do not reflect the reasoning behind the model’s prediction, justifying the observations made by Jain et al. [attention_is_not_explanation]. However, for pair sequence tasks, attentional weights were essential to explaining the model’s reasoning. Manual tests have also shown that attentional weights are highly similar to the manual assessment of human observers’ attention. Recently, Wiegreffe et al. [wiegreffe2019attention] also investigated these issues in depth through an extensive protocol of experiments. The authors observed that attention as an explanation depends on the definition of explainability considered. If the focus is on plausible explainability, the authors concluded that attention could help interpret model insights. However, if the focus is a faithful and accurate interpretation of the link that the model establishes between inputs and outputs, results are not always positive. These authors confirmed that good alternatives distributions could be found in LSTMs and classification tasks, as hypothesized by Jain et al. [attention_is_not_explanation]. However, in some experiments, adversarial training’s alternative distributions had poor performances concerning attention’s traditional mechanisms. These results indicate that the attention mechanisms trained mainly in RNNs learn something significant about the relationship between tokens and prediction, which cannot be easily hacked. In the end, they showed that attention efficiency as an explanation depends on the data set and the model’s properties.
6 Trends and Opportunities
Attention has been one of the most influential ideas in the Deep Learning community in recent years, with several profound advances, mainly in computer vision and natural language processing. However, there is much space to grow, and many contributions are still to appear. In this section, we highlight some gaps and opportunities in this scenario.
6.1 End-To-End Attention models
Over the past eight years, most of the papers published in the literature have involved attentional mechanisms. Models that are state of the art in DL use attention. Specifically, we note that end-to-end attention networks, such as Transformers [vaswani_attention_2017] and Graph Attention Networks [velickovic_graph_2018], have been expanding significantly and have been used successfully in tasks across multiple domains (Section 2). In particular, Transformer has introduced a new form of computing in which the neural network’s core is fully attentional. Transformer-based language models like BERT [devlin2018bert], GPT2 [radford2019language], and GPT3 [brown2020language] are the most advanced language models in NLP. Image GPT [chen2020generative] has recently revolutionized the results of unsupervised learning in imaging. It is already a trend to propose Transfomer based models with sparse attentional mechanisms to reduce the Transformer’s complexity from quadratic to linear and use attentional mechanisms to deal with multimodality in GATs. However, Transformer is still an autoregressive architecture in the decoder and does not use other cognitive mechanisms such as memory. As research in attention and DL is still at early stages, there is still plenty of space in the literature for new attentional mechanisms, and we believe that end-to-end attention architectures might be very influential in Deep Learning’s future models.
6.2 Learning Multimodality
Attention has played a crucial role in the growth of learning from multimodal data. Multimodality is extremely important for learning complex tasks. Human beings use different sensory signals all the time to interpret situations and decide which action to take. For example, while recognizing emotions, humans use visual data, gestures, and voice tones to analyze feelings. Attention allowed models to learn the synergistic relationship between the different sensory data, even if they are not synchronized, allowing the development of increasingly complex applications mainly in emotion recognition, [zhang2019deep], feelings [tan2019multimodal], and language-based image generation [unpublished2021dalle]. We note that multimodal applications are continually growing in recent years. However, most research efforts are still focused on relating a pair of sensory data, mostly visual and textual data. Architectures that can scale easily to handle more than one pair of sensors are not yet widely explored. Multimodal learning exploring voice data, RGBD images, images from monocular cameras, data from various sensors, such as accelerometers, gyroscopes, GPS, RADAR, biomedical sensors, are still scarce in the literature.
6.3 Cognitive Elements
Attention proposed a new way of thinking about the architecture of neural networks. For many years, the scientific community neglected using other cognitive elements in neural network architectures, such as memory and logic flow control. Attention has made possible including in neural networks other elements that are widely important in human cognition. Memory Networks [weston_2014_memory], and Neural Turing Machine [graves_neural_2014] are essential approaches in which attention makes updates and recoveries in external memory. However, research on this topic is at an early stage. The Neural Turing Machine has not yet been explored in several application domains, being used only in simple datasets for algorithmic tasks, with a slow and unstable convergence. We believe that there is plenty of room to explore the advantages of NTM in a wide range of problems and develop more stable and efficient models. Still, Memory Networks [weston_2014_memory] presents some developments (Section 2), but few studies explore the use of attention to managing complex and hierarchical structures of memory. Attention to managing different memory types simultaneously (i.e., working memory, declarative, non-declarative, semantic, and long and short term) is still absent in the literature. To the best of our knowledge, the most significant advances have been made in Dynamic Memory Networks [kumar_ask_2015] with the use of episodic memory. Another open challenge is how to use attention to plug external knowledge into memory and make training faster. Finally, undoubtedly one of the biggest challenges still lies in including other human cognition elements such as imagination, reasoning, creativity, and consciousness working in harmony with attentional structures.
6.4 Computer Vision
Recurrent Attention Models (RAM) [mnih_recurrent_2014] introduced a new form of image computing using glimpses and hard attention. The architecture is simple, scalable, and flexible. Spatial Transformer (STN) [jaderberg_spatial_2015] presented a simple module for learning image transformations that can be easily plugged into different architectures. We note that RAM has a high potential for many tasks in which convolutional neural networks have difficulties, such as large, high-resolution images. However, currently, RAM has been explored with simple datasets. We believe that it is interesting to validate RAM in complex classification and regression tasks. Another proposal is to add new modules to the architecture, such as memory, multimodal glimpses, and scaling. It is interesting to explore STN in conjunction with RAM in classification tasks or use STN to predict transformations between sets of images. RAM aligned with STN can help address robostusnees to spatial transformation, learn the system dynamics in Visual Odometry tasks, enhance multiple-instance learning, addressing multiple view-points.
6.5 Capsule Neural Network
Capsule networks (CapsNets), a new class of deep neural network architectures proposed recently by Hinton et al. [sabour2017dynamic], have shown excellent performance in many fields, particularly in image recognition and natural language processing. However, few studies in the literature implement attention in capsule networks. AR CapsNet [choi2019attention] implements a dynamic routing algorithm where routing between capsules is made through an attention module. The attention routing is a fast forward-pass while keeping spatial information. DA-CapsNet [huang2020capsnet] proposes a dual attention mechanism, the first layer is added after the convolution layer, and the second layer is added after the primary caps. SACN [hoogi2019self] is the first model that incorporates the self-attention mechanism as an integral layer. Recently, Tsai. et al. [tsai2020capsules] introduced a new attentional routing mechanism in which a daughter capsule is routed to a parent capsule-based between the father’s state and the daughter’s vote. We particularly believe that attention is essential to improve the relational and hierarchical nature that CapsNets propose. The development of works aiming at the dynamic attentional routing of the capsules and incorporating attentional capsules of self-attention, soft and hard attention can bring significant results to current models.
6.6 Neural-Symbolic Learning and Reasoning
According to LeCun [lecun2015deep] one of the great challenges of artificial intelligence is to combine the robustness of connectionist systems (i.e., neural networks) with symbolic representation to perform complex reasoning tasks. While symbolic representation is highly recursive and declarative, neural networks encode knowledge implicitly by adjusting weights. For many decades exploring the fusion between connectionist and symbolic systems has been overlooked by the scientific community. Only over the past decade, research with hybrid approaches using the two families of AI methodologies has grown again. Approaches such as statistical relational learning (SRL) [khosravi2010survey] and neural-symbolic learning [besold2017neural] were proposed. Recently, attention mechanisms have been integrated into some neural-symbolic models, the development of which is still at an early stage. Memory Networks [weston_2014_memory] (Section 2) and Neural Turing Machine [graves_neural_2014] (Section 2) were the first initiatives to include reasoning in deep connectionist models.
In the context of neural logic programming, attention has been exploited to reason about knowledge graphs or memory structures to combine the learning of parameters and structures of logical rules. Neural Logic Programming[yang2017differentiable] uses attention on a neural controller that learns to select a subset of operations and memory content to execute first-order rules. Logic Attention Networks [wang2019logic] facilitates inductive KG embedding and uses attention to aggregate information coming from graph neighbors with rules and attention weights. A pGAT [harsha2020probabilistic] uses attention to knowledge base completion, which involves the prediction of missing relations between entities in a knowledge graph. While producing remarkable advances, recent approaches to reasoning with deep networks do not adequately address the task of symbolic reasoning. Current efforts are only about using attention to ensure efficient memory management. We believe that attention can be better explored to understand which pieces of knowledge are relevant to formulate a hypothesis to provide a correct answer, which are rarely present in current neural systems of reasoning.
6.7 Incremental Learning
Incremental learning is one of the challenges for the DL community in the coming years. Machine learning classifiers are trained to recognize a fixed set of classes. However, it is desirable to have the flexibility to learn additional classes with limited data without re-training in the complete training set. Attention can significantly contribute to advances in the area and has been little explored. Ren et al.[mengye_ren;renjie_liao;ethan_fetaya;richard_s._zemel_incremental_2019] were the first to introduce seminal work in the area. They use Attention Attractor Networks to regularize the learning of new classes. In each episode, a set of new weights is trained to recognize new classes until they converge. Attention Attractor Networks helps recognize new classes while remembering the classes beforehand without revising the original training set.
6.8 Credit Assignment Problem (CAP)
In Reinforcement Learning (RL), an action that leads to a higher final cumulative reward should have more value. Therefore, more "credit" should be assigned to it than an action that leads to a lower final reward. However, measuring the individual contribution of actions to future rewards is not simple and has been studied by the RL community for years. There are at least three variations of the CAP problem that have been explored. The temporal CAP refers to identifying which actions were useful or useless in obtaining the final feedback. The structural CAP seeks to find the set of sensory situations in which a given sequence of actions will produce the same result. Transfer CAP refers to learning how to generalize a sequence of actions in tasks. Few works in the literature explore attention to the CAP problem. We believe that attention will be fundamental to advance credit assignment research. Recently, Ferret et al. [ferret2019self] started the first research in the area by proposing a seminal work with attention to learn how to assign credit through a separate supervised problem and transfer credit assignment capabilities to new environments.
6.9 Attention and Interpretability
There are investigations to verify attention as an interpretability tool. Some recent studies suggest that attention can be considered reliable for this purpose. However, other researchers criticize the use of attention weights as an analytical tool. Jain and Wallace [attention_is_not_explanation] proved that attention is not consistent with other explainability metrics and that it is easy to create distributions similar to those of the trained model but to produce a different result. Their conclusion is that changing attention weights does not significantly affect the model’s prediction, contrary to research by Rudin [rudin2018please] and Riedl [riedl2019human] (Section 5.7). On the other hand, some studies have found how attention in neural models captures various notions of syntax and co-reference [vig2019analyzing] [clark2019does] [tenney2019bert]. Amid such confusion, Vashishth et al. [vashishth2019attention] investigated attention more systematically. They attempted to justify the two types of observation (that is, when attention is interpretable and not), employing various experiments on various NLP tasks. The conclusion was that attention weights are interpretable and are correlated with metrics of the importance of features. However, this is only valid for cases where weights are essential for predicting models and cannot simply be reduced to a gating unit. Despite the existing studies, there are numerous research opportunities to develop systematic methodologies to analyze attention as an interpretability tool. The current conclusions are based on experiments with few architectures in a specific set of applications in NLP.
6.10 Unsupervised Learning
In the last decade, unsupervised learning has also been recognized as one of the most critical challenges of machine learning since, in fact, human learning is mainly unsupervised [lecun2015deep]. Some works have recently successfully explored attention within purely unsupervised models. In GANs, attention has been used to improve the global perception of a model (i.e., the model learns which part of the image gives more attention to the others). SAGAN [zhang2018self]
was one of the pioneering efforts to incorporate self-attention in Convolutional Gans to improve the quality of the images generated. Image Transformer is an end-to-end attention network created to generate high-resolution images that significantly surpassed state-of-the-art in ImageNet in 2018. AttGan[he2019attgan] uses attention to easily take advantage of multimodality to improve the generation of images. Combining a region of the image with a corresponding part of the word-context vector helps to generate new features with more details in each stage.
Attention has still been little explored to make generative models simpler, scalable, and more stable. Perhaps the only approach in the literature to explore such aspects more deeply is DRAW [draw], which presents a sequential and straightforward way to generate images, being possible to refine image patches while more information is captured sequentially. However, the architecture was tested only in simple datasets, leaving open spaces for new developments. There is not much exploration of attention using autoencoders. Using VAEs, Bornschein et al. [bornschein2017variational] increased the generative models with external memory and used an attentional system to address and retrieve the corresponding memory content.
In Natural Language Processing, attention is explored in unsupervised models mainly to extract aspects of sentiment analysis. It is also used within autoencoders to generate semantic representations of phrases [zhang2017battrae][tian2019attention]. However, most studies still use supervised learning attention, and few approaches still focus on computer vision and NLP. Therefore, we believe that there is still a great path for research and exploration of attention in the unsupervised context, particularly we note that the construction of purely bottom-up attentional systems is not explored in the literature and especially in the context of unsupervised learning, these systems can great value, accompanied by inhibition and return mechanisms.
6.11 New Tasks and Robotics
Although attention has been used in several domains, there are still potential applications that can benefit from it. The prediction of time series, medical applications, and robotics applications are little-explored areas of the literature. Predicting time series becomes challenging as the size of the series increases. Attentional neural networks can contribute significantly to improving results. Specifically, we believe that exploring RAM [mnih_recurrent_2014] with multiple glimpses looking at different parts of the series or different frequency ranges can introduce a new way of computing time series. In medical applications, there are still few works that explore biomedical signals in attentional architectures. There are opportunities to apply attention to all applications, ranging from segmentation and image classification, support for disease diagnosis to support treatments such as Parkinson’s, Alzheimer’s, and other chronic diseases.
For robotics, there are countless opportunities. For years the robotics community has been striving for robots to perform tasks in a safe manner and with behaviors closer to humans. However, DL techniques need to cope well with multimodality, active learning, incremental learning, identify unknowns, uncertainty estimation, object and scene semantics, reasoning, awareness, and planning for this task. Architectures like RAM[mnih_recurrent_2014], DRAW [draw] and Transformer [vaswani_attention_2017] can contribute a lot by being applied to visual odometry, SLAM and mapping tasks.
In this survey, we presented a systematic review of the literature on attention in Deep Learning to overview the area from its main approaches, historical landmarks, uses of attention, applications, and research opportunities. In total, we critically analyzed more than 600 relevant papers published from 2014 to the present. To the best of our knowledge, this is the broadest survey in the literature, given that most of the existing reviews cover only particular domains with a slightly smaller number of reviewed works. Throughout the paper, we have identified and discussed the relationship between attention mechanisms in established deep neural network models, emphasizing CNNs, RNNs, and generative models. We discussed how attention led to performance gains, improvements in computational efficiency, and a better understanding of networks’ knowledge. We present an exhaustive list of application domains discussing the main benefits of attention, highlighting each domain’s most representative instances. We also showed recent discussions about attention on the explanation and interpretability of models, a branch of research that is widely discussed today. Finally, we present what we consider trends and opportunities for new developments around attentive models. We hope that this survey will help the audience understand the different existing research directions and provide significant scientific community background in generating future research.
It is worth mentioning that our survey results from an extensive and exhaustive process of searching, filtering, and critical analysis of papers published between 01/01/2014 until 15/02/2021 in the central publication repositories for machine learning and related areas. In total, we collected more than 20,000 papers. After successive automatic and manual filtering, we selected approximately 650 papers for critical analysis and more than 6,000 for quantitative analyses, which correspond mainly to identifying the main application domains, places of publication, and main architectures. For automatic filtering, we use keywords from the area and set up different combinations of filters to eliminate noise from psychology and classic computational visual attention techniques (i.e., saliency maps). In manual filtering, we separate the papers by year and define the originality and number of citations of the work as the main selection criteria. In the appendix, we provide our complete methodology and links to our search codes to facilitate improving future revisions on any topic in the area.
We are currently complementing this survey with a theoretical analysis of the main neural attention models. This complementary survey will help to address an urgent need for an attentional framework supported by taxonomies based on theoretical aspects of attention, which predate the era of Deep Learning. The few existing taxonomies in the area do not yet use theoretical concepts and are challenging to extend to various architectures and application domains. Taxonomies inspired by classical concepts are essential to understand how attention has acted in deep neural networks and whether the roles played corroborate with theoretical foundations studied for more than 40 years in psychology and neuroscience. This study is already in the final stages of development by our team and will hopefully help researchers develop new attentional structures with functions still little explored in the literature. We hope to make it available to the scientific community as soon as possible.
This survey employs a systematic review (SR) approach aiming to collect, critically evaluate, and synthesize the results of multiple primary studies concerning Attention in Deep Learning. The selection and evaluation of the works should be meticulous and easily reproducible. Also, SR should be objective, systematic, transparent, and replicable. Although recent, the use of attention in Deep Learning is extensive. Therefore, we systematically reviewed the literature, collecting works from a variety of sources. SR consists of the following steps: defining the scientific questions, identifying the databases, establishing the criteria for selecting papers, searching the databases, performing a critical analysis to choose the most relevant works, and preparing a critical summary of the most relevant papers, as shown Figure 23.
This survey covers the following aspects: 1) The uses of attention in Deep Learning; 2) Attention mechanisms; 3) Uses of attention; 4) Attention applications; 5) Attention and interpretability; 6) Trends and challenges. These aspects provide the main topics regarding attention in Deep Learning, which can help understand the field’s fundamentals. The second step identifies the main databases in the machine learning area, such as arXiv, DeepMind, Google AI, OpenAI, Facebook AI research, Microsoft research, Amazon research, Google Scholar, IEEE Xplore, DBLP, ACM, NIPS, ICML, ICLR, AAAI, CVPR, ICCV, CoRR, IJCNN, Neurocomputing, and Google general search (including blogs, distill, and Quora). Our searching period comprises 01/01/2014 to 06/30/2019 (first stage) and 07/01/2019 to 02/15/2021 (second stage), and the search was performed via a Phyton script 222https://github.com/larocs/attention_dl. The papers’ title, abstract, year, DOI, and source publication were downloaded and stored in a JSON file. The most appropriate set of keywords to perform the searches was defined by partially searching the field with expert knowledge from our research group. The final set of keywords wore: attention, attentional, and attentive.
However, these keywords are also relevant in psychology and visual attention. Hence, we performed a second selection to eliminate these papers and remmove duplicate papers unrelated to the DL field. After removing duplicates, 18,257 different papers remained. In the next selection step, we performed a sequential combination of three types of filters: 1) Filter I: Selecting the works with general terms of attention (i.e., attention, attentive, attentional, saliency, top-down, bottom-up, memory, focus, and mechanism); 2) Filter II: Selecting the works with terms related to DL (i.e. deep learning, neural network, ann, dnn deep neural, encoder, decoder, recurrent neural network, recurrent network, rnn, long short term memory, long short-term memory, lstm, gated recurrent unit, gru, autoencoder, ae, variational autoencoder, vae, denoising ae, dae, sparse ae, sae, markov chain, mc, hopfield network, boltzmann machine, em, restricted boltzmann machine, rbm, deep belief network, dbn, deep convolutional network, dcn, deconvolution network, dn, deep convolutional inverse graphics network, dcign, generative adversarial network, gan, liquid state machine, lsm, extreme learnng, machine, elm, echo state network, esn, deep residual network, drn, konohen network, kn, turing machine, ntm, convolutional network, cnn, and capsule network); 3) Filter III: Selecting the works with specific words of attention in Deep Learning (i.e., attention network, soft attention, hard attention, self-attention, self attention deep attention, hierarchical attention, transformer, local attention, global attention, coattention, co-attention, flow attention, attention-over-attention, way attention, intra-attention, self-attentive, and self attentive).
The decision tree with the second selection is shown in Figure24. The third filtering selects works with at least one specific term of attention in deep learning. In the next filtering, we remove papers without abstract, the collection of filters verify if there is at least one specific term of Deep Learning and remove the works with the following keywords: visual attention, saliency, and eye-tracking. For the papers with abstract, the selection is more complex, requiring three cascade conditions: 1) First condition: Selecting the works that have more than five filter terms from filter II; 2) Second condition: selecting the works that have between three and five terms from filter II and where there is at least one of the following: attention model, attention mechanism, attention, or attentive; 3) Third condition: Selecting the works with one or two terms from filter II; without the terms: salience, visual attention, attentive, and attentional mechanism. A total of 6,338 works remained for manual selection. We manually excluded the papers without a title, abstract, or introduction related to the DL field. After manual selection, 3,567 works were stored in Zotero. Given the number of papers, we grouped them by year and chose those above a threshold (average citations in the group). Only works above average were read and classified as relevant or not for critical analysis. To find the number of citations, we automated the process with a Python script. 650 papers were considered relevant for this survey’s critical analysis, and 6,567 were used to perform quantitative analyzes.