In this section, we discuss the nature of attention based recurrent neural network (RNN) model. What does ”attention” mean? What are the applications of the attention based model in computer vision area? What is the attention based RNN model? What are the attention mechanisms introduced in this survey? What are the advantages of using attention based RNN model? These are the questions we would like to answer in this section.
1.1 What does ”attention” mean?
In psychology, limited by the processing bottlenecks, humans tend to selectively concentrate on a part of the information, and at the same time ignore other perceivable information. The above mechanism is usually called attention [anderson1990cognitive]
. For example, in human visual processing, although human’s eye has the ability to receive a large visual field, usually only a small part is fixated on. The reason is different areas of retina have different magnitude of processing ability, which is usually referred as acuity. And only a small area of the retina, fovea, has the greatest acuity. To allocate the limited visual processing resources, one needs to firstly choose a particular part of the visual field, and then focuses on it. For example, when humans are reading, usually the words to be read at the particular moment are attended and processed. As a result, there are two main aspects of attention:
Decide which part of the input needs to be focused on.
Allocate the limited processing resources to the important part.
The definition of attention introduced in psychology is very abstract and intuitive, so the idea is borrowed and widely used in computer science area, although some techniques do not contain exactly the word ”attention”. For example, a CPU will only load the data needed for computing instead of loading all data available, which can be seen as a naïve application of the attention. Furthermore, to make the data loading more effective, the multilevel storage design in computer architecture is employed, i.e., the allocation of computing resources is aided by the structure of multi caches, main memory, and storage device as shown in Figure 1
. In a big picture, all of them make up an attention model.
1.2 What are the applications of the attention based model in computer vision?
As mentioned above, the attention mechanism plays an important role in human visual processing. Some researchers also bring the attention into computer vision area. As in the human perception, a computer vision system should also focus on the important part of the input image, instead of giving all pixels the same weights. A simple and widely used method is extracting local image features from the input. Usually the local image features can be points, lines, corners, or small image patches, and then some measurements are taken from the patches and converted into descriptors. Obviously, the distribution of local features’ positions in a natural image is not uniform, which makes the system give different weights to different parts of the input image, and satisfies the definition of attention.
Saliency detection is another typical example directly motivated by human perception. As mentioned above, humans have ability to detect salient objects/regions rapidly and then attend on by moving the line of sight. This capability is also studied by the computer vision community, and many saliency object/region detection algorithms are proposed, e.g. [achanta2008salient, itti1998model, liu2011learning]. However, most of the saliency detection methods only use the low level image features, e.g., contrast, edge, intensity, etc., which makes them bottom-up, stimulus-driven, and fixed. So they cannot capture the task specific semantic information. Furthermore, the detections of different saliency objects/regions in an image are individually and independent, which is obviously unlike the humans, where the next attended region of an image usually is influenced by the previous perceptions.
The sliding window paradigm [dalal2005histograms, viola2001rapid] is another model which matches the essence of attention. It is common that the interested object only occupies a small region of the input image, and sometimes there are some limitations in the computational capabilities, so the sliding window is widely used in many computer vision tasks, e.g., object detection, object tracking, etc. However, since the size, shape and location of a window can be assigned arbitrarily, in theory there are infinite possible windows for an input image. Lots of work is devoted to reducing the number of windows to be evaluated, and some of them make notable speedups compared to the naïve method. But most of the window reducing algorithms are specifically designed for object detection or object tracking tasks, and the lack of universality makes them difficult to be applied in other tasks. Besides, most of the sliding window methods do not have the ability to take full advantage of the past processed data in future prediction.
1.3 What is the attention based RNN model?
Note that the attention is just a mechanism, or a methodology, so there is no strict definition in mathematics what it is. For example, the local image features, saliency detection, sliding window methods introduced above, or some recently proposed object detection methods like [gonzalez2015active], all employ the attention mechanism in different mathematical forms. On the other hand, as illustrated later, the RNN is a specific type of neural network with a specific mathematical description. The attention based RNN models in this survey refers in particular to the RNN models which are designed for sequence to sequence problems with the attention mechanism. Besides, all the systems in this survey are end-to-end trainable, where the parameters for both the attention module and the common RNN model should be learned simultaneously. Here we will give a general explanation of what the RNN is and what the attention based RNN model is, and the mathematical details are left to be described later.
1.3.1 Recurrent neural network (RNN)
Recently, the convolutional neural network (CNN) is very popular in the computer vision community. But the biggest problem for the CNN is that it only accepts a fixed length input vector and gives a fixed-length output vector. So it cannot handle the data with rich structures, especially the sequences. That is the reason why RNN is interesting, which can not only operate over sequences of input vectors, but also generate sequences of output vectors. Figure2 gives an abstract structure of a RNN unit.
The blue rectangles are the input vectors in a sequence, and the yellow rectangles are the output vectors in a sequence. By holding a hidden state (green rectangles in Figure 2), the RNN is able to process the sequence data. The dashed arrow indicates sometimes the output vectors do not have to be generated. The details of the structure of the neural network, the recurrent neural network, how the hidden state works, and how the parameters are learned will be described later.
1.3.2 The RNN for sequence to sequence problem
188.8.131.52 Sequence to sequence problem
As mentioned above, the RNN unit holds a hidden state which empowers it to process sequential data with variable sizes. In this survey, we focus on the RNN models for sequence to sequence problem. The sequence to sequence problem is defined as to map the input sequence to the output sequence, which is a very general definition. Here the input sequence does not refer strictly to a sequence of items, otherwise for different sequence to sequence problems, it could be in different forms. For example:
For the machine translation task, one usually needs to translate some sentences in the source language to the target language. In this case, the input sequence is a natural language sentence, where each item in the sequence is the word in the sentence. The order of the items in the sequence is critical.
In the video classification task, a video clip usually should be assigned a label. In this case, the input sequence is a list of frames of the video, where each frame is an image. And the order is critical.
The object detection task requires the model to detect some objects in an input image. In this case, the input sequence is only an image which can be seen as a list of objects. But obviously, it is not trivial to extract those objects from the image without additional data or information. So for the object detection task, the input of the model is just a single feature map (an image).
Similarly, the output sequence also does not always contain explicit items, but in this survey, we only focus on the problems where the output sequence has explicit items, although sometimes the number of items in the sequence is one. Besides, no matter the input is a feature map or it contains some explicit items, we always use ”input sequence” to represent them. And the term ”item” is used to describe the item in the sequence no matter it is contained by the sequence explicitly or implicitly.
184.108.40.206 The structure of the RNN model for sequence to sequence problem
Under the condition that the lengths of the input and output sequences can vary, it is not possible for a common RNN to directly construct the corresponding relationships between the items in the input and output sequences without any additional information. As a result, the RNN model needs to firstly learn and absorb all (or a part of) useful information from all of the input items, and then make the predictions. So it is natural to divide the whole system into two parts: the encoder, and the decoder [cho2015describing, SutskeverVL14]. The encoder encodes all (or a part of) the information of the input data to an intermediate code, and the decoder employs this code in generating the output sequence. In some sense, all neural networks can be cut into two parts in any position, the first half is called the encoder, and the second half is treated as the decoder, so here the encoder and decoder are both neural networks.
Figure 3 gives a visual illustration of a RNN model for sequence to sequence problem. The blue rectangles represent the input items, the dark green rectangles are the hidden states of the encoder, the shallow green rectangles represent the hidden states of the decoder, and the yellow rectangles are the output items. Although there are three input items and three output items in Figure 3, actually the number of input and output items can be arbitrary number and do not have to be the same. And the rectangle surrounded by the red box in Figure 3 serves as the ”middleman” between the input and the output which is the intermediate code generated by the encoder. In Figure 3 both the encoder and the decoder are recurrent networks, while with different types of the input sequence mentioned above, the encoder does not have to be recurrent all the time. For example, when the input is a feature map, a neural network without recurrence usually is applied as the encoder.
When dealing with a supervised learning problem, during training, usually the ground truth output sequence corresponds to an input sequence is available. With the predicted output sequence, one can calculate the log-likelihood between the items which have the same indices in the predicted and the ground truth output sequences. The sum of the log-likelihood of all items usually is set as the objective function of the supervised learning problem. And then usually the gradient decent/ascent is used to optimize the objective in order to learn the value of the parameters of the model.
1.3.3 Attention based RNN model
For sequence to sequence problems, usually there are some corresponding relations between the items in the output and input sequences. However for a common RNN model introduced above, the length of the intermediate code is fixed, which prevents the model giving different weights to different items in an input sequence explicitly, so all items in the input sequence has the same importance no matter which output item is attempted to be predicted. This inspires the researchers to add the attention mechanism into the common RNN model.
Besides, the encoding-decoding process can also be regarded as a compression process: firstly the input sequence is compressed into an intermediate code, and then the decoder decompresses the code to get the output. The biggest problem is that no matter what kind of input the model gets, it always compresses the information into a fixed length code, and then the decoder uses this code to decompress all items of a sequence. From an information theory perspective, this is not a good design since the length of the compressed code should have a linear relationship to the amount of information the inputs contain. As in the real file compression, the length of the compressed file is proportional to the amount of information the source files contain. There are some straightforward solutions to solve the problem mentioned above:
Build an extremely non-linear and complicated encoder model such that it can store large amount of information into the intermediate code.
Just use a long enough code.
Stay with the fixed length code, but let the encoder only encode a part of the input items which are needed for the current decoding.
Because the input sequence can be infinite long (or contain infinite amount of information), the first solution does not solve the problem essentially. Needless to say, it is even a more difficult task to estimate the amount of information some data contains. Based on the same reason, solution 2 is not a good choice either. In some extreme cases, even if the amount of information is finite, the code should still be large enough to store all possible input-output pairs, which is obviously not practical.
Solution 3 also cannot solve the problem when the amount of the input information needed for the current decoding is infinite. But usually a specific item in the output sequence only corresponds to a part of the input items. In this case, solution 3 is more practical and can prevent the problem in some degrees compared to other solutions if the needed input items can be ”located” correctly. Furthermore, it essentially has the same insights as the multilevel storage design pattern as shown in Figure 1. The RNN model which has the addressing ability is called the attention based RNN model.
With the attention mechanism, the RNN model is able to assign different weights to different parts of items in the input sequence, and consequently, the inherent corresponding relations between items in input and output sequences can be captured and exploited explicitly. Usually, the attention module is an additional neural network which is connected to the original RNN model whose details will be introduced later.
1.4 What are the attention mechanisms introduced in this survey?
In this survey, we introduce four types of attention mechanisms, which are: item-wise soft attention, item-wise hard attention, location-wise hard attention, and location-wise soft attention. Here we only give a brief and high-level illustration of them, and leave the mathematical descriptions in the next chapter.
For the item-wise and the location-wise, as mentioned above, the forms of the input sequence are different for different tasks. The item-wise mechanism requires that the input is a sequence containing explicit items, or one has to add an additional pre-processing step to generate a sequence of items from the input. However the location-wise mechanism is designed for the input which is a single feature map. The term ”location” is used because in most cases this kind of input is an image, and when treating it as a sequence of objects, all the objects can be pointed by their locations.
As described above, the item-wise method works on ”item-level”. After being fed into the encoder, each item in the input sequence has an individual code, and all of them make up a code set. During decoding, at each step the item-wise soft attention just calculates a weight for each code in the code set, and then makes a linear combination of them. The combined code is treated as the input of the decoder to make the current prediction. The only difference lies in the item-wise hard attention is that it instead makes a hard decision and stochastically picks some (usually one) codes from the code set according to their weights as the intermediate code fed into the decoder.
The location-wise attention mechanism directly operates on an entire feature map. At each decoding step, the location-wise hard attention mechanism discretely picks a sub-region from the input feature map and feeds it to the encoder to generate the intermediate code. And the location of the sub-region to be picked is calculated by the attention module. The location-wise soft attention still accepts the entire feature map as input at each step while otherwise makes a transformation on it in order to highlight the interesting parts instead of discretely picking a sub-region.
As mentioned above, usually the attention mechanism is implemented as an additional neural network connected to the raw RNN. The whole model should still be end-to-end, where both the raw RNN and the attention module are learned simultaneously. When it comes to soft and hard
, for the soft attention, the attention module is differentiable with respect to the inputs, so the whole system can still be updated by gradient ascent/decent. However, since the hard attention mechanism makes hard decisions and gives discrete selections of the intermediate code, the whole system is not differentiable with respect to its inputs anymore. Then some techniques from the reinforcement learning are used to solve learning problem.
In summary, the attention mechanisms are designed to help the model select better intermediate codes for the decoder. Figure 4 gives a visual illustration of the four attention mechanisms which are about to be introduced in this survey.
1.5 What are the advantages of using attention based RNN model?
Here we give some general advantages of the attention based RNN model. The detailed comparison between the attention based RNN model and the common RNN model will be illustrated later. Advantages:
First of all, as its name indicates, the attention based RNN model is able to learn to assign weights to different parts of the input instead of treating all input items equally, which can provide the inherent relations between the items the in input and output sequence. This is not only a way to boost the performance of some tasks, but it is also a powerful tool of visualization compared to the common RNN model.
The hard attention model does not need to process all items in the input sequence, instead, sometimes only processes the interested ones, so it is very useful for some tasks with only partially observable environments, like game playing, which usually cannot be handled by the common RNN model.
Not limited in computer vision area, the attention based RNN model is also suitable for all kinds of sequence related problems. For example:
Applications in natural language processing (NLP), e.g., machine translation[BahdanauCB14, MengLWLJL15], machine comprehension [HermannKGEKSB15], sentence summarization [RushCW15], word representation [LingDBT15, LingTAFDBTL15].
Speech recognition [ChanJLV15, MeiBW15].
Game play [mnih-atari-2013].
1.6 Organization of this survey
In Section 2, we describe the general mathematical form of the attention based RNN model, especially four types of the attention mechanisms: item-wise soft attention, item-wise hard attention, location-wise hard attention, and location-wise soft attention. Next, we introduce some typical applications of the attention based RNN model in computer vision area in Section 3. At last, we summarize the insights of the attention based RNN model, and discuss some potential challenges and future research directions.
2 The attention based RNN model
One can have a general idea of what the attention based RNN is from the last section. In this section, we firstly give a short introduction of the neural network, followed by an abstract mathematical description of the RNN. And then four attention mechanisms on RNN model are analyzed in detail.
2.1 Neural network
A neural network is a data processing and computational model consists of multiple neurons which connect to each other. The basic component of all neural network is the neuron whose name indicates its inspiration from the human neuron cell. Ignoring the biological implication, a neuron in the neural network accepts one or more inputs, reweighs the inputs and sums them, and finally gives one output by passing the sum through an activation function. Let the input to the neuron be, the corresponding weights be , and the output be . A visual example of a neuron is given in Figure 5, where is the activation function. And we have:
where is the bias parameter. If we add into and rewrite the previous equation in vector form, then
In summary, a neuron just applies an activation function on the linear transformation of the input parameterized by the weights. Usually the activation functions are designed to be non-linear in order to import non-linearity into the model. There are many different forms of activation functions, i.e., sigmoid, tanh, ReLU[NairH10], PReLU [HeZR015].
A neural network is usually constructed by the connections of many neurons, and the most popular structure is layer by layer connection as shown in Figure 6. In Figure 6, each circle represents a neuron, and each arrow represents a connection between neurons. The outputs of all neurons in a layer in Figure 6 are connected to the next layer, which can be categorized as fully-connected layer. In summary, if we treat the neural network as a black box and see it from outside, a simple mathematical form can be obtained:
where is the input, is the output, and is a particular model consists of many neurons parameterized by . One property of the model is that it is usually differentiable with respect to and . For more details about the neural network, one can refer to [Schmidhuber15].
The convolutional neural network (CNN) is a specific type of neural network with structures designed for image inputs [lecun1998gradient]
, which usually consists of multiple convolutional layers followed by a few fully-connected layers. The convolutional layer can take a region of a 2D feature map as input, and this is the reason why it is suitable for image data. Recently the CNN is one of the most popular model as candidate for image related problems. A deep CNN model trained on some large image classification dataset (e.g., ImageNet[DengDSLL009]) has good generalization power/ability and can be easily transferred to other tasks. For more details about CNN, one can refer to [Schmidhuber15].
2.2 A general RNN unit
As its name shows, the RNN is also a type of neural network, which therefore consists of a certain number of neurons as all kinds of neural network. The ”recurrent” in the RNN indicates that it performs the same operations for each item in the input sequence, and at the same time, keeps a memory of the processed items for future operations. Let the input sequence be . At each step , the operation performed by a general RNN unit can be described by:
where is the predicted output vector at time , and is the hidden state at time . is a neural network parameterized by which takes the -th input item (), and the previous hidden state as inputs. Here it is clear that the hidden state performs as a memory to store the previous processed information. From Equation (4), we can see the definition of a RNN unit is quite general, because there is no specific requirements for the structure of the network
. People have already proposed hundreds or even thousands of different structures which are out of the scope of this paper. Now two widely used structures are LSTM (long-short term memory)[HochreiterS97]
and GRU (gated recurrent units)[ChoMBB14].
2.3 The model for sequence to sequence problem
As shown in Figure 3, the model for sequence to sequence problem can be generally divided into two parts: the encoder, and the decoder, where both of them are neural networks. Let the input sequence be , and the output sequence be . As mentioned above, in some tasks the input sequence does not consists of explicit items, for which we use to represent the input sequence, and otherwise . Besides, in this survey, only the output sequence containing explicit items is considered, i.e., is the output sequence. The lengths of the input and the output sequences do not have to be the same.
As mentioned above, the core task of an encoder is to encode all (or a part of) the input items to an intermediate code to be decoded by the decoder. In the common RNN for sequence to sequence problem, the model does not try to figure out the corresponding relations between the items in the input and output sequences, and usually all items in the input sequence are compressed into a single intermediate code. So for the encoder,
where represents the encoder neural network parameterized by , is the intermediate code to be used by the decoder. Here the encoder can be any neural network, but its type usually depends on the input data. For example, when the input is treated as a single feature map, usually a neural network without recurrence is used as the encoder, like a CNN for the image input. But it is also common to set encoder as a recurrent network when the input is a sequence of data, e.g., a video clip, a natural language sentence, a human speech clip.
If the length of the output sequence is larger than 1, in most cases the decoder is recurrent, because the decoder at least needs to have the knowledge what it has already predicted to prevent the repeated prediction. Especially in the RNN model with the attention mechanism which will be introduced later, the weights of the input items are assigned with the guidance of the past predictions, so in this case, the decoder should be able to store some history information. As a result, in this survey we only consider the cases where the decoder is an RNN.
As mentioned above, in a common RNN, the decoder accepts the intermediate code as input, and at each step generates the predicted output , and the hidden state by
where represents the decoder recurrent neural network parameterized by , which accepts the code , the previous hidden state , and the previous predicted output as input. After steps of decoding, a sequence of predicted outputs is obtained.
Like many other supervised learning problems, the supervised sequence to sequence problem is also optimized by maximizing the log-likelihood. With the input sequence , the ground truth output sequence , and the predicted output sequence , the log-likelihood for the -th item in the output sequence is
where represents all learnable parameters in the model, and calculates the likelihood of based on . For example, in a classification problem where indicates the index of the label, the . The objective function is then set as the sum of the log-likelihood:
For the common encoder-decoder system, both the encoder and decoder network are differentiable with respect to their parameters, , and , respectively, so the objective function is differentiable with respect to . As a result, one can update the parameters by gradient ascent to maximize the sum of the log-likelihood.
2.4 Attention based RNN model
In this section, we give a comprehensive mathematical description of the four attention mechanisms. As mentioned above, essentially the attention module helps the encoder calculate a better intermediate code for the decoder. So in this section, the decoder part is identical as shown in Section 2.3.2, but the encoder network may not always be the same as in Equation (5). In order to make the illustration clear, we still keep the term to represent the encoder.
2.4.1 Item-wise soft attention
The item-wise soft attention requires the input sequence contains some explicit items . Instead of extracting only one code from the input , the encoder of the item-wise soft attention RNN model extracts a set of codes from :
where the size of does not have to be the same as , which means one can use multiple input items to calculate a single code or extract multiple codes from one input item. For simplicity, we just set . So
where is the encoder neural network parameterized by . Note here that the encoder is different from the encoder in Equation (5), because the encoder in the item-wise soft attention model only takes one item of the sequence as input.
During the decoding, the intermediate code fed into the decoder is calculated by the attention module, which accepts all individual codes , and the decoder’s previous hidden state as input. At the decoding step , the intermediate code is:
|(11)222For simplicity, in the following sections, we just use to represents the intermediate code fed into the decoder, and ignore the superscript .|
where represents the attention module parameterized by . Actually, as long as the attention module is differentiable with respect to , it satisfies the requirements of the soft attention. Here we only introduce the first proposed item-wise soft attention model in [BahdanauCB14] for natural language processing. In this item-wise soft attention model, at decoding step , for each input code , a weight is calculated by:
where usually is also a neural network within the attention module and calculates the unnormalized weight . Here the normalized weight
is explained as the probability that how the codeis relevant to the output , or the importance should be assigned to the -th input item when making the -th prediction. Note here that Equation (13) is a simple softmax function transforming the scores to the scale from 0 to 1, which makes the can be interpreted as a probability.
Since now we have the probabilities, then the code is just calculated by taken the expectation of all s with their probabilities s:
2.4.2 Item-wise hard attention
The item-wise hard attention is very similar to the item-wise soft attention. It still needs to calculate the weights for each code as shown from Equation (9) to Equation (13). As mentioned above, the can be interpreted as the probability that how the code is relevant to the output , so instead of a linear combination of all codes in , the item-wise hard attention stochastically picks one code based on their probabilities. In detail, an indicator is generated from a categorical distribution at decoding step to indicate which code should be picked:
where is a categorical distribution parameterized by the probabilities of the codes (). And works as an index in this case:
When the size of is 2, the above categorical distribution in Equation (15
) turns into a Bernoulli distribution:
2.4.3 Location-wise hard attention
As mentioned above, for some types of input , like an image, it is not trivial to directly extract items from them. So the location-wise hard attention model is developed which accepts the whole feature map as input, stochastically picks a sub-region from it, and uses this sub-region to calculate the intermediate code at each decoding step. This kind of location-wise hard attention is analyzed in many previous works [DenilBLF12, larochelle2010learning], while here we only focus on two recent proposed mechanisms in [MnihHGK14] and [ba-attention-2015], because the attention models in those works are more general and can be trained end-to-end.
Since the location-wise hard attention model only picks a sub-region of the input at the decoding step, it makes sense to estimate that the picked region may not correspond to the output item to be predicted. There are two potential solutions for this problem:
Pick only one input vector at each decoding step.
Make a few glimpses for each prediction, which means at each decoding step, pick some sub-regions, process them and update the model one by one, where the subsequent region (glimpse) is acquired based on the previous processed regions (glimpses). And the last prediction is treated as the ”real” prediction in this decoding step. The number of glimpses
either can be an arbitrary number (a hyperparameter) or automatically be decided by the model.
With a large enough training dataset, it is reasonable to estimate that method 1 can also find the optimal while the convergence may take longer time because of the limitation mentioned above. In testing, obviously method 1 is faster than method 2, but there may be a sacrifice on the performance. For method 2, it may have better performance in testing, while it also needs longer calculating time. Till now there are not so many works done on how to select an appropriate number of glimpses which leaves it an open problem. However, it is clear that method 1 is just an extreme case of method 2, and in the following sections, we treat the hard attention models as taking glimpses at each step of prediction.
To keep the notation consistent, here we still use term as the indicator generated by the attention model to show which sub-region should be extracted, i.e., in the case of location-wise hard attention, the indicates the location of the center of the sub-region is about to be picked. In detail, at the -th glimpse in the decoding step , the attention module accepts the input , the previous hidden states of the decoder (), and calculates by:
is a normal distribution with a mean
and a standard deviation. And usually is a neural network similar to the in Equation (12). Then the attention module picks the sub-region centered at location :
where indicates a sub-region of centered at , and represents the output sub-region at glimpse . Note here that the shape, size of the sub-region, and the standard deviation in Equation (18) are all hyperparameters. One can also let the attention network generate these parameters as the generation of in Equation (18) [yoo2015attentionnet]. When is generated, it is fed into the encoder in calculating the code for the decoder, i.e.,
and then we have
where is the -th hidden state of the decoder, and is the -th prediction.
2.4.4 Location-wise soft attention
As mentioned above, the location-wise soft attention also accepts a feature map as input, and at each decoding step, a transformed version of the input feature map is generated to calculate the intermediate code. It is firstly proposed in [jaderberg2015spatial]
called the spatial transformation network (STN). Instead of the RNN model, the STN in[jaderberg2015spatial] is initially applied on a CNN model, but it can be easily transferred to a RNN model with tiny modifications. To make the explanations more clear, we use to represent the input of the STN (where usually ), and to represent the output of the STN. Figure 8 gives a visual example of how the STN is embedded into the whole framework, and how it works at decoding step , in which the purple ellipse is the STN module. From Figure 8, we can see that after being processed by the STN, the is transformed to , and it is used to calculate the intermediate code which is then fed to the decoder. The details of how the STN works will be illustrated in the following part of this section.
Let the shape of be and the shape of be , where , , represent the height, width, and number of channels respectively, so when is a color image, its shape should be . The STN works identically on each channel to keep the consistency between the channels, so the number of channels of the input and output are the same (). A typical STN network consists of three parts shown in Figure 9: localization network, grid generator, and the sampler.
At each decoding time , the localization network takes the input , the previous hidden state of the decoder as input, and generates the transformation parameter as output:
|(25)444Since the original STN in [jaderberg2015spatial] is applied on a CNN, there is no hidden state. As a result, in [jaderberg2015spatial] the location network only takes as input, and Equation (25) changes to .|
Then the transformation parameterized by is applied on a mesh grid generated by the grid generator to produce a feature map which indicates how to select pixels555Here the pixel means the element of the feature map , which does not have to be an image. from and map them to . In detail, the grid consists of pixels of the output feature map , i.e.,
where and represent the coordinates of the pixel, and is the pixel of which locates at coordinate . And we have
where each shows the coordinates in the input feature map that defines a sampling position. Figure 10 gives a visual example illustrating how the transformation and the grid generator work. In Figure 10, the grid consists of the red dots in the right part of the figure. Then a transformation (the dashed green lines in the figure) is applied on to generate , which is the red (and blue) dots in the left part of the figure. Then indicates the positions to perform the sampling.
Actually, can be any transformations, for example, the affine transformation, the plane projective transformation, or the thin plate spline. In order to make the descriptions above more clear, we assume is a 2D affine transformation, so is a matrix consists of 6 elements:
Then Equation (27) can be rewritten as:
The last step is the sampling which is operated by the sampler. Because is calculated by a transformation, does not always correspond exactly to the pixels in , hence a sampling kernel is applied to map the pixels in to :
where can be any kernel as long as it is partial differentiable with respect to both and
, for example, one of the most widely used sampling kernel for images is the bilinear interpolation:
when the coordinates of and are normalized, i.e., and . A visual example of the bilinear interpolation is shown in Figure 11, where the blue dot represents a sampling position, and its value is calculated by a weighted sum of its surrounding pixels as shown in Equation (32). In Figure 11, the surrounding pixels of the sampling position are represented by four red intersection symbols.
Now if we review the whole process described above, the STN:
Accepts a feature map, and the previous hidden state of the decoder as input.
Generates parameters for a pre-defined transformation.
Generates the sampling grid and calculates the corresponding sampling positions in the input feature map based on the transformation.
Calculates the output pixel values by a pre-defined sampling kernel.
Therefore, for an input feature map , an output feature map is generated which has the ability to only focus on some interesting parts in . For instance, the affine transformation defined in Equation (30
) allows translation, rotation, scale, skew, and cropping, which is enough for most of the image related tasks. The whole process introduced above, i.e., feature map transformation, grid generation, and sampling, is inspired by the standard texture mapping in computer graphics.
At last the is fed into the encoder to generate the intermediate code :
2.5 Learning of the attention based RNN models
As mentioned in the introduction, the difference between the soft and hard attention mechanisms in optimization is that the soft attention module is differentiable with respect to its parameters so the standard gradient ascent/decent can be used for optimization. However, for the hard attention, the model is non-differentiable and the techniques from the reinforcement learning are applied for optimization.
2.5.1 Soft attention
For the item-wise soft attention mechanism, when in Equation (12) is differentiable with respect to its inputs, it is clear that the whole attention model is differentiable with respect to , as well as the whole RNN model is differentiable with respect to . As a result, the same sum of log-likelihood in Equation (8) is used as the objective function for the learning, and the gradient ascent is used to maximize the objective.
The location-wise soft attention module (STN) is also differentiable with respect to its parameters if the location network , the transformation and the sampling kernel are carefully selected to ensure their gradients with respect to their inputs can be defined. For example, when using a differentiable location network , the affine transformation (Equation (29), Equation (30)), and the bilinear interpolation (Equation (32)) as the sampling kernel, we have:
Now the gradients of () can be easily derived from Equation (25). As a result, the entire STN is differentiable with respect to its parameters. As in the item-wise soft attention, the objective function is the sum of the log-likelihood, and the gradient ascent is used for optimization.
2.5.2 Hard attention
As shown in Section 2.4.2 and Section 2.4.3, the hard attention mechanism stochastically picks an item or a sub-region from the input. In this case, the gradients with respect to the picked item/region are zero because they are discrete. The zero gradients cannot be used to maximize the sum of the log-likelihood by the standard gradient ascent. In general, this is a problem of training a neural network with discrete values/units, and here we just introduced the method applied in [MnihHGK14, ba-attention-2015, XuBKCCSZB15] which employs the techniques from the reinforcement learning. The learning methods for the item-wise and location-wise hard attention are essentially the same, but the item-wise hard attention implicitly sets the number of glimpse as 1. In this section, we just make the number of glimpses as .
Instead of the raw log-likelihood in Equation (7), a new objective is defined which is a variational lower bound on the log-likelihood 666For notational simplicity, we ignore the in the log-likelihood later, so in Equation (7) parameterized by (). And we have
Then the derivative of is:
The whole learning process described above for the hard attention is equivalent to the REINFORCE learning rule in [Williams92], and from the reinforcement learning perspective, after each step, the model can get a reward from the environment. In Equation (45), the is used as the reward . But in the reinforcement learning, the reward can be assigned to an arbitrary value. Depending on the tasks to be solved, here are some widely used schemes:
Set the reward to be exactly .
Set the reward to be a real value proportional to , which means , and is a hyperparameter.
Set the reward as a zero/one discrete value:
Then Equation (45) can be written as:
Now as shown in [MnihHGK14], although Equation (47
) is an unbiased estimation of the real gradient of Equation (44
), it may have high variance because of the unbounded. As a result, usually a variance reduction item is added to the equation:
where can be calculated in different ways which are discussed in [ba-attention-2015, BaGSF15, MnihHGK14, WeaverT01, XuBKCCSZB15]. A common approach is to set . By using the variance reduction variable , the training becomes faster and more robust. At last, a hyperparameter can be added to balance the two gradient components:
With the gradients in Equation (49), now the hard attention based RNN model can also be trained by gradient ascent. For the whole sequence , we have the new objective function :
In this section, we firstly give a brief introduction of the neural network, the CNN, and the RNN. Then we discuss the common RNN model for sequence to sequence problem. And the detailed descriptions of four types of attention mechanisms are given: item-wise soft attention, item-wise hard attention, location-wise hard attention, and location-wise soft attention, followed by how to optimize the attention based RNN models.
3 Applications of the attention based RNN model in computer vision
In this section, we firstly give a brief overview of the differences when one uses different attention based RNN models. And then two applications in computer vision which apply the attention based RNN model are introduced.
As mentioned above, the item-wise attention model requires that the input sequence consists of explicit items, which is not trivial for some types of input, like an image input. A simple solution is to manually extract some patches from the input images and treat them as a sequence of item. For the extraction method, one can either extract some fixed size patches from some pre-defined locations, or apply other object proposal methods, like [UijlingsSGS13].
The location-wise attention model can directly accept a feature map as input which avoids the ”items extraction” step mentioned above. And if compared to human perception, the location-wise attention is more appealing. Because when human being sees an image, he will not manually divide the image into some patches and calculate the weights for them before recognizing the objects in the image. Instead, he will directly focus on an object and its surroundings. That is exactly what the location-wise attention does.
The item-wise attention model has to calculate the codes for all items in the input sequence, which is less efficient than the location-wise attention model.
The soft attention is differentiable with respect to its inputs, so the model can be trained by optimizing the sum of the log-likelihood using gradient ascent. However, the hard attention is non-differentiable, usually the techniques from the reinforcement learning are applied to maximize a variational lower bound of the log-likelihood, and it makes sense to estimate there will be a sacrifice on the performance.
The STN keeps both the advantages of the location-wise attention and the soft attention, where it can directly accept a feature map as input and optimize the raw sum of log-likelihood. However, it is just recently proposed, and originally applied on CNN. Currently, there are not enough applications of the STN based RNN model.
3.2 Image classification and object detection
Image classification is one of the most classical and fundamental applications in computer vision area, where each image has a (some) label(s), and the model needs to learn how to predict the label given a new input image. Now the most popular and successful model for image classification task is the CNN. While as mentioned above, the CNN takes one fixed size vector as input, which has some disadvantages:
When the input image is larger than the input size accepted by the CNN, either the image needs to be rescaled to a smaller size to meet the requirements of the model, or the image needs to be cut into some subparts to be processed one by one. If the image is rescaled smaller, there are some sacrifices on the details of the image, which may harm the classification performance. On the other hand, if the image is cut into some patches, the amount of computation will scale linearly with the size of the image.
The CNN has a degree of spatial transformation invariance build-in, while when the image is noisy or the object indicated by the label only occupies a small region of the input image, a CNN model may not still keep satisfied performance.
As a result, [MnihHGK14] and [ba-attention-2015] propose the RNN based image classification models with the location-wise hard attention mechanism. As explained above, the location-wise hard attention based model stochastically picks a sub-region from the input image to generate the intermediate code. So the models in [MnihHGK14] and [ba-attention-2015] can not only make prediction of the image label, but also localize the position of the object. This means the attention based RNN model integrates the image classification and object detection into a single end-to-end trainable model, which is another advantage compared to the CNN based object detection model. If one wants to apply a convolution network to solve the object detection problem, he has to use a separate model to propose the potential locations of the objects, like [UijlingsSGS13], which is expensive.
In this section, we will briefly introduce and analyze the models proposed in [MnihHGK14], [ba-attention-2015], and their extensions. A brief structure of the RNN model in [MnihHGK14] is shown in Figure 12. In this model, is the encoder neural network taking a patch of the input image to generate the intermediate code, where is cut from the raw image based on a location value generated by the attention network. In detail, is generated by Equation (18), and is extracted by Equation (19). Then the decoder accepts the intermediate code as input to make the potential prediction. Here in Figure 12 the decoder and the attention network are put in the same network .
The experiments in [MnihHGK14] compare the performances of the proposed model with some non-recurrent neural networks, and the results show the superiority of the attention based RNN model to the non-recurrent neural networks with similar number of parameters, especially on the datasets with noise. For the details of the experiments, one can refer to Section 4.1 in [MnihHGK14].
However, the original model shown in Figure 12 is very simple, where is a two-layer neural network, and is a three-layer neural network. Besides, the first glimpse ( in Figure 12) is assigned manually. And all the experiments are only conducted on some toy datasets. There is no evidence to prove the generalization power of the model in Figure 12 to some real-world problems. [ba-attention-2015] proposes an extended version of model as shown in Figure 13 by firstly making the network deeper, and secondly using a context vector to obtain a better first glimpse.
The encoder () in [ba-attention-2015] is deeper, i.e., it consists of three convolutional layers and one fully connected layer. Another big difference between the model in Figure 13 and the model in Figure 12 is that the model in Figure 13 adds an independent layer as the attention network in order to make the first glimpse as accurate as possible, i.e., the model extracts a context vector from the whole image ( in the figure), and feeds it to the attention model to generate the first potential location . The same context vector of the whole image is not fed into the decoder network , because [ba-attention-2015] observes that if so, the predicted label are highly influenced by the information of the whole image. Besides, both and are recurrent, while in Figure 12, only one hidden state exists.
Both two models introduced above are evaluated on a dataset called ”MNIST pairs” which is generated from the MNIST dataset. MNIST [MNIST] is a handwritten digits dataset, which contains 70000 28×28 binary images (Figure 13(a)). The MNIST pairs dataset randomly puts two digits into a 100×100 image with some additional noise in the background [MnihHGK14] (Figure 13(b)). The results are shown in Table 1. It is clear that the performance of [ba-attention-2015] is better, and the context vector indeed makes some improvements by suggesting a more accurate first glimpse.
|[ba-attention-2015] without the context vector||7%|
The model in [ba-attention-2015] are also tested on multi-digit street view house number (SVHN) [netzer2011reading] sequence recognition task, and the results show that with the same level of error rates, the number of parameters to be learned, and the training time of the attention based RNN model are much less than the state-of-the-art CNN methods.
However nowadays both MNIST and SVHN are toy datasets with constraints environments, i.e., both of them are about digits classification and detection. Besides, their sizes are relatively small compared to other popular image classification/detection datasets, like the ImageNet. [SermanetFR14] further extends the model in [ba-attention-2015] with only a few modifications and applies it to a real-world image classification task: Stanford Dogs fine-grained categorization task [khosla2011novel], in which the images have larger clutter, occlusion, and variations in pose. The biggest change made in [SermanetFR14] is that a part of the GoogLeNet [SzegedyLJSRAEVR15] is used as the encoder, which is a pre-trained CNN on ImageNet dataset. The performance shown in Table 2 indicates that the attention based RNN model indeed can be applied to real-world datasets while still keep a competitive performance.
|GoogLeNet 96×96 (the encoder in [SermanetFR14])||0.42|
|RNN with location-wise hard attention [SermanetFR14]||0.68|
3.3 Image caption
Image caption is a very challenging problem: by given an input image, the system needs to generate a natural language sentence to describe the content of the image. The classical way of solving this problem is by dividing the problem into some sub-problems, like object detection, objects-words alignment, sentence generation by the template, etc., and solving them independently. This process will obviously make some sacrifices on the performance. With the development and success of the recurrent network in machine translation area, the image caption problem can also be treated as a machine translation problem, i.e., the image can also be seen as a language, and the system just translate it to another language (natural language, like English). The RNN based image caption system is recently proposed in [VinyalsTBE15] (Neural Image Caption, NIC) which perfectly fits our encoder-decoder RNN framework without the attention mechanism.
The general structure of the NIC is shown in Figure 15. With an input image, the same trick mentioned above is applied: a pre-trained convolutional network is used as the encoder, and the output of a fully-connected layer is the intermediate code. Then it is followed by a recurrent network serving as the decoder to generate the natural language caption, and in NIC, the LSTM unit is used. Compared to the pervious methods which decompose the image caption problem into some sub-problems, the NIC is end-to-end, and is directly trained on the raw data, so the whole system is much simpler and can keep all information of the input data. The experimental results show the superiority of the NIC compared to the traditional methods.
However, one problem for the NIC is that only a single image representation (the intermediate code) is obtained by the encoder from the whole input image, which is counterintuitive, i.e., when human beings describe an image by natural language, usually some objects or some salient areas are focused on one by one. So it is natural to import the ability of ”focusing” by applying the attention based RNN model.
[XuBKCCSZB15] adds the item-wise attention mechanisms into the NIC system. The encoder is still a pre-trained CNN, while instead of the fully-connected layer, the output of a convolutional layer is used to calculate the code set in Equation (9). In detail, at each decoding step the whole image is fed into the pre-trained CNN, and the output of the CNN is a 2D feature map. Then some pre-defined 14×14 sub-regions of the feature map are extracted as the items in . [XuBKCCSZB15] implements both the item-wise soft and the item-wise hard attention mechanisms. For the item-wise soft attention, at each decoding step a weight for each code in is calculated by Equation (12) and Equation (13), and then an expectation of all codes in (Equation (14)) is used as the intermediate code. For the item-wise hard attention, the same weights for all codes in are calculated and the index of the code to be picked is generated by Equation (15).
By using the attention mechanism, [XuBKCCSZB15] reports improvements compared to the common RNN models without attention mechanism including NIC. However, both NIC and [XuBKCCSZB15] have participated in the MS COCO Captioning Challenge 2015777MS COCO Captioning Challenge 2015, http://mscoco.org/dataset/#captions-challenge2015, and the competition gives opposite results. MS COCO Captioning Challenge 2015 is an image caption competition running on one of the biggest image caption datasets: Microsoft COCO [LinMBHPRDZ14]
. Microsoft COCO contains 164k images in total where each image is labeled by at least 5 natural language captions. The performance is evaluated by human beings as well as some commonly used evaluation metrics (BLEU[PapineniRWZ02], Meteor [denkowski2014meteor], ROUGE-L [lin2004rouge] and CIDEr-D [VedantamZP15]). There are 17 teams participating the competition in total, and here we list the rankings of some top teams in Table 3 including [XuBKCCSZB15].
|Model||Human||Other evaluation metrics|
|MSR Captivator [DevlinCFGDHZM15]||5||4||1||1||1||1||2||2||2|
|Berkeley LRCN [DonahueHGRVDS15]||6||6||10||7||8||8||7||6||8|
Except for the team ”Human” and the system proposed by [XuBKCCSZB15], all other teams shown in Table 3 use models without the attention mechanism. We can notice that the attention based model performs worse than NIC on both the human evaluations and the automatic evaluation metrics, which suggests that there are still many spaces for improvements. In addition, by comparing the human and the automatic evaluation metrics, the attention based RNN model performs worse in the automatic evaluation metrics than in the human evaluations. This may suggest that, on one side, the current automatic evaluation metrics are still not perfect, and on the other side, the attention based RNN model can generate captions which are more ”natural” for human beings.
In addition to the performance, another big advantage of the attention based RNN model is that it makes the visualization easier and much more intuitive as mentioned in Section 1.5. Figure 16 gives a visual example showing where the model attends in the decoding process. It is clear the attention mechanism indeed works, for example, when generating the word ”people”, the model focuses on the people, and when generating ”water”, the lake is attended.
The model in [XuBKCCSZB15] uses fixed-size feature maps (14×14) to construct the code set , and each feature map corresponds to a fixed-size patch of the input image. [JinFCSZ15] claims that this design may harm the performance because some ”meaningful scenes” may only occupy a small part of the image patch or cannot be cover by a single patch, where the meaningful scene indicates the objects/scenes correspond to the word is about to be predicted. [JinFCSZ15] makes an improvement by firstly extract many object proposals [UijlingsSGS13] from an input image, which potentially contain the meaningful scenes, as the input sequence. And then each proposal is used to calculate the code in . However, [JinFCSZ15] neither directly compares its performance to [XuBKCCSZB15], nor attends the MS COCO Captioning Challenge 2015. So here we cannot directly measure if and how the performance will be improved by using the object proposals to construct the input sequence. When analyzing in theory, we doubt if the improvements can really be obtained, because the currently popular object proposal methods only extract regions which probably contain some ”objects”. But a verb or an adjective in the ground truth image caption may not correspond to any explicit objects. Still, the problem proposed in [JinFCSZ15] is very interesting and cannot be ignored: when the input of a RNN model is an image, it is not trivial to obtain a sequence of items from it. One straightforward solution is to use the location-wise attention models like the location-wise hard attention or the STN.
4 Conclusion and future research
In this survey, we discuss the attention based RNN model and its applications in some computer vision tasks. Attention is an intuitive methodology by giving different weights to different parts of the input which is widely used in the computer vision area. Recently with the development of the neural network, the attention mechanism is also embedded to the RNN model to further enhance its ability to solve sequence to sequence problem. We illustrate how the common RNN works, and gives detailed descriptions of four different attention mechanisms, and their pros and cons are also analyzed.
The item-wise attention based model requires that the input sequence contains explicit items. For some types of the input, like an image input, an additional step is needed to extract items from the input. Besides, the item-wise attention based RNN model needs to feed all items in the input sequence to the encoder, so on one side, it results in a slow training process, and on the other side, the testing efficiency is also not improved compared to the RNN without the attention module.
The location-wise attention based model allows the input to be an entire feature map. At each time, the location-wise attention based model only focuses on a part of the input feature map, which gives efficiency improvements in both training and testing compared to the item-wise attention model.
The soft attention module is differentiable with respect to its inputs, so the standard gradient ascent/decent can be used for optimization.
The hard attention module is non-differentiable, and the objective is optimized by some techniques from the reinforcement learning.
At last, some applications in computer vision which apply the attention based RNN models are introduced. The first application is image classification and object detection which applies the location-wise hard attention mechanism. The second one is the image caption which uses both the item-wise soft and hard attention. The experimental results demonstrate that
Considering the performance, the attention based RNN model is better than, or at least comparable to, the model without the attention mechanism.
The attention based RNN model makes the visualization more intuitive.
The location-wise soft attention model, i.e., the STN, is not introduced in the applications, because it is firstly proposed on the CNN and currently there are not enough applications of the STN based RNN model.
Since the attention based RNN model is a new topic, it has not been well addressed yet. As a result, there are many problems which either need theoretical analyses or practical solutions. Here we list a few of them:
A large majority of the current attention based RNN models are only applied on some small datasets. How to use the attention based RNN model to larger datasets, like the ImageNet dataset?
Currently, most of the experiments are performed to compare the RNN models with/without the attention mechanism. More comprehensive comparisons of different attention mechanisms for different tasks are needed.
In many applications the input sequences contain explicit items naturally, which makes them fit the item-wise attention model. For example, most of the applications in NLP treat the natural sentence as the input, where each item is a word in the sentence. Is it possible to convert these types of inputs to a feature map and apply the location-wise attention model?
The current hard attention mechanism uses techniques from the reinforcement learning in optimization by maximizing an approximation of the sum of log-likelihood, which makes a sacrifice on the performance. Can the learning method be improved? [BaGSF15] gives some insights but it is still an open question.
How to extend the four attention mechanisms introduced in this survey? For example, an intuitive way of extension is to put multiple attention modules in parallel instead of only embedding one into the RNN model. Another direction is to further generalize an RNN and make it have a similar structure as the modern computer architecture [weston2014memory, graves2014neural], e.g. a multiple cache system shown in Figure 1, where the attention module serves as a ”scheduler” or an addressing module.
In summary, the attention based RNN model is a newly proposed model, and there are still many interesting questions remaining to be answered.