Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel conditional global feature which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the conditional global feature. The conditional global feature can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image caption. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; and for image caption, our attention model generates better scores than the popular soft attention model.READ FULL TEXT VIEW PDF
We have successfully implemented the "Learn to Pay Attention" model of
We propose the Recurrent Soft Attention Model, which integrates the visu...
We present recursive recurrent neural networks with attention modeling
We propose a new architecture that learns to attend to different
We propose an end-to-end-trainable attention module for convolutional ne...
This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural
Several recent projects demonstrated the promise of end-to-end learned d...
have witnessed the important role of attention mechanism. In computer vision, like human visual system, attention does not need to focus on the whole image, but only on the salient areas of the image. For example,, , ,  embedded attention mechanism into image caption which enables the model to learn to automatically generate a caption describing the content of an image. Subsequently, attention approaches were introduced into the emerging visual question answering task (VQA) which greatly improved the overall performance   .
proposed a novel end-to-end trainable attention module for convolutional neural network architectures. The core idea of their work lies in estimating the attention maps by measuring how the local convolutional feature aligns to the global feature, which is different from the previous attention approaches. This novel attention approach experimentally proves its capability on weakly supervised object recognition / query though, it is only suitable for one object, saying that the trained global feature only represents a specific object according to the image label.
In this paper, we take a further step on the work of  to extend its capability of sequential visual tasks, such as multiple objects recognition and image caption. Our work is inspired by the recent works on employing attention in image caption     and the new attention mechanism in , in which we propose a novel conditional attention framework for the sequential visual tasks. In order to generate the attention map sequentially for each object, we design a conditional global feature
and represent it as a feature descriptor of the current focused object. Then the attention feature vector is sequentially produced through a compatibility function between convolutional features andconditional global feature in our framework. Note that our new conditional attention framework is different from the popular soft attention architecture proposed by 
. Instead of predicting the probability distribution between image pixels through an attention network, we use dot product as a compatible function between local features andconditional global feature to get a map score that highlights the relevant areas of the image and suppresses the background information. Fig.1 shows the difference between the popular soft attention model and our proposed model.
The contribution of our work is threefold. Firstly, unlike the popular soft attention model , a very different conditional attention framework is proposed to tackle sequential visual tasks. Secondly, we introduce the novel conditional global feature which can be regarded as a weak feature descriptor for a particular object in a sequential visual task. Then the attention map for each object can be simply estimated by measuring how the local convolutional feature aligns to the conditional global feature. Thirdly, for the popular multi-digit “Street View House Number” (SVHN) dataset, our new attention model achieves the best performance over other state of the arts with / without extra bounding box.
Attention can be achieved in two ways, post-hoc mechanism and trainable attention in CNNs.  computes a class map to capture internal changes of deep convolutional neural network for image classification. Subsequently,  proposed feedback CNNs to produce a saliency map to show its attention on expected objects in the image. Notably, all the methods produce attention maps by passing through the well trained CNN models, which are called post-hoc processing.
In recent years, many studies demonstrated that the methods of obtaining attention maps by optimizing the weight of attention modules in the process of training CNNs can achieve better performance. Trainable attention in CNNs are divided into two categories: hard attention and soft attention. For the former, it is a stochastic sampling method and thus is non-differential, in which the attention module must be trained via Reinforcement Learning methods, for example the work of recurrent attention model (RAM) . For the latter, soft attention computes the weight vector as the attention probability, which is differentiable and can be easily trained. Trainable attention in CNNs especially soft attention has been applied in a variety of visual tasks. For example, attention can be applied to query-based tasks , , , , and visual question answering tasks (VQA) , , . Especially, for query-based task, 
introduced a novel learn to pay attention method which directly uses a learned global feature to query images different from previous methods performing query by one-hot encoding of category labels. However, the learned global feature does not consider its variability conditional on its context information, which prevents this new attention method from sequential visual tasks, such as weakly supervised multiple objects recognition task and image caption.
Recent image caption methods embedded attention into encoder-decoder framework , , , . For example, two popular attention methods (soft and hard) were proposed in  which not only can generate meaningful words but can highlight the corresponding region of interest in images. To enhance image caption,  combines both top-down and bottom-up approaches to fuse the extracted features of both sides.  incorporates spatial and channel-wise attentions in a convolutional neural network and these attentions are embedded into different layers to represent multi-scale features. In order to highlight the image regions more accurately,  utilizes bottom-up attention to obtain salient image regions by Faster R-CNN like technique , and combines top-down attention as a language model to produce more coherent sentences.
In this section, we first introduce the conditional global feature, the key idea of our proposed conditional attention framework for sequential visual task. For multiple objects recognition, a simple recurrent neural network can be used for generating the conditional global feature
. For image captioning, we demonstrate that a language model can be easily incorporated into this attention framework.
Our conditional attention framework is mainly inspired by  which introduces a compatibility measure between local features and global features. For sequential visual tasks, processing objects of an image usually consists of T steps. At each step t, the model needs to produce an attention map of the current object and its corresponding attention feature. Then the trained global feature must change accordingly to represent different objects, so the recurrent structure can be naturally exploited to provide the conditional information for the variable global feature. Thus we propose the conditional global feature, output of the recurrent structure given the previous attention feature and the last recurrent state . In this paper, the conditional global feature refers to the output vector of each LSTM cell :
Note that, is actually the updated recurrent state of LSTM , but it can be regarded as the weak feature descriptor of the focused object, which is the key design of our conditional attention framework.
For multiple objects recognition task, the conditional attention model combined with a simple recurrent network is shown in Fig. 2. Once the conditional global feature is calculated, the attention maps of the current focused object can be generated by the conditional attention submodule, referring to Section 3.2 for details. Notably, at the first step, the model directly estimates the attention map of the first object through compatibility function between local CNN features and the original global feature outputed by the last fc layer.
This section describes the details of attention generation, saying that how conditional global feature aligns to local CNN features. At each step t, we denote as the set of local feature vectors extracted at a given intermediate layer in the CNN pipeline, where is the i-th feature vector of n total spatial locations in the local feature of layer s. Since the conditional global feature vector may have different dimension with , a linear mapping is applied to to keep the dimension consistent. Then, dot product is used to measure the compatibility between and as shown in Equation (3.2):
By calculating the compatibility function, we can get a series of compatibility scores , which are then normalized by a softmax operation
The normalized compatibility scores are then used to produce vector for layer s by element-wise weighted averaging the local features:
To produce multiple scale attentions, we can concatenate for different layers as
which can provide more discriminative and complementary representation for a particular object at different scales. Finally, we compute the conditional distribution over possible output through a classifier:
where f is a classifier with a Softmax function that outputs the probability of , and is the concatenated attention feature vector at time t.
The above conditional attention model for multiple objects recognition is a basic model for sequential visual tasks. However, for image captioning, unlike multiple object recognition, the model not only needs to output a word at each step, but also needs a strong context dependence between words to produce a better sentence. Here we show that our new conditional attention framework can incorporate a popular language model to solve image caption task.
In order to adapt our proposed attention model to image caption, the conditional global feature should not only represent the weak visual feature descriptor as in the first conditional attention model, it should also provide the context visual information for a language model. Thus we adopt two layers of LSTM to extract visual and context features separately. The first LSTM layer is aimed for our conditional visual attention to provide the conditional global feature , and the second LSTM layer works for the language model, specifically a bi-directional LSTM. The adapted attention model for image caption is shown in Fig. 3. At each time step, the attention LSTM requires the previous attention feature , the previous hidden state of language LSTM, the previous hidden state of attention LSTM, and previous word vector . Thus the is produced given by Equation (3.6):
Then is used to generate the attention feature through conditional attention submodule as discussed above. For the incorporated language model, it requires the visual attention feature , the conditional global feature , and the previous hidden state of language LSTM, which provide not only the last visual attention, but also the context information of both visual and language. Finally, and are concatenated to feed to the classifier to predict the conditional distribution over possible output at each step , referring to Equation (3.7).
where f is a classifier with a Softmax function that outputs the probability of , is the attention feature vector at time t and is the output of the language LSTM.
Optimizing our proposed conditional attention models is end-to-end, which allows the model to be trained directly with respect to a given task. Back-propagation is exploited to train the neural network components by minimizing the smooth cross entropy loss function shown as Equation (3.8):
where N is the number of samples, T is the length of a sequence in a sequence task and is the number of categories. is the ground truth of class k in step t while is the predicted probability of class k in step t.
For the proposed attention model for image captioning, since a language model is incorporated, the training process is slightly different from the version of the multiple objects recognition task. We adopt double stochastic attention loss as a regularization method follows by  to improving overall BLEU score:
where is a constant, is the number of spatial locations of attention feature and is the length of the caption. Thus the final loss function of training the attention model is:
In the experiments, we first benchmarked the performance of our attention model for multiple objects recognition on the SVHN dataset  with / without extra bounding box. Next, we evaluated the our attention framework adapted for the image caption task on the MSCOCO 2014 captions dataset . 111Our code for both multiple objects recognition and image caption has been released at https://github.com/caoquanjie/ConditionalLearnToPayAttention.
Attention model for multiple objects recognition: The network structure configuration parameters in Fig. 2 are the same as  shown in Table 1. It has 16 layers: 15 convolutional layers and 1 fully connected layer. Different from the standard VGGNet 
, the first two max-pooling layers of the VGGNet are moved after the additional convolutional layersconv6 and conv7 respectively to make the estimated attention map have higher resolution. The output of last layer is defined as the global feature G and the convolutional features conv4_3 and conv5_3 are defined as the local features and . In all subsequent experiments, our model only with is referred as conditional-model-att1, and the attention model with both and is referred as conditional-model-att2. In order to compare with our proposed models, we evaluated two versions of soft attention model  as the baselines in the experiment. One is standard-soft-attention, saying that we use the standard VGGnet for soft attention according to its original implementation ; the other is modified-soft-attention
, in which the first two max pooling layers are removed from the VGGnet as ours. Moreover, batch normalization and Dropout regularization are adopted to each convolutional layer, with dropout rate 0.3 for the first convolutional layer and 0.4 for the rest. The implementation of LSTM closely follows the one used in, the initial memory state and hidden state of the LSTM were predicted by global features through two fully connected layers mapping. The size of the LSTM cell in our model was set to 512.
|conv1_1||3 3||1 1||64|
|conv1_2||3 3||1 1||64|
|conv2_1||3 3||1 1||128|
|conv2_2||3 3||1 1||128|
|conv3_1||3 3||1 1||256|
|conv3_2||3 3||1 1||256|
|conv3_3||3 3||1 1||256|
|max_pool||2 2||2 2||256|
|conv4_1||3 3||1 1||512|
|conv4_2||3 3||1 1||512|
|conv4_3||3 3||1 1||512|
|max_pool||2 2||2 2||512|
|conv5_1||3 3||1 1||512|
|conv5_2||3 3||1 1||512|
|conv5_3||3 3||1 1||512|
|max_pool||2 2||2 2||512|
|conv6||3 3||1 1||512|
|max_pool||2 2||2 2||512|
|conv7||3 3||1 1||512|
|max_pool||2 2||2 2||512|
Attention model for image caption
: We used the VGGnet pretrained on ImageNet as the convolutional layers in Fig.3. The configuration of the LSTM layer for visual attention is the same as our basic conditional attention model. A bi-directional LSTM is exploited for the LSTM layer in this language model. The dimension of word embedding and the hidden state of LSTM were all set to 512. In the testing period, we utilized beam search to select the best caption from some candidates, and the beam size was set to 3. in double stochastic attention was set to 1.
The Street View House Numbers (SVHN) dataset  consists of three parts, the training set, the testing set, and the extra set, which has around 240k images in total. In the experiments, our model was trained by using the extra set and the testing set. In the first experiment on SVHN, the ground truth bounding boxes were used to generated the well conditioned dataset for multiple objects recognition. The dataset was preprocessed closely following , in which the digits were cropped from the street number images by using the bounding boxes, and the cropped digits were resized to . Then we randomly cropped on the enlarged digits to augment our training dataset. Since there are at most 5 digits in one street number image, we used a dummy label ’10’ as a placeholder to guarentee the label length of each image to be 5.
|11-layer-CNN ( )||96.04|
|ST-CNN Single ( )||96.30|
|ST-CNN Multi ( )||96.40|
|standard-soft-attention ( )||96.47|
Table 2 demonstrates the performance of our model compared with state of the arts in this experiment. It can be seen that both conditional-model-att1 and conditional-model-att2 outperform other methods. Compared with the existing methods, at each step, the obtained attention feature helps to improve the performance of our model by focusing on the object of interest while suppressing the background regions. Fig. 4 visualizes the differences between the generated attention map by our proposed attention model and the soft attention model . It is interesting that our model can focus more accurately than the soft attention model 
from left to right on each object in the image. In addition, the probability of the predicted label is shown on the top of each image. Note that for the third very skewed image “332” in Fig.4, our model correctly predicts all digits, while the soft attention model wrongly predicts the last digit ‘2’, because our attention model directly computes the attention feature through the compatibility function between local features and the conditional global feature at each time step, rather than relies on a simple attention network. For the two versions of our model, conditional-model-att2 performs better than conditional-model-att1 since it exploits multi-scale attentions, however, conditional-model-att2 requires more training time.
To further test the robustness of our model, we designed a challenging weakly labeled multiple objects recognition task in which no ground truth bounding boxes were used to crop the digits. Instead, the original real street number images are directly used for training and testing. The first column in Fig. 5 shows the original street number images, say weakly labeled images in this experiment. We used the same data augmentation technique as previous experiments, saying that the original SVHN images were resized to pixels and randomly cropped to . It can seen from the Table 3, although the overall accuracy of all models is not too high, by comparison, our attention models especially conditional-model-att2 achieves , and improvements over modified-soft-attention, standard-soft-attention  and 11-layer-CNN  respectively for this task.
|11-layer-CNN ( )||70.58|
|standard-soft-attention ( )||74.89|
Fig. 5 visualizes the attention maps generated by our proposed attention model and the soft attention model . It clearly shows that our model correctly highlights the street numbers at each step despite the very noisy and cluttered large background. However, the soft attention model can only coarsely identify the regions of street numbers. For example, for the extremely noisy street numbers image “26” at the first row in Fig. 5, our attention model can accurately predict each number with high confidence, say 0.7982 for ‘2’ and 0.8739 for ‘6’; but the soft attention model wrongly recognizes ‘6’ as ‘5’ and the overall prediction probabilities are low, say 0.2444 for ‘2’ and 0.1757 for ‘5’.
Note that this task is very different from the classical object detection which requires the accurate bounding boxes for training. However, for this weakly supervised recognition task, only the sequential labels can be exploited. Experiment shows that our model can always nicely identify the interested objects. Thus we hypothesize that our model can be used to automatically crop the regions of interest (ROI) for collecting the training dataset of object detection by simply sequentially memorizing the interested objects, which would greatly reduce the burdening manual annotation work. We put this investigation in our future work.
In our last experiment, we designed another conditional global feature based on a language model. Note that the main purpose of this evaluation is to demonstrate the generality of our conditional attention framework on the image caption task, and thus we do not expect it could outperform the state of the arts methods optimized for image captioning, such as the bottom-up approach . However, we show that our attention model does excel several popular models based on the VGGNet architecture as ours on the image caption task. We used the MSCOCO 2014 captions dataset  to evaluate the our attention model, and used the same split as 
. To deal with various length of captions, we truncated each caption to 20 words. The evaluation criteria are the standard automatic evaluation metrics, namely BLEU, METEOR , and CIDEr .
|Deep VS ||62.5||45.0||32.1||23.0||19.5||-|
|Google NIC ||66.6||46.1||32.9||24.6||-||-|
|Our attention model||70.9||54.1||40.5||30.3||23.9||89.5|
Results from Table 4, we can see that our attention model adapted for image captioning outperforms the other models in most cases. Especially when compared with soft attention , our model sightly outperforms it on the score of B-1 though, it achieves large improvement over soft attention on the scores of B-2, B-3, and B-4. Compared with the recent proposed SCA-CNN models , given the same pretrained VGGnet, our model outperforms SCA-CNN-VGG on all BLEU scores, even though SCA-CNN-VGG incorporates spatial and channel-wise attentions and exploits multi-layer attentions . For a more powerful CNN architecture, such as ResNet, we expect our model could benefit more to produce more accurate visual attention, which in turn to generate better caption results, just as SCA-CNN-ResNet. We put it as our future investigation.
Fig. 6 shows the visualization results of our attention model on the image caption task. Our attention model generates meaningful captions for images and the computed attention maps do highlight the relevant parts corresponding to the caption words in most cases. Note that the two images of the last row in Fig. 6 are also exhibited in  as the failure cases. For the “giraffe” at the left side, our attention model can accurately recognize it as “giraffe”. Moreover, it is interesting that the word “couple” does correspond to two giraffes, which indicates that our model can learn the basic counting capability.
In this paper, we proposed a new conditional attention framework to tackle sequential visual tasks, such as multiple objects recognition and image captioning. By introducing the novel conditional global feature, the attention feature for each object can be produced by measuring how the local convolutional feature aligns to the conditional global feature. Moreover, the proposed conditional attention model has demonstrated its generality on various sequential visual tasks by designing the conditional global feature, which has achieved the best performance on the SVHN benchmark with / without extra bounding box, and generated better scores on the image captioning task than the popular soft attention model. Thus we believe that the new conditional attention model could also achieve promising performance for other sequential visual tasks, such as VQA. We put it as our future work.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
International conference on machine learning, pages 2048–2057, 2015.
Nips Workshop on Deep Learning and Unsupervised Feature Learning, 2011.