Conditionally Learn to Pay Attention for Sequential Visual Task

11/11/2019 ∙ by Jun He, et al. ∙ NetEase, Inc 58

Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel conditional global feature which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the conditional global feature. The conditional global feature can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image caption. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; and for image caption, our attention model generates better scores than the popular soft attention model.



There are no comments yet.


page 2

page 4

page 6

page 9

page 10

page 12

page 17

page 18

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent successes in machine translation [1], speech recognition [2], and image caption [3]

have witnessed the important role of attention mechanism. In computer vision, like human visual system, attention does not need to focus on the whole image, but only on the salient areas of the image. For example,  

[4][5][6][7] embedded attention mechanism into image caption which enables the model to learn to automatically generate a caption describing the content of an image. Subsequently, attention approaches were introduced into the emerging visual question answering task (VQA) which greatly improved the overall performance [8] [9] [7].

Recently,  [10]

proposed a novel end-to-end trainable attention module for convolutional neural network architectures. The core idea of their work lies in estimating the attention maps by measuring how the local convolutional feature aligns to the global feature, which is different from the previous attention approaches. This novel attention approach experimentally proves its capability on weakly supervised object recognition / query though, it is only suitable for one object, saying that the trained global feature only represents a specific object according to the image label.

In this paper, we take a further step on the work of  [10] to extend its capability of sequential visual tasks, such as multiple objects recognition and image caption. Our work is inspired by the recent works on employing attention in image caption  [4] [5] [6] [7] and the new attention mechanism in  [10], in which we propose a novel conditional attention framework for the sequential visual tasks. In order to generate the attention map sequentially for each object, we design a conditional global feature

and represent it as a feature descriptor of the current focused object. Then the attention feature vector is sequentially produced through a compatibility function between convolutional features and

conditional global feature in our framework. Note that our new conditional attention framework is different from the popular soft attention architecture proposed by [4]

. Instead of predicting the probability distribution between image pixels through an attention network, we use dot product as a compatible function between local features and

conditional global feature to get a map score that highlights the relevant areas of the image and suppresses the background information. Fig.1 shows the difference between the popular soft attention model and our proposed model.

Figure 1: Illustration of the difference between the popular soft attention architecture [4](left) and our proposed conditional attention framework(right). In our framework, ’G’ denotes the global feature, ’CG’ denotes the conditional global feature, and ’L’ denotes convolutional features.

The contribution of our work is threefold. Firstly, unlike the popular soft attention model [4], a very different conditional attention framework is proposed to tackle sequential visual tasks. Secondly, we introduce the novel conditional global feature which can be regarded as a weak feature descriptor for a particular object in a sequential visual task. Then the attention map for each object can be simply estimated by measuring how the local convolutional feature aligns to the conditional global feature. Thirdly, for the popular multi-digit “Street View House Number” (SVHN) dataset, our new attention model achieves the best performance over other state of the arts with / without extra bounding box.

2 Related Work

Attention can be achieved in two ways, post-hoc mechanism and trainable attention in CNNs.  [11] computes a class map to capture internal changes of deep convolutional neural network for image classification. Subsequently,  [12] proposed feedback CNNs to produce a saliency map to show its attention on expected objects in the image. Notably, all the methods produce attention maps by passing through the well trained CNN models, which are called post-hoc processing.

In recent years, many studies demonstrated that the methods of obtaining attention maps by optimizing the weight of attention modules in the process of training CNNs can achieve better performance. Trainable attention in CNNs are divided into two categories: hard attention and soft attention. For the former, it is a stochastic sampling method and thus is non-differential, in which the attention module must be trained via Reinforcement Learning 

[13] methods, for example the work of recurrent attention model (RAM)  [14]. For the latter, soft attention computes the weight vector as the attention probability, which is differentiable and can be easily trained. Trainable attention in CNNs especially soft attention has been applied in a variety of visual tasks. For example, attention can be applied to query-based tasks [15][4][16][8], and visual question answering tasks (VQA) [8][9][7]. Especially, for query-based task,  [10]

introduced a novel learn to pay attention method which directly uses a learned global feature to query images different from previous methods performing query by one-hot encoding of category labels 

[17]. However, the learned global feature does not consider its variability conditional on its context information, which prevents this new attention method from sequential visual tasks, such as weakly supervised multiple objects recognition task and image caption.

Recent image caption methods embedded attention into encoder-decoder framework [4][5][6][7]. For example, two popular attention methods (soft and hard) were proposed in [4] which not only can generate meaningful words but can highlight the corresponding region of interest in images. To enhance image caption, [5] combines both top-down and bottom-up approaches to fuse the extracted features of both sides. [6] incorporates spatial and channel-wise attentions in a convolutional neural network and these attentions are embedded into different layers to represent multi-scale features. In order to highlight the image regions more accurately, [7] utilizes bottom-up attention to obtain salient image regions by Faster R-CNN like technique [18], and combines top-down attention as a language model to produce more coherent sentences.

3 Conditional Attention Model

In this section, we first introduce the conditional global feature, the key idea of our proposed conditional attention framework for sequential visual task. For multiple objects recognition, a simple recurrent neural network can be used for generating the conditional global feature

. For image captioning, we demonstrate that a language model can be easily incorporated into this attention framework.

3.1 Conditional Global Feature

Our conditional attention framework is mainly inspired by [10] which introduces a compatibility measure between local features and global features. For sequential visual tasks, processing objects of an image usually consists of T steps. At each step t, the model needs to produce an attention map of the current object and its corresponding attention feature. Then the trained global feature must change accordingly to represent different objects, so the recurrent structure can be naturally exploited to provide the conditional information for the variable global feature. Thus we propose the conditional global feature, output of the recurrent structure given the previous attention feature and the last recurrent state . In this paper, the conditional global feature refers to the output vector of each LSTM cell [19]:


Note that, is actually the updated recurrent state of LSTM , but it can be regarded as the weak feature descriptor of the focused object, which is the key design of our conditional attention framework.

Figure 2: Overview of the conditional attention model for multiple objects recognition, ‘’ denotes our conditional global feature and ‘’ denotes CNN features. The detailed structure of convolution neural network is listed in the Appendix.

For multiple objects recognition task, the conditional attention model combined with a simple recurrent network is shown in Fig. 2. Once the conditional global feature is calculated, the attention maps of the current focused object can be generated by the conditional attention submodule, referring to Section 3.2 for details. Notably, at the first step, the model directly estimates the attention map of the first object through compatibility function between local CNN features and the original global feature outputed by the last fc layer.

3.2 Conditional Attention Submodule

This section describes the details of attention generation, saying that how conditional global feature aligns to local CNN features. At each step t, we denote as the set of local feature vectors extracted at a given intermediate layer in the CNN pipeline, where is the i-th feature vector of n total spatial locations in the local feature of layer s. Since the conditional global feature vector may have different dimension with , a linear mapping is applied to to keep the dimension consistent. Then, dot product is used to measure the compatibility between and as shown in Equation (3.2):


By calculating the compatibility function, we can get a series of compatibility scores , which are then normalized by a softmax operation


The normalized compatibility scores are then used to produce vector for layer s by element-wise weighted averaging the local features:


To produce multiple scale attentions, we can concatenate for different layers as

which can provide more discriminative and complementary representation for a particular object at different scales. Finally, we compute the conditional distribution over possible output through a classifier:


where f is a classifier with a Softmax function that outputs the probability of , and is the concatenated attention feature vector at time t.

3.3 Attention Model for Image Caption

The above conditional attention model for multiple objects recognition is a basic model for sequential visual tasks. However, for image captioning, unlike multiple object recognition, the model not only needs to output a word at each step, but also needs a strong context dependence between words to produce a better sentence. Here we show that our new conditional attention framework can incorporate a popular language model to solve image caption task.

Figure 3: Overall structure of the attention model for image captioning. Two separated LSTM layers are used to generate visual attention features and context information for language model. not only represents the weak visual feature descriptor but provides the context visual information for a language model.

In order to adapt our proposed attention model to image caption, the conditional global feature should not only represent the weak visual feature descriptor as in the first conditional attention model, it should also provide the context visual information for a language model. Thus we adopt two layers of LSTM to extract visual and context features separately. The first LSTM layer is aimed for our conditional visual attention to provide the conditional global feature , and the second LSTM layer works for the language model, specifically a bi-directional LSTM. The adapted attention model for image caption is shown in Fig. 3. At each time step, the attention LSTM requires the previous attention feature , the previous hidden state of language LSTM, the previous hidden state of attention LSTM, and previous word vector . Thus the is produced given by Equation (3.6):


Then is used to generate the attention feature through conditional attention submodule as discussed above. For the incorporated language model, it requires the visual attention feature , the conditional global feature , and the previous hidden state of language LSTM, which provide not only the last visual attention, but also the context information of both visual and language. Finally, and are concatenated to feed to the classifier to predict the conditional distribution over possible output at each step , referring to Equation (3.7).


where f is a classifier with a Softmax function that outputs the probability of , is the attention feature vector at time t and is the output of the language LSTM.

3.4 Optimization

Optimizing our proposed conditional attention models is end-to-end, which allows the model to be trained directly with respect to a given task. Back-propagation is exploited to train the neural network components by minimizing the smooth cross entropy loss function shown as Equation (



where N is the number of samples, T is the length of a sequence in a sequence task and is the number of categories. is the ground truth of class k in step t while is the predicted probability of class k in step t.

For the proposed attention model for image captioning, since a language model is incorporated, the training process is slightly different from the version of the multiple objects recognition task. We adopt double stochastic attention loss as a regularization method follows by [4] to improving overall BLEU score:


where is a constant, is the number of spatial locations of attention feature and is the length of the caption. Thus the final loss function of training the attention model is:


4 Experiments

In the experiments, we first benchmarked the performance of our attention model for multiple objects recognition on the SVHN dataset [20] with / without extra bounding box. Next, we evaluated the our attention framework adapted for the image caption task on the MSCOCO 2014 captions dataset [21]. 111Our code for both multiple objects recognition and image caption has been released at

4.1 Experimental Setup

Attention model for multiple objects recognition: The network structure configuration parameters in Fig. 2 are the same as [10] shown in Table 1. It has 16 layers: 15 convolutional layers and 1 fully connected layer. Different from the standard VGGNet [22]

, the first two max-pooling layers of the VGGNet are moved after the additional convolutional layers

conv6 and conv7 respectively to make the estimated attention map have higher resolution. The output of last layer is defined as the global feature G and the convolutional features conv4_3 and conv5_3 are defined as the local features and . In all subsequent experiments, our model only with is referred as conditional-model-att1, and the attention model with both and is referred as conditional-model-att2. In order to compare with our proposed models, we evaluated two versions of soft attention model [4] as the baselines in the experiment. One is standard-soft-attention, saying that we use the standard VGGnet for soft attention according to its original implementation [4]; the other is modified-soft-attention

, in which the first two max pooling layers are removed from the VGGnet as ours. Moreover, batch normalization and Dropout regularization are adopted to each convolutional layer, with dropout rate 0.3 for the first convolutional layer and 0.4 for the rest. The implementation of LSTM closely follows the one used in 

[4], the initial memory state and hidden state of the LSTM were predicted by global features through two fully connected layers mapping. The size of the LSTM cell in our model was set to 512.

Layers Filters Stride Depth
conv1_1 3 3 1 1 64
conv1_2 3 3 1 1 64
conv2_1 3 3 1 1 128
conv2_2 3 3 1 1 128
conv3_1 3 3 1 1 256
conv3_2 3 3 1 1 256
conv3_3 3 3 1 1 256
max_pool 2 2 2 2 256
conv4_1 3 3 1 1 512
conv4_2 3 3 1 1 512
conv4_3 3 3 1 1 512
max_pool 2 2 2 2 512
conv5_1 3 3 1 1 512
conv5_2 3 3 1 1 512
conv5_3 3 3 1 1 512
max_pool 2 2 2 2 512
conv6 3 3 1 1 512
max_pool 2 2 2 2 512
conv7 3 3 1 1 512
max_pool 2 2 2 2 512
fc —— —— 512
Table 1: The CNN network of our attention model for multiple objects recognition.

Attention model for image caption

: We used the VGGnet pretrained on ImageNet as the convolutional layers in Fig. 

3. The configuration of the LSTM layer for visual attention is the same as our basic conditional attention model. A bi-directional LSTM is exploited for the LSTM layer in this language model. The dimension of word embedding and the hidden state of LSTM were all set to 512. In the testing period, we utilized beam search to select the best caption from some candidates, and the beam size was set to 3. in double stochastic attention was set to 1.

All models were trained using ADAM optimization algorithm [23]

with mini-batch size of 32, and the learning rate was set to 1e-4. All our experiments were implemented by Tensorflow 1.9 and were trained on a workstation with NVIDIA 2080Ti GPU and 32Gb system RAM.

4.2 Multiple Objects Recognition on SVHN

The Street View House Numbers (SVHN) dataset [20] consists of three parts, the training set, the testing set, and the extra set, which has around 240k images in total. In the experiments, our model was trained by using the extra set and the testing set. In the first experiment on SVHN, the ground truth bounding boxes were used to generated the well conditioned dataset for multiple objects recognition. The dataset was preprocessed closely following [24], in which the digits were cropped from the street number images by using the bounding boxes, and the cropped digits were resized to . Then we randomly cropped on the enlarged digits to augment our training dataset. Since there are at most 5 digits in one street number image, we used a dummy label ’10’ as a placeholder to guarentee the label length of each image to be 5.

Model Test acc
11-layer-CNN ( [24]) 96.04
10-layer-CNN 95.89
DRAM( [25]) 94.9
ST-CNN Single ( [26]) 96.30
ST-CNN Multi ( [26]) 96.40
standard-soft-attention ( [4]) 96.47
modified-soft-attention 96.08
conditional-model-att1 (ours) 96.98
conditional-model-att2 (ours) 97.15
Table 2: Whole sequence recognition accuracy on SVHN dataset with trained models.
Figure 4: Attention maps from standard-soft-attention [4](left) and conditional-model-att1(right) trained on SVHN dataset with bounding box. Above the images are the probabilities of the label prediction. Our model learns to focus on the central part of each object.

Table 2 demonstrates the performance of our model compared with state of the arts in this experiment. It can be seen that both conditional-model-att1 and conditional-model-att2 outperform other methods. Compared with the existing methods, at each step, the obtained attention feature helps to improve the performance of our model by focusing on the object of interest while suppressing the background regions. Fig. 4 visualizes the differences between the generated attention map by our proposed attention model and the soft attention model [4]. It is interesting that our model can focus more accurately than the soft attention model [4]

from left to right on each object in the image. In addition, the probability of the predicted label is shown on the top of each image. Note that for the third very skewed image “332” in Fig. 

4, our model correctly predicts all digits, while the soft attention model wrongly predicts the last digit ‘2’, because our attention model directly computes the attention feature through the compatibility function between local features and the conditional global feature at each time step, rather than relies on a simple attention network. For the two versions of our model, conditional-model-att2 performs better than conditional-model-att1 since it exploits multi-scale attentions, however, conditional-model-att2 requires more training time.

4.3 Weakly Supervised Multiple Objects Recognition

To further test the robustness of our model, we designed a challenging weakly labeled multiple objects recognition task in which no ground truth bounding boxes were used to crop the digits. Instead, the original real street number images are directly used for training and testing. The first column in Fig. 5 shows the original street number images, say weakly labeled images in this experiment. We used the same data augmentation technique as previous experiments, saying that the original SVHN images were resized to pixels and randomly cropped to . It can seen from the Table 3, although the overall accuracy of all models is not too high, by comparison, our attention models especially conditional-model-att2 achieves , and improvements over modified-soft-attention, standard-soft-attention [4] and 11-layer-CNN [24] respectively for this task.

Model Test acc
11-layer-CNN ( [24]) 70.58
standard-soft-attention ( [4]) 74.89
modified-soft-attention 77.61
conditional-model-att1 (ours) 79.04
conditional-model-att2 (ours) 80.45
Table 3: Weakly supervised multiple recognition accuracy on SVHN dataset.

Fig. 5 visualizes the attention maps generated by our proposed attention model and the soft attention model [4]. It clearly shows that our model correctly highlights the street numbers at each step despite the very noisy and cluttered large background. However, the soft attention model can only coarsely identify the regions of street numbers. For example, for the extremely noisy street numbers image “26” at the first row in Fig. 5, our attention model can accurately predict each number with high confidence, say 0.7982 for ‘2’ and 0.8739 for ‘6’; but the soft attention model wrongly recognizes ‘6’ as ‘5’ and the overall prediction probabilities are low, say 0.2444 for ‘2’ and 0.1757 for ‘5’.

Figure 5: Attention maps from standard-soft-attention [4](left) and conditional-model-att1(right) trained on SVHN dataset without bounding box. Our model learns to focus on the central part of each object

Note that this task is very different from the classical object detection which requires the accurate bounding boxes for training. However, for this weakly supervised recognition task, only the sequential labels can be exploited. Experiment shows that our model can always nicely identify the interested objects. Thus we hypothesize that our model can be used to automatically crop the regions of interest (ROI) for collecting the training dataset of object detection by simply sequentially memorizing the interested objects, which would greatly reduce the burdening manual annotation work. We put this investigation in our future work.

4.4 Evaluation for Image Caption

In our last experiment, we designed another conditional global feature based on a language model. Note that the main purpose of this evaluation is to demonstrate the generality of our conditional attention framework on the image caption task, and thus we do not expect it could outperform the state of the arts methods optimized for image captioning, such as the bottom-up approach [7]. However, we show that our attention model does excel several popular models based on the VGGNet architecture as ours on the image caption task. We used the MSCOCO 2014 captions dataset [21] to evaluate the our attention model, and used the same split as [27]

. To deal with various length of captions, we truncated each caption to 20 words. The evaluation criteria are the standard automatic evaluation metrics, namely BLEU 

[28], METEOR [29], and CIDEr [30].

Method B-1 B-2 B-3 B-4 METEOR CIDEr
Deep VS [27] 62.5 45.0 32.1 23.0 19.5 -
m-RNN [31] 67.0 49.0 35.0 25.0 - -
Google NIC [3] 66.6 46.1 32.9 24.6 - -
Soft-Attention [4] 70.7 49.2 34.4 24.3 23.9 -
Hard-Attention [4] 71.8 50.4 35.7 25.0 23.04 -
SCA-CNN-VGG [6] 70.5 53.3 39.7 29.8 24.2 89.7
SCA-CNN-ResNet [6] 71.9 54.8 41.1 31.1 25.0 95.2
Our attention model 70.9 54.1 40.5 30.3 23.9 89.5
Table 4: The model performance on MSCOCO test splits. Our attention model outperforms many existing models especially in the score of B-2 to B-4.

Results from Table 4, we can see that our attention model adapted for image captioning outperforms the other models in most cases. Especially when compared with soft attention [4], our model sightly outperforms it on the score of B-1 though, it achieves large improvement over soft attention on the scores of B-2, B-3, and B-4. Compared with the recent proposed SCA-CNN models [6], given the same pretrained VGGnet, our model outperforms SCA-CNN-VGG on all BLEU scores, even though SCA-CNN-VGG incorporates spatial and channel-wise attentions and exploits multi-layer attentions [6]. For a more powerful CNN architecture, such as ResNet, we expect our model could benefit more to produce more accurate visual attention, which in turn to generate better caption results, just as SCA-CNN-ResNet. We put it as our future investigation.

Figure 6: Examples of generated captions and corresponding visual attention maps on MSCOCO with proposed attention model adapted to image caption. These three rows show attended regions is consistent with the underlined words. Detailed captions and visualization of attention maps are included in the Appendix.

Fig. 6 shows the visualization results of our attention model on the image caption task. Our attention model generates meaningful captions for images and the computed attention maps do highlight the relevant parts corresponding to the caption words in most cases. Note that the two images of the last row in Fig. 6 are also exhibited in [4] as the failure cases. For the “giraffe” at the left side, our attention model can accurately recognize it as “giraffe”. Moreover, it is interesting that the word “couple” does correspond to two giraffes, which indicates that our model can learn the basic counting capability.

5 Conclusions

In this paper, we proposed a new conditional attention framework to tackle sequential visual tasks, such as multiple objects recognition and image captioning. By introducing the novel conditional global feature, the attention feature for each object can be produced by measuring how the local convolutional feature aligns to the conditional global feature. Moreover, the proposed conditional attention model has demonstrated its generality on various sequential visual tasks by designing the conditional global feature, which has achieved the best performance on the SVHN benchmark with / without extra bounding box, and generated better scores on the image captioning task than the popular soft attention model. Thus we believe that the new conditional attention model could also achieve promising performance for other sequential visual tasks, such as VQA. We put it as our future work.


  • [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [2] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.
  • [3] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3156–3164, 2015.
  • [4] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In

    International conference on machine learning

    , pages 2048–2057, 2015.
  • [5] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
  • [6] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5659–5667, 2017.
  • [7] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
  • [8] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
  • [9] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
  • [10] Saumya Jetley, Nicholas A Lord, Namhoon Lee, and Philip HS Torr. Learn to pay attention. In International Conference on Learning Representations, 2018.
  • [11] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • [12] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2956–2964, 2015.
  • [13] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [14] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
  • [15] Preksha Nema, Mitesh Khapra, Anirban Laha, and Balaraman Ravindran. Diversity driven attention model for query-based abstractive summarization. arXiv preprint arXiv:1704.08300, 2017.
  • [16] Linfeng Song, Zhiguo Wang, and Wael Hamza. A unified query-based generative model for question generation and question answering. arXiv preprint arXiv:1709.01058, 2017.
  • [17] Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui Shen, and Bohyung Han. Progressive attention networks for visual attribute prediction. arXiv preprint arXiv:1606.02393, 2016.
  • [18] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [19] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [20] Yuval Netzer, Wang Tao, Adam Coates, Alessandro Bissacco, Wu Bo, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In

    Nips Workshop on Deep Learning and Unsupervised Feature Learning

    , 2011.
  • [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [22] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations(ICLR), 2015.
  • [24] Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2014.
  • [25] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In International Conference on Learning Representations(ICLR), 2015.
  • [26] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
  • [27] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  • [28] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  • [29] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  • [30] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • [31] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.

Appendix A Appendices

a.1 Visualization of Image Captioning

Figure 7: a young man riding a skateboard down a ramp.
Figure 8: a man holding a bunch of ripe bananas.
Figure 9: a couple of elephants are standing in a field.
Figure 10: a motorcycle parked on a dirt field next to a fence.