Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

06/26/2017 ∙ by Marcella Cornia, et al. ∙ 0

Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to captioning baselines with and without saliency, and to different state of the art approaches combining saliency and captioning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A core problem in computer vision and artificial intelligence is that of building a system that can replicate the human ability of understanding a visual stimuli and describing it in natural language. Indeed, this kind of system would have a great impact on society, opening up to a new progress in human-machine interaction and collaboration. Recent advancements in computer vision and machine translation, together with the availability of large datasets, have made it possible to generate natural sentences describing images. In particular, deep image captioning architectures have shown impressive results in discovering the mapping between visual descriptors and words 

(Karpathy and Fei-Fei, 2015; Vinyals et al., 2015; Xu et al., 2015; You et al., 2016). They combine Convolutional Neural Networks (CNNs), to extract an image representation, and Recurrent Neural Networks (RNNs), to build the corresponding sentence.

While the progress of these techniques is encouraging, the human ability in the construction and formulation of a sentence is still far from being adequately emulated in today’s image captioning systems. When humans describe a scene, they look at an object before naming it in a sentence (Griffin and Bock, 2000), and they do not focus on each region with the same intensity, as selective mechanisms attract their gaze on saliency and relevant parts of the scene (Rensink, 2000). Also, they care about the context using peripheral vision, so that the description of an image alludes not only to the main objects in the scene, and to how they relate to each other, but also to the context in which they are placed in the image.

An intensive research effort has been carried out in the computer vision community to predict where humans look in an image. This task, called saliency prediction, has been tacked in early works by defining hand-crafted features that capture low-level cues such as color and texture or higher-level concepts such as faces, people and text (Itti et al., 1998; Borji, 2012; Judd et al., 2009). Recently, with the advent of deep neural networks and large annotated datasets, saliency prediction techniques have obtained impressive results generating maps that are very close to the ones computed with eye-tracking devices (Cornia et al., 2016; Jetley et al., 2016; Huang et al., 2015).

Despite the encouraging progress in image captioning and visual saliency, and their close connections, these two fields of research have remained almost separate. In fact, only few attempts have been recently presented in this direction (Sugano and Bulling, 2016; Tavakoli et al., 2017b). In particular, Sugano et al.  (Sugano and Bulling, 2016) presented a gaze-assisted attention mechanism for image caption based on human eye fixations (i.e. the static states of gaze upon a specific location). Although this strategy confirms the importance of using eye fixations, it requires gaze information from a human operator. Therefore, it can not be applied on general visual data archives, in which this information is missing. To overcome this limit, Tavakoli et al.  (Tavakoli et al., 2017b) presented an image captioning method based on saliency maps, which can be automatically predicted from the input image.

In this paper we present an approach which incorporates saliency prediction to effectively enhance the quality of image description. We propose a generative Recurrent Neural Network architecture which can focus on different regions of the input image by means of an attentive mechanism. This attentive behaviour, differently from previous works (Xu et al., 2015), is conditioned by two different attention paths: the former focused on salient spatial regions, predicted by a saliency model, and the latter focused on contextual regions, which are computed as well from saliency maps. Experimental results on five public image captioning datasets (SALICON, COCO, Flickr8k, Flickr30k and PASCAL-50S), demonstrate that our solution is able to properly exploit saliency cues. Also, we show that this is done without losing the key properties of the generated captions, such as their diversity and the vocabulary size. By visualizing the states of both attentive paths, we finally show that the trained model has learned to attend to both salient and contextual regions during the generation of the caption, and that attention focuses produced by the network effectively correspond, step by step, to generated words.

To sum up, our contributions are as follows. First, we show that saliency can enhance image description, as it provides an indication of what is salient and what is context. Second, we propose a model in which the classic machine attention approach is extended to incorporate two attentive paths, one for salient regions and one for context. These two paths cooperate together during the generation of the caption, and show to generate better captions according to automatic metrics, without loss of diversity and size of the dictionary. Third, we qualitatively show that the trained model has learned to attend to both salient and contextual regions in an appropriate way.

2. Related work

In this section, we review the literature related to saliency prediction and image captioning. We also report some recent works which investigate the contribution of saliency for generating natural language descriptions.

2.1. Visual saliency prediction

Saliency prediction has been extensively studied by the computer vision community and, in the last few years, has achieved a considerable improvement thanks to the large spread of deep neural networks (Kümmerer et al., 2015; Huang et al., 2015; Kruthiventi et al., 2016; Jetley et al., 2016; Pan et al., 2016; Cornia et al., 2016, 2017). However, a very large variety of models have been proposed before the advent of deep learning and almost each of them has been inspired by the seminal work of Itti and Koch (Itti et al., 1998)

, in which multi-scale low-level features extracted from the input image were linearly combined and then processed by a dynamic neural network with a winner-takes-all strategy. The same idea of properly combining different low-level features was also explored by Harel 

et al. (Harel et al., 2006)

who defined Markov chains over various image maps, and treated the equilibrium distribution over map locations as an activation. In addition to the exploitation of low-level features, several saliency models have also incorporated high-level concepts such as faces, people, and text 

(Judd et al., 2009; Borji, 2012; Zhang and Sclaroff, 2013). In fact, Judd et al. (Judd et al., 2009) highlighted that, when humans look at images, their gazes are attracted not only by low-level cues typical of the bottom-up attention, but also by top-down image semantics. To this end, they proposed a model in which low and medium level features were effectively combined, and exploited face and people detectors to capture important high-level concepts. Nonetheless, all these techniques have failed to effectively capture the wide variety of causes that contribute to define the visual saliency on images and, with the advent of deep learning, researchers have developed data-driven architectures capable of overcoming many of the limitations of hand-crafted models.

First attempts of computing saliency maps through a neural network lacked from the absence of sufficiently large training datasets (Vig et al., 2014; Kümmerer et al., 2015; Liu et al., 2015). Vig et al. (Vig et al., 2014) proposed the first deep architecture for saliency, which was composed by only three convolutional layers. Afterwards, Kümmerer et al.  (Kümmerer et al., 2015, 2016) based their models on two popular convolutional networks (AlexNet (Krizhevsky et al., 2012) and VGG-19 (Simonyan and Zisserman, 2014)) obtaining adequate results, despite the network parameters were not fine-tuned on a saliency dataset. Liu et al. (Liu et al., 2015) tried to overcome the absence of large scale datasets by training their model on image patches centered on fixation and non-fixation locations, thus increasing the amount of training data.

With the arrival of the SALICON dataset (Jiang et al., 2015), which is still the large publicly available dataset for saliency prediction, several deep architectures have moved beyond previous approaches bringing consistent performance advances. The starting point of all these architectures is a pre-trained Convolutional Neural Network (CNN), such as VGG-16 (Simonyan and Zisserman, 2014), GoogleNet (Szegedy et al., 2015) and ResNet (He et al., 2016), to which different saliency-oriented components are added (Cornia et al., 2016, 2017), together with different training strategies (Huang et al., 2015; Jetley et al., 2016; Cornia et al., 2017).

In particular, Huang et al. (Huang et al., 2015)

compared three standard CNNs by applying them at two different image scales. In addition, they were the first to train the network using a saliency evaluation metric as loss function. Jetley 

et al. (Jetley et al., 2016)

introduced a model which formulates a saliency map as generalized Bernoulli distribution. Moreover, they trained their network by using different loss functions which pair the softmax activation function with measures designed to compute distances between probability distributions. Tavakoli 

et al. (Tavakoli et al., 2017a)

investigated inter-image similarities to estimate the saliency of a given image using an ensemble of extreme learners, each trained on an image similar to the input image. Kruthiventi 

et al. (Kruthiventi et al., 2016), instead, presented an unified framework to predict both eye fixations and salient objects.

Another saliency prediction model was recently presented by Pan et al. (Pan et al., 2017) who, following the large dissemination of Generative Adversarial Networks, trained their model by using adversarial examples. In particular, their architecture is composed by two agents: a generator which is responsible for generating the saliency map of a given image, and a discriminator which performs a binary classification task between generated and real saliency maps. Liu et al. (Liu and Han, 2016), instead, proposed a model to learn long-term spatial interactions and scene contextual modulation to infer image saliency showing promising results, also thanks to the use of the powerful ResNet-50 architecture (He et al., 2016).

In contrast to all these works, we presented two different deep saliency architectures. The first one, called ML-Net (Cornia et al., 2016), effectively combines features coming from different levels of a CNN and applies a matrix of learned weights to the predicted saliency map thus taking into account the center bias present in human eye fixations. The second one, called SAM (Cornia et al., 2017), incorporates neural attentive mechanisms which focus on the most salient regions of the input image. The core component of the proposed model is an Attentive Convolutional LSTM that iteratively refines the predicted saliency map. Moreover, to tackle the human center bias, the network is able to learn multiple Gaussian prior maps without predefined information. Since this model achieved state of the art performances, being at the top of different saliency prediction benchmarks, we use it in this work.

2.2. Image captioning

Recently, the automatic description of images and video has been addressed by computer vision researchers with recurrent neural networks which, given a vectored description of the visual content, can naturally deal with sequences of words 

(Vinyals et al., 2015; Karpathy and Fei-Fei, 2015; Baraldi et al., 2017). Before deep learning models, the generation of sentences was mainly tackled by identifying visual concepts, objects and attributes which were then combined into sentences using pre-defined templates (Yao et al., 2010; Yang et al., 2011; Kulkarni et al., 2013). Another strategy was that of posing the image captioning as a retrieval problem, where the closest annotated sentence in the training set was transferred to a test image, or where training captions were split into parts and then reassembled to form new sentences (Farhadi et al., 2010; Ordonez et al., 2011; Hodosh et al., 2013; Socher et al., 2014). Obviously, all these approaches limited the variety of possible outputs and could not satisfy the richness of natural language. Recent captioning models, in fact, address the generation of sentences as a machine translation problem in which a visual representation of the image coming from a convolutional network is translated in a language counterpart through a recurrent neural network.

One of the first models based on this idea is that proposed by Karpathy et al. (Karpathy and Fei-Fei, 2015) in which sentence snippets are aligned to the visual regions that they describe through a multimodal embedding. After that, these correspondences are treated as training data for a multimodal recurrent neural network which learns to generate the corresponding sentences. Vinyals et al. (Vinyals et al., 2015), instead, developed an end-to-end model trained to maximize the likelihood of the target sentence given the input image. Xu et al. (Xu et al., 2015)

introduced an approach to image captioning which incorporates a form of machine attention, by which a generative LSTM can focus on different regions of the image while generating the corresponding caption. They proposed two different versions of their model: the first one, called “Soft Attention”, is trained in a deterministic manner using standard backpropagation techniques, while the second one, called “Hard Attention”, is trained by maximizing a variational lower bound through the reinforcement learning paradigm.

Johnson et al. (Johnson et al., 2016)

addressed the task of dense captioning, which jointly localizes and describes in natural language salient image regions. This task consists of generalizing the object detection problem when the descriptions consist of a single word, and the image captioning task when one predicted region covers the full image. You 

et al. (You et al., 2016)

proposed a semantic attention model in which, given an image, a convolutional neural network extracts top-down visual features and at the same time detects visual concepts such as regions, objects and attributes. The image features and the extracted visual concepts are combined through a recurrent neural network that finally generates the image caption. Differently from previous works which aim at predicting a single caption, Krause 

et al. (Krause et al., 2017) introduced the generation of entire paragraphs for describing images. Finally, Shetty et al. (Shetty et al., 2017) employed adversarial training to change the training objective of the caption generator from reproducing ground-truth captions to generating a set of captions that is indistinguishable from human generated captions.

In this paper, we are interested in demonstrating the importance of using saliency along with contextual information during the generation of image descriptions. Our solution falls in the class of neural attentive captioning architectures and, in the experimental section, we compare it against a standard attentive model built upon the Soft Attention approach presented in (Xu et al., 2015).

2.3. Visual saliency and captioning

Only a few other previous works have investigated the contribution of human eye fixations to generate image descriptions. The first work that has explored this idea was that proposed in (Sugano and Bulling, 2016) which presented an extension of a neural attentive captioning architecture. In particular, the proposed model incorporates human fixation points (obtained with eye-tracking devices) instead of computed saliency maps to generate image captions. This kind of strategy mainly suffers of the need of having both eye fixation and caption annotations. Currently, only the SALICON dataset (Jiang et al., 2015), being a subset of the Microsoft COCO dataset (Lin et al., 2014), is available with both human descriptions and saliency maps.

Ramanishka et al. (Ramanishka et al., 2017), instead, introduced an encoder-decoder captioning model in which spatiotemporal heatmaps are produced for predicted captions and arbitrary query sentences without explicit attention layers. They refer to these heatmaps as saliency maps, even though they are internal representations of the network, not related with human attention. Experiments showed that the gain in performance with respect to a standard captioning attentive model is not consistent, even though the computational overhead is lower.

A different approach, presented in (Tavakoli et al., 2017b), explores if image descriptions, by humans or models, agree with saliency and if saliency can benefit image captioning. To this end, they proposed a captioning model in which image features are boosted with the corresponding saliency map by exploiting a moving sliding window and mean pooling as aggregation strategies. Comparisons with respect to a no-saliency baseline did not show significant improvements (especially on the Microsoft COCO dataset).

In this paper, we instead aim at enhancing image captions by directly incorporating saliency maps in a neural attentive captioning architecture. Differently from previous models that exploit human fixation points, we obtain a more general architecture which can be potentially trained using any image captioning dataset, and can predict captions for any input image. In our model, the machine attentive process is split in two different and unrelated paths, one for salient regions and one for context. We demonstrate through extensive experiments that the incorporation of saliency and context can enhance image captioning on different state of art datasets.

Figure 1. Ground-truth semantic segmentation and saliency predictions from our model (Cornia et al., 2017) on sample images from Pascal-Context (Mottaghi et al., 2014) (first row), Cityscapes (Cordts et al., 2016) (second row) and LIP (Gong et al., 2017) (last row).

3. What is hit by saliency?

Human gazes are attracted by both low-level cues such as color, contrast and texture, and high-level concepts such as faces and text (Judd et al., 2009; Bylinskii et al., 2016). Current state of the art saliency prediction methods, thanks to the use of deep networks and large-scale datasets, are able to effectively incorporate all these factors and predict saliency maps which are very close to those obtained from human eye fixations (Cornia et al., 2017). In this section we qualitatively investigate which parts of an image are actually hit or ignored by saliency models, by jointly analyzing saliency and semantic segmentation maps. This will motivate the need of using saliency predictions as an additional conditioning for captioning models.

To compute saliency maps, we employ the approach in (Cornia et al., 2017), which has shown good results on popular saliency benchmarks, such as the MIT Saliency (Bylinskii et al., 2017) and the SALICON dataset (Jiang et al., 2015), and which also won the LSUN Challenge in 2017. It is worthwhile to mention, anyway, that the qualitative conclusions of this section can be applied to any state of the art saliency model.

Since semantic segmentations algorithms are not always completely accurate, we perform the analysis on three semantic segmentation datasets, in which regions have been segmented by human annotators: Pascal-Context (Mottaghi et al., 2014), Cityscapes (Cordts et al., 2016) and the Look into Person (LIP) (Gong et al., 2017) dataset. While the first one contains natural images without a specific target, the other two are focused, respectively, on urban streets and human body parts. In particular, the Pascal-Context provides additional annotations for the Pascal VOC 2010 dataset (Everingham et al., 2010) which contains training and validation images and testing images. It goes beyond the original Pascal semantic segmentation task by providing annotations for the whole scene, and images are annotated by using more than different labels. The Cityscapes dataset, instead, is composed by a set of video sequences recorded in street scenes from different cities. It provides high quality pixel-level annotations for frames and coarse annotations for frames. The dataset is annotated with street-specific classes, such as car, road, traffic sign, etc. Finally, the LIP dataset is focused on the semantic segmentation of people and provides more than images annotated with semantic human part labels. Images contain person instances cropped from the Microsoft COCO dataset (Lin et al., 2014) and split in training, validation and testing sets with , and images respectively. For our analyses we only consider train and validation images for the Pascal-Context and LIP datasets, and the pixel-level annotated frames for the Cityscapes dataset. Figure 1 shows, for some sample images, the predicted saliency map and the corresponding semantic segmentation on the three datasets.

We firstly investigate which are the most and the least salient classes for each dataset. Since there are semantic classes with a low number of occurrences with respect to the total number of images, we only consider relevant semantic classes (i.e. classes with at least occurrences). Due to the different dataset sizes, we set to for the Pascal-Context and LIP datasets, and to

for the Cityscapes dataset. To collect the number of times that the predicted saliency hits a semantic class, we binarize each map by thresholding the values of its pixels. A low threshold value leads to a binarized map with dilated salient regions, while an high threshold creates small salient regions around the fixation points. For this reason, we use two different threshold values to analyze the most and the least salient classes. We choose a threshold near

to find the least salient classes for each dataset, and a value near to find instead the most salient ones.

(a) Pascal-Context (Mottaghi et al., 2014)

(b) Cityscapes (Cordts et al., 2016)
(c) LIP (Gong et al., 2017)
Figure 5. Most salient classes on Pascal-Context, Cityscapes and LIP datasets.

(a) Pascal-Context (Mottaghi et al., 2014)
(b) Cityscapes (Cordts et al., 2016)

(c) LIP (Gong et al., 2017)
Figure 9. Least salient classes on Pascal-Context, Cityscapes and LIP datasets.

Figures 5 and 9 show the most and the least salient classes in terms of the percentage of times that saliency hits a region belonging to a class. As it can be seen, there are different distributions depending on the considered dataset. For example, for the Pascal-Context, the most salient classes are animals (such as cats, dogs and birds), people and vehicles (such as airplanes and cars), while the least salient ones result to be ceiling, floor and light. As for the Cityscapes dataset, cars are absolutely the most salient class with a of times in which is hit by saliency. All other classes, instead, do not reach the . On the LIP dataset, the most salient classes are all human body parts in the upper body, while the least salient ones are all in the lower body. As expected, people faces are those most hit by saliency with an absolute number of occurrences near to . It can be observed as a general pattern that the most important or visible objects in the scene are hit by saliency, while objects in the background, and the context itself of the image are usually ignored. This leads to the hypothesis that both salient and non salient regions are important to generate the description of an image, given that we generally want the context to be included in the caption, and that the distinction between salient regions and context, given by a saliency prediction model, can improve captioning results.

We also investigate the existence of a relation between the size of an object and its saliency values. In Figure 13

, we plot the joint distribution of object sizes and saliency values on the three datasets, where the size of an object is simply computed as the number of its pixels normalized by the size of the image. As it can be seen, most of the low saliency instances are small; however, high saliency values concentrate on small objects as well as on large ones. In summary,

there is not always a proportionality between the size of an object and its saliency, so the importance of an object can not be assessed by simply looking at its size. In the image captioning scenario that we want to tackle, larger objects correspond to larger activations in the last layers of a convolutional architecture, while smaller objects correspond to smaller activations. Since salient and non salient regions can have comparable activations, the supervision given by a saliency prediction model on whether a pixel belongs or not to a salient region can be beneficial during the generation of the caption.

(a) Pascal-Context (Mottaghi et al., 2014)
(b) Cityscapes (Cordts et al., 2016)
(c) LIP (Gong et al., 2017)
Figure 13. Distribution of object sizes and saliency values (best seen in color).

4. Saliency and Context Aware Attention

Following the qualitative findings of the previous section, we develop a model in which saliency is exploited to enhance image captioning. Here, a generative recurrent neural network is conditioned, step by step, on salient spatial regions, predicted by a saliency model, and on contextual features which account for the role on non-salient regions in the generation of the caption. In the following, we describe the overall model. An overview is presented in Figure 14.

Each input image is firstly encoded through a Fully Convolutional Network, which provides a stack of high-level features on a spatial grid , each corresponding to a spatial location of the image. At the same time, we extract a saliency map for the input image using the model in (Cornia et al., 2017), and downscale it to fit the spatial size of the convolutional features, so to obtain a spatial grid of salient regions, where . Correspondingly, we also define a spatial grid of contextual regions, where . Under the model, visual features at different locations will be selected or inhibited according to their saliency value.

The generation of the caption is carried out word-by-word by feeding and sampling words from an LSTM layer, which, at every timestep, is conditioned on features extracted from the input image and on the saliency map. Formally, the behaviour of the generative LSTM is driven by the following equations:

(1)
(2)
(3)
(4)
(5)
(6)

where, at each timestep, denotes the visual features extracted from , by considering the map of salient regions , and those of contextual regions . is the input word, and and are respectively the internal state and the memory cell of the LSTM. denotes the element-wise Hadamard product,

is the sigmoid function,

is the hyperbolic tangent tanh, are learned weight matrices and

are learned biases vectors.

Figure 14. Overview of the proposed model. Two different attention paths are built for salient regions and contextual regions, to help the model build captions which describe both components (best seen in color).

To provide the generative network with visual features, we draw inspiration from the machine attention literature (Xu et al., 2015) and compute the fixed-length feature vector as a linear combination of spatial features with time-varying weights , normalized over the spatial extent via a softmax operator:

(7)
(8)

At each timestep the attention mechanism selects a region of the image, based on the previous LSTM state, and feeds it to the LSTM, so that the generation of a word is conditioned on that specific region, instead of being driven by the entire image.

Ideally, we want weights to be aware of the saliency and contextual value of location , and to be conditioned on the current status of the LSTM, which can be well encoded by its internal state . In this way, the generative network can focus on different locations of the input image according to their belonging to a salient or contextual region, and to the current generation state. Of course, simply multiplying attention weights with saliency values would result in a loss of context, which is fundamental for caption generation. We instead split attention weights into two contributions, one for saliency and one for context regions, and employ two different fully connected networks to learn the two contributions (Figure 14). Conceptually, this is equivalent to building two separate attention paths, one for salient regions and for contextual regions, which are merged to produce the final attention. Overall, the model obeys to the following equation:

(9)

where and are, respectively, the attention weights for salient and context regions. Attention weights for saliency and context are computed as follows:

(10)
(11)

Notice that our model learns different weights for saliency and contextual regions, and combines them into a final attentive map in which the contributions of salient and non-salient regions are merged together. Similarly to the classical Soft Attention approach (Xu et al., 2015), the proposed generative LSTM can focus on every region of the image, but the attentive process is aware of the saliency of each location, so that the focus on salient and contextual regions is driven by the output of the saliency predictor.

4.1. Sentence generation

Words are encoded with one-hot vectors having size equal to that of the vocabulary, and are then projected into an embedding space via a learned linear transformation. Because sentences have different lengths, they are also marked with special begin-of-string and end-of-string tokens, to keep the model aware of the beginning and end of a particular sentence.

Given an image and sentence , encoded with one-hot vectors, the generative LSTM is conditioned step by step on the first words of the caption, and is trained to produce the next word of the caption. The objective function which we optimize is the log-likelihood of correct words over the sequence

(12)

where

are all the parameters of the model. The probability of a word is modeled via a softmax layer applied on the output of the LSTM. To reduce the dimensionality, a linear embedding transformation is used to project one-hot word vectors into the input space of the LSTM and, viceversa, to project the output of the LSTM to the dictionary space.

(13)

where is a matrix for transforming the LSTM output space to the word space and is the output of the LSTM.

At test time, the LSTM is given a begin-of-string tag as input for the first timestep, then the most probable word according to the predicted distribution is sampled and given as input for the next timestep, until an end-of-string tag is predicted.

5. Experimental evaluation

In this section we perform qualitative and quantitative experiments to validate the effectiveness of the proposed model with respect to different baselines and other saliency-boosted captioning methods. First, we describe datasets and metrics used to evaluate our solution and provide implementation details.

5.1. Datasets and metrics

To validate the effectiveness of the proposed Saliency and Context aware Attention, we perform experiments on five popular image captioning datasets: SALICON (Jiang et al., 2015), Microsoft COCO (Lin et al., 2014), Flickr8k (Hodosh et al., 2013), Flickr30k (Young et al., 2014), and PASCAL-50S (Vedantam et al., 2015).

Microsoft COCO is composed by more than images divided in training and validation sets, where each of them is provided with at least five sentences generated by using Amazon Mechanical Turk. SALICON is a subset of this one, created for the visual saliency prediction task. Since its images are from the Microsoft COCO dataset, at least five captions for each image are available. Overall, it contains training images, validation images and testing images where eye fixations for each image are simulated with mouse movements. In our experiments, we only use train and validation sets for both datasets. The Flickr8k and the Flickr30k datasets are composed by and images respectively. Both of them come with five annotated sentences for each image. In our experiments, we randomly choose validation images and test images for each of these two datasets. The PASCAL-50S dataset provides additional annotations for the UIUC PASCAL sentences (Rashtchian et al., 2010). It is composed of images from the PASCAL-VOC dataset, each of them annotated with human-written sentences, instead of as in the original dataset. Due to the limited number of samples and for a fair comparison with other captioning methods, we first pre-train the model on the Microsoft COCO dataset, then we test it on the images of this dataset without a specific fine-tuning.

For evaluation, we employ four automatic metrics which are usually employed in image captioning: BLEU (Papineni et al., 2002),  (Lin, 2004), METEOR (Banerjee and Lavie, 2005) and CIDEr (Vedantam et al., 2015)

. BLEU is a modified form of precision between n-grams to compare a candidate translation against multiple reference translations. We evaluate our predictions with BLEU using mono-grams, bi-grams, three-grams and four-grams.

computes an F-measure considering the longest co-occurring in sequence n-grams. METEOR, instead, is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. CIDEr, finally, computes the average cosine similarity between n-grams found in the generated caption and those found in reference sentences, weighting them using TF-IDF. To ensure a fair evaluation, we use the Microsoft COCO evaluation toolkit

111https://github.com/tylin/coco-caption to compute all scores.

5.2. Implementation details

Each image is encoded through a convolutional network, which computes a stack of high-level features. We employ the popular ResNet-50 (He et al., 2016)

, trained over the ImageNet dataset 

(Russakovsky et al., 2015)

, to compute the feature maps over the input image. In particular, the ResNet-50 is composed by 49 convolutional layers, divided in 5 convolutional blocks, and 1 fully connected layer. Since we want to maintain the spatial dimensions, we extract the feature maps from the last convolutional layer and ignore the fully connected layer. The output of the ResNet model is a tensor with

channels. To limit the number of feature maps and the number of learned parameters, we fed this tensor into another convolutional layer with filters and a kernel size of

, followed by a ReLU activation function. Differently from the weights of the ResNet-50 which are kept fixed, the weights of this last convolutional layer are initialized according to 

(Glorot and Bengio, 2010) and fine-tuned over the considered datasets. In the LSTM, following the initialization proposed in (Bahdanau et al., 2015)

, the weight matrices applied to the inputs are initialized by sampling each element from the Gaussian distribution of

mean and variance, while the weight matrices applied to the internal states are initialized by using the orthogonal initialization. The vectors and as well as all bias vectors are instead initialized to zero.

To predict the saliency map for each input image, we exploit our Saliency Attentive Model (SAM) (Cornia et al., 2017) which is able to predict accurate saliency maps according to different saliency benchmarks. We note however, that we do not expect a significant performance variation when using other state of the art saliency methods.

Dataset Model B@1 B@2 B@3 B@4 METEOR ROUGE CIDEr
SALICON Soft Attention 69.0 50.9 36.1 25.4 22.5 49.9 70.8
Saliency+Context Attention 69.2 51.4 37.2 26.9 22.9 50.4 73.3
COCO Soft Attention 70.6 53.0 38.3 27.5 24.3 51.8 87.9
Saliency+Context Attention 70.8 53.6 39.1 28.4 24.8 52.1 89.8
Flickr8k Soft Attention 59.9 41.8 27.9 18.2 19.8 45.0 47.7
(Validation) Saliency+Context Attention 62.8 44.5 30.2 19.9 20.3 46.5 50.1
Flickr8k Soft Attention 61.0 43.2 29.6 20.1 20.8 46.5 53.2
(Test) Saliency+Context Attention 63.5 45.6 31.5 21.2 21.1 47.5 54.1
Flickr30k Soft Attention 61.9 43.3 29.7 20.2 19.9 44.8 43.2
(Validation) Saliency+Context Attention 61.3 43.3 30.1 20.9 20.2 45.0 44.5
Flickr30k Soft Attention 61.9 43.4 29.9 20.5 19.8 44.5 44.2
(Test) Saliency+Context Attention 61.5 43.8 30.5 21.3 20.0 45.2 46.4
PASCAL-50S Soft Attention 82.4 70.0 57.0 45.1 32.8 65.9 70.4
Saliency+Context Attention 82.4 70.2 57.5 45.7 32.9 66.3 70.7
Table 1. Image captioning results. The conditioning of saliency and context (Saliency+Context Attention) enhances the generation of the caption with respect to the traditional machine attention mechanism. Soft Attention here indicates our reimplementation of (Xu et al., 2015), using the same visual features of our model.

As mentioned, we perform experiments over five different datasets. For the SALICON dataset, since its images have all the same size of , we keep the original size of these images, thus obtaining . For all other datasets, which are composed of images with different sizes, we set the input size to obtaining . Since saliency maps are exploited inside the proposed saliency-context attention model, we resize the SALICON saliency maps to have a size of while, for all other datasets, we resize them to .

All experiments are performed by using the Adam optimizer (Kingma and Ba, 2014) with Nestorov momentum (Sutskever et al., 2013) using an initial learning rate of and batch size . The hidden state dimension is set to while the embedding size to . For all datasets, we choose a vocabulary size equal to the number of words which appear at least 5 times in training and validation captions.

5.3. Quantitative results and comparisons with baselines

To assess the performance of our method, and to investigate the hypotheses behind it, we first compare with the classic Soft Attention approach, and we then build three baselines in which saliency is used to condition the generative process.

Soft Attention (Xu et al., 2015): The visual input to the LSTM is computed via the Soft Attention mechanism to attend at different locations of the image, without considering salient and non-salient regions. A single feed forward network is in charge of producing attention values, which can be obtained by replacing Eq. 9 with

(14)

This approach is equivalent to the one proposed in (Xu et al., 2015), although some implementation details are different. In order to achieve a fair evaluation, we use activations from the ResNet-50 model instead of the VGG-19, and we do not include the doubly stochastic regularization trick. For this reason, the numerical results that we report are not directly comparable with those in the original paper (ours are in general higher than the original ones).

Saliency pooling: Visual features from the CNN are multiplied at each location by the corresponding saliency value, and then summed, without any attention mechanism. In this case the visual input of the LSTM is not time dependent, and salient regions are given more focus than non-salient ones. Comparing with Eq. 7, it can be seen as a variation of the Soft Attention in which the network always focuses on salient regions.

(15)
Dataset Model B@1 B@2 B@3 B@4 METEOR ROUGE CIDEr
SALICON Saliency pooling 66.1 47.8 33.7 24.0 21.1 47.9 62.4
Attention on saliency 68.8 51.3 37.0 26.5 22.7 50.1 71.3
Saliency+Cont. Att. (weight sh.) 68.9 51.3 36.8 26.3 22.6 50.2 71.4
Saliency+Context Attention 69.2 51.4 37.2 26.9 22.9 50.4 73.3
COCO Saliency pooling 68.6 50.9 36.3 25.8 23.3 50.2 81.4
Attention on saliency 70.4 53.2 38.6 27.6 24.1 51.6 86.6
Saliency+Cont. Att. (weight sh.) 70.4 53.1 38.8 28.2 24.7 52.1 89.4
Saliency+Context Attention 70.8 53.6 39.1 28.4 24.8 52.1 89.8
Saliency pooling 56.1 37.7 24.3 15.6 18.3 42.8 37.0
Flickr8k Attention on saliency 58.7 40.4 26.8 17.6 19.7 45.1 44.7
(Validation) Saliency+Cont. Att. (weight sh.) 62.0 43.9 29.6 19.8 20.2 45.7 50.2
Saliency+Context Attention 62.8 44.5 30.2 19.9 20.3 46.5 50.1
Saliency pooling 56.5 37.8 24.6 16.2 18.5 42.9 37.7
Flickr8k Attention on saliency 59.6 42.2 28.7 19.5 20.7 46.1 50.1
(Test) Saliency+Cont. Att. (weight sh.) 62.4 44.2 29.9 19.7 21.1 46.7 51.7
Saliency+Context Attention 63.5 45.6 31.5 21.2 21.1 47.5 54.1
Saliency pooling 58.7 40.5 27.1 18.4 18.3 43.0 34.2
Flickr30k Attention on saliency 63.0 44.5 30.8 21.3 19.4 44.7 43.5
(Validation) Saliency+Cont. Att. (weight sh.) 62.0 43.8 30.0 20.5 19.7 44.6 43.3
Saliency+Context Attention 61.3 43.3 30.1 20.9 20.2 45.0 44.5
Saliency pooling 58.3 40.6 27.5 18.6 18.7 43.0 36.2
Flickr30k Attention on saliency 62.5 44.2 30.5 21.0 19.6 44.9 45.0
(Test) Saliency+Cont. Att. (weight sh.) 61.7 43.7 30.0 20.4 19.6 44.2 43.1
Saliency+Context Attention 61.5 43.8 30.5 21.3 20.0 45.2 46.4
PASCAL-50S Saliency pooling 79.9 67.1 53.6 41.8 31.4 64.1 65.3
Attention on saliency 82.4 70.3 57.4 45.5 32.7 66.3 70.2
Saliency+Cont. Att. (weight sh.) 82.0 69.7 56.4 44.2 32.7 65.2 70.0
Saliency+Context Attention 82.4 70.2 57.5 45.7 32.9 66.3 70.7
Table 2. Comparison with image captioning with saliency baselines. While the use of machine attention strategies is beneficial (see Saliency pooling vs. Attention on Saliency), saliency and context are both important for captioning. The use of different attention paths for saliency and context also enhances the performance (see Saliency+Context Attention (with weight sharing) vs. Saliency+Context Attention).

Attention on saliency: This is an extension of the Soft Attention approach in which saliency is used to modulate attention values at each location. The attention mechanism, therefore, is conditioned to attend salient regions with higher probability, and to ignore non-salient regions.

(16)

Attention on saliency and context (with weight sharing): The attention mechanism is aware of salient and context regions, but weights used to compute the attentive scores of salient and context are shared, excluding the vectors. Notice that, if those were shared too, this baseline would be equivalent to the Soft Attention one.

(17)
(18)
(19)

It is straightforward also to notice that our proposed approach is equivalent to the last baseline, without weight sharing.

In Table 1 we first compare the performance of our method with respect to the Soft Attention approach, to assess the superior performance of the proposal with respect to the published state of the art. We report results on all the datasets, both on validation and test sets, with respect to all the automatic metrics described in Section 5.1. As it can be seen, the proposed approach always overcomes by a significant margin the Soft Attention approach, thus experimentally confirming the benefit of having two separate attention paths, one for salient and one for non-salient regions, and the role of saliency as a conditioning for captioning. In particular, on the METEOR metric, the relative improvement ranges from on the PASCAL-50S to of the Flickr8k validation set.

In Table 2, instead, we compare our approach with the three baselines that incorporate saliency. Firstly, it can be observed that the Saliency pooling baseline usually performs worse than the Soft Attention, thus demonstrating that always attending to salient locations is not sufficient to achieve good captioning results. When plugging in attention, like in the Saliency Attention baseline, numerical results are a bit higher, thanks to a time-dependent attention, but still far from the performance achieved by the complete model. It can also be noticed that, even though this baseline does not take into account the context, it sometimes achieves better results than the Soft Attention model (such as in the case of SALICON, with respect to the METEOR metric). Finally, we notice that the baseline with attention on saliency and context, and with weight sharing, is better than Saliency Attention, further confirming the benefit of including the context. Having two completely separated attention paths, such as in our model, is anyway important, as demonstrated by the numerical results of this last baseline with respect to those of our method.

Dataset Model B@4 METEOR ROUGE CIDEr
SALICON Sugano et al. (Sugano and Bulling, 2016) 24.5 21.9 52.4 63.8
Ours 26.9 22.9 50.4 73.3
COCO Tavakoli et al. (Tavakoli et al., 2017b) (GBVS) 28.7 23.5 51.2 84.1
Tavakoli et al. (Tavakoli et al., 2017b) (iSEEL) 28.3 23.5 50.8 83.6
Ours 28.4 24.8 52.1 89.8
Flickr30k (Test) Ramanishka et al. (Ramanishka et al., 2017) 18.3
Ours 21.3 20.0 45.2 46.4
PASCAL-50S Tavakoli et al. (Tavakoli et al., 2017b) (GBVS) 40.0 30.2 63.5 61.5
Tavakoli et al. (Tavakoli et al., 2017b) (iSEEL) 39.6 30.2 63.2 61.4
Ours 45.7 32.9 66.3 70.7
Table 3. Comparison with existing saliency-boosted captioning models.

5.4. Comparisons with other saliency-boosted captioning models

We also compare to existing captioning models that incorporate saliency during the generation of image descriptions. In particular, we compare to the model proposed in (Sugano and Bulling, 2016), which exploited human fixation points, to the work by Tavakoli et al. (Tavakoli et al., 2017b) which reports experiments on Microsoft COCO and on PASCAL-50S, and to the proposal by Ramanishka et al. (Ramanishka et al., 2017) which used convolutional activations as a proxy for saliency.

Table 3 shows the results on the three considered datasets in term of BLEU@4, METEOR, ROUGE and CIDEr. We compare our solutions to both versions of the model presented in (Tavakoli et al., 2017b). The GBVS version exploits saliency maps calculated by using a traditional bottom-up model (Harel et al., 2006), while the other one includes saliency maps extracted from a deep convolutional network (Tavakoli et al., 2017a).

Overall, results show that the proposed Saliency and Context Attention model can overcome the other methods on different metrics, thus confirming the strategy of including two attention paths. In particular, on the METEOR metric, we obtain a relative improvement of on the SALICON dataset, on the Microsoft COCO and on the PASCAL-50S.

5.5. Analysis of generated captions

Dataset Model Div-1 Div-2 Vocabulary % novel sent. % different sent.
SALICON Soft Attention 0.0136 0.0498 658 95.22% 95.34%
Saliency+Context Attention 0.0125 0.0549 614 93.12%
COCO Soft Attention 0.0038 0.0187 1490 81.81% 93.80%
Saliency+Context Attention 0.0037 0.0182 1444 78.02%
Flickr8k Soft Attention 0.0367 0.1026 389 98.30% 97.90%
(Validation) Saliency+Context Attention 0.0400 0.1094 411 99.30%
Flickr8k Soft Attention 0.0385 0.1041 404 98.50% 97.60%
(Test) Saliency+Context Attention 0.0419 0.1119 423 99.60%
Flickr30k Soft Attention 0.0577 0.1445 699 99.90% 98.62%
(Validation) Saliency+Context Attention 0.0565 0.1439 715 99.61%
Flickr30k Soft Attention 0.0580 0.1508 682 99.90% 98.20%
(Test) Saliency+Context Attention 0.0585 0.1549 711 99.70%
PASCAL-50S Soft Attention 0.0475 0.1379 465 97.10% 94.80%
Saliency+Context Attention 0.0468 0.1359 456 96.40%
Table 4. Statistics on vocabulary size and diversity of the generated captions. Including saliency and context in two different machine attention paths (Saliency+Context attention) produces different captions with respect to the traditional machine attention approach (Soft Attention), while preserving almost the same diversity statistics.

We further collect statistics on captions generated by our method and by the Soft Attention model, to quantitatively assess the quality of generated captions. Firstly, we define three metrics which evaluate the vocabulary size and the difference between the corpus of captions generated by the two models and the ground-truth:

  • Vocabulary size: number of unique words generated in all captions;

  • Percentage of novel sentences: percentage of generated sentences which are not seen in the training set;

  • Percentage of different sentences: percentage of images which are described differently by the two models;

Then, we measure the diversity of the set of captions generated by each of the two models, via the following two metrics (Shetty et al., 2017):

  • Div-1: ratio of number of unique unigrams in a set of captions to the number of words in the same set. Higher is more diverse.

  • Div-2: ratio of number of unique bigrams in a set of captions to the number of words in the same set. Higher is more diverse.

In Table 4 we compare the set of captions generated by our model with that generated by the Soft Attention baseline. Although our model features a slight reduction of the vocabulary size on SALICON, COCO and PASCAL-50S, captions generated by the two models are very often different, thus confirming that the two approaches have learned different captioning models. Moreover, the diversity and the number of novel sentences of the Soft Attention approach are entirely preserved.

5.6. Analysis of attentive states

Figure 15. Examples of attention weights changes between saliency and context along with the generation of captions (best seen in color). Images are from the Microsoft COCO dataset (Lin et al., 2014).

The selection of a location in our model is based on the competition between the saliency attentive path and the context attentive path (see Eq. 9). To investigate how the two paths interact and contribute to the generation of a word, in Figure 15 we report, for several images from the Microsoft COCO dataset, the changes in attention weights between the two paths. Specifically, for each image we report the average of and values at each timestep, along with a visualization of its saliency map. It is interesting to see how the model was able to correctly exploit the two attention paths for generating different parts of the caption, and how generated words correspond in most cases to the attended regions. For example, in the case of the first image (“a group of zebras graze in a grassy field”), the saliency attentive path is more active than the context path during the generation of words corresponding to the “group of zebras”, which are captured by saliency. Instead, when the model has to describe the context (“in a grassy field”), the saliency attentive path has lower weights with respect to the context attentive path. The same can be observed for all the reported images; it can also be noticed that generated captions tend to describe both salient objects and the context, and that usually the salient part, which is also the most important, is described before the context.

Ground-truth: A group of people that are standing around giraffes. Saliency+Context Attention: A group of people standing around a giraffe. Without saliency: A group of people standing around a stage with a group of people. Ground-truth: A group of people at the park with some flying kites. Saliency+Context Attention: A group of people flying kites in a park. Without saliency: A group of people standing on top of a lush green field. Ground-truth: A man is looking into a home refrigerator. Saliency+Context Attention: A man is looking inside of a refrigerator. Without saliency: A man is making a refrigerator in a kitchen. Ground-truth: A women who is jumping on the bed. Saliency+Context Attention: A woman is jumping up in a bed. Without saliency: A woman is playing with a remote control. Ground-truth: A man takes a profile picture of himself in a bathroom mirror. Saliency+Context Attention: A person taking a picture of himself in a bathroom. Without saliency: A bathroom with a sink and a sink. Ground-truth: A double decker bus driving down a street. Saliency+Context Attention: A double decker bus driving down a street. Without saliency: A bus is parked on the side of the road. Ground-truth: A teddy bear holding a cell phone in front of a window with a view of the city. Saliency+Context Attention: A teddy bear sitting on a chair next to a window. Without saliency: A brown dog is sitting on a laptop keyboard. Ground-truth: A group of people riding down a snow covered slope. Saliency+Context Attention: A group of people riding skis down a snow covered slope. Without saliency: A group of people on skis in the snow. Ground-truth: A laptop computer sitting on top of a table. Saliency+Context Attention: A laptop computer sitting on a top of a desk. Without saliency: A desk with a laptop computer and a laptop. Ground-truth: A person on a motorcycle riding on a mountain. Saliency+Context Attention: A person riding a motorcycle on a road. Without saliency: A man on a bike with a bike in the background. Ground-truth: A car is parked next to a parking meter. Saliency+Context Attention: A car is parked in the street next to a parking meter. Without saliency: A car parked next to a white fire hydrant. Ground-truth: A plate of food and a cup of coffee. Saliency+Context Attention: A plate of food with a sandwich and a cup of coffee. Without saliency: A table with a variety of food on it.
Figure 16. Example results on the Microsoft COCO dataset (Lin et al., 2014).

5.7. Qualitative results

Finally, in Figure 16 we report some sample results on images taken from the Microsoft COCO dataset. For each image we report the corresponding saliency map, and captions generated by our model and by the Soft Attention baseline compared to the ground-truth. It can be seen that, on average, captions generated by our model are more consistent with the corresponding image and the human-generated caption, and that, as also observed in the previous section, salient parts are described as well as the context. The incorporation of saliency and context also help the model to avoid failures due to hallucination, such as in the case of the fourth image, in which the Soft Attention model predicts a remote control which is not depicted in the image. Other failure cases, which are avoided by our model, include the repetition of words (as in the fifth image) and the failure to describe the context (first image). We speculate that the presence of two separate attention paths, which the model has learned to attend during the generation of the caption, helps to avoid such failures more effectively than the classic machine attention approach.

For completeness, some failure cases of the proposed model are reported in Figure 17. The majority of failures occurs when the salient regions of the image are not described in the corresponding ground-truth caption (as for example in the first row), thus causing a performance loss. Some problems arise also in presence of complex scenes (such as in the fourth image). However, we observe that the Soft Attention baseline fails as well to predict correct and complete captions in these cases.

Ground-truth: The yellow truck passes by two people on motorcycles from opposing directions. Saliency+Context Attention: A person on a motor bike in a city. Without saliency: A man in a red shirt on a horse. Ground-truth: A cityscape that is seen from the other side of the river. Saliency+Context Attention: A large building with a large clock tower in the background. Without saliency: A large building with a large clock in the water. Ground-truth: A large tree situated next to a large body of water. Saliency+Context Attention: A person is sitting under a red umbrella. Without saliency: A street sign with a large tree in the middle. Ground-truth: A busted fire hydrant spewing water out onto a street. Saliency+Context Attention: A person standing in a front of a large cruise ship. Without saliency: A man is standing in a dock near a large truck. Ground-truth: A small airplane flying over a field filled with people. Saliency+Context Attention: A group of people walking around a large jet. Without saliency: A large group of people standing on top of a lush green field. Ground-truth: The view of city buildings is seen from the river. Saliency+Context Attention: A large clock tower towering over the water. Without saliency: A large building with a large clock tower in the water.
Figure 17. Failure cases on sample images of the Microsoft COCO dataset (Lin et al., 2014).

6. Conclusion

We proposed a novel image captioning architecture which extends the machine attention paradigm by creating two attentive paths conditioned on the output of a saliency prediction model. The first one is focused on salient regions, and the second on contextual regions: the overall model exploits the two paths during the generation of the caption, by giving more importance to salient or contextual regions as needed. The role of saliency with respect to context has been investigated by collecting statistics on semantic segmentation datasets, while the captioning model has been evaluated on large scale captioning datasets, using standard automatic metrics and by evaluating the diversity and the dictionary size of the generated corpora. Finally, the activations of the two attentive paths have been investigated, and we have shown that they correspond, word by word, to a focus on salient objects or on the context in the generated caption; moreover, we qualitatively assessed the superiority of the captions generated by our method with respect to those generated by the Soft Attention approach. Although our focus has been that of demonstrating the effectiveness of saliency on captioning, rather than that of beating captioning approaches which rely on different cues, we point out that our method can be easily incorporated into those architectures.

References

  • (1)
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
  • Baraldi et al. (2017) Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In

    IEEE International Conference on Computer Vision and Pattern Recognition

    .
  • Borji (2012) Ali Borji. 2012. Boosting bottom-up and top-down visual features for saliency estimation. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Bylinskii et al. (2017) Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. 2017. MIT Saliency Benchmark. http://saliency.mit.edu/. (2017).
  • Bylinskii et al. (2016) Zoya Bylinskii, Adrià Recasens, Ali Borji, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. Where should saliency models look next?. In European Conference on Computer Vision.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Cornia et al. (2016) Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2016. A Deep Multi-Level Network for Saliency Prediction. In International Conference on Pattern Recognition.
  • Cornia et al. (2017) Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2017. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. arXiv preprint arXiv:1611.09571 (2017).
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.
  • Farhadi et al. (2010) Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics.
  • Gong et al. (2017) Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. 2017. Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Griffin and Bock (2000) Zenzi M Griffin and Kathryn Bock. 2000. What the eyes say about speaking. Psychological science 11, 4 (2000), 274–279.
  • Harel et al. (2006) Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Hodosh et al. (2013) Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.
  • Huang et al. (2015) Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In IEEE International Conference on Computer Vision.
  • Itti et al. (1998) Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259.
  • Jetley et al. (2016) Saumya Jetley, Naila Murray, and Eleonora Vig. 2016. End-to-End Saliency Mapping via Probability Distribution Prediction. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Jiang et al. (2015) Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in context. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Johnson et al. (2016) Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Judd et al. (2009) Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In IEEE International Conference on Computer Vision.
  • Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Krause et al. (2017) Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hierarchical Approach for Generating Descriptive Image Paragraphs. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems.
  • Kruthiventi et al. (2016) Srinivas SS Kruthiventi, Vennela Gudisa, Jaley H Dholakiya, and R Venkatesh Babu. 2016. Saliency Unified: A Deep Architecture for Simultaneous Eye Fixation Prediction and Salient Object Segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Kulkarni et al. (2013) Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903.
  • Kümmerer et al. (2015) Matthias Kümmerer, Lucas Theis, and Matthias Bethge. 2015. DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet. In International Conference on Learning Representations Workshops.
  • Kümmerer et al. (2016) Matthias Kümmerer, Thomas SA Wallis, and Matthias Bethge. 2016. DeepGaze II: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563 (2016).
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In

    ACL Workshop on Text Summarization Branches Out

    .
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.
  • Liu and Han (2016) Nian Liu and Junwei Han. 2016. A Deep Spatial Contextual Long-term Recurrent Convolutional Network for Saliency Detection. arXiv preprint arXiv:1610.01708 (2016).
  • Liu et al. (2015) Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu. 2015. Predicting eye fixations using convolutional neural networks. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Mottaghi et al. (2014) Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143–1151.
  • Pan et al. (2017) Junting Pan, Cristian Canton, Kevin McGuinness, Noel E O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto. 2017. SalGAN: Visual Saliency Prediction with Generative Adversarial Networks. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops.
  • Pan et al. (2016) Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O’Connor, and Xavier Giró-i Nieto. 2016. Shallow and Deep Convolutional Networks for Saliency Prediction. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In 40th Annual Meeting on Association for Computational Linguistics.
  • Ramanishka et al. (2017) Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2017. Top-down Visual Saliency Guided by Captions. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Rashtchian et al. (2010) Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In NAACL HLT Workshops.
  • Rensink (2000) Ronald A. Rensink. 2000. The Dynamic Representation of Scenes. Visual Cognition 7, 1-3 (2000), 17–42.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
  • Shetty et al. (2017) Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. arXiv preprint arXiv:1703.10476 (2017).
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Socher et al. (2014) Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218.
  • Sugano and Bulling (2016) Yusuke Sugano and Andreas Bulling. 2016. Seeing with humans: Gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203 (2016).
  • Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In

    International Conference on Machine Learning

    .
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Tavakoli et al. (2017a) Hamed R Tavakoli, Ali Borji, Jorma Laaksonen, and Esa Rahtu. 2017a.

    Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features.

    Neurocomputing 244 (2017), 10–18.
  • Tavakoli et al. (2017b) Hamed R Tavakoli, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017b. Paying Attention to Descriptions Generated by Image Captioning Models. In IEEE International Conference on Computer Vision.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Vig et al. (2014) Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning.
  • Yang et al. (2011) Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In

    Conference on Empirical Methods in Natural Language Processing

    .
  • Yao et al. (2010) Benjamin Z Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2t: Image parsing to text description. Proc. IEEE 98, 8 (2010), 1485–1508.
  • You et al. (2016) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In IEEE International Conference on Computer Vision and Pattern Recognition.
  • Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
  • Zhang and Sclaroff (2013) Jianming Zhang and Stan Sclaroff. 2013. Saliency detection: A boolean map approach. In IEEE International Conference on Computer Vision.