“Two elderly ladies and a young man sitting at a table with food on it.” The previous sentence is an example of a description that one could provide when showed the image in Figure 1
. Describing images in a few words, extracting the gist of the scene they depict while ignoring unnecessary details is an easy task for humans, that can in some case be achieved from only a very brief glance at the image. In a stark contrast, providing a formal algorithm for the same task is an intricate challenge that has been beyond computer vision science for decades. Recently, with the availability of powerful deep neural network architectures and large scale datasets, new data-driven approaches for automatic captioning of images have been proposed, which demonstrated intriguing performance[22, 23, 4, 12]. Although there is no suggestion that such models can in truth capture the full complexity of visual scenes, they appear to be able to produce credible captions for a variety of images. This raises the question of whether such artificial systems are using similar strategies to the human visual system to generate those captions.
Human-like image captions by machines have puzzled the public and have caused the delusion that artificial intelligence (AI) rivals humans in many reasoning and perception tasks. However, there still remains fundamental challenges ranging from describing attributes, reasoning the relations in captions, and identifying identities for caption generation, to avoid fabulation.
One clue onto how human perform this task is through the study of attention by eye-tracking. Attention mechanisms have been studied from different perspectives under the umbrella terms of visual attention (bottom-up and top-down mechanisms of attention), saliency prediction (predicting fixations), and eye movement analysis. A large number of studies in computer vision and robotics have tried to replicate these capabilities for different applications such as object detection, image thumbnailing, and human-robot interaction [2, 1]. There has been a recent trend in adapting attention mechanisms for automatic image captioning e.g., [23, 4, 12]. In such research papers, we often find appealing visualizations of feature importance over visual features accompanied with the corresponding phrase ”mimicking human attention”. One may ask, “Is this really the same as human attention?” and “How much such mechanisms agree with human attention during describing content?”.
In this paper, we strive to answer the aforementioned questions. We establish a basis by studying how human pay attention to scene items under the captioning tasks. Our contributions are be summarized as follows:
We introduce a multi-modal dataset with synchronously recorded human fixations and scene descriptions (in verbal form), which provides the largest number of instances at the moment.
We compare human attention in free-viewing with human attention under the task of describing images.
We analyze the relationship between fixations and descriptions under the image captioning task.
We investigate the similarities and differences between human attention and machine attention in captioning.
2 Related Work
2.1 Bottom-up attention and saliency prediction
Predicting where humans look in an image is a long standing problem in computer vision, a review of which is outside the scope of this manuscript (See 
). We review some of the recent works in bottom-up attention modeling in the following. Currently, the most successful saliency prediction models rely on deep neural architectures. Deepfix combines the deep architectures of VGG , GoogleNet , and Dilated convolutions  to predict saliency in images. SalGAN 
uses an encoder-decoder architecture and proposes a binary cross entropy (BCE) loss to perform pixel-wise saliency estimation. After pre-training the encoder-decoder, it uses a Generative Adversarial Network (GAN) to boost the performance. SAM  uses an LSTM  network, which can attend to different salient regions in the image. Deep gaze 2  uses features from different layers of a pre-trained deep model and combines them with the prior knowledge (center-bias) to predict saliency. These models are trying to replicate the bottom-up attention mechanisms of humans during free viewing of natural scenes.
2.2 Neural image captioning
The image captioning task can be seen as a machine translation problem, e.g. translating an image to an English sentence. This breakthrough was achieved with the help of large scale databases for image captioning (e.g. Flickr30k , MSCOCO ) that include large number of images and captions (i.e
. source and target instances). The neural captioning models often consist of a deep Convolutional Neural Network (CNN) and a Long Short Term Memory (LSTM) language model, where the CNN part generates the feature representation for the image, and the LSTM cell acts as a language model, which decodes the features from the CNN part to the text,e.g. . In this article, we mainly focus on the models that are having attention mechanisms. Xu et al.  introduced a soft-attention mechanism to the approach in . That is, during the generation of a new word based on the previously generated word and the hidden state of the language model, their model learns to put spatial weight on the visual features. Instead of re-weighting features only spatially, Chen et al.  exploit spatial and channel wise weighting. Lu et al.  utilized memory to prevent the model in attending mainly to the visual content and enforce it to utilize textual context as well. This is referred to as adaptive attention.
2.3 Human attention and image descriptions
In the vision community, some previous works have investigated the relationship between human attention and image captioning. Yun et al.  studied the relationship between human gaze and descriptions, where the human gaze is recorded under the free-viewing condition, where subjects were shown an image for 3s. Another group of participants also described the image content separately. We will refer to their data as sbugaze. Tavakoli et al.  pushed further to investigate the relation between the machine generated and human generated descriptions. They looked into the contribution of boosting visual features spatially using saliency models as a replicate to bottom-up attention. Contrary to previous studies, we focus on the human attention under the image captioning task and investigate the attention behaviour by human and machine.
investigated the relation between linguistic labels and important regions in the image by utilizing eye tracking data, and image descriptions. A comparison between our data and the datasets from the natural language processing community is provided in Table1. Our dataset has higher number instances and images in total, making it more suitable for vision related tasks. In contrast to existing works, we are also pursuing a different goal: understanding how well current computational attention mechanisms in captioning models align with human attention behaviour during image description.
|Dataset||# images||# subjects||# Instances|
3 Data Collection
We use images from the Pascal-50s dataset . Our stimulus set includes 1,000 images (the same images as that in sbugaze), 50 captions for each image by human, annotated semantic masks, with 222 semantic categories in total. The same images were also used to analyze the relation between descriptions and object importance, and the consistency of captions by human and machine with respect to bottom-up attention during scene free-viewing (e.g. [26, 19]). We, thus, can compare the human attention behaviour during captioning task with free-viewing. We can also quantify the relation between the human attention during describing content with the top-down attention mechanisms in language models such as soft-attention.
Precise recording of human subjects’ fixations in the image captioning task requires specialized accurate eye-tracking equipment, making crowd-sourcing impractical for this purpose. We used a Tobii X2-30 eye-tracker to record the eye movements under the image captioning task in a controlled laboratory condition. The eye-tracker was positioned at the bottom of the screen of a laptop, operating with a screen resolution of . The subject’s distance from the screen was about , and he was asked to simultaneously look at the image and describe it (in verbal form) in one sentence. The eye-tracker and an embedded voice recorder in the computer recorded the subject’s eye movements and description synchronously for each image.
Five subjects (postgraduate students, native English speakers, 3 males and 2 females) participated in the data collection. All of subjects finished the data collection over all 1,000 images in the dataset. The image presentation order was randomized across subjects. For each subject, we divided the data collection into 50 sessions of 20 images. Before each session, the eye-tracker was re-calibrated. To start a session, the subject was asked to fixate on a central red cross which appeared for s. The image was then displayed on the screen and the subject viewed and described the image. After describing the image, the subject pressed a designated button to move to the next image in the session. An example of the collected data is illustrated in Fig. 1. During the experiments, the subjects often looked at the image silently for a short while to scan the scene, then started describing the content spontaneously for several seconds.
After the data collection, we manually transcribed the collected oral descriptions into text for all images and subjects. The transcriptions were double-checked and cross checked with the images. We use off-the-shelf part of speech (POS) tagging software  to extract the nouns in the transcribed sentences. We then form a mapping from the extracted nouns to the semantic categories present in the image. For example, boys and girls are both mapped into the person category.
To check the quality of captions in our collected data, we evaluate the CIDEr  and METEOR  scores of the collected captions based on the ground truth in Pascal-50s database (50 sentences for each image). To ensure that the eye tracking and simultaneous voice recording have not affected the quality of captions adversely, we compared our scores with the scores of sbugaze  captions that are collected in text form, where eye tracking and description collection have been asynchronous. Table 2 summarizes the results, showing that eye-tracking does not appear to distract the subjects and their descriptions rated comparably to the subjects in sbugaze .
In this section, we provide a detailed analysis of attention during free viewing and captioning tasks, relationship between fixations and generated captions, and study of captioning models with respect to attention.
4.1 Attention in free-viewing vs. attention in image captioning
|number of subjects|
How does attention behaviour differ during free-viewing and describing images? We first analyze the differences between these two tasks by visualizing the amount of attention center-bias and the degree of inter observer congruency (IOC). In the free-viewing condition over sbugaze dataset, the gaze was recorded for a maximum duration of 3s, while in the image captioning task, people need on average to look and describe each image. To ensure that difference in gaze locations is not solely due to viewing duration, we divide the visual attention in the image captioning task into two cases: fixations during the first (cap3s), and fixations during the full viewing (cap).
The visual attention difference between free viewing and image captioning is shown in Figs. 2 and 3. Visual attention in free viewing is more focused on the central part of the image (i.e. high centre-bias); while attention under the image captioning task has higher dispersion over the whole duration of the task.
Table 3 reports the Inter Observer Congruency (IOC). To compute IOC, we leave fixations of one subject out and compute congruency with the fixations of other subjects using the AUC-Judd  evaluation score. The results show that the IOC is higher under the image captioning task, indicating the subjects tend to attend similar regions. This provides further supporting evidence that visual attention is task dependent. The IOC, however, decreases over time.
4.2 Analyzing the relationship between fixations and scene descriptions
How does task-based attention relate to image description? To answer, we analyze the distribution of fixations on objects in the scene, and the relation between attention allocation and noun descriptions in the sentences. Given described objects (), non-described objects (), described background (e.g., mountain, sky, wall, are defined as background) denoted as , non-described background (), and fixated objects (), we compare the distribution of the fixation data on objects and image background in free-viewing, first captioning and full captioning tasks. We compute the attention ratio for regions of interest as:
Table 4 shows the results in terms of overall attention allocation. As depicted, in all viewing conditions, most of the fixations correspond to objects that are described in the caption: in line with previous findings in , described objects receive more fixations than background (either described or not) and non-described objects. When comparing the fixations in the free-viewing and captioning conditions, we see that, in the first of captioning (the common viewing duration for free-viewing), slightly more attention is allocation to described objects. When looking at the captioning task for the full duration, we observe a decrease in the attention allocation on described objects, and an increase in the attention of described background. This indicates that subjects are more likely to attend to the items, which are going to be described, in the first few seconds before shifting their attention towards context-defining elements of the scene.
Table 5 shows the attention allocation to objects with respect to their order of appearance in the descriptions (noun order). In other words, are the objects more likely to be fixated if they figure at the start of the description rather than then end? We see that the nouns that are described first receive a larger share of fixations than subsequent nouns. The slightly lower number in the captioning condition is associated with the change in viewing strategy observed after the first , as discussed previously.
How much time do humans spend to view described objects? Synchronous eye tracking and description articulation enables us to investigate the duration of fixations, on scene elements, specifically on described and non-described objects. As shown in Table 6, the described objects attract longer fixations than non-described objects. This indicates that once an important object grabs the attention, more time is allocated to scrutinize it.
The probability of an object being described when fixated vs. fixated when described.
How likely is an object to be described if it exists in the image and is fixated? We compute this probability, ), and compare it with the , that is the probability that a object is fixated when it is described (and it is present in the image). In other words, are you more likely to fixate what you described, or to describe what you fixated? The results are summarized in Table 7, confirm expectation that described objects are very likely to have been fixated, whereas some fixated objects are not described. Interestingly, in the captioning conditions more fixated objects are not described while described objects are even more likely to have been fixated. This indicate that the task of captioning elicits fixations on a other aspects of scene, besides described objects.
How often do subjects describe something not annotated in an image (ie, not present in the image at all) What kind of nouns are described more often and which ones are less likely to be mentioned? The answer is visualized in Figure 4: most occurrences of described but non-annotated nouns are scene categories and places nouns, that are not annotated as scene elements (because the annotations are local, pixel-based). One glaring exception to this, where an object not present in the scene is described, is the special case of ‘camera’. The reference to ‘camera’ is often associated with captions that reference the photographer taking the picture, an example is provided in Figure 5. Since the word camera in this case denotes a property of the scene rather the material object, we can weakly construe such cases as a scene category.
How does the scene category affect the description of objects? We measure the probability of non-described objects if they are fixated when the scene category is mentioned, i.e. , and when the scene is not mentioned, . The results, summarized in Table 8, indicates that non-described, but fixated objects tend to occur when subjects describe the scene context, which indicates that the fixated non-described objects likely contribute to the perception of the scene context.
4.3 Comparing human and machine attention
How similar are human and machine attention in image captioning? This section describes two analyses performed to answer this question.
4.3.1 Attention in the visual encoder
An overlooked aspect in previous research is the amount of saliency that may have been encoded within the visual encoder. Consider the situation where a standard convolutional neural network (CNN) architecture, often used for encoding visual features, is used to provide the features to a language model, for captioning. We ask (1) to what extend does this CNN captures salient regions of the visual input? and (2) how well do the salient regions of the CNN correspond to human attended locations in the captioning task?
To answer these questions, we first transform the collected fixation data into saliency maps by convolving it with a Gaussian filter (sigma corresponding to one degree of visual angle in our experiments). Then, we threshold the saliency map by its top 5% value and extract the connected regions, as depicted in Figure 6. We then check how well the activation maps in the CNN, here layer conv5-3 of the VGG-16  (including 512 activation maps) correspond to the connected regions. To this end, for each connected region, we identify if there is an activation map that has a NSS score  higher than a threshold (here T=4) within that connected regions, if there exist one, then that connected region is also attended by CNN. We report how many regions in images attended by human also attended by machine, as well as the mean highest NSS score of all the connected region in all the images (each connected region has a highest NSS score from 512 activation maps). We use fixation maps from free-viewing attention (free), first 3s fixations under the captioning task (cap3s), and the fixations of the whole duration of the image captioning task. The results are shown in Table 9. It can be seen that there exists a large agreement between internal activation maps of the encoder CNN and the human attended regions (over 70%). Interestingly, despite not fine-tuning the CNN for captioning, this agreement is higher for the task-based eye movement data than that for free-viewing eye movement.
4.3.2 Attention in image captioning models
How well the top-down attention mechanism in the automatic image captioning model agrees with human attention when describing images? We study the spatial and temporal consistency of soft-attention mechanism in  with human attention image captioning.
We assess the consistency between the spatial dimension of human attention and machine. For machine, the spatial attention is computed as the mean saliency map over all the generated words. We compute the NSS score over this saliency map using human fixations. To compare with bottom-up saliency models, we also compute the NSS score over the saliency maps obtained by the SalGAN , a leading saliency model, with no centre-bias.
Table 10 summarizes the consistency of the saliency maps generated either a standard bottom-up saliency model (trained on free viewing data)  or by a top-down captioning system  with ground truth saliency maps captured either in the free viewing or the captioning condition (full duration). Interestingly, the bottom-up saliency obtains a higher score on both free-viewing and task-based ground-truth data. In other words, a bottom-up model captures important regions of the scene better than the top-down soft-attention model, even for captioning fixations. Fig. 8 illustrates some example maps.
What is the temporal difference between human and machine attention in image captioning? Here, for the human fixation data, we split the sequence of fixations by intervals of using the sample time stamps. The fixations of each interval are then transformed into separate saliency maps, resulting in a sequence of saliency maps. For machine attention, we use the sequence of generated saliency maps during the scene description. We then employ Dynamic Time Warping (DTW)  to align the sequences and compute the difference between them. Figure 9 shows this process for an example sequence. We report the distance between each frame pair as:
where is the frame in the human attention sequence, is the frame in the machine attention sequence. SIM is the similarity score  between two attention maps. The final distance between two sequence is the total distance divided by the path length in DTW. Our analysis shows a mean difference of , which is significantly large and demonstrate that the two attention patterns differ significantly over time.
Correlation between machine captioning performance and machine-human attention congruency:
Is the consistency between the machine and human subjects’ attention patterns a predictor of the quality of the description generated by the machine? To answer this question, we compute the Spearman correlation coefficient between the machine performance on each image instance in terms of caption quality (CIDEr score) and the consistency of machine attention (spatial and temporal) with human (NSS score for spatial consistency, DTW distance for temporal consistency). The results are visualized in Figure 10, indicating a very low coefficient, 0.01 and -0.05 for spatial and temporal attention, respectively. In other words, it seems there is no correlation between the quality of a machine descriptions of images and the similarity between its attention patterns and human ones.
To summarize, we learn that the so called top-down soft-attention mechanism (1) is a poor predictor of human attention, outclassed by bottom-up saliency models even in the captioning condition, (2) has low temporal consistency with human attention sequences, and (3) that similarity between the machine’s soft-attention and human subjects’ attention has no bearing on the quality of the description. In other words, the top-down attention mechanisms deployed by the automatic captioning system bears little relation with the attention in humans doing the same task.
5 Discussions and Conclusion
In this paper, we introduced a novel, large scale dataset consisting of synchronized multi-modal attention and caption annotations. We revisited the consistency between human attention and captioning models on this data. We learned that task-based exploration of images leads to higher overall inter-observer congruency (IOC) in comparison to free-viewing. We reconfirmed the strong relationship between described objects and attended ones similar to those that have been observed in free-viewing experiments.
Interestingly, we demonstrated that the top-down soft-attention mechanism used by automatic captioning systems captures neither spatial locations nor the temporal properties of human attention during captioning. Also, similarity between human and machine attention has no bearing on the quality of the machine generated captions. These finding suggests that improvements should be possible through further investigation of top-down attention mechanisms.
Overall, the proposed dataset and analysis offer new perspectives for the study of top-down attention mechanisms in captioning pipelines, providing critical hitherto missing information that we believe will assist further advancements in developing and evaluating captioning models. Our dataset and code will be freely shared with the community to expedite future research in this area.222Data is available at: LINK MASKED..
-  A. Borji. Saliency prediction in the deep learning era: An empirical investigation. arXiv preprint arXiv:1810.03716, 2018.
-  A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013.
Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand.
What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence, 2018.
L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua.
Sca-cnn: Spatial and channel-wise attention in convolutional networks
for image captioning.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306. IEEE, 2017.
-  M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. SAM: Pushing the Limits of Saliency Prediction Models. Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition Workshops, 2018.
-  M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9):4446–4456, 2017.
-  M. Kummerer, T. S. A. Wallis, L. A. Gatys, and M. Bethge. Understanding low- and high-level contributions to fixation prediction. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 6, page 2, 2017.
-  C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60, 2014.
-  E. Miltenburg, Á. Kádár, R. Koolen, and E. Krahmer. Didec: The dutch image description and eye-tracking corpus. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3658–3669, 2018.
-  M. Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
-  J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto. Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  H. R. Tavakoliy, R. Shetty, A. Borji, and J. Laaksonen. Paying attention to descriptions generated by image captioning models. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2506–2515. IEEE, 2017.
-  P. Vaidyanathan, E. T. Prud?hommeaux, J. B. Pelz, and C. O. Alm. Snag: Spoken narratives and gaze dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 132–137, 2018.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Show, attend and tell: Neural image caption generation with visual
International conference on machine learning, pages 2048–2057, 2015.
-  P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
-  K. Yun, Y. Peng, D. Samaras, G. J. Zelinsky, and T. L. Berg. Studying relationships between human gaze, description, and computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 739–746, 2013.