A Synchronized Multi-Modal Attention-Caption Dataset and Analysis

by   Sen He, et al.

In this work, we present a novel multi-modal dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences between human attention in free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also compare human and machine attention, in particular the top-down soft attention approach that is argued to mimick human attention, in captioning tasks. Our study reveals that, (1) human attention behaviour in free-viewing is different than image description as humans tend to fixate on a greater variety of regions under the latter task; (2) there is a strong relationship between the described objects and the objects attended by subjects (97% of described objects are being attended); (3) a convolutional neural network as feature encoder captures regions that human attend under image captioning to a great extent (around 78%); (4) the soft-attention as the top-down mechanism does not agree with human attention behaviour neither spatially nor temporally; and (5) soft-attention does not add strong beneficial human-like attention behaviour for the task of captioning as it has low correlation between caption scores and attention consistency scores, indicating a large gap between human and machine in regard to top-down attention.


page 2

page 4

page 6

page 7

page 8


Paying Attention to Descriptions Generated by Image Captioning Models

To bridge the gap between humans and machines in image understanding and...

Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

Medical image captioning automatically generates a medical description t...

X-Linear Attention Networks for Image Captioning

Recent progress on fine-grained visual recognition and visual question a...

Language-Driven Region Pointer Advancement for Controllable Image Captioning

Controllable Image Captioning is a recent sub-field in the multi-modal t...

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Entity-aware image captioning aims to describe named entities and events...

Attention Correctness in Neural Image Captioning

Attention mechanisms have recently been introduced in deep learning for ...

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

When speakers describe an image, they tend to look at objects before men...

1 Introduction

“Two elderly ladies and a young man sitting at a table with food on it.” The previous sentence is an example of a description that one could provide when showed the image in Figure 1

. Describing images in a few words, extracting the gist of the scene they depict while ignoring unnecessary details is an easy task for humans, that can in some case be achieved from only a very brief glance at the image. In a stark contrast, providing a formal algorithm for the same task is an intricate challenge that has been beyond computer vision science for decades. Recently, with the availability of powerful deep neural network architectures and large scale datasets, new data-driven approaches for automatic captioning of images have been proposed, which demonstrated intriguing performance

[22, 23, 4, 12]. Although there is no suggestion that such models can in truth capture the full complexity of visual scenes, they appear to be able to produce credible captions for a variety of images. This raises the question of whether such artificial systems are using similar strategies to the human visual system to generate those captions.

Human-like image captions by machines have puzzled the public and have caused the delusion that artificial intelligence (AI) rivals humans in many reasoning and perception tasks. However, there still remains fundamental challenges ranging from describing attributes, reasoning the relations in captions, and identifying identities for caption generation, to avoid fabulation.

One clue onto how human perform this task is through the study of attention by eye-tracking. Attention mechanisms have been studied from different perspectives under the umbrella terms of visual attention (bottom-up and top-down mechanisms of attention), saliency prediction (predicting fixations), and eye movement analysis. A large number of studies in computer vision and robotics have tried to replicate these capabilities for different applications such as object detection, image thumbnailing, and human-robot interaction [2, 1]. There has been a recent trend in adapting attention mechanisms for automatic image captioning e.g., [23, 4, 12]. In such research papers, we often find appealing visualizations of feature importance over visual features accompanied with the corresponding phrase ”mimicking human attention”. One may ask, “Is this really the same as human attention?” and “How much such mechanisms agree with human attention during describing content?”.

Figure 1: Introduced database: an example of collected data, the image shows to the subjects, the subject’s audio description to the image, and the transcribed text description of the image, along with the eye-fixation sequence on the image when the subject look and describe the content.

In this paper, we strive to answer the aforementioned questions. We establish a basis by studying how human pay attention to scene items under the captioning tasks. Our contributions are be summarized as follows:

  • We introduce a multi-modal dataset with synchronously recorded human fixations and scene descriptions (in verbal form), which provides the largest number of instances at the moment.

  • We compare human attention in free-viewing with human attention under the task of describing images.

  • We analyze the relationship between fixations and descriptions under the image captioning task.

  • We investigate the similarities and differences between human attention and machine attention in captioning.

2 Related Work

2.1 Bottom-up attention and saliency prediction

Predicting where humans look in an image is a long standing problem in computer vision, a review of which is outside the scope of this manuscript (See [2]

). We review some of the recent works in bottom-up attention modeling in the following. Currently, the most successful saliency prediction models rely on deep neural architectures. Deepfix 

[9] combines the deep architectures of VGG [17], GoogleNet [18], and Dilated convolutions [25] to predict saliency in images. SalGAN [16]

uses an encoder-decoder architecture and proposes a binary cross entropy (BCE) loss to perform pixel-wise saliency estimation. After pre-training the encoder-decoder, it uses a Generative Adversarial Network (GAN) 

[7] to boost the performance. SAM [5] uses an LSTM [8] network, which can attend to different salient regions in the image. Deep gaze 2 [10] uses features from different layers of a pre-trained deep model and combines them with the prior knowledge (center-bias) to predict saliency. These models are trying to replicate the bottom-up attention mechanisms of humans during free viewing of natural scenes.

2.2 Neural image captioning

The image captioning task can be seen as a machine translation problem, e.g. translating an image to an English sentence. This breakthrough was achieved with the help of large scale databases for image captioning (e.g. Flickr30k [24], MSCOCO [11]) that include large number of images and captions (i.e

. source and target instances). The neural captioning models often consist of a deep Convolutional Neural Network (CNN) and a Long Short Term Memory (LSTM) language model, where the CNN part generates the feature representation for the image, and the LSTM cell acts as a language model, which decodes the features from the CNN part to the text,

e.g[22]. In this article, we mainly focus on the models that are having attention mechanisms. Xu et al[23] introduced a soft-attention mechanism to the approach in [22]. That is, during the generation of a new word based on the previously generated word and the hidden state of the language model, their model learns to put spatial weight on the visual features. Instead of re-weighting features only spatially, Chen et al[4] exploit spatial and channel wise weighting. Lu et al[12] utilized memory to prevent the model in attending mainly to the visual content and enforce it to utilize textual context as well. This is referred to as adaptive attention.

2.3 Human attention and image descriptions

In the vision community, some previous works have investigated the relationship between human attention and image captioning. Yun et al[26] studied the relationship between human gaze and descriptions, where the human gaze is recorded under the free-viewing condition, where subjects were shown an image for 3s. Another group of participants also described the image content separately. We will refer to their data as sbugaze. Tavakoli et al[19] pushed further to investigate the relation between the machine generated and human generated descriptions. They looked into the contribution of boosting visual features spatially using saliency models as a replicate to bottom-up attention. Contrary to previous studies, we focus on the human attention under the image captioning task and investigate the attention behaviour by human and machine.

In the natural language community, eye-tracking and image descriptions have been used to study the cause of ambiguity between languages, e.g. English vs Dutch [14]. Vaidyanathan et al[20]

investigated the relation between linguistic labels and important regions in the image by utilizing eye tracking data, and image descriptions. A comparison between our data and the datasets from the natural language processing community is provided in Table 

1. Our dataset has higher number instances and images in total, making it more suitable for vision related tasks. In contrast to existing works, we are also pursuing a different goal: understanding how well current computational attention mechanisms in captioning models align with human attention behaviour during image description.

Dataset # images # subjects # Instances
DIDEC [14] 305 45 4604
SNAG [20] 100 30 3000
sbugaze [26] 1000 3 3000
Ours 1000 5 5000
Table 1: Comparison between our data and similar datasets. DIDEC subjects participated in both free-viewing and task-based eye tracking experiments. SNAG only consists of task-based eye tracking data. Notwithstanding, our data complements sbugaze data which includes only free-viewing gaze.

3 Data Collection


We use images from the Pascal-50s dataset [21]. Our stimulus set includes 1,000 images (the same images as that in sbugaze), 50 captions for each image by human, annotated semantic masks, with 222 semantic categories in total. The same images were also used to analyze the relation between descriptions and object importance, and the consistency of captions by human and machine with respect to bottom-up attention during scene free-viewing (e.g[26, 19]). We, thus, can compare the human attention behaviour during captioning task with free-viewing. We can also quantify the relation between the human attention during describing content with the top-down attention mechanisms in language models such as soft-attention.


Precise recording of human subjects’ fixations in the image captioning task requires specialized accurate eye-tracking equipment, making crowd-sourcing impractical for this purpose. We used a Tobii X2-30 eye-tracker to record the eye movements under the image captioning task in a controlled laboratory condition. The eye-tracker was positioned at the bottom of the screen of a laptop, operating with a screen resolution of . The subject’s distance from the screen was about , and he was asked to simultaneously look at the image and describe it (in verbal form) in one sentence. The eye-tracker and an embedded voice recorder in the computer recorded the subject’s eye movements and description synchronously for each image.

Five subjects (postgraduate students, native English speakers, 3 males and 2 females) participated in the data collection. All of subjects finished the data collection over all 1,000 images in the dataset. The image presentation order was randomized across subjects. For each subject, we divided the data collection into 50 sessions of 20 images. Before each session, the eye-tracker was re-calibrated. To start a session, the subject was asked to fixate on a central red cross which appeared for s. The image was then displayed on the screen and the subject viewed and described the image. After describing the image, the subject pressed a designated button to move to the next image in the session. An example of the collected data is illustrated in Fig. 1. During the experiments, the subjects often looked at the image silently for a short while to scan the scene, then started describing the content spontaneously for several seconds.



sbugaze [26] 0.938/0.038 0.368/ 0.012
Ours 0.937/0.060 0.366/ 0.015
Table 2: Assessing quality of the collected captions against 50 ground-truth captions of the Pascal-50S.

After the data collection, we manually transcribed the collected oral descriptions into text for all images and subjects. The transcriptions were double-checked and cross checked with the images. We use off-the-shelf part of speech (POS) tagging software [13] to extract the nouns in the transcribed sentences. We then form a mapping from the extracted nouns to the semantic categories present in the image. For example, boys and girls are both mapped into the person category.

To check the quality of captions in our collected data, we evaluate the CIDEr [21] and METEOR [6] scores of the collected captions based on the ground truth in Pascal-50s database (50 sentences for each image). To ensure that the eye tracking and simultaneous voice recording have not affected the quality of captions adversely, we compared our scores with the scores of  sbugaze [26] captions that are collected in text form, where eye tracking and description collection have been asynchronous. Table 2 summarizes the results, showing that eye-tracking does not appear to distract the subjects and their descriptions rated comparably to the subjects in sbugaze [26].

4 Analysis

In this section, we provide a detailed analysis of attention during free viewing and captioning tasks, relationship between fixations and generated captions, and study of captioning models with respect to attention.

4.1 Attention in free-viewing vs. attention in image captioning

(a) free
(b) cap3s
(c) cap
Figure 2: Average fixation map across the whole dataset for (a) the free viewing condition, (b) image captioning condition after 3 seconds (cap3s) and (c) captioning after the whole duration (cap).
Figure 3: An example of the difference between fixations in free viewing and image captioning tasks. From left to right: original image, free viewing fixations, first 3s fixations and all fixations in the captioning task. The captions generated by 5 subjects are shown at the bottom.
number of subjects
2 3 4 5
free 0.8299 0.8149 - -
cap(3s) 0.8391 0.8362 0.8348 0.8339
cap 0.8284 0.8255 0.8238 0.8233
Table 3: IOC under different number of subjects (free viewing, first 3s image captioning, full time image captioning).

How does attention behaviour differ during free-viewing and describing images? We first analyze the differences between these two tasks by visualizing the amount of attention center-bias and the degree of inter observer congruency (IOC). In the free-viewing condition over sbugaze dataset, the gaze was recorded for a maximum duration of 3s, while in the image captioning task, people need on average to look and describe each image. To ensure that difference in gaze locations is not solely due to viewing duration, we divide the visual attention in the image captioning task into two cases: fixations during the first (cap3s), and fixations during the full viewing (cap).

The visual attention difference between free viewing and image captioning is shown in Figs. 2 and 3. Visual attention in free viewing is more focused on the central part of the image (i.e. high centre-bias); while attention under the image captioning task has higher dispersion over the whole duration of the task.

Table 3 reports the Inter Observer Congruency (IOC). To compute IOC, we leave fixations of one subject out and compute congruency with the fixations of other subjects using the AUC-Judd [3] evaluation score. The results show that the IOC is higher under the image captioning task, indicating the subjects tend to attend similar regions. This provides further supporting evidence that visual attention is task dependent. The IOC, however, decreases over time.

4.2 Analyzing the relationship between fixations and scene descriptions

How does task-based attention relate to image description? To answer, we analyze the distribution of fixations on objects in the scene, and the relation between attention allocation and noun descriptions in the sentences. Given described objects (), non-described objects (), described background (e.g., mountain, sky, wall, are defined as background) denoted as , non-described background (), and fixated objects (), we compare the distribution of the fixation data on objects and image background in free-viewing, first captioning and full captioning tasks. We compute the attention ratio for regions of interest as:

Figure 4: From left to right: the nouns described in the caption but not annotated in the image; the fixated objects (top 15) that have a very high likelihood to be described; the fixated objects that have a very low likelihood to be described
free 0.66 0.09 0.14 0.11
cap(3s) 0.68 0.09 0.14 0.09
cap 0.63 0.10 0.16 0.11
Table 4: Mean attention allocation on different regions (object vs background).

Table 4 shows the results in terms of overall attention allocation. As depicted, in all viewing conditions, most of the fixations correspond to objects that are described in the caption: in line with previous findings in [19], described objects receive more fixations than background (either described or not) and non-described objects. When comparing the fixations in the free-viewing and captioning conditions, we see that, in the first of captioning (the common viewing duration for free-viewing), slightly more attention is allocation to described objects. When looking at the captioning task for the full duration, we observe a decrease in the attention allocation on described objects, and an increase in the attention of described background. This indicates that subjects are more likely to attend to the items, which are going to be described, in the first few seconds before shifting their attention towards context-defining elements of the scene.

Noun Order
1 2 3 4 5
cap 0.486 0.271 0.222 0.187 0.138
free 0.552 0.244 0.158 0.117
Table 5: Attention allocation on described objects relatively to the order in which they figure in the description

Table 5 shows the attention allocation to objects with respect to their order of appearance in the descriptions (noun order). In other words, are the objects more likely to be fixated if they figure at the start of the description rather than then end? We see that the nouns that are described first receive a larger share of fixations than subsequent nouns. The slightly lower number in the captioning condition is associated with the change in viewing strategy observed after the first , as discussed previously.

0.52 s 1.68s
Table 6: The mean fixation duration () on described objects vs. non-described objects

How much time do humans spend to view described objects? Synchronous eye tracking and description articulation enables us to investigate the duration of fixations, on scene elements, specifically on described and non-described objects. As shown in Table 6, the described objects attract longer fixations than non-described objects. This indicates that once an important object grabs the attention, more time is allocated to scrutinize it.

free 0.56 0.87
cap (3s) 0.48 0.95
cap 0.44 0.96
Table 7:

The probability of an object being described when fixated vs. fixated when described.

How likely is an object to be described if it exists in the image and is fixated? We compute this probability, ), and compare it with the , that is the probability that a object is fixated when it is described (and it is present in the image). In other words, are you more likely to fixate what you described, or to describe what you fixated? The results are summarized in Table 7, confirm expectation that described objects are very likely to have been fixated, whereas some fixated objects are not described. Interestingly, in the captioning conditions more fixated objects are not described while described objects are even more likely to have been fixated. This indicate that the task of captioning elicits fixations on a other aspects of scene, besides described objects.

Figure 5: Two examples when camera was described by subject but not in the image

How often do subjects describe something not annotated in an image (ie, not present in the image at all) What kind of nouns are described more often and which ones are less likely to be mentioned? The answer is visualized in Figure 4: most occurrences of described but non-annotated nouns are scene categories and places nouns, that are not annotated as scene elements (because the annotations are local, pixel-based). One glaring exception to this, where an object not present in the scene is described, is the special case of ‘camera’. The reference to ‘camera’ is often associated with captions that reference the photographer taking the picture, an example is provided in Figure 5. Since the word camera in this case denotes a property of the scene rather the material object, we can weakly construe such cases as a scene category.

0.48 0.21
Table 8: Probability of fixation on non-described objects when the scene category is described and not described

How does the scene category affect the description of objects? We measure the probability of non-described objects if they are fixated when the scene category is mentioned, i.e. , and when the scene is not mentioned, . The results, summarized in Table 8, indicates that non-described, but fixated objects tend to occur when subjects describe the scene context, which indicates that the fixated non-described objects likely contribute to the perception of the scene context.

4.3 Comparing human and machine attention

How similar are human and machine attention in image captioning? This section describes two analyses performed to answer this question.

4.3.1 Attention in the visual encoder

An overlooked aspect in previous research is the amount of saliency that may have been encoded within the visual encoder. Consider the situation where a standard convolutional neural network (CNN) architecture, often used for encoding visual features, is used to provide the features to a language model, for captioning. We ask (1) to what extend does this CNN captures salient regions of the visual input? and (2) how well do the salient regions of the CNN correspond to human attended locations in the captioning task?

Figure 6: Example of obtaining connected regions from human attention map.

To answer these questions, we first transform the collected fixation data into saliency maps by convolving it with a Gaussian filter (sigma corresponding to one degree of visual angle in our experiments). Then, we threshold the saliency map by its top 5% value and extract the connected regions, as depicted in Figure 6. We then check how well the activation maps in the CNN, here layer conv5-3 of the VGG-16 [17] (including 512 activation maps) correspond to the connected regions. To this end, for each connected region, we identify if there is an activation map that has a NSS score [3] higher than a threshold (here T=4) within that connected regions, if there exist one, then that connected region is also attended by CNN. We report how many regions in images attended by human also attended by machine, as well as the mean highest NSS score of all the connected region in all the images (each connected region has a highest NSS score from 512 activation maps). We use fixation maps from free-viewing attention (free), first 3s fixations under the captioning task (cap3s), and the fixations of the whole duration of the image captioning task. The results are shown in Table 9. It can be seen that there exists a large agreement between internal activation maps of the encoder CNN and the human attended regions (over 70%). Interestingly, despite not fine-tuning the CNN for captioning, this agreement is higher for the task-based eye movement data than that for free-viewing eye movement.

percentage mean value
free 72.5% 5.43
cap3s 78.1% 5.62
cap 77.9% 5.61
Table 9: Attention agreement between human and the visual encoder (pre-trained CNN).
Figure 7: Example of human attention under captioning task and VGG-16’s attention. From left to right: image, human attended regions, and VGG-16 attended regions that best correlated with human attended regions.

4.3.2 Attention in image captioning models

How well the top-down attention mechanism in the automatic image captioning model agrees with human attention when describing images? We study the spatial and temporal consistency of soft-attention mechanism in [23] with human attention image captioning.

Ground truth
Model free-viewing image captioning
SalGAN 1.929 1.618
Soft-attention 1.149 1.128
Table 10: Spatial attention consistency, between saliency prediction model and human free-viewing attention, human attention in image captioning; as well as top-down attention in image captioning model(soft-attention) with human free-viewing attention, human attention in image captioning
Spatial consistency:

We assess the consistency between the spatial dimension of human attention and machine. For machine, the spatial attention is computed as the mean saliency map over all the generated words. We compute the NSS score over this saliency map using human fixations. To compare with bottom-up saliency models, we also compute the NSS score over the saliency maps obtained by the SalGAN [16], a leading saliency model, with no centre-bias.

Figure 8: Example of spatial attention difference. From left to right: original image, attention in free viewing, attention in image captioning, saliency map predicted by SalGAN, and saliency map from top-down image captioning model.

Table 10 summarizes the consistency of the saliency maps generated either a standard bottom-up saliency model (trained on free viewing data) [16] or by a top-down captioning system [23] with ground truth saliency maps captured either in the free viewing or the captioning condition (full duration). Interestingly, the bottom-up saliency obtains a higher score on both free-viewing and task-based ground-truth data. In other words, a bottom-up model captures important regions of the scene better than the top-down soft-attention model, even for captioning fixations. Fig. 8 illustrates some example maps.

Figure 9: Example of Dynamic Time Warping between human fixation sequence and machine attention sequence. The left is the image, top row is the human attention sequence on the image when describing the image, the bottom row is the top-down model’s attention sequence when generating the caption for the image, the number besides each blue arrow is the distance for each warping step.
Figure 10: Machine-Human attention congruency (spatial and temporal) and Machine performance on image captioning(CIDEr score)
Temporal consistency:

What is the temporal difference between human and machine attention in image captioning? Here, for the human fixation data, we split the sequence of fixations by intervals of using the sample time stamps. The fixations of each interval are then transformed into separate saliency maps, resulting in a sequence of saliency maps. For machine attention, we use the sequence of generated saliency maps during the scene description. We then employ Dynamic Time Warping (DTW) [15] to align the sequences and compute the difference between them. Figure 9 shows this process for an example sequence. We report the distance between each frame pair as:


where is the frame in the human attention sequence, is the frame in the machine attention sequence. SIM is the similarity score [3] between two attention maps. The final distance between two sequence is the total distance divided by the path length in DTW. Our analysis shows a mean difference of , which is significantly large and demonstrate that the two attention patterns differ significantly over time.

Correlation between machine captioning performance and machine-human attention congruency:

Is the consistency between the machine and human subjects’ attention patterns a predictor of the quality of the description generated by the machine? To answer this question, we compute the Spearman correlation coefficient between the machine performance on each image instance in terms of caption quality (CIDEr score) and the consistency of machine attention (spatial and temporal) with human (NSS score for spatial consistency, DTW distance for temporal consistency). The results are visualized in Figure 10, indicating a very low coefficient, 0.01 and -0.05 for spatial and temporal attention, respectively. In other words, it seems there is no correlation between the quality of a machine descriptions of images and the similarity between its attention patterns and human ones.

To summarize, we learn that the so called top-down soft-attention mechanism (1) is a poor predictor of human attention, outclassed by bottom-up saliency models even in the captioning condition, (2) has low temporal consistency with human attention sequences, and (3) that similarity between the machine’s soft-attention and human subjects’ attention has no bearing on the quality of the description. In other words, the top-down attention mechanisms deployed by the automatic captioning system bears little relation with the attention in humans doing the same task.

5 Discussions and Conclusion

In this paper, we introduced a novel, large scale dataset consisting of synchronized multi-modal attention and caption annotations. We revisited the consistency between human attention and captioning models on this data. We learned that task-based exploration of images leads to higher overall inter-observer congruency (IOC) in comparison to free-viewing. We reconfirmed the strong relationship between described objects and attended ones similar to those that have been observed in free-viewing experiments.

Interestingly, we demonstrated that the top-down soft-attention mechanism used by automatic captioning systems captures neither spatial locations nor the temporal properties of human attention during captioning. Also, similarity between human and machine attention has no bearing on the quality of the machine generated captions. These finding suggests that improvements should be possible through further investigation of top-down attention mechanisms.

Overall, the proposed dataset and analysis offer new perspectives for the study of top-down attention mechanisms in captioning pipelines, providing critical hitherto missing information that we believe will assist further advancements in developing and evaluating captioning models. Our dataset and code will be freely shared with the community to expedite future research in this area.222Data is available at: LINK MASKED..


  • [1] A. Borji. Saliency prediction in the deep learning era: An empirical investigation. arXiv preprint arXiv:1810.03716, 2018.
  • [2] A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013.
  • [3] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand.

    What do different evaluation metrics tell us about saliency models?

    IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 6298–6306. IEEE, 2017.
  • [5] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. SAM: Pushing the Limits of Saliency Prediction Models. Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition Workshops, 2018.
  • [6] M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [9] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9):4446–4456, 2017.
  • [10] M. Kummerer, T. S. A. Wallis, L. A. Gatys, and M. Bethge. Understanding low- and high-level contributions to fixation prediction. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [12] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 6, page 2, 2017.
  • [13] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60, 2014.
  • [14] E. Miltenburg, Á. Kádár, R. Koolen, and E. Krahmer. Didec: The dutch image description and eye-tracking corpus. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3658–3669, 2018.
  • [15] M. Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  • [16] J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, and X. Giro-i Nieto. Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081, 2017.
  • [17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [19] H. R. Tavakoliy, R. Shetty, A. Borji, and J. Laaksonen. Paying attention to descriptions generated by image captioning models. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2506–2515. IEEE, 2017.
  • [20] P. Vaidyanathan, E. T. Prud?hommeaux, J. B. Pelz, and C. O. Alm. Snag: Spoken narratives and gaze dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 132–137, 2018.
  • [21] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • [22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  • [23] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In

    International conference on machine learning

    , pages 2048–2057, 2015.
  • [24] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • [25] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [26] K. Yun, Y. Peng, D. Samaras, G. J. Zelinsky, and T. L. Berg. Studying relationships between human gaze, description, and computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 739–746, 2013.