A huge amount of visual information constantly reaches our eyes during daily activities . A visual scene typically contains much more items than the human visual system can process. Visual attention refers to a series of cognitive operations that allow us to focus on salient elements and filter out the irrelevant information 
. The study of this process is at the crossroad of different disciplines such as neuroscience, cognitive science, computer vision, psychology. Many computational models of human attention have been developed in the last three decades (see[4, 2] for an extensive analysis of the state-of-the art), and the increasing interest in this topic is also due to a wide range of possible applications, including object detection , video compression , advertising  or visual tracking , among others.
Nevertheless, we are still far from formalising a mechanism of attention that approximates human capabilities. Inspired by the idea of [37, 24], and following the path traced out by the seminal works of [20, 19, 7], state-of-the-art models focus on learning saliency from human data. This trend tacitly assumes a centralized role of the saliency map and that fixations may be eventually generated according to the Winner-Take-All algorithm described in . For this reason, these models are commonly evaluated with saliency metrics that take into account only the spatial component of this phenomenon, i.e. the spatial distribution of the fixations, while the temporal dynamics of the attention are not considered. Models of scanpath that take into account the temporal order of fixations have been proposed as well, but they are often task-specific (exploration of shapes  or action recognition ) and not easily exploitable in a free-viewing scenario. Recently, a general purpose computational description of attention as a dynamic process has been presented by , where laws of eye movements are described in the framework of mechanics. The authors propose a mathematical formulation based on a few fundamental principles somehow connected with human attention, such as the boundedness of the retina, the curiosity towards differences in brightness, and the property of brightness invariance. Despite being oriented to scanpath modeling, this approach leads to impressive results in unsupervised saliency prediction (see the large comparison performed by ), while an evaluation of the quality of the predicted scanpaths has not been performed. Moreover, the fundamental principles mentioned above, although very general, are too local, since they do not provide a way to aggregate information from the peripheries of the visual field, and they lack a mechanism that avoids revisiting recently visited locations, which might generate unnatural trajectories when exploring the input stream. A recent approach proposes an explanation of visual attention trough gravitational models . This results in an unsupervised scanpath-oriented model in which attention emerges as a dynamic process. Attention is modeled as a unitary mass subject to gravitational attraction, where the gravitational field is induced by masses associated to visual features, such as image details, motion, and, if needed, task-related information. The output of the model is a continuous function that describes the trajectory of the focus of attention. Similarly to , saliency can be obtained as a by-product, summing up the most visited locations.
With the aim of improving the evaluation methodology of models of human visual attention, we underline the limits of the current metrics for scanpath similarity, and we introduce a statistical measure for the evaluation of the dynamics of the simulated eye movements. All the different approaches are tested both in saliency and scanpath prediction. Despite of their simplicity, the analysis of the results shows that gravitational models oriented to capture the dynamics of the phenomenon (instead of estimating the saliency map) outperform other approaches. Finally, with emphasis to gravitational models, we present a study of the opinions of human evaluators, collected through a crowd-sourcing platform. To the best of our knowledge, this is the first time that this type of analysis is conducted to evaluate computational models of visual attention.
This paper is organized as follows. We review graviational models of visual attention in Section 2. An in-depth discussion on the problem of evaluating models of visual attention is presented in section 3. An experimental evaluation and comparisons with state-of-the-art models are presented in section 4. Mathematical formulation of the model is given in section 2, together with results of the crowd-sourcing evaluation.
2 Gravitational models of visual attention
The analysis of most of this paper is based on gravitational models of visual attention, that are recent models that have shown to yield state-of-the-art performances in unsupervised scanpath prediction . These models are able to generate a dynamic scanpath trajectory without the need of producing a saliency map first, thus fully relying on a differential equation that drives the focus of attention.
In order to describe the gravitational model of , we consider a generic stream of visual input, that is defined on the domain
where the subset represents the retina coordinates while is the temporal domain. The visual attention scanpath is the trajectory , being the time index. Attention is driven by the attraction triggered by relevant visual features of the visual input. Let be the function associated to the activation of a visual feature, modeling the presence of a certain property in a pixel of the input stream, i.e.,
Larger values of correspond with more evident presence of the visual feature in , being the pixel coordinates. Let us assume to have the use of a number of ’s, each of them associated to different properties of the input stream.
Inspired by the behaviour of gravitation fields, the visual attention scanpath can be modeled as the motion of a unitary mass subject to the gravitational attraction of a distribution of masses , associated to the visual features,
In particular, is defined as , being the mass associated to feature , that is
where the norm measures the strength of the activation of , and is a customizable scaling factor. The gravitation field  is such the attraction toward the distributional mass is inversely proportional to the squared distance from the focus of attention , and it is given by
where is the convolution operator and . A sketch of this idea is reported in Fig. 1.
Once we are given the gravitational field, the Newtonian differential equation of attention are
where dumping term , with , prevents from oscillations typical of gravitational systems and it helps to produce precise ballistic movements toward the salient target. Integrating Eq. 2 allows us to compute the visual attention trajectory at each time instant.111We converted the equation to a first-order system of differential equations, as commonly done, introducing auxiliary variables. Then we used the odeint function of the Python SciPy library, in the setting in which it automatically determines where the problem is stiff and it chooses the appropriate integration method.
The choice of the visual features that induce the corresponding masses is determinant in modeling the behaviour of the attention system. A key property of the this model is that there are no restrictions on the categories of features one could consider. While some of the features can be pretty generic and not associated to high-level semantics of the observed input stream (e.g., variations of brightness, motion, etc.), other features could be associated to semantic categories (faces, objects, actions, etc.) that might be relevant in specific visual exploration tasks. The features we consider in this paper are described as follows.
Let be the brightness of the video, that yields the feature associated to spatial gradient of the brightness, . This features carries information about edges and, generally speaking, it reveals the presence of details in the input data (being it a fixed image or a video).
Let be the optical flow, that is the velocity field at any . The feature characterizes moving areas in the retina. This feature only applies in the case of video streams, and we computed it using off-the-shelf implementations of the optical flow.
be the probability of thepresence of a human face at any . The feature is active in those areas of the retina characterized by the presence of human faces.
More features could be considered as well, by simply introducing new visual feature functions. While is what we constantly used in all our experiments (Section 4), and were only used in human evaluations, where video streams are considered too (thus enabling ) and where we also injected contribute from , since faces are known to attract human attention in a task-independent way .
In humans, after a reflexive shift of attention towards the source of stimulation, there is an inhibition to remain in the same location . This mechanism is called Inhibition Of Return (IOR). A similar mechanism is defined in the gravitational model, to prevent the trajectory to get trapped into regions of equilibrium and favour complete exploration of the scene. The dynamic of a function of inhibition can be modeled as
where and . This is directly applied to the feature masses, in order to decrease the gravitational contribution from already-visited spatial locations. As a results, the distribution of masses becomes
3 Evaluating visual attention dynamics
and saliency metrics, such as the distribution-based Kullback-Leibler divergence (KL), the location-based Area Under the Curve (AUC) , and the Normalized Scanpath Saliency (NSS) . Different metrics give different importance to the presence of false positives and false negatives in the predicted saliency map, when compared to ground truth human fixations. Moreover, they can be differently affected by systematic viewing biases, such as the center bias . The problem of evaluating saliency models has been deeply studied and a set of qualitative and quantitative properties of saliency metrics has been investigated over years [39, 4, 9].
|Human||Synthetic scanpath||Synthetic scanpath|
In the computer vision literature, it is less frequent to find studies on the problem of evaluating computational models of visual attention taking into account the temporal order of the fixations, in addition to the widely considered spatial distribution of such fixations, i.e., the saliency map. There exists a number of tool for measuring the similarity between human and simulated visual scanpaths222A visual scanpath is defined as an ordered sequence of fixations.. Some authors use the string-edit (Levenshtein) distance (SE) [22, 6, 16], where the visual input is divided into regions, uniquely labeled with a character. Then, each scanpath can be associated with a string, taking the ordered sequence of labels of the regions in which the fixations fall. The distance between strings is an indicator of the distance between the corresponding scanpaths. In , the string-edit distance has been shown to be a robust metric with respect to changes in the number of considered regions. In , a number of saliency models are used to generate scanpaths, and their performances are evaluated with a slightly modified version of the SE. Other authors proposed a scaled time-delay embedding (STDE) [38, 43] measure of similarity, which derives from a popular metric for a quantitative comparison of stochastic and dynamic trajectories of varied lengths, in the filed of physics.
However, the widely used saliency and scanpath metrics do not evaluate some important properties on the dynamics of the exploration, that we emphasize in the following example. Let be a true (human) scanpath across three spatial locations , , , and let and be two synthetic (simulated) scanpaths generated with two different models of visual attention, as shown in Fig. 2. Both the models visit exactly the same three spatial locations that are visited by the human scanpath, but the three scanpaths differ in the order in which these locations are visited. Since the spatial distribution of the fixation is identical, a saliency metric will indicate a perfect saliency prediction in both the synthetic cases. Differently, visual-scanpath-oriented metrics, such as SE, will capture some differences. As a matter of fact, the string-edit distance between each of the two synthetic scanpaths and the human scanpath is equal to (only an exchange operation in the string is needed). However, we would have reason to say that the synthetic scanpath of Fig. 2 is better than the synthetic scanpath since it yields an initial short saccade, similarly to what happens in the human case. Differently, the synthetic scanpath is only based on long saccades, making it less closer to the human scanpath.
In this specific case, it may be useful to study statistical quantities related to the dynamics of the phenomenon under examination. In particular, the distribution of saccade amplitudes provide statistical information that is not captured by the aforementioned popular metrics. This statistical quantity has been previously used in evaluating the quality of computational models of attention [29, 1], in a context in which human exploration biases were added to the model. We propose to evaluate artificially generated scanpaths not only with classic metrics, but also with the KL divergence between the distributions of amplitudes of human saccades and of artificially generated ones.
Despite introducing some precious information, the proposed evaluation methodology is still not enough. A number of dynamic patterns of visual exploration can characterize the human scanpath. Some may concern the mechanics of the eyes, others the visual patterns of the scene, or other high-level semantics. Furthermore, there exists a wide variability among human subjects. While the definition of an all-inclusive metric is probably not possible, we can evaluate how strongly a synthetic scanpath is plausible (i.e. ”human-like” or ”natural”) by collecting feedbacks from uninformed observers which may be sensible to uncommon behaviours, unnatural vibrations, meaningless explorations. For this reason, we propose to complement the experimental analysis based on metrics with a crow-sourcing-based evaluation, in which human evaluators are asked to tag scanpaths as ”human-like” or ”artificial”. A statistical study of the collected evaluator opinions provides an indication on the qualitative plausibility of the output of a computational model.
4 Experimental evaluation and analysis
In what follows, we evaluate a number of different visual attention models following all the strategies of Section 3. A huge number of models are present in the literature. They have been selected in this work among the most representative of their typology. In Section 4.1 we briefly describe each of the selected models of visual attention. In Section 4.2 we evaluate the models in the tasks of saliency and scanpath prediction. Saccade amplitude statistics are compared to human statistics in Section 4.3. Crowd-sourcing evaluation is performed for the case of gravitational models in section 4.4.
4.1 State-of-the-art models of human visual attention
Itti  is an unsupervised saliency model. None of the original papers evaluate the model in the task of scanpath prediction. For all experiments, we used the code provided by the authors in their public repositories.
4.2 Saliency and scanpath prediction
Our first analysis consists in benchmarking selected models using commonly used image datasets, focussing on the tasks of () scanpath prediction and of () saliency prediction. In particular, the datasets used for the scanpath prediction are MIT1003 , SIENA12 , TORONTO , KOOTSRA , while we used the well established CAT2000  dataset for the saliency prediction task. The first datasets contain a total of images, belonging to a wide range of different semantic categories. The resolution of the images varies from to px. The CAT2000 test dataset contains images from different categories and the resolution of the images is px. Table 1 shows the results of a massive quantitative analysis on a merged collection of the aforementioned datasets of human fixations, comparing state-of-the-art approaches of visual attention.
|Saliency prediction||Scanpath prediction|
|Deep Gaze II||Yes||0.77||1.16||8.17||0.72|
|Expert evaluators||0.55 (0.11)|
|Naive evaluators||0.50 (0.09)|
|Human videos labeled as human||0.53 (0.17)|
|Synthetic videos labeled as human||0.46 (0.18)|
We report the the average fraction of videos that were correctly labeled (either as human or non-human). Standard deviation is in brackets.
The results clearly show that supervised deep learning models yield better results than scanpath oriented models in the task of saliency prediction333We calculated saliency scores for the model Deep Gaze II on the training set of CAT2000, since authors did not submit their model to the MIT Saliency Team  for the test evaluation., but they lack in capturing the time dynamics, and gravitational models have the best score in the scanpath prediction task.
This discrepancy was anticipated by the analysis of the metrics made in the previous section. If models based on deep learning show a surprising ability to learn associations between visual features and salience, they fail to capture the dynamics of the process. In other words, the two alternatives excel in modeling two different aspects: one related to ”where” humans look, the other related to ”when” or in what order they do it.
4.3 Saccade amplitude analysis
This analysis, instead, wants to assess how good the models are at predicting ”how” people shift attention from one location to another. Saliency and scanpath metrics alone cannot provide a comprehensive tool for the evaluation of visual attention models, since some aspects related to dynamics still are not captured by those metrics. Here we compare the distribution of human saccade amplitude together with the distribution generated from the simulations of the models under examination. Results are summarized in Fig. 3. The plot of gravitational models is the closest to the human one, and this is further confirmed by the results in Table 3, that show the KL-divergence between the distribution of the saccade amplitude of the artificial attention models and that of the human scanpaths. Also the Eymol model  produces competitive results. One of the motivations behind the results is that we noticed that scanpath-oriented models favour short saccades, incorporating a principle of proximity preference which is also observed in humans [20, 24, 23].
|Grav. models||Eymol||Sam||Deep Gaze II||Itti|
4.4 Crowd-sourcing evaluation
We setup a crowd-sourcing evaluation procedure for testing the best performing model in scanpath predictions, i.e. the gravitational models. To this end, we used a collection of 60 videos from the COUTROT Dataset 1  and 60 static images randomly sampled from MIT1003 , that are publicly available datasets of human fixations. Videos include one or several moving objects, landscapes, and scenes of people having a conversation (see supplementary material). The resolution of the video frames is px, and the average duration of each clip is seconds. Static images size varies from to px, and they include landscape and portrait. The duration of the scanpaths in the case of static images was set to seconds.
The participants in the crowd-sourcing are presented 20 random videos of scanpaths from the aforementioned collection, in which the the gaze position is marked by a red circle, as shown in Fig. 4. Out of them, 10 videos are about human scanpaths, while the other 10 are about synthetic scanpaths generated with the model of Section 2
. Subjects are asked to evaluate each scanpath, classifying it as human or synthetic, and they provide their feedback by means of a web platform that we developed to the purpose of this evaluation. Subjects are asked some personal information about their level of education and their level of knowledge on eye movements (from 1 to 5) before starting the test. We invited 35 different subjects to participare to the crowd-sourcing, almost evenly distributed between experts on eye movements and not-experts (“naive”).
The statistics we collected are reported in Table 2. Results shows that the accuracy in recognizing synthetic scanpaths is close to the accuracy in recognizing human scanpaths. It is important to remark that since subjects were explicitly asked to distinguish human videos from the simulated ones, they had a natural tendency of assigning the label “human” only to a portion of the videos, that we found to be 49.4% (+/- 13.7%) of the observed videos. The overall accuracy of the subjects (53%) is very close to the random policy (50%). This means that there are few elements that allow the observers to distinguish the human scanpaths from the synthetic ones. The expert evaluators (self-evaluated level of knowledge about eye movement between 3 and 5) have reached a score that is slightly larger than that of the naive observers (eye movement knowledge between 1 and 2). In this sense, we conclude that many aspects of the motion dynamics have been captured by the gravitational model (Section 2), as motion artefacts are normally easily perceived by experts in the field. The last two columns of Table 2 confirms that the evaluators were in strong difficulties in discriminating human scanpaths by the artificial ones.
In order to evaluate the agreement between annotators, we used the Fleiss’ kappa ,
is the number of videos, is the number of annotators who assigned the clip to the -th category (Human or Synthetic), and is the total number of annotators. The term gives the degree of agreement that is attainable by chance. The quantity corresponds to the extent to which annotators agree on the -th clip, that is the number of pairs of evaluators that are in agreement, relative to the number of all possible evaluator pairs. Values of close to 1 express complete agreement among annotators, while value of lower then 0 indicate poor agreement. Analysis show a slight agreement among annotators , while there is fair agreement in the case of expert annotators (, against of the naive annotators). Fleiss’ kappa values are very similar in the case of human () and synthetic () scanpaths annotations.
5 Conclusions and Future Work
In this paper we presented a comparison between a selection of state-of-the-art saliency and scanpath oriented models of human visual attention. Experimental results show that the approaches that postulate the central role of saliency maps are not effective as a computational description of human visual attention as a dynamic process. Scanpath oriented models overcome saliency based approaches, despite of their simplicity. In particular, gravitational models show the best results. Great attention has been directed to the problem of correctly evaluating attention models, taking into account all the fundamental components: spatial distribution of fixations (saliency), temporal order of fixations (scanpath prediction) and movement dynamics. We have shown how certain dynamics can be captured by other statistics such as the study of saccade amplitude. Gravitational models generated saccades statistics very similar to the human ones, even if it has not been explicitly modeled for that. For this reason we further investigated this approach with a study of the data collected with a crowd-sourcing platform. Analysis of participants opinions show that gravitational models’ generated scanpaths appear plausible and are not easily distinguishable from the human ones, particularly in the case of naive annotators. We wish that this evaluation methodology will be applied to evaluate the attention models in a broad way from now on, making results more readable, fair and reliable, comparing to the well-established saliency benchmarks.
-  (2004) Modelling gaze shift as a constrained random walk. Physica A: Statistical Mechanics and its Applications 331 (1-2), pp. 207–218. Cited by: §3.
-  (2013) State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 185–207. Cited by: §1.
-  (2015) Cat2000: a large scale fixation dataset for boosting saliency research. ArXiv preprint, arXiv:1505.03581. Cited by: §3, §4.2.
-  (2013) Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Transactions on Image Processing 22 (1), pp. 55–69. Cited by: §1, §3, §3.
-  (2018) Saliency prediction in the deep learning era: an empirical investigation. arXiv preprint arXiv:1810.03716. Cited by: §1, 2nd item.
-  (1997) Spontaneous eye movements during visual imagery reflect the content of the visual scene. Journal of cognitive neuroscience 9 (1), pp. 27–38. Cited by: §3.
-  (2007) Attention based on information maximization. Journal of Vision 7 (9), pp. 950–950. Cited by: §1, §3, Figure 3, §4.2, Table 3.
-  MIT saliency benchmark. Cited by: 1st item, footnote 3.
What do different evaluation metrics tell us about saliency models?. IEEE transactions on pattern analysis and machine intelligence 41 (3), pp. 740–757. Cited by: §3.
-  (2009) Faces and text attract gaze independent of the task: experimental data and computer model. Journal of vision 9 (12), pp. 10–10. Cited by: §2.
-  (1995) String editing analysis of human visual search.. Optometry and vision science: official publication of the American Academy of Optometry 72 (7), pp. 439–451. Cited by: §3.
-  (2016) A deep multi-level network for saliency p rediction. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 3488–3493. Cited by: 1st item, §4.1.
-  (2013) Toward the introduction of auditory information in dynamic visual attention models. In Image Analysis for Multimedia Interactive Services (WIAMIS), 2013 14th International Workshop on, pp. 1–4. Cited by: §4.4.
-  (1965) The feynman lectures on physics; vol. i. American Journal of Physics 33 (9), pp. 750–752. Cited by: §2.
-  (1971) Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: §4.4.
-  (2008) What can saliency models predict about eye movements? spatial and sequential aspects of fixations during encoding and recognition. Journal of vision 8 (2), pp. 6–6. Cited by: §3.
-  (2006) VOCUS: a visual attention system for object detection and goal-directed search. Vol. 3899, Springer. Cited by: §1.
-  (2013) Saliency-aware video compression. IEEE Transactions on Image Processing 23 (1), pp. 19–33. Cited by: §1.
-  (2007) Graph-based visual saliency. In Advances in neural information processing systems, pp. 545–552. Cited by: §1.
-  (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence (11), pp. 1254–1259. Cited by: §1, 4th item, §4.1, §4.3.
-  (2009) Learning to predict where humans look. pp. 2106–2113. Cited by: §3, Figure 3, §4.2, §4.4, Table 3.
-  (2014) Speech and language processing. Vol. 3, Pearson London. Cited by: §3.
Selecting one among the many: a simple network implementing shifts in selective visual attention..
MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB. Cited by: §4.3.
-  (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pp. 115–141. Cited by: §1, §4.1, §4.3.
-  (2006) How much the eye tells the brain. Current Biology 16 (14), pp. 1428–1434. Cited by: §1.
-  (2011) Predicting eye fixations on complex visual stimuli using local symmetry. Cognitive computation 3 (1), pp. 223–240. Cited by: §3, Figure 3, §4.2, Table 3.
-  (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §3.
DeepGaze ii: reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563. Cited by: 1st item, §4.1.
-  (2015) Saccadic model of eye movements for free-viewing condition. Vision research 116, pp. 152–164. Cited by: §3.
-  (2009) Saliency-based discriminant tracking. In 2009 IEEE conference on computer vision and pattern recognition, pp. 1007–1013. Cited by: §1.
-  (2013) Action from still image dataset and inverse optimal control to learn task specific visual scanpaths. In Advances in neural information processing systems, pp. 1923–1931. Cited by: §1.
-  (2009) Visual attention. In Encyclopedia of Neuroscience, M. D. Binder, N. Hirokawa, and U. Windhorst (Eds.), pp. 4296–4302. External Links: Cited by: §1.
-  (2005) Components of bottom-up gaze allocation in natural images. Vision research 45 (18), pp. 2397–2416. Cited by: §3.
-  (1985) Inhibition of return: neural basis and function. Cognitive neuropsychology 2 (3), pp. 211–228. Cited by: §2.
-  (2005) An information maximization model of eye movements. In Advances in neural information processing systems, pp. 1121–1128. Cited by: §1.
-  (2013) Saliency and human fixations: state-of-the-art and study of comparison metrics. In Proceedings of the IEEE international conference on computer vision, pp. 1153–1160. Cited by: §3.
-  (1980) A feature-integration theory of attention. Cognitive psychology 12 (1), pp. 97–136. Cited by: §1.
-  (2011) Simulating human saccadic scanpaths on natural images. In CVPR 2011, pp. 441–448. Cited by: §3.
-  (2011) Measures and limits of models of fixation selection. PloS one 6 (9), pp. e24038. Cited by: §3.
-  (2011) Visual search for arbitrary objects in real scenes. Attention, Perception, & Psychophysics 73 (6), pp. 1650. Cited by: §1.
-  (2017) Variational laws of visual attention for dynamic scenes. In Advances in Neural Information Processing Systems, pp. 3823–3832. Cited by: §1, 2nd item, §4.3.
-  (2019) Gravitational laws of focus of attention. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2, §2, 3rd item.
-  (2018) FixaTons: a collection of human fixations datasets and metrics for scanpath similarity. ArXiv preprint, arXiv:1802.02534. Cited by: §3, Figure 3, §4.2, Table 3.