Wave Propagation of Visual Stimuli in Focus of Attention

06/19/2020 ∙ by Lapo Faggi, et al. ∙ Università di Siena UNIFI 0

Fast reactions to changes in the surrounding visual environment require efficient attention mechanisms to reallocate computational resources to most relevant locations in the visual field. While current computational models keep improving their predictive ability thanks to the increasing availability of data, they still struggle approximating the effectiveness and efficiency exhibited by foveated animals. In this paper, we present a biologically-plausible computational model of focus of attention that exhibits spatiotemporal locality and that is very well-suited for parallel and distributed implementations. Attention emerges as a wave propagation process originated by visual stimuli corresponding to details and motion information. The resulting field obeys the principle of "inhibition of return" so as not to get stuck in potential holes. An accurate experimentation of the model shows that it achieves top level performance in scanpath prediction tasks. This can easily be understood at the light of a theoretical result that we establish in the paper, where we prove that as the velocity of wave propagation goes to infinity, the proposed model reduces to recently proposed state of the art gravitational models of focus of attention.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual attention plays a central role in our daily activities. While we are playing sports, teaching a class or driving a vehicle, the amount of information our eyes collect is dramatically greater than what we are able to process Allport (1989); Koch et al. (2006). To work properly, we need a mechanism that at each time instant only locates the most relevant objects, thus optimizing the computational resources Ungerleider and G (2000). Human visual attention performs this task so efficiently that, at conscious level, it goes unnoticed.

Attention mechanisms have been the subject of massive investigation also in machines, especially whenever they are asked to solve tasks related with human perception such as video compression, where loss of quality is not perceivable by viewers Itti (2004); Hadizadeh and Bajić (2013), or caption generation Liu et al. (2017); Chen and Zhao (2018). Following the seminal works by Treisman et al. Treisman and Gelade (1980); Treisman (1969) and Koch and Ullman Koch and Ullman (1987), as well as the first computational implementations Itti et al. (1998), over the last three decades scientists have presented numerous attempts to model focus of attention Borji and Itti (2012). The notion of saliency map

has been introduced, which consists of a spatial map that indicates the probability of focussing on each pixel. Under the

centralized saliency map hypothesis, it has been claimed that shifts in visual attention are then generated by a winner-take-all mechanism Koch and Ullman (1987) to select, at each time step, the most relevant location in space.

Some authors have tried to improve the estimation of scanpaths taking into account the dynamics of the human visual selection process. Preserving the centrality of the saliency map, an attempt has been made to introduce hand-crafted human bias to choose subsequent fixations 

Le Meur and Liu (2015); similarly, others have tried to formalize the fact that top-level information cues continue to increase in importance during visual exploration at the expense of more perceptive low-level information Jiang et al. (2016). In Khosla et al. (2007) the authors propose a bio-inspired visual attention module that is based on the pragmatic choice of identifying proto-objects and learning a ranking to determine the order in which these proto-objects will be attended. All of these approaches still assume the centrality of a saliency map so as to perform a long stack of global computations over the entire visual field before establishing the next fixation point. This is hardly compatible with what is done by the human visual system that most likely begins computing attention in the early stages of vision Briggs and Usrey (2007); McAlonan et al. (2008) and restricts computation to a small portion of the available visual information Treisman and Gelade (1980); Schlingensiepen et al. (1986).

More recently, Zanca et al. proposed approaches Zanca and Gori (2017); Zanca et al. (2019) that are inspired from physics to model the process of visual attention as a continuous dynamic phenomenon. The focus of attention is regarded as particle that is gravitationally attracted by virtual masses originated from details and motion in the visual scene. Details are associated with masses proportional to the magnitude of the gradient of the brightness, while masses due to motion are proportional to the magnitude of the optical flow. This framework is applicable to both images and videos, as long as one considers a static image as a video whose frames are repeated at each time step. Moreover, the model proposed in Zanca et al. (2019)

also implements the inhibition of return mechanism, by monotonically decreasing the saliency of a given area of the retina that has already been explored in the previous moments. Unlike the previous approaches, the prediction of the focus does not rely on a centralized saliency map, but it acts directly on early representations of basic features organized in spatial maps. Besides the advantage in real-time applications, these models make it possible to characterize patterns of eye movements (such as

fixations, saccades and smooth pursuit) and, despite their simplicity, they reach the state of the art in scanpath prediction. However, when looking at these gravitational models from the biological and computational perspective, one promptly realizes that finding the focus of attention at a certain time does require the access to all the visual information of the retina to sum up the attraction arising from any virtual mass. Basically, those models are not local in space.

This paper proposes a paradigm-shift in the computation of the attraction proposed in Zanca and Gori (2017); Zanca et al. (2019), which is inspired by the classic link between global gravitational or electrostatic forces and the associated Poisson equation on the corresponding potential, that can be regarded as a spatially local computational model. Interestingly, Newtonian gravity yields an instantaneous propagation of signals, so as a sudden change in the mass density of a given pixel immediately affects the focus of attention, regardless of its location on the retina. While the link is intriguing, modeling the focus of attention by the force emerging from the static nature of the Poisson potential does not give rise to a truly local computational process, since one needs solving the Poisson equation for each frame. This means that such a static model is still missing the temporal propagation that take place in peripheral vision mechanisms. We show that the temporal dynamics which arise from diffusion and wave-based mechanisms are effective to naturally implement local computation in both time and space. The intuition is that attention is also driven by virtual masses that are far away from the current focus by means of wave-based and/or diffusion propagation. We discuss the two different mechanisms of propagation and prove their reduction to gravitational forces as the velocity goes to infinity. The experimental results confirm the intuition that wave-based propagation is more effective in transmitting the information coming from virtual masses than diffusion propagation. However, better results are obtained when properly combining these two propagation mechanisms. Our experimental analysis on scanpaths leads to state of the art results which can clearly be interpreted when considering the reduction to the gravitational model for infinite propagation velocity. The bottom up is that we can reach state of the art results by a computational model which is truly local in space and time and that it is clearly very well-suited for SIMD hardware implementations.

The paper is organized as follow. In Section 2, we give a brief review of gravitational models of attention  Zanca et al. (2019) and provide their interpretation in terms of the classic Poisson’s equation. In section 3 we propose our model and show that as the velocity of propagation goes to infinity, it reduces to Zanca et al. (2019). In section 4, we discuss algorithmic issues along with the experimental setup used to test our model, and illustrate the results on saliency and scanpath prediction. Finally, some conclusions are given in Section 5.

2 Gravitational models of FOA and Poisson’s Equation

According to Zanca et al. (2019), the trajectory of the focus of attention starting at with velocity is the solution of the following Cauchy problem:

(1)

where and the scalar function is defined as follows:

(2)

Here is the Euclidean norm in and is the mass distribution at a certain temporal instant that is present on the retina and is determined by details and motion:

(3)

In particular, , where is the brightness, while , where is the optical flow and and are positive parameters. The term implements the inhibition of return mechanism, and satisfies

(4)

with . The crucial observation that will be exploited in the next sections to develop a fully local theory of focus of attention is the fact that the potential satisfies the Poisson equation on :

(5)

where is the Laplacian in two dimensions. Such result, which is the two-dimensional analogue of the Poisson equation for the classical gravitational potential can be checked by direct calculation. More generally it follows from these two facts (see for example Evans (2010)):

  1. The function , defined for , , is the fundamental solution of Laplace’s equation (i.e. );

  2. to get a solution of the Poisson equation in , when is regular and compactly supported, it is sufficient to choose as the convolution of with .

Because the “mass” density is time dependent, and its temporal dynamics is synced with the temporal variations of the video, Eq. (5) should in principle be solved for any . In the next section we will discuss instead how the values of the potential in a spatial neighbour of are exploited to estimate the values of at by interpreting Eq. (5) as an elliptic limit of a parabolic or hyperbolic equations.

3 Visual diffusion and wave propagation

The reformulation of FOA gravitational attraction based on the solution of Poisson’s equation does require discovering the potential due to the virtual masses at each frame, thus ignoring any temporal relation. This remark clearly underlines also the strong limitation of the solution proposed in Zanca and Gori (2017); Zanca et al. (2019), where the gravitational force is re-computed at each frame from scratch. The main idea behind the reformulation presented in this paper is that since we expect that small temporal changes in the source cause small changes in the solution , then it is natural to model the potential by dynamical equations which prescribe, for each spatial point , how the solution must be updated depending on the spatial neighborhood of at time . There is in fact an intriguing connection with electrostatics and electrodynamics. We can introduce an explicit temporal dynamics in Eq. (5) by introducing the following two “regularizations”

(6)

where . Problem is a Cauchy problem for the heat equation with source , whereas problem is a Cauchy problem for a wave equation. The term in represent the diffusivity constant, whereas the constant in problem can be regarded as the speed of propagation of the wave. The reason why we can consider problem and as temporal regularizations of Eq. (5) is due to the following fundamental result.

Theorem 1.

Let be the solution, described in Eq. (2), of Eq. (5), then the gradients and of the solutions and to problems and in Eq. (6) (at least pointwise) converge to as .

Proof.

See supplementary material. ∎

Notice that the convergence result expressed by Theorem 1 is given on the gradients of the potentials and not on the potentials themselves. The interpretation of this result is actually quite straightforward. For problem it means that the solution of the heat equation in a substances with high diffusivity , instantly converges to its stationary value which is given by Poisson equation (5). For problem , Theorem 1 turns out to be the two dimensional analogue of the infinite-speed-of-light limit in electrodynamics and in particular it expresses the fact that the retarded potential (see Jackson (2007)), which in three spatial dimensions are the solutions of problem , converges to the electrostatic potential as the speed of propagation of the wave goes to infinity () 111It is worth mentioning that this regularization result was not established in two dimensions.. Although both temporal regularization and achieve the goal of transforming the Poisson equation into an initial value problem in time from which all subsequent states can be evolved from, the different nature of the two PDE determines, for finite , qualitative differences in the FOA trajectories computed using Eq (1). Indeed, since problem models a diffusion process, it tends to smooth out details, that are instead preserved in the wave-based propagation mechanism . For this reason this paper is mostly concerned with FOA trajectories obtained from potentials that solve a wave equation instead of a diffusion equation. Hence, in the remainder of the paper, we consider the following generalized version of Eq. (6)

(7)

where is the drag coefficient and . Such equation in one spatial dimension (and without the source term ) is known as the telegraph equation (see Evans (2010)). More generally, it describes the propagation of a damped wave. The FOA model proposed in this paper is based on these equations along with the inhibition of return equation expressed by 4.

Clearly, Eq. 7 are local in both space and time, which is a fundamental ingredient of biological plausibility. In addition, they are very well-suited for SIMD hardware implementations. At a first sight, Eq. 4 does not possess spatial locality. While this holds true in any computer-based retina, in nature, moving eyes rely on the principle that you can simply pre-compute by an appropriate foveal structure. Interestingly, the implementation of moving eyes are the subject of remarkable interest in robotics for different reasons (see e.g. Kuang et al. (2012)).

4 Algorithmic issues and Experimental Results

The proposed model is evaluated in saliency and scanpath prediction tasks and compared with state-of-the-art models. Human eye-tracking datasets exist to provide well-established benchmarks for the evaluation of computational models of visual attention. In this work we make use of CAT2000 Kümmerer et al. ; Borji and Itti (2015), provided by the MIT Saliency Team, to evaluate the model in saliency prediction. The CAT2000 training set includes 2000 input stimuli, grouped into 20 different semantic categories. Since CAT2000 does not provide temporal information of human visual explorations, to evaluate the proposed model in the task of scanpath prediction we use a collection of four smaller datasets (MIT1003 Judd et al. (2009), TORONTO Bruce and Tsotsos (2007), KOOTSTRA Kootstra et al. (2011), SIENA12 Zanca et al. (2018)) of eye-tracking data, for a total of 1234 input stimuli. All eye-tracking data have been collected in free-viewing conditions and subjects were exposed to the stimulus from 3 up to 5 sec. For each stimulus, we simulate different scanpaths for each of the two models and , Eq. 6 and Eq. 7, respectively.

Numerical implementation

From a computational point of view, we must take into account that our computations are limited to a finite region of space, the retina . Thus, to determine the potential and its time evolution on we have to impose additional boundary conditions on . We adopt Dirichlet boundary conditions, requiring the vanishing of the potential on the boundary, on , . Thus Eq. (7) becomes

(8)

where from now on we decided to set . The pure wave equation corresponds to and , while with and we get a damped wave model. Spurious reflections originating from the boundary are avoided through this dumping term since, through an appropriate choice of the parameters, out-going waves are suppressed before they can reach the boundary. The case of the heat equation is recovered with with a diffusion coefficient equal to .

The first step to numerically solve the update equation for the potential is to discretize both the retina, considering a mesh of points (pixels), and the time interval with steps of length . In the case where we process visual streams, is synced with the temporal resolution of the video by choosing it to be the inverse of the frame rate of the stream. For static images, is a priori fixed to an arbitrary value. Then, we adopt the so called finite difference method, approximating spatial and temporal derivatives through finite differences. Considering an arbitrary pixel of the retina at a certain time , the evaluation of spatial (temporal) derivatives of the potential in this point will just require the knowledge of the potential in its adjacent points in space (time). According to the chosen approximations for the derivatives, the unknown value of the potential at the following time step is determined by a set of algebraic equations (explicit methods) or by a set of coupled equations (implicit methods). Generally speaking, implicit methods are less afflicted by numerical instabilities, even though they are much slower than the explicit ones Langtangen and Linge (2017). It should be noticed that in our implementation the potential is rescaled by a costant multiplicative factor, i.e. . We performed a grid search to select the best hyper-parameter . We found out that maximized the performance in both saliency and scanpath prediction tasks.

In our experiments, we have chosen backward finite difference approximations for the time derivatives and a central one for the spatial derivatives,

resulting in a implicit scheme. To test our model, we have chosen two different sets of parameters, summarized in Tab. 1. The model corresponds to the pure diffusion case, while the model also considers a non vanishing term.

Model
0 1 1 1 5
1 1 1 5
Table 1: Parameters of the and models

Qualitative results

In Fig. 1 and Fig. 2 some qualitative results are reported. Fig. 1 shows some sequences of fixations obtained through our model. The input static images come from the CAT2000 training set, with an exposition time of 5 sec. In addition, Fig. 2 shows the dynamical evolution of the potential, in a sec. exploration of a static image. From this qualitative analysis we can underline that the model seems to have a stronger bias towards the centre of the visual scene than model . Moreover, model seems to heavily smooth out the microscopic structure of the mass density distribution, making the model probably preferable despite its slightly worst results in the measures reported in the next sections.

Figure 1: Fixations for the model (top) and model (bottom) obtained in five seconds of exploration. Images from the CAT2000 training set.
Figure 2: Dynamical evolution of the potential in the model (left) and model (right) in seconds of exploration of a static image.

4.1 Saliency prediction

The saliency prediction task consists of generating saliency maps to predict the probability of each pixel to be attended by a human subject during free-viewing Borji et al. (2013). We exploit our model to generate simulated scanpaths by means of numerical integration, as described in the previous section. Different visual explorations are obtained by initializing the system with arbitrary different initial conditions. Since human subjects during free-viewing are usually asked to look at a target point in the center of the screen before a visual stimulus is presented, we choose to initialize the model within a range of of visual angle in the center of the image and with focus of attention velocity close to zero. Fixations extracted from the simulated scanpath were accumulated in the so-called fixation map. Finally, a saliency map was obtained by applying a Gaussian smoothing to the fixation map. It is well known Bylinskii et al. (2016); Borji and Itti (2012); Kümmerer et al. (2015) that when applying a certain amount of smoothing and center bias, we can improve the performance of models in saliency prediction. Therefore, we performed a grid search on blurring and center bias parameters to determine the best estimate associated with the model. Metrics exist to compare the generated saliency map with the human one. Here we compare the performance of our model with respect to two different saliency metrics:

  • Area Under the ROC Curve Judd et al. (2009) (AUC)

    . The saliency map is treated as a binary classifier to separate positive from negative samples at various thresholds. The true positive rate is the proportion of saliency map values above threshold at fixation locations. The false positive rate is the proportion of saliency map values above threshold at all pixels.

  • Normalized Scanpath Saliency Peters et al. (2005) (NSS)

    . It measures the mean saliency value at fixated locations of the saliency map, normalized with zero mean and unit variance.

Results are summarized in Tab. 2. The model is compared with five different models. Two of these Kümmerer et al. (2016); Cornia et al. (2016) are state-of-the-art supervised222

We indicate as supervised those approaches in which saliency is learned directly from data by machine learning techniques.

saliency models. Two others Zanca et al. (2019); Zanca and Gori (2017) are state-of-the-art unsupervised scanpath models. The last one is the classic Itti’s model Itti et al. (1998). All models were used in their original software implementation provided by the respective authors. We notice that supervised models maintain the state of the art in the estimation of saliency. This may be due to the fact that they can learn semantic properties of the image, which highly correlate with human visual attentionTheeuwes et al. (2000); Connor et al. (2004). However, the proposed models achieve state of the art results among unsupervised models and competes very well with supervised models.

4.2 Scanpath prediction

The scanpath prediction task consists of predicting the sequence of fixations that a human subject performs in free-viewing condition when presented with a stimulus. Unlike the saliency prediction task, here the temporal dynamics of the attention mechanism are evaluated. The models were asked to predict where and in which order

the subject will perform fixations on the scene. We used two different evaluation metrics:

  • String-edit distance Foulsham and Underwood (2008) (SED). The input stimulus is divided into regions, labeled with characters. Scanpaths is turned into strings by associating each fixation with the corresponding region. Finally, the string-edit algorithm is used to provide a measure of the distance between the two generated strings.

  • Scaled time-delay embeddings Zanca et al. (2020) (STDE). This measure derives from quantitative methods in physics to compare stochastic and dynamic scanpaths of varied lengths. It is defined as the average of the minimum Euclidean distances of each sub-sequence of length from a target trajectory. Coordinates are normalized between zero and one to obtain comparable measure for images of different sizes.

Difficulties arise when evaluating visual attention models in tasks of scanpath prediction 

Zanca et al. (2020). We show the results in terms of mean and best prediction scores. In the case of mean, scores are averaged over all subjects in the dataset; in the case of best, we consider, for each of the simulated scanpaths, only the subject that best matches with that simulated scanpath.

The proposed models are compared with the same five models considered in the previous experiment. Whenever possible, we used the authors’ original implementation to generate fixation sequences. For strictly saliency-oriented models, we have applied the winner-take-all algorithm Koch and Ullman (1987). We selected the location with the highest saliency value, then we inhibited in a radius of 4 degrees Hooge and Frens (2000) of visual angle and selected the next fixation. The results are summarized in Tab. 2. They show that the proposed model reaches the state of the art in scanpath prediction with respect to the STDE metric. Better results in terms of STDE indicate a greater adherence to the shape of the target human trajectories, while the discretized spatial grid makes the SED less precise to evaluate spatial properties. The advantage of our model over the other approaches is likely due to the fact that the local implementation induces a proximity preference Koch and Ullman (1987) in the choice of the following target. Supervised models that benefit from the possibility of learning semantic characteristics of the scenes in a data-driven approach, are instead outclassed by unsupervised models in this scanpath prediction task.

Saliency Pred. Scanpath Pred.
AUC NSS SED STDE
Model Supervised mean best mean best
GRAV Zanca et al. (2019) No 0.84 1.57 7.34 3.72 0.81 0.85
Eymol Zanca and Gori (2017) No 0.83 1.784 7.94 4.10 0.74 0.81
SAM Cornia et al. (2016) Yes 0.88 2.38 8.02 4.25 0.77 0.83
Deep Gaze II Kümmerer et al. (2016) Yes 0.77 1.16 8.17 4.34 0.72 0.79
Itti Itti et al. (1998) No 0.77 1.06 8.15 4.36 0.70 0.76
Our No 0.84 1.69 7.73 3.85 0.87 0.90
Our No 0.84 1.56 7.69 3.88 0.86 0.90
Table 2: Saliency prediction scores. AUC/NSS are similarity metrics. Larger values are preferable. Scanpath prediction scores. STDE is a similarity metric while SED is a distance. Larger STDE scores are preferable, while smaller SED score correspond with better results. Best in bold.

5 Conclusions

In this paper, we have presented a computational model of focus of attention that is inspired from classic wave and diffusion propagation mechanisms, that are joined with the principle of inhibition of return. It is proven that the resulting field gets arbitrarily close to recently proposed gravitational-based models of focus of attention models, which explains the reasons of the state of the art experimental results achieved in scanpath prediction tasks. The connection with gravitational model arises because of a fundamental regularization property that is proven to reduce the proposed model to Poisson’s equation. The proposed theory clearly explains the emergence of reactions to peripheral visual stimuli. In particular, a distinctive feature of the proposed computational model is its spatiotemporal locality, which makes it very well-suited for SIMD parallel implementations, and also represents a fundamental property typically required for biological plausibility.

While the current numerical discretization is based on the traditional 2D gridding of the retina, one can think of approximating the continuous differential equations by higher-order approximations of spatial derivatives, which improves the approximation and offers additional insights on the biological plausibility of the model.

Broader Impact

Our work is a foundational study. We believe that there are neither ethical aspects nor future societal consequences that should be discussed.

References

  • A. Allport (1989) Visual attention. In Foundations of cognitive science, pp. 631–682. Cited by: §1.
  • A. Borji and L. Itti (2012) State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 185–207. Cited by: §1, §4.1.
  • A. Borji and L. Itti (2015) Cat2000: a large scale fixation dataset for boosting saliency research. arXiv preprint arXiv:1505.03581. Cited by: §4.
  • A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti (2013) Analysis of scores, datasets, and models in visual saliency prediction. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 921–928. Cited by: §4.1.
  • F. Briggs and W. M. Usrey (2007) A fast, reciprocal pathway between the lateral geniculate nucleus and visual cortex in the macaque monkey. Journal of Neuroscience 27 (20), pp. 5431–5436. Cited by: §1.
  • N. Bruce and J. Tsotsos (2007) Attention based on information maximization. Journal of Vision 7 (9), pp. 950–950. Cited by: §4.
  • Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand (2016) What do different evaluation metrics tell us about saliency models?. arXiv preprint arXiv:1604.03605. Cited by: §4.1.
  • S. Chen and Q. Zhao (2018)

    Boosted attention: leveraging human attention for image captioning

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84. Cited by: §1.
  • C. E. Connor, H. E. Egeth, and S. Yantis (2004) Visual attention: bottom-up versus top-down. Current biology 14 (19), pp. R850–R852. Cited by: §4.1.
  • M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara (2016) A deep multi-level network for saliency p rediction. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 3488–3493. Cited by: §4.1, Table 2.
  • L. C. Evans (2010) Partial differential equations. Vol. 19, American Mathematical Soc.. Cited by: Appendix B, §2, §3, Wave Propagation of Visual Stimuli in Focus of Attention.
  • T. Foulsham and G. Underwood (2008) What can saliency models predict about eye movements? spatial and sequential aspects of fixations during encoding and recognition. Journal of vision 8 (2), pp. 6–6. Cited by: 1st item.
  • H. Hadizadeh and I. V. Bajić (2013) Saliency-aware video compression. IEEE Transactions on Image Processing 23 (1), pp. 19–33. Cited by: §1.
  • I. T. C. Hooge and M. A. Frens (2000) Inhibition of saccade return (isr): spatio-temporal properties of saccade programming. Vision research 40 (24), pp. 3415–3426. Cited by: §4.2.
  • L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence (11), pp. 1254–1259. Cited by: §1, §4.1, Table 2.
  • L. Itti (2004) Automatic foveation for video compression using a neurobiological model of visual attention. IEEE transactions on image processing 13 (10), pp. 1304–1318. Cited by: §1.
  • J. D. Jackson (2007) Classical electrodynamics. John Wiley & Sons. Cited by: §3.
  • M. Jiang, X. Boix, G. Roig, J. Xu, L. Van Gool, and Q. Zhao (2016) Learning to predict sequences of human visual fixations.

    IEEE transactions on neural networks and learning systems

    27 (6), pp. 1241–1252.
    Cited by: §1.
  • T. Judd, K. Ehinger, F. Durand, and A. Torralba (2009) Learning to predict where humans look. pp. 2106–2113. Cited by: 1st item, §4.
  • D. Khosla, C. K. Moore, D. Huber, and S. Chelian (2007) Bio-inspired visual attention and object recognition. In Intelligent Computing: Theory and Applications V, Vol. 6560, pp. 656003. Cited by: §1.
  • C. Koch and S. Ullman (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pp. 115–141. Cited by: §1, §4.2.
  • K. Koch, J. McLean, R. Segev, M. A. Freed, M. J. Berry II, V. Balasubramanian, and P. Sterling (2006) How much the eye tells the brain. Current Biology 16 (14), pp. 1428–1434. Cited by: §1.
  • G. Kootstra, B. de Boer, and L. R. Schomaker (2011) Predicting eye fixations on complex visual stimuli using local symmetry. Cognitive computation 3 (1), pp. 223–240. Cited by: §4.
  • X. Kuang, M. Gibson, B. E. Shi, and M. Rucci (2012) Active vision during coordinated head/eye movements in a humanoid robot. IEEE Transactions on Robotics 28 (6). Cited by: §3.
  • [25] M. Kümmerer, Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba MIT/tübingen saliency benchmark. Note: https://saliency.tuebingen.ai/ Cited by: §4.
  • M. Kümmerer, T. S. Wallis, and M. Bethge (2015) Information-theoretic model comparison unifies saliency metrics. Proceedings of the National Academy of Sciences 112 (52), pp. 16054–16059. Cited by: §4.1.
  • M. Kümmerer, T. S. Wallis, and M. Bethge (2016)

    DeepGaze ii: reading fixations from deep features trained on object recognition

    .
    arXiv preprint arXiv:1610.01563. Cited by: §4.1, Table 2.
  • H. P. Langtangen and S. Linge (2017) Finite difference computing with pdes. Springer International Publishing. Cited by: §4.
  • O. Le Meur and Z. Liu (2015) Saccadic model of eye movements for free-viewing condition. Vision research 116, pp. 152–164. Cited by: §1.
  • C. Liu, J. Mao, F. Sha, and A. Yuille (2017) Attention correctness in neural image captioning. In

    Thirty-First AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • K. McAlonan, J. Cavanaugh, and R. H. Wurtz (2008) Guarding the gateway to cortex with attention in visual thalamus. Nature 456 (7220), pp. 391–394. Cited by: §1.
  • R. J. Peters, A. Iyer, L. Itti, and C. Koch (2005) Components of bottom-up gaze allocation in natural images. Vision research 45 (18), pp. 2397–2416. Cited by: 2nd item.
  • K. Schlingensiepen, F. Campbell, G. E. Legge, and T. Walker (1986) The importance of eye movements in the analysis of simple patterns. Vision Research 26 (7), pp. 1111–1117. Cited by: §1.
  • J. Theeuwes, P. Atchley, and A. F. Kramer (2000) On the time course of top-down and bottom-up control of visual attention. Control of cognitive processes: Attention and performance XVIII, pp. 105–124. Cited by: §4.1.
  • A. M. Treisman and G. Gelade (1980) A feature-integration theory of attention. Cognitive psychology 12 (1), pp. 97–136. Cited by: §1, §1.
  • A. M. Treisman (1969) Strategies and models of selective attention.. Psychological review 76 (3), pp. 282. Cited by: §1.
  • S. K. Ungerleider and L. G (2000) Mechanisms of visual attention in the human cortex. Annual review of neuroscience 23 (1), pp. 315–341. Cited by: §1.
  • D. Zanca and M. Gori (2017) Variational laws of visual attention for dynamic scenes. In Advances in Neural Information Processing Systems, pp. 3823–3832. Cited by: §1, §1, §3, §4.1, Table 2.
  • D. Zanca, S. Melacci, and M. Gori (2019) Gravitational laws of focus of attention. IEEE transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §1, §2, §3, §4.1, Table 2.
  • D. Zanca, S. Melacci, and M. Gori (2020) Toward improving the evaluation of visual attention models: a crowdsourcing approach. arXiv preprint arXiv:2002.04407. Cited by: 2nd item, §4.2.
  • D. Zanca, V. Serchi, P. Piu, F. Rosini, and A. Rufa (2018) FixaTons: a collection of human fixations datasets and metrics for scanpath similarity. arXiv preprint arXiv:1802.02534. Cited by: §4.

Appendix A Duhamel’s Principle

Consider the following abstract form of the Cauchy problem for an evolution equation of the first order:

where the spatial dependece of is not explicitly written and is a differential operator with respect to the spatial variables. Now consider for every fixed the additional homogeneous problem

and let us indicate with the solution to this problem. Then we can directly check that the function

solves problem . Indeed we have and

In a similar way we can treat the second order problem

This time for any we consider the solution to the problem

Then one can verify that the function

is a solution to problem .

Appendix B Proof of Theorem 1

We prove the claim first for the solutions to the heat equation and then for the solution to the wave equations .

Heat equation

In order to find an explicit formula for the solution to the problem

(1)

we begin considering the problem

(2)

For this particular PDE the constant can be absorbed entirely by a time rescaling . So that the solution to Eq. (2) can be written as where solves

(3)

Finally the solution to this problem can be found in terms of the well known heat kernel

(4)

as . Then the solution to (2) is

Applying Duhamel’s principle to Eq. (1) yields

Now let , then

The gradient of can now be directly calculated as

Taking the formal limit we obtain

Because , we have

which is indeed the gradient of the potential that solves Poisson equation with source . Notice however that performing the formal limit directly into the expression for the potential would lead to a divergent limit due to the divergent quantity .

Wave equation

Consider now the nonhomogeneous wave eqaution

(5)

In order to find an explicit solution to this problem, as usual (see Evans [2010]), we start by considering the related problem

(6)

The solution of such problem is , where solves

(7)

Since the solution of (7) is given by the Poisson’s formula in two dimensions

we have

(8)

The solution to Eq. (5) can be obtained from the solution of Eq (7) via the Duhamel’s principle (see Section A). In this case we have that

(9)

where solves

Thus from Eq. (8) and (9) we immediately have:

Now let us make the change of variables in the integral over ; we thus obtain:

(10)
Example 1.

In order to understand Eq. (10) let us consider the case of a unit mass fixed at the origin: . In this case

(11)

Then

(12)

Notice that in the last formula as we have a divergent part plus a finite part which is indeed our initial guess for this limit; moreover the divergent part has a vanishing spatial gradient meaning that it does not effect the force which is entirely given by gradient of which is indeed the force that ones derive from Poisson equation on .

This example suggests to look for the convergence of rather than that of itself since the latter quantity can give rise to divergences.

In general let us now come back to Eq. (10). This integral is performed over a cone in the space -. A little thinking shows that such integration can be rearranged as follows:

(13)

Or, equivalently, performing the change of variables

(14)

Since we are interested in the limit we can expand in powers of around zero:

(15)

At order zero in we have

(16)

The gradient of such expression is

(17)

Now we can use the following version of the divergence theorem

(18)

where is the normal to

. In order to prove this start from the divergence theorem for vector fields:

then choose , therefore

which gives the wanted formula. If we apply such expression to Eq. (17) we get

(19)

Here we found by direct calculation (since ) that the boundary term is null; notice that indeed this term must be zero also because it comes from an integral evaluated betwee and , so when the whole term is vanishing.

Expanding the gradient in the remaining term we get

(20)

As we take the formal limit , we obtain

(21)

which is the wanted result.

We will now consider the generic term in (15) to show that we correctly performed the limit disregarding in the approximation of all the higher order terms in .

Let us define

Changing the integration variable we this quantity can be written as

And its derivative (with respect to the parameter ) is

In the last integral let us perform the additional change of variable

As we formally let we have that the integral converges to . For it is immediate to check that

for this property can be checked by direct calculations; indeed

The quantity has been defined precisely so that the -th term in the expansion in powers of of the gradient of the potential would be

(22)

Since using (18) Eq. (22) becomes

and this quantity for goes to as .