Are all the frames equally important?

05/20/2019 ∙ by Oleksii Sidorov, et al. ∙ adobe NTNU Harvard University 0

In this work, we address the problem of measuring and predicting temporal video saliency -- a measure which defines the importance of a video frame for human attention. Unlike the conventional spatial saliency which defines the location of the salient regions within a frame (as it is done for still images), temporal saliency considers importance of a frame as a whole and may not exist apart from context. The proposed interface is an interactive cursor-based algorithm for collecting experimental data about temporal saliency. We collect the first human responses and perform their analysis. As a result, we show that qualitatively, the produced scores have very explicit meaning of the semantic changes in a frame, while quantitatively being highly correlated between all the observers. Apart from that, we show that the proposed tool can simultaneously collect fixations similar to the ones produced by eye-tracker in a more affordable way. Further, this approach may be used for creation of first temporal saliency datasets which will allow training computational predictive algorithms. The proposed interface does not rely on any special equipment, which allows to run it remotely and cover a wide audience.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 2. The proposed interface. A more representative video demonstration is available online: [link].

The proposed interface interface.

It seems obvious that some fragments of a video are more important than others. Such fragments concentrate most of the viewer’s attention while others remain of no interest. The naïve examples are: a culmination scene in a movie, a screamer in a horror film, the moment of an explosion, or even a slight motion in very calm footage. We denote such fragments as groups of frames with high temporal saliency. Information about temporal saliency is an essential part of a video characterization which gives valuable insights about the video structure. Such information is directly applicable in video compression (frames which do not attract attention may be compressed more), video summarization (salient frames contain the most of perceived video content), indexing, memorability prediction, and others tasks. So, the reader may expect that there is a big number of algorithms and techniques aimed at measuring and predicting temporal saliency. However, this is not the case. The most if not all of the well-known works on video saliency are aimed at spatial saliency,

i.e., a prediction of spatial distribution of the observer’s attention across the frame (in a similar way as if it was an individual image). We hypothesize that this is due to the absence of established methodology for measuring temporal saliency in the experiment which is crucial for obtaining ground truth data. Conventionally, ground truth saliency data is collected using an eye-tracking, which is a technique that produces a continuous temporal signal. In other words, it does not allow to differentiate between the frames as a whole, because each frame produces the same kind of output – a pair of gaze fixation coordinates with a rate defined by hardware.
In this work, we propose a new methodology for measuring temporal video saliency in the experiment – the first, to the best of our knowledge, method of this kind. For this, we develop a special interface based on mouse-contingent moving-window approach for measuring saliency maps of static images. We also show that it can simultaneously gather meaningful spatial information which can serve as an approximation of gaze fixations.
During the experiment, observers are presented with repeated blurry video-sequences which they can partially deblur using mouse click (Fig. 2) . ”Bubble” in this context denotes a circular region with a center at cursor location which is deblurred by clicking. ”Bubbles” are intended to approximate the confined area of focus in the human eye fovea surrounded by a blurred periphery (Gosselin and Schyns, 2001). Since the number of clicks is limited - observers are forced to use clicks only on most ”interesting” frames which attract their attention. Statistical analysis of the collected clicks allows to assign the corresponding level of importance to each frame. This information can be applied directly in numerous tasks of video processing.
To summarize, unlike the conventional approaches which only try to understand where the observer looks, we also study when the observer pay the most attention.

2. Related works

The human visual system has an inherent ability to quickly select visually important regions instead of monotonically processing all the available visual information. One of the straightforward methods of retrieving the information about human gaze is based on the utilization of commercial eye-trackers (e.g. EyeLink, Tobii). Hardware-based eye-tracking has been used widely in various studies on human-computer interaction (Jacob and Karn, 2003)(Nielsen and Pernice, 2010) and particularly in the studies of multimedia content (Hayhoe, 2004)(Yarbus, 1967)(Gitman et al., 2014)(Mantiuk et al., 2013). Currently, eye-tracking is considered to be the most accurate method for approximation of gaze fixations and studying cognitive processes involved in the processing of visual information. However, it relies on expensive equipment (cameras and infrared sensors) as well as accurate calibration, which limits its use.
A less accurate but much more affordable way of measuring human gaze is based on measuring the mouse cursor position which was proven to correlate strongly with gaze fixations (Guo and Agichtein, 2010)(Huang et al., 2012)(Rodden et al., 2008). The most successful algorithms of this type utilize a moving-window paradigm, which masks information outside of the area adjacent to the cursor and requires a user to move the cursor (followed by a window around it) to make other regions visible. Such algorithms include Restricted Focus Viewer software by Jansen et al. (Jansen et al., 2003) and more recent SALICON (Jiang et al., 2015) and BubbleView (Kim et al., 2017)(Kim et al., 2015). These algorithms were also used in large online crowdsourcing experiments due to the native scalability of cursor-based approaches. However, they were studied only in the context of spatial saliency of static images.
The most affordable way to predict regions which are important for human attention is the use of computational predictive algorithms. The classic ones are based on image statistics and such features like contrast which make objects to stand out and capture user’s attention (Le Meur et al., 2006)(Zhang et al., 2008)(Gao and Vasconcelos, 2005)(Bruce and Tsotsos, 2006)

. Later, a significant improvement in accuracy of prediction was introduced with the use of deep-learning methods

(Vig et al., 2014)(Kruthiventi et al., 2017)(Huang et al., 2015)(Pan et al., 2016). There are also a number of works aimed at prediction of video saliency (Guo and Zhang, 2010)(Mahadevan and Vasconcelos, 2009)(Rudoy et al., 2013)(Seo and Milanfar, 2009) which include a few deep-learning based (Wang et al., 2018)(Bak et al., 2018)(Jiang et al., 2017).
However, all the discussed approaches only try to answer the question, where does the observer direct their attention? But ignore the question when. This is fair for static images, but for video-sequences, temporal information and characterization of a frame as a whole are commonly even more important than spatial regions. Furthermore, there are no well-known experimental datasets which can provide this kind of information111A comprehensive list of saliency datasets: http://saliency.mit.edu/datasets.html and be used for training of computational algorithms. For example, the common video saliency datasets Hollywood-2 (Vig et al., 2014), UCF sports (Mathe and Sminchisescu, 2015), SAVAM (Gitman et al., 2014), DHF1K (Wang et al., 2018) only provide eye-tracking results which are constant in the temporal domain.

3. Methodology

Our approach is inspired by moving-window gaze approximations methods for still images. Not only do we shift it to video domain and show that it can be used for measuring spatial video saliency, but also modify it for measuring temporal video saliency which cannot be measured using an eye-tracking or any other method (except evidently asking the observer directly). A number of fixations registered by eye-tracker is defined by its hardware, and excluding errors decreases only when the observer blinks or move gaze away from screen. We hypothesize that temporal saliency may be approximated by measuring sparsity (spatial variance) of a frame’s saliency map, but it relies on the assumption that observers look at random locations when they are not interested and concentrate on one point otherwise. Our approach avoids these assumptions.

3.1. General approach

We propose three different setups which may be considered an extension of static-image cursor-based methods for video domain:

  • Type A, ”temporal information only” (Fig. 1a): all frames are blurred, clicking the mouse deblurs the whole frame until the button is released. Observers deblur the frames they are interested in the most, whereas the total amount of clicks defines the saliency score of a frame.

  • Type B, ”spatial information only” (Fig. 1b): all frames are blurred, the circular region around the cursor is clear constantly, the observer only moves the window without clicking. This is the most direct approximation of eye-tracking. It provides continuous temporal signal and cursor coordinates (proxy gaze fixations) for each frame.

  • Type C, ”combination of temporal and spatial information” (Fig. 1c): all frames are blurred, clicking the mouse deblurs a round window around the cursor. The total number of clicks on the frame defines temporal saliency score, location of the cursor when the mouse button is pressed approximates gaze fixation location and makes a hint on what has caused the interest.

3.2. Discretization

Considering that it is more likely that a short fragment of a video would attract user’s attention rather than a single frame, we let the users keep the mouse button pressed instead of clicking on each frame they find interesting. Additionally, it makes user experience more pleasant and interaction smoother. However, empirically, we observed an expected tendency: when not forced explicitly, observers tend to keep the mouse button pressed all the time. This is natural since releasing the button makes the frame blurry again and has no benefit for the user. This is a critical aspect for the proposed method because the absence of the discrete clicks leads to the same saliency score for each frame and, consequently, to the absence of temporal information.
Thus, it is crucial to restrict users artificially. A lot of methods can be proposed. For example, introducing a reward which decreases with clicks and motivates the user to ”play safer”; accumulating the clicks and defining their limit which the observer should not reach; decreasing the radius of a circle in inverse proportion to click rate; etc. However, such cost function should be simple and do not occupy the observer’s attention which is the main object of the study. Our solution is to simply limit the amount of deblurred frames, after which clicking the mouse button stops working, and additionally limit the amount of deblurred frames per one continuous click. These limits can also be defined in seconds. The users cannot see the limits, instead, they learn them during a test trial and then follow them intuitively. For example, a 10-second video may have up to 4 seconds of clear frames, but no more than 1 second at once. In the result, a user can make 4 long clicks 1 second each or a bigger number of shorter clicks, while we are guaranteed to have at least four discrete responses after one run.

Figure 3. Experimental setup (light is off during the session).

Experimental setup

Pearson Correlation Coefficient (mean ) Kolmogorov-Smirnov test (mean -value)
”The underwater world” 0.663 0.694 0.740 0.770 0.119 0.048 0.011 0.036
”Cinematic scene” 0.615 0.711 0.803 0.789 0.164 0.107 0.033 0.067
”Leaves in the wind” 0.694 0.563 0.545 0.647 0.081 0.073 0.044 0.057
”Basketball game” 0.741 0.766 0.863 0.845 0.164 0.099 0.055 0.063
”Diver suffocating” 0.789 0.788 0.820 0.834 0.134 0.092 0.043 0.068
”Meeting of the two” 0.660 0.701 0.740 0.753 0.121 0.112 0.061 0.053
Table 1. Inter-observer consistency of the measured temporal saliency

3.3. Repetition

Initially, the idea of repeating the videos was proposed to gather more responses from one observer and have richer statistics. Moreover, if a salient event happens at the end, the observer may reach the limit before seeing it, so it is necessary to make a second round. Also, eye-motion and cognitive processing are faster than clicking the mouse, so giving the user an opportunity to predict when an event will happen is beneficial for the creation of more accurate saliency maps with a shorter delay.
However, we observed that in the majority of the cases, the first run is the most informative one, and the user is able to detect most salient information without preparation. Subsequent repeats lead to shifting the user’s attention to smaller details.
Eventually, we use repetition in our experiments because it fits a few strategies at the same time: to use data only from the first round and discard the others; to use all rounds; or to assign different weight to each round and compute their linear combination.

3.4. Other parameters

Other important parameters which have not been discussed yet are the blur radius, the radius of the round window, and the task. Each of them requires additional detailed study. Blur should model low detailing of information in peripheral vision. We selected the value heuristically so that it hides the details but allows a user to understand whether anything important happens. Same applies to the window radius. On one hand, it should model 1 degree of visual angle which corresponds to the fovea area, but on the other, it defines a balance between the type A and type C experiments and depends on the precision of spatial information needed.


The task given to an observer influences where they look (Yarbus, 1967)(Kim et al., 2017), so, this parameter depends on the particular context in which the experiment is performed. In our case, we are interested in basic watching of a video without a particular task, so we worked under a ”free-view” setup.

4. Experimental setup

The experiments were performed offline using a special setup in the laboratory (Fig. 3) for the sake of fully-controlled conditions (in future we are also planning to run the experiment on Amazon Mechanical Turk for gathering larger database). The display used is 24.1” EIZO ColorEdge CG241W color-calibrated with X-Rite Eye-One Pro. The distance between the display and the observer was 50 cm. The code is written in MatLab with Psychtoolbox-3 (Kleiner et al., [n. d.]) and can be downloaded from https://github.com/acecreamu/temporal-saliency.
Videos with ground-truth eye-tracking data were taken from SAVAM dataset (Gitman et al., 2014) due to their remarkably high quality, duration, and diverse content. We used eight 10-seconds long HD videos including two test videos. The content of the videos is diverse and includes: a basketball game with a score moment, a calm shot of leaves in the wind, marine animals underwater, a cinematic scene of a child coming home, a surveillance camera footage of two men meeting, a suffocating diver emerging from the water.
Interface parameters: radius of a circular window – 200 px (

visual angle), blur kernel – Gaussian with standard deviation of 15, video duration – 10 s, limit of deblurred frames per one round – 4 s (100 frames), limit of deblurred frames at one click – 1 s (25 frames), number of repetitions – 5, frame-rate of the videos – 25 fps, video resolution – 1280 px

720 px ( visual angle), videos are silent.
The observers were invited from the University staff and students. 30 subjects in total, 15 women and 15 men. Age: 21-42 (mean 25.6).

5. Results and discussion

A type C interface was used in our experiments as the most comprehensive one. This type allows measuring both temporal and spatial saliency at the same time, thus, we evaluate the accuracy of both outputs.

Figure 4. The produced temporal saliency graphs. Thick black line , red line , thin black line . Videos from top to bottom: ”The underwater world”, ”Cinematic scene”, ”Leaves in the wind”, ”Basketball game”. Zoom is required.

The produced temporal saliency graphs.

Figure 5. The comparison of spatial saliency maps. Top row in each pair – eye-tracking results, bottom – our results. Videos from top to bottom: ”Cinematic scene”, ”Basketball game”, ”Diver suffocating”.

The produced spatial saliency maps.

5.1. Temporal saliency results

Considering that there is no ground truth temporal saliency data in regard to which the accuracy can be estimated, we evaluate the output of the algorithm by analyzing the produced temporal saliency ”maps” and estimating inter-observer consistency. The examples of obtained temporal saliency maps are illustrated in Fig.

4. The demonstration of the videos with saliency scores encoded as a color-map is available online: [click to access] . Figure 4 demonstrates three plots for each video which correspond to different averaging approaches: the sum of all clicks from all five video repeats (thick black line); the sum of clicks only from the first round without repeating (red line); and the weighted sum of clicks from all 5 rounds (thin black line). The weighted sum was calculated using the formula , where

corresponds to the vector of clicks from round

, and is the vector of weights. All the scores are normalized by a maximum number of clicks the frame can have, that is the number of repeats () multiplied by the number of observers (). In our experiments , whereas depends on averaging technique: 1 for , 5 for , and for .
Qualitative analysis shows that most of the peaks on the temporal saliency graph correspond to the semantically meaningful salient events on the video. This is the main goal and the main achievement of the proposed interface. It can also be seen that an intentionally taken monotonic video without salient events (”leaves in the wind”) has relatively flat saliency graph without strongly pronounced peaks (which may be even flatter when the response statistics is larger). Apart from that, it may be seen that in the case of other videos, the output of the first round (red line) is very similar to the total output of all five rounds. This means that even when in the next rounds observers start exploring smaller, less salient details, they still return to the ”main” events and follow a similar pattern of clicks as in the first round. Also, it may be seen that adding weights to the sum (thin black line) does not influence the results significantly, which again indicates the similarity of clicks from all the rounds. However, using rounds indeed allows to gather times more responses making the graph smoother and, as we show next, produces more consistent responses from each observer.
In order to estimate consistency between different groups of observers, we synthetically split observers into two groups of 15 people each. Then, we compute temporal saliency maps for each group independently and compare the results. The comparison is done using the Pearson Correlation Coefficient between the saliency maps from different groups, as well as performing the Kolmogorov-Smirnov test between two distributions and reporting p-value. Results are averaged between 100 random splits (standard deviation is also reported for PCC). Table 1 shows that the correlation between responses from different observers is very high, up to 0.86. Increasing the number of rounds considered increases the correlation of responses significantly, with maximum values achieved when all five rounds are included. The difference between the weighted and non-weighted sum of the five rounds is not large and depends on the particular video. Interestingly, the only video which violates this rule is the monotonic video of leaves, where the observers correlate the most in the first round.

5.2. Spatial saliency results

The spatial saliency maps produced by eye-tracking data versus our interface can be compared visually in Fig. 5. (fixation points are blurred with a Gaussian of sigma equal to of visual angle (33 px)). As may be seen, the results are very similar, even though we did not use any special equipment and collected spatial data additionally to the main temporal output.
Saliency maps are evaluated quantitatively using standard saliency metrics: Area under ROC Curve (AUC) (Judd et al., 2009)(Borji et al., 2013) and Normalized Scanpath Saliency (NSS) (Peters et al., 2005)(Bylinskii et al., 2018). Table 2 presents statistics of the scores computed per frame. Results demonstrate both good and poor performance and differ significantly from video to video.
In addition, quality of spatial saliency can be assessed visually via the rendered videos with map overlay [link], as well as the videos with both eye-tracking and our results simultaneously [link], presented as blue and red dots correspondingly.

AUC (mean ) NSS (mean )
”The underwater world” 0.617 0.73
”Cinematic scene” 0.712 1.59
”Leaves in the wind” 0.548 0.18
”Basketball game” 0.727 1.52
”Diver suffocating” 0.794 2.66
”Meeting of the two” 0.625 0.95
Table 2. Comparison of the measured spatial saliency maps and gaze-fixations obtained using eye-tracker

5.3. Limitations of the method

Despite the demonstrated good overall performance, the proposed methodology is not flawless. One of the limitations is that human reaction and the following mouse movement is slower than the eye-movement, so the results produced inevitably have a delay. This may not be critical when long video-sequences are studied. Besides that, repeating the videos helps shorten the delay. Another aspect is the influence of discretization and corresponding strategy of the observer. The experimenter expects the observers to use mouse only when a salient event appears, because in this case, the measured signal will be the clearest. However, in practice, most of the observers follow a similar pattern of clicks for any video: *long click until limit* - pause - *long click until limit* - pause - and so on. Nevertheless, the obtained maps are diverse and meaningfully correspond to video content. Another arguable observation is that observers tend to make the first click right at the beginning of the video. However, we consider the corresponding results to be accurate, because any frame at the beginning has a naturally high saliency score for the observer who sees the video for the first time and tries to quickly capture its content. A less natural issue is that due to the seamless repetition of a video, the ”tail” of the click at the end of a round continues to the beginning of the next round. We discovered this only during the analysis, so our experiment was not modified accordingly, although it may be corrected very easily by nullifying the flag at the beginning of each round. Other limitations include a big number of parameters to define, and the difficulty of the analysis of complex scenes.

6. Conclusions

In this work, we presented a novel mouse-contingent interface designed for measuring temporal and spatial video saliency. Temporal saliency is a novel concept which is studied incongruously less than it should in comparison to spatial saliency. Temporal video saliency allows identifying the important fragments of a video by assigning a score to each frame as a whole. The analysis of the experimental study shows that the use of the proposed interface allows to accurately approximate the temporal saliency ”map” as well as gaze-fixations of the observers at the same time. We believe that the most promising use of this approach is gathering large response bases which then can be used for training computational predictive algorithms. Thereby, we define it as the further direction of our work.
The answer our methodology gives for a question in the title is ”No, they are not”.

References

  • (1)
  • Bak et al. (2018) Cagdas Bak, Aysun Kocak, Erkut Erdem, and Aykut Erdem. 2018. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia 20, 7 (2018), 1688–1698.
  • Borji et al. (2013) Ali Borji, Hamed R Tavakoli, Dicky N Sihite, and Laurent Itti. 2013. Analysis of scores, datasets, and models in visual saliency prediction. In

    Proceedings of the IEEE international conference on computer vision

    . 921–928.
  • Bruce and Tsotsos (2006) Neil Bruce and John Tsotsos. 2006. Saliency based on information maximization. In Advances in neural information processing systems. 155–162.
  • Bylinskii et al. (2018) Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2018.

    What Do Different Evaluation Metrics Tell Us About Saliency Models?

    IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2018), 740–757.
  • Gao and Vasconcelos (2005) Dashan Gao and Nuno Vasconcelos. 2005. Discriminant saliency for visual recognition from cluttered scenes. In Advances in neural information processing systems. 481–488.
  • Gitman et al. (2014) Yury Gitman, Mikhail Erofeev, Dmitriy Vatolin, Andrey Bolshakov, and Alexey Fedorov. 2014.

    Semiautomatic Visual-Attention Modeling and Its Application to Video Compression. In

    2014 IEEE International Conference on Image Processing (ICIP 2014). Paris, France, 1105–1109.
  • Gosselin and Schyns (2001) Frédéric Gosselin and Philippe G Schyns. 2001. Bubbles: a technique to reveal the use of information in recognition tasks. Vision research 41, 17 (2001), 2261–2271.
  • Guo and Zhang (2010) Chenlei Guo and Liming Zhang. 2010. A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE transactions on image processing 19, 1 (2010), 185–198.
  • Guo and Agichtein (2010) Qi Guo and Eugene Agichtein. 2010. Towards predicting web searcher gaze position from mouse movements. In CHI’10 Extended Abstracts on Human Factors in Computing Systems. ACM, 3601–3606.
  • Hayhoe (2004) Mary M Hayhoe. 2004. Advances in relating eye movements and cognition. Infancy 6, 2 (2004), 267–274.
  • Huang et al. (2012) Jeff Huang, Ryen White, and Georg Buscher. 2012. User see, user point: gaze and cursor alignment in web search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1341–1350.
  • Huang et al. (2015) Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015.

    Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In

    Proceedings of the IEEE International Conference on Computer Vision. 262–270.
  • Jacob and Karn (2003) Robert J. K. Jacob and Keith S. Karn. 2003. Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliver the Promises. Mind 2, 3 (2003), 4.
  • Jansen et al. (2003) Anthony R Jansen, Alan F Blackwell, and KIM Marriott. 2003. A tool for tracking visual attention: The restricted focus viewer. Behavior research methods, instruments, & computers 35, 1 (2003), 57–69.
  • Jiang et al. (2017) Lai Jiang, Mai Xu, and Zulin Wang. 2017. Predicting video saliency with object-to-motion CNN and two-layer convolutional LSTM. arXiv preprint arXiv:1709.06316 (2017).
  • Jiang et al. (2015) Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. Salicon: Saliency in context. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 1072–1080.
  • Judd et al. (2009) Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In 2009 IEEE 12th international conference on computer vision. IEEE, 2106–2113.
  • Kim et al. (2017) Nam Wook Kim, Zoya Bylinskii, Michelle A Borkin, Krzysztof Z Gajos, Aude Oliva, Fredo Durand, and Hanspeter Pfister. 2017. BubbleView: an interface for crowdsourcing image importance maps and tracking visual attention. ACM Transactions on Computer-Human Interaction (TOCHI) 24, 5 (2017), 36. https://doi.org/10.1145/3131275
  • Kim et al. (2015) Nam Wook Kim, Zoya Bylinskii, Michelle A Borkin, Aude Oliva, Krzysztof Z Gajos, and Hanspeter Pfister. 2015. A crowdsourced alternative to eye-tracking for visualization understanding. In Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems. ACM, 1349–1354.
  • Kleiner et al. ([n. d.]) Mario Kleiner, David Brainard, Denis Pelli, Allen Ingling, Richard Murray, Christopher Broussard, et al. [n. d.]. What’s new in Psychtoolbox-3. ([n. d.]).
  • Kruthiventi et al. (2017) Srinivas SS Kruthiventi, Kumar Ayush, and R Venkatesh Babu. 2017.

    Deepfix: A fully convolutional neural network for predicting human eye fixations.

    IEEE Transactions on Image Processing 26, 9 (2017), 4446–4456.
  • Le Meur et al. (2006) Olivier Le Meur, Patrick Le Callet, Dominique Barba, and Dominique Thoreau. 2006. A coherent computational approach to model bottom-up visual attention. IEEE transactions on pattern analysis and machine intelligence 28, 5 (2006), 802–817.
  • Mahadevan and Vasconcelos (2009) Vijay Mahadevan and Nuno Vasconcelos. 2009. Spatiotemporal saliency in dynamic scenes. IEEE transactions on pattern analysis and machine intelligence 32, 1 (2009), 171–177.
  • Mantiuk et al. (2013) Radoslaw Mantiuk, Bartosz Bazyluk, and Rafal K Mantiuk. 2013. Gaze-driven Object Tracking for Real Time Rendering. In Computer Graphics Forum, Vol. 32. Wiley Online Library, 163–173.
  • Mathe and Sminchisescu (2015) Stefan Mathe and Cristian Sminchisescu. 2015. Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37, 7 (2015), 1408–1424.
  • Nielsen and Pernice (2010) Jakob Nielsen and Kara Pernice. 2010. Eyetracking web usability. New Riders.
  • Pan et al. (2016) Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 598–606.
  • Peters et al. (2005) Robert J Peters, Asha Iyer, Laurent Itti, and Christof Koch. 2005. Components of bottom-up gaze allocation in natural images. Vision research 45, 18 (2005), 2397–2416.
  • Rodden et al. (2008) Kerry Rodden, Xin Fu, Anne Aula, and Ian Spiro. 2008. Eye-mouse Coordination Patterns on Web Search Results Pages. In CHI ’08 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’08). ACM, New York, NY, USA, 2997–3002. https://doi.org/10.1145/1358628.1358797
  • Rudoy et al. (2013) Dmitry Rudoy, Dan B Goldman, Eli Shechtman, and Lihi Zelnik-Manor. 2013. Learning video saliency from human gaze using candidate selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1147–1154.
  • Seo and Milanfar (2009) Hae Jong Seo and Peyman Milanfar. 2009. Static and space-time visual saliency detection by self-resemblance. Journal of vision 9, 12 (2009), 15–15.
  • Vig et al. (2014) Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2798–2805.
  • Vig et al. (2014) E. Vig, M. Dorr, and D. Cox. 2014. Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2798–2805. https://doi.org/10.1109/CVPR.2014.358
  • Wang et al. (2018) Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4894–4903.
  • Yarbus (1967) Alfred L Yarbus. 1967. Eye movements and vision. Plenum, New York, NY, USA.
  • Zhang et al. (2008) Lingyun Zhang, Matthew H Tong, Tim K Marks, Honghao Shan, and Garrison W Cottrell. 2008. SUN: A Bayesian framework for saliency using natural statistics. Journal of vision 8, 7 (2008), 32–32.