Bottom-up Attention, Models of

by   Ali Borji, et al.

In this review, we examine the recent progress in saliency prediction and proposed several avenues for future research. In spite of tremendous efforts and huge progress, there is still room for improvement in terms finer-grained analysis of deep saliency models, evaluation measures, datasets, annotation methods, cognitive studies, and new applications. This chapter will appear in Encyclopedia of Computational Neuroscience.


Saliency Prediction in the Deep Learning Era: An Empirical Investigation

Visual saliency models have enjoyed a big leap in performance in recent ...

Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

Since 2014 transfer learning has become the key driver for the improveme...

Tidying Deep Saliency Prediction Architectures

Learning computational models for visual attention (saliency estimation)...

Do Saliency Models Detect Odd-One-Out Targets? New Datasets and Evaluations

Recent advances in the field of saliency have concentrated on fixation p...

Saliency Prediction with External Knowledge

The last decades have seen great progress in saliency prediction, with t...

Modeling and Computation of Liquid Crystals

Liquid crystal is a typical kind of soft matter that is intermediate bet...

Consistency of interdisciplinarity measures

Assessing interdisciplinarity is an important and challenging work in bi...

1 Definition

Attention - a general concept covering all factors that influence selection mechanisms, whether they are scene-driven and bottom-up, or expectation-driven and top-down.

Salience - parts of a stimulus (e.g. spatial regions, temporal regions, objects) that appear to an observer to stand out relative to their neighboring parts.

Gaze - a coordinated motion of the eyes and head that offers a key property of attention in natural behavior.

Scene free viewing - A task in which participants are asked to look at an image, without any specific instruction. As a default scene analysis task, the free viewing paradigm offers a wealth of insights regarding the cues that attract attention.

2 Detailed Description

2.1 History, scope, and organization

Deciphering the computational mechanisms by which the brain deals with the computational complexity of an overwhelmingly high volume of incoming data (at a rate of bits/s) and how the brain programs eye movements continue to be important problems in neuroscience. Where humans look in images and videos provides important clues regarding how they perceive static (still images) and dynamic scenes (videos), locate the main focus of the image, recognize actions or events, and identify the main participants.

The significant amount of behavioral and computational research on attention has revealed that attention is deployed in two ways: bottom-up (BU) and top-down (TD). The bottom-up component of attention, a.k.a endogenous or stimulus-driven, processes sensory information primarily in a feed-forward manner. Typically, a series of successive transformations are applied to the entire visual field to highlight the most interesting, important, conspicuous, or so-called salient regions [43, 34]. In contrast, in top-down attention, a.k.a context-driven, or goal-driven, information related to the ongoing behavior, task, or goal is selected (e.g. staying with the road lanes while driving [49]). The reader is referred to [48, 2, 26, 54, 10] for reviews of top-down attention studies.

Conventionally, bottom-up attention models generate a 2D topographic saliency map, where a value at every location determines how salient that location is, relative to its neighbors. The goal in saliency modeling is then to transform an image into its spatially corresponding saliency map (static saliency), possibly also taking into account temporal relations between successive video frames of a movie (dynamic saliency)

[36]. Early computational saliency models were primarily concerned with identifying conspicuous regions due to low-level feature contrast. Gradually, however, saliency models have shifted from locating low-level conspicuous image regions to predicting eye movements. This has been driven by the popularity of eye movement datasets captured with the free-viewing task for evaluating attention models [55].

To frame the concepts and models of bottom-up attention so far into a broader picture, we refer to Figure 1 as a possible anchor to help organize this review. The computational modeling effort will be reviewed in four phases. Phase I regards the early computational works (e.g. Koch and Ullman in 1985; Figure 1.a) closely built on top of behavioral hypotheses (e.g. the Feature Integration Theory by Triesman & Gelade). Models in Phase II extended and operationalized Phase I models to be applicable to an unconstrained variety of stimuli (e.g. Itti et al. model in 1998; Figure 1.b). The Itti et al. model spurred a lot of ideas resulting in a myriad of saliency models in phase III. The spectral residual model by Hou and Zhang in 2007 is an example of models in this phase (see Figure 1

.c for details on this model). It is a simple model and can be implemented in few lines of code. Recently, the resurgence and success of neural networks (NN) in computer vision and other areas has brought along a new wave of highly predictive saliency models (Phase IV; Figure 

1.d). A notable example here is the SALICON model [31], proposed in 2015. It is the first model to be trained on a large scale attention dataset.

The principal emphasis in this chapter will be on computational models that can process any visual stimulus in the form of a still image and return a prediction map, the same size as the image, that can be compared to human or animal behavioral or physiological responses (typically fixations during a task such as free viewing in the context of bottom-up attention) [61]. In addition to these models, some other types of models including abstract models, phenomenological models, or models specifically designed for a single task or for a restricted class of stimuli also exist, but will not be covered here (see [34, 61, 23, 65]).

Figure 1: Four models from four different eras of saliency modeling. (a) Phase I: Koch & Ullman (1985) introduced the concept of a saliency map receiving bottom-up inputs from all feature maps, where a winner-take-all (WTA) network selects the most salient location for further processing. (b) Phase II: Itti et al. (1998) proposed a complete computational implementation of a purely bottom-up and task-independent model based on Koch & Ullman’s theory, including multiscale feature maps, saliency map, winner-take-all, and inhibition of return. (c) Phase III: A myriad of saliency models appeared between 1998 to 2013. Schematic representation of spectral residual model [30] by Hou & Zhang (2007). The log spectrum is computed from the down-sampled image (with amplitude and phase ). From , the spectral residual is obtained by multiplying

with a local average filter and subtracting the result from itself. The saliency map is then the inverse Fourier transform of the exponential of amplitude plus phase (

i.e. ). (d) Phase IV:

A new wave of saliency models has emerged with the resurgence of convolutional neural networks (CNNs). Huang

et al.  [31] proposed a deep saliency model, known as the SALICON, that combines information from two pre-trained CNNs, each on a different image scale (fine and coarse). The two CNNs are then concatenated to produce the final saliency map.

Bottom-up attention models carry value for at least two main purposes. First, they present testable predictions that can be utilized for understanding human attention mechanisms at computational, behavioral, and neural levels. Indeed, a large number of cognitive studies have utilized saliency models for model-based hypothesis testing (e.g.  [55, 9]). Second, predicting where people look in images and videos is useful in a wide variety of applications across several domains (e.g. computer vision, robotics, neuroscience, medicine, assistive systems, healthcare, and human-computer interaction). Some example applications include gaze-aware compression and summarization, image enhancement, activity recognition, object segmentation, recognition and detection, image captioning, visual question answering, advertisement design, novice training, patient diagnosis, and surveillance. See [6] for a review.

In what follows, we first examine key concepts of early bottom-up attention models (Section 2.2), followed by a brief overview of deep saliency models (Section 2.3), and a discussion of biological plausibility of deep and non-deep saliency models (Section 2.4). In Section 2.5, current saliency benchmarks, datasets and new methodologies for collecting large scale data are explained. In Section 2.6, we provide a quantitative comparison of a large number of deep and non-deep saliency models in their ability to predict eye movements during free viewing of natural scenes. Section 2.7, explores what current models are missing. Finally, in Section 2.8, we discuss the remaining challenges that need to be addressed in order to build better saliency models.

2.2 Bottom-up attention modeling: pre deep learning era

Computational modeling of bottom-up attention dates back to the seminal theoretical works by Treisman and Gelade [67], the computational architecture by Koch and Ullman [43], and the bottom-up model of Itti et al.  [36]. Itti et al.’s model was able to predict human behavior in visual search tasks (e.g. pop-out versus conjunctive search [33]), demonstrate robustness to image noise [36], detect traffic signs and other salient objects in natural environments [35], detect pedestrians in natural scenes [53], locate military vehicles in overhead imagery [32], and — most importantly — predict where humans look during passive viewing of images and videos [56, 58]. Note that visual salience does not only depend on the physical property of a visual stimulus. It is a consequence of the interaction of a stimulus with other stimuli, as well as with a visual system (biological or artificial). For example, a color-blind person may have a dramatically different experience of visual salience than a person with normal color vision.

Following initial success, many research groups started exploring the notions of bottom-up attention and visual salience, which gave rise to many computational models from 1998 to 2013. In 2013, we summarized 53 bottom-up models along 13 different factors [6, 8]. These models fall into different categories (e.g. Bayesian, learning-based, spectral, cognitive). The early models mainly computed visual salience from bottom-up features in several feature maps, including luminance contrast, red-green and blue-yellow color opponency, and oriented edges [36]. Subsequent models incorporated mid- and higher level features (e.g. face and text [16], gaze direction [57]) to better predict gaze. A thorough examination of all models in this period is certainly not feasible in this limited space. Instead, we list a number of highly influential static and dynamic saliency models as follows. These include both models that are strongly inspired by biological vision, as well as other implementations of saliency that are based on more abstract mathematical definitions.

  1. Static saliency models: Attention for Information Maximization (AIM) [11], Graph-based Visual Saliency (GBVS) [25], Saliency Using Natural statistics (SUN) [77], Spectral Residual saliency (SR) [30], Adaptive Whitening Saliency (AWS) [20], Boolean Map based Saliency (BMS) [76], and the Judd et al. model [40].

  2. Dynamic saliency models: AWS-D [50], OBDL [29], Xu et al.  [72], PQFT [24], and Rudoy et al.  [62].

2.3 Bottom-up attention modeling: deep learning era

Deep learning has emerged as a very successful solution to a variety of problems across several computational domains [51]. A deep-learning architecture is a cascade of simple modules that compute non-linear input-output mappings, and all (or most) of which are subject to learning. A certain type of deep architectures, known as convolutional neural networks (CNN, a.k.a 

ConvNets) has been very popular. A typical CNN is composed of a series of convolutional layers and pooling layers, followed by one or more fully connected layers. The parameters of the entire network are learned via backpropagation over a large scale labeled dataset for a certain task (

e.g. object recognition). The overall architecture of CNNs resembles the LGN-V1-V2-V4-IT hierarchy of the visual ventral stream, and the convolutional and pooling layers are directly inspired by the classic notions of simple cells and complex cells in the cortex (e.g.  [74]).

The success of CNNs on large scale object recognition corpuses [19], has brought along a new wave of saliency models that perform markedly better than traditional saliency models based on hand-crafted features. To model bottom-up attention, researchers leverage existing deep architectures that are trained for scene or object recognition and re-purpose them to predict saliency. Often some architectural novelties are also introduced. These models are trained in an end-to-end manner, effectively formulating saliency as a regression problem. To remedy the lack of sufficiently large scale fixation datasets, deep saliency models are often pre-trained on large image datasets and are then fine-tuned on small scale eye movement or click datasets. This procedure allows models to re-use the object-level visual knowledge already learned in CNNs and successfully transfer them to the task of saliency prediction. A large number of deep saliency models have appeared in a relatively short period of time (2014-2018). A detailed discussion of these models goes beyond the scope of this chapter. Instead, we include a number of landmark static and dynamic deep saliency models and refer the reader to [5], for a comprehensive review. These models differ in their architectures and the way they are trained.

  1. Static saliency models: eDN [68], DeepGaze I & II [46], Mr-CNN [52], SALICON [31], DeepFix [45], SAM-ResNet [18], and EML-Net [37].

  2. Dynamic saliency models: Two-stream network [1], Chaabouni et al.  [17], Bazzani et al.  [3], OM-CNN [38], Gorji & Clark [22], ACLNet [69], and SG-FCN [63].

To account for differences in viewing static and dynamic stimuli by human observers555Observers view images and videos differently. They have much less time to view each video frame (about 1/30 of a second) compared to 3 to 5 seconds over still images. Further, motion is a key component that is missing in still images but strongly attracts human attention over videos (See [62]).

, traditional video saliency models pair bottom-up feature extraction with an ad-hoc motion estimation method that can be performed either by means of optical flow or feature tracking. In contrast, deep video saliency models learn the entire process end-to-end, either by adding temporal information to CNNs (as in 


), or developing a dynamic structure using recurrent neural networks 

[28] (as in [3]).

2.4 Biological plausibility of classic and deep saliency models

How well do classic and deep saliency models agree with biological findings on visual attention mechanisms? To answer this question, we would like to highlight two key points. First, as explained above, previous research has shown that CNNs can explain the feed-forward mechanisms involved in rapid object recognition [74]. Second, since deep saliency models are built on top of CNNs, they inherit biologically-plausible properties of CNNs (e.g. convolution operation). While it is not entirely clear how saliency is computed in these models and whether they work similar to traditional saliency models (e.g. by implementing center-surround operations, normalization, etc.), there is evidence that they may generalize classic models. To get an idea, consider a CNN with a single convolutional layer followed by a fully connected layer trained to predict fixations. This model generalizes the classic Itti model and also models built upon it that learn to combine feature maps (e.g.  [41, 4, 71]). The learned features in the CNN will correspond to orientation, color, intensity, etc. which can be combined linearly by a fully connected layer (or 1x1 convolutions in a fully convolutional neural network). To handle the scale dependency of saliency computation, classic models often utilize multiple image resolutions. In addition to this technique (as in the SALICON model), deep saliency models concatenate maps from several convolutional layers (as in ML-Net), or combine input from earlier layers in the network with later layers (e.g. using skip connections) to preserve fine details.

Despite the above resemblances, the most evident shortcoming of classic models (e.g. the Itti model) with respect to today’s deep architectures is the lack of ability to extract higher level features, objects, or parts of objects. Some classic models remedied this shortcoming by explicitly incorporating object detectors such as face or text detectors. The hierarchical deep structure of CNNs (e.g. 152 layers in the ResNet [27]) allows capturing complex cues that attract gaze automatically. This is perhaps the main reason behind the big performance gap between the two types of models. In practice, however, research has shown that there are cases where classic saliency models win over the deep models [31], indicating that current deep models still fall short in fully explaining low-level saliency (See Figure 9). Further, deep models still fail in capturing some high-level attention cues (e.g. gaze direction, objects of action, relative importance of objects; See Figure 8).

Our understanding of how saliency computation emerges inside deep saliency architectures, and how the mechanisms involved in these models differ from those implemented in deep models for object recognition are still limited. In this regard, recent work on understanding the representations learned by CNNs for scene and object recognition can offer new insights to understand deep saliency models (e.g.  [78]).

2.5 Benchmarks, datasets, and new data collection methodologies

Benchmarks have been instrumental for advances in computer vision. In the saliency domain, they have sparked a lot of interest and have spurred a lot of interesting ideas over the past several years. Two of the most influential image-based benchmarks include and SALICON777, shown in Figure. 2.

The MIT benchmark is currently the gold standard for evaluating and comparing image-based saliency models. It supports eight evaluation measures for comparison and reports results over two eye movement datasets: 1) MIT300 and 2) CAT2000 [7]. As of October 2018, 85 models are evaluated over the MIT300 dataset, out of which 26 are NN-based models (30% of all submissions). The CAT2000 dataset has 30 models evaluated to date (9 are NN-based). In addition, 5 baselines are computed on both datasets. The SALICON benchmark is relatively new and is primarily based on the SALICON dataset [31]. It offers results over 7 scores and uses the same evaluation tools as the MIT benchmark. These two benchmarks are complementary to each other. The former evaluates models with respect to actual fixations but suffers from small scale data. The latter fixes the scale problem but considers noisy click data as a proxy of attention. In this review, we provide an overview of the SALICON dataset, since it has been very useful for constructing saliency models, but focus on providing results over the MIT benchmark since fixations provide a closer link to visual attention mechanisms than mouse clicks [66].

Figure 2: Two major saliency benchmarks: MIT (left) and SALICON (right). The former compares models to fixations over two datasets (MIT300 and CAT2000), whereas the latter considers mouse trajectories.

Traditionally, saliency models have been validated by comparing their outputs on small scale datasets composed of eye movements of humans or monkeys watching complex image or video stimuli (e.g.  [55, 11]). New large scale databases have emerged by following two trends, 1) increasing the number of images, and 2) introducing new measurements to saliency by providing contextual annotations (e.g. image categories, regional properties, etc.). To annotate these large scale datasets, researchers have resorted to crowd-sourcing schemes such as gaze tracking using webcams [73] or mouse movements [39, 42] as alternatives to lab-based eye trackers (Figure 3). Deep supervised saliency models rely heavily on these sufficiently large and well-labeled datasets. Here, we provide an overview of the most recent and popular image datasets for training and testing saliency models. For a review of fixation datasets pre-deep learning era please consult [70].

  • MIT300: This dataset is a collection of 300 natural images from the Flickr Creative Commons and personal collections [13]. It contains eye movement data of 39 observers which results in a fairly robust ground-truth to test models against. It is a challenging dataset for saliency models, as images are highly varied and natural. Fixation maps of all images are held out and used by the MIT Saliency Benchmark for evaluating models.

  • CAT2000: Released in 2015, this is a relatively larger dataset consisting of 2000 training images and 2000 test images spanning 20 different categories such as Cartoons, Art, Satellite, Low resolution images, Indoor, Outdoor, Line drawings, etc. [13]. Images in this dataset come from search engines and computer vision datasets. The training set contains 100 images per category and has fixation annotations from 18 different observers. The test test, used for evaluation, contains the fixations of 24 observers. Both MIT300 and CAT2000 datasets are collected using the EyeLink1000 eye-tracker.

  • SALICON: It is currently the largest crowd-sourced saliency dataset. Images of this dataset come from the Microsoft COCO dataset and contain pixelwise semantic annotations. The SALICON dataset contains 10,000 training images, 5,000 validation images and 5,000 test images. Mouse movements are collected using Amazon Mechanical Turk via a psychophysical paradigm known as mouse-contingent saliency annotation (Figure 3). Eye movements sometimes do not match mouse movements [66]. Nevertheless, this dataset introduces an acceptable and scalable method for the collection of additional data for saliency modeling. Currently, many deep saliency models are first trained on the SALICON dataset and are then finetuned on the MIT1000 or CAT2000 datasets for predicting fixations. A similar paradigm, known as the BubbleView [42], has also been proposed where a subject has to successively click on a blurred image to reveal the story of the scene.

Figure 3: Left: New crowd-sourcing methodologies to collect attention data including TurkerGaze [73], SALICON [39], and BubbleView [42]. Right: a) An illustration of BubbleView [42] paradigm, b): The NSS score obtained by comparing mouse clicks and mouse movements to ground truth fixations on natural images in the OSIE dataset [71]. Each point represents the score obtained at a given number of participants, averaged over 10 random splits of participants and all 51 images used. It shows that BubbleView clicks better approximate fixations than SALICON mouse movements for all feasible numbers of participants (). It also shows that clicks of 10 participants explain 90% of fixations.

2.6 State of the art performance

A quantitative comparison of static saliency models over the MIT benchmark is presented here. The MIT benchmark has the most comprehensive set of traditional and deep saliency models evaluated over eight scores. We mainly focus on performances using the AUC-Judd and NSS scores since they provide a better model assessment than others [14]. The following 5 baselines are also considered.

  1. Infinite humans: How well a fixation map of infinite observers predicts fixations from a different set of infinite observers, computed as a limit? See [14] for details. Prediction is still not perfect due to observer differences.

  2. One human: How well a fixation map of one observer (taken as a saliency map) predicts the fixations of the other observers. This is computed for each observer in turn, and averaged over all observers. Different individuals are more or less predictive of the rest of the population, and so a range of prediction scores is obtained.

  3. Center: This saliency model is computed by stretching a symmetric Gaussian to fit the aspect ratio of a given image, under the assumption that the center of the image is most salient [64].

  4. Permutation control: For each image, instead of randomly sampling fixations, fixations from a randomly-sampled image are chosen as the saliency map. This process is repeated 5 times per image, and the average performance is computed. This method allows capturing observer and center biases that are independent of the image [44].

  5. Chance: A random uniform value is assigned to each image pixel to build a saliency map. Average performance is computed over 5 such chance saliency maps per image.

2.6.1 Model comparison over the MIT300 dataset

Figure 4 shows the results over the MIT300 dataset. According to the AUC-Judd and NSS scores, the top 5 models are all NN-based. Using the AUC-Judd measure, DeepGaze II and EML-NET models hold the top 2 spots with a score of 0.88. The ‘’infinite humans” baseline obtains 0.92 using this score. While models perform close to each other according to the AUC-Judd score, switching to NSS widens the differences. EML-NET has the highest score with an NSS of 2.47. The ‘’infinite humans” baseline achieves NSS of 3.29. The second and third ranks here belong to CEDNS and DPNSal models. BMS is the best-performing non NN-based model (AUC-Judd = 0.83 and NSS = 1.41) and ranks better than the deep eDN model. The majority of models significantly outperform all the baselines, except the ‘’infinite humans” baseline. Among baselines, one human and center predict fixations better than permutation and chance.

Comparing the best results pre and during the deep learning era shows about 43% improvement in terms of NSS (EML-NET vs. BMS) and about 5.7% improvement in terms of AUC-Judd (DeepGaze II vs. BMS). At the same time, the gap between the best model and the ‘’infinite humans” baseline shrinks from 57% to 25% using NSS, and from 9.8% to 4.4% in terms of AUC-Judd. Using the NSS score, about 73% of the top 30 models are NN-based (about 67% using AUC-Judd).

Figure 4: Performance of 85 saliency models and 5 baselines (marked with dashed red lines) for fixation prediction over the MIT300 dataset. Models are sorted in terms of the AUC-Judd (top row) and NSS (bottom row) scores. Infinite h. stands for ‘’infinite humans” baseline.

2.6.2 Model comparison over the CAT200 dataset

Figure 5 shows the results over the CAT2000 dataset. According to the AUC-Judd measure, SAM-ResNet, SAM-VGG, and CEDNS are tied in the top with a score of 0.88 (the ‘’infinite humans” baseline scores 0.90). EML-Net, the winner on the MIT300 dataset, is ranked second with the score of 0.87 (tied with DeepFix). Notice that as of now (Oct. 2018), there is no submission from DeepGaze models on this dataset. Switching to NSS, CEDNS wins with a score of 2.39 slightly above SAM-ResNet, SAM-VGG, and EML-Net (all with NSS of 2.38). Among classic models, BMS [76] and EYMOL [75] perform better than the others. Overall, models that do well over the MIT300 dataset perform well here as well.

There is a 3.5% improvement from the best non-deep model to the best deep model in terms of AUC-Judd. Improvement in terms of the NSS score is 30%. The performance gap between the best non-deep model and the ‘’infinite humans” upper-bound is 5.55% and shrinks to 2.22% using the best deep learning model (based on the AUC-Judd measure). The corresponding values are 41.5% and 16.2% using the NSS score.

Figure 5: Performance of 30 saliency models and 5 baselines (marked with dashed red lines) for fixation prediction over the CAT2000 dataset. Models are sorted in terms of the AUC-Judd (top row) and NSS (bottom row) scores. Infinite h. stands for ‘’infinite humans” baseline.

A qualitative comparison of model predictions over a sample image from the CAT2000 dataset is presented in Figure 6. It shows large differences in appearance of the maps generated by different models. At it can be seen, recent deep models generate saliency maps that are very similar to the ground truth fixation map on this image, better than their non-deep predecessors.

Figure 6: Saliency prediction maps of 35 models (including 5 baselines) on a sample image from the CAT2000 dataset (See the MIT benchmark webpage). Red boxes illustrate baselines.

Overall, results show that new neural network based saliency models have created a large gap in performance relative to traditional saliency models from the pre deep learning era. However, they still fall short in performing at the level of humans on this task as is demonstrated in Figure 7. Further, they suffer from several shortcomings that need to be addressed, and will be discussed next.

Figure 7: Performance of five baselines and the best saliency model (can be different for each score) for predicing fixations over two eye movement datasets (MIT300 and CAT2000) using 8 scores. For EMD and KL, the lower the better (downward arrows). As it can be seen, the best model often wins over the baselines except ‘’infinite humans” baseline, indicating a gap between the best models and humans in saliency prediction. The gap is wider for some scores (e.g. NSS, EMD) and narrower using some other scores (e.g. AUC-Borji, sAUC). This holds over both datasets.

2.7 Where do models fail?

As we saw above, deep learning models have shown impressive performance in saliency prediction. A deeper look, however, reveals that they continue to miss key elements in images (Figure 8).

In [15], we investigated the state-of-the-art image saliency models using a fine-grained analysis on image types, image regions, etc. In a behavioral study, conducted via Amazon Mechanical Turk, workers were asked to label (1 out of 15 choices) image regions that fall on the top 5% of the fixation heatmap. Analyzing the failures of models on those regions shows that the majority of the errors made by models are due to failures in accurately detecting parts of a person, faces, animals, text, objects of action, and gaze direction. These regions carry the greatest semantic importance in images (Figure 8.A & B). One way to ameliorate such errors is to train models on more instances of faces (e.g. partial, blurry, small, or occluded faces, non-frontal views), more instances of text (different sizes and types), and animals. Saliency models may also need to be trained on different tasks, to learn to detect gaze and action and leverage this information for saliency. Moreover, saliency models need to reason about the relative importance of image regions, such as focusing on the most important person in the room or the most informative sign on the road (Figure 8.C, D & E). Interestingly, when we added the missing regions to models, performance improved drastically [15]. This has been corroborated by several other studies that showed augmented deep saliency models perform better than original models (e.g.  [60, 21, 22]).

Figure 8: While deep learning models have shown impressive performance for saliency prediction, a finer-grained analysis shows that they miss key elements in images. Some example stimuli for which models under- or overestimate fixation locations due to gaze direction (A), or locations of implied action or motion (B) are shown. Also, models fail to detect small faces or profile ones, and fail to assign correct relative importance to them (C). Cases for which models do not correctly assign relative importance to people (D), or text regions (E) in the scene are also shown. Please see text and [15] for further details.

Previous research has also identified cases where deep saliency models produce counterintuitive results relative to models based on feature contrast. For example, Rahman and Bruce [59] and Huang et al.  [31] showed that, unlike the classic Itti et al. 

model, new saliency models fail to highlight odd items in pop-out psychological patterns (Figure 


Figure 9: Example stimuli where deep saliency models produce counterintuitive results relative to models based on feature contrast. Sometimes deep models neglect low-level image features (local contrast) and overweight the contribution of high-level features (e.g. faces or text). See [59] and [31] for more details.

2.8 Discussion and outlook

Saliency prediction performance has improved dramatically in the last few years, in large part due to deep supervised learning and large scale mouse click datasets. We also have a much better understanding of challenges pertaining to model evaluation than before. The new NN-based models are trained in a single end-to-end manner, combining feature extraction, feature integration, and saliency value prediction, and have created a large gap in performance relative to traditional saliency models. The success of these saliency prediction models suggests that the high-level image features encoded by deep networks (

e.g. sensitivity to faces, objects and text), as well as the ability of CNNs to capture global context are extremely useful for predicting fixation locations. Despite the immense recent progress, however, saliency prediction is far from being solved and there continues to be a big room for improvement. Some areas in which improvement can be made are discussed below.

  • Saliency models based on deep learning are good face and text detectors, much better than their non-deep predecessors. The degree to which these models perform in face and text detection, compared to the state of the art face and text detectors, still remains to be determined.

  • Even the best saliency models tend to place a disproportionate amount of importance on face regions, humans, and text even when they are not necessarily the most semantically interesting parts of the image. Saliency predictors will need to reason about the relative importance of image regions, such as focusing on the most important person in the room or the most informative sign on the road (which is image- and context-dependent). Similarly, in the presence of several text regions in the image, some high-level understanding of meanings is necessary to prioritize different text regions (e.g. what is the warning sign about?)

  • There has been a lot of progress in understanding the representations learned by CNNs for scene and object recognition in recent years (e.g.  [78]). Our understanding of what is learned by deep saliency models, however, is limited. The main questions here are how saliency computation emerges inside deep saliency architectures, and how the patterns in different network layers learned for saliency prediction differ from those patterns learned for object recognition?

  • Fair saliency model comparison still remains an unsolved research problem today. Active research is ongoing to understand the pros and cons of the saliency measures (e.g.  [47]). Many of the current saliency methods compete closely with one another at the top of the existing benchmarks and performances vary in a narrow band (See Figures 45). Also, as we saw in Section 2.6, some evaluation measures have begun to saturate and the produced rankings by different models are often inconsistent with each other. Thus, as the number of saliency models grows and score differences between models shrink, evaluation measures should be adjusted to a) elucidate differences between models and fixations (e.g. by taking into account the relative importance of spatial and temporal regions), and b) mitigate sensitivity to map smoothing and center-bias. Complementary to measures, finer-grained stimuli such as image regions in a collection or in a panel (as in Figure 10 to measure how well models predict the relative importance of image content), psychophysical patterns (pop-out search arrays and natural oddball scenes), as well as transformed images can be used to further differentiate among models.

  • Collecting large scale data for constraining, training and evaluating attention models is crucial to progress. Bruce et al.  [12] pointed out that the manner in which data is selected, ground truth is created, and prediction error is measured (e.g. loss function) is critical to model performance. Large scale click datasets [31, 42] have been highly useful to train deep saliency models and to achieve high accuracy. However, clicks occasionally disagree with fixations. Thus, separating good clicks from noisy ones, can improve model training. Moreover, studying discrepancies between mouse movements and eye movements, collecting new types of image and video data, as well as refining available datasets can be rewarding.

Figure 10: A finer-grained test proposed in [15] for determining how saliency models prioritize different sub-images in a panel, relative to each other. (left) A panel image from the MIT300 dataset. (right) The saliency map predictions given the panel as an input image. The maximum response of each saliency model on each subimage is visualized (as an importance matrix). AUC and NSS scores are also computed for these saliency maps. The panel ranking is a measure of the correlation of values in the ground truth and predicted importance matrices.

In this review, we examined the recent progress in saliency prediction and proposed several avenues for future research. In spite of tremendous efforts and huge progress, there continues to be room for improvement in terms finer-grained analysis of deep saliency models, evaluation measures, datasets, annotation methods, cognitive studies, and new applications.

3 Cross-References

Hierarchical Models of the Visual System
Saliency in the Visual Cortex
Working Memory, Models of
Attentional Top-Down Modulation, Models of


  • [1] C. Bak, A. Kocak, E. Erdem, and A. Erdem. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia, 20(7):1688–1698, 2018.
  • [2] D. Ballard, M. Hayhoe, and J. Pelz. Memory representations in natural tasks. Journal of Cognitive Neuroscience., 7(1):66–80, 1995.
  • [3] L. Bazzani, H. Larochelle, and L. Torresani. Recurrent mixture density network for spatiotemporal visual attention. ICLR, 2016.
  • [4] A. Borji. Boosting bottom-up and top-down visual features for saliency estimation. In

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    , pages 438–445. IEEE, 2012.
  • [5] A. Borji. Saliency prediction in the deep learning era: An empirical investigation. arXiv preprint arXiv:1804.09626, 2018.
  • [6] A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2013.
  • [7] A. Borji and L. Itti. Cat2000: A large scale fixation dataset for boosting saliency research. arXiv:1505.03581, pages 1–4, May 2015.
  • [8] A. Borji, D. N. Sihite, and L. Itti. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing, 22(1):55–69, 2013.
  • [9] A. Borji, D. N. Sihite, and L. Itti. What stands out in a scene? a study of human explicit saliency judgment. Vision Research, 91:62–77, Aug 2013.
  • [10] A. Borji, D. N. Sihite, and L. Itti. What/where to look next? modeling top-down visual attention in complex interactive environments. IEEE Transactions on Systems, Man, and Cybernetics, Part A - Systems and Humans, 44(5):523–538, 2014.
  • [11] N. Bruce and J. Tsotsos. Saliency based on information maximization. In Advances in neural information processing systems, pages 155–162, 2005.
  • [12] N. D. Bruce, C. Catton, and S. Janjic. A deeper look at saliency: feature contrast, semantics, and beyond. In IEEE Conference on Computer Vision and Pattern Recognition, pages 516–524, 2016.
  • [13] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba. Mit saliency benchmark.
  • [14] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand.

    What do different evaluation metrics tell us about saliency models?

    IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [15] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand. Where should saliency models look next? In ECCV (5), pages 809–824, 2016.
  • [16] M. Cerf, E. P. Frady, and C. Koch. Faces and text attract gaze independent of the task: Experimental data and computer model. Journal of Vision, 9(12), November 18 2009.
  • [17] S. Chaabouni, J. Benois-Pineau, and C. B. Amar. Transfer learning with deep networks for saliency prediction in natural video. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 1604–1608. IEEE, 2016.
  • [18] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. Predicting human eye fixations via an lstm-based saliency attentive model. arXiv preprint arXiv:1611.09571, 2016.
  • [19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
  • [20] A. Garcia-Diaz, X. R. Fdez-Vidal, X. M. Pardo, and R. Dosil.

    Saliency from hierarchical adaptation through decorrelation and variance normalization.

    Image and Vision Computing, 30(1):51–64, 2012.
  • [21] S. Gorji and J. J. Clark. Attentional push: A deep convolutional network for augmenting image salience with shared attention modeling in social scenes. In Computer Vision and Pattern Recognition (CVPR), volume 2, page 5. IEEE, 2017.
  • [22] S. Gorji and J. J. Clark. Going from image to video saliency: Augmenting image salience with dynamic attentional push. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7501–7511, 2018.
  • [23] J. Gottlieb and P. Balan. Attention as a decision in information space. Trends in cognitive sciences, 14(6):240–248, 2010.
  • [24] C. Guo, Q. Ma, and L. Zhang. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In CVPR, pages 1–8. IEEE, 2008.
  • [25] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In NIPS, pages 545–552, 2006.
  • [26] M. M. Hayhoe and D. H. Ballard. Eye movements in natural behavior. Trends in cognitive sciences, 9(4):188–194, 2005.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [28] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput, 9(8):1735–80, 1997.
  • [29] S. Hossein Khatoonabadi, N. Vasconcelos, I. V. Bajic, and Y. Shan. How many bits does it take for a stimulus to be salient? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2015.
  • [30] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In CVPR, pages 1–8. IEEE, 2007.
  • [31] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In IEEE International Conference on Computer Vision, pages 262–270, 2015.
  • [32] L. Itti, C. Gold, and C. Koch. Visual attention and target detection in cluttered natural scenes. Optical Engineering, 40(9):1784–1793, Sep 2001.
  • [33] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(10-12):1489–1506, May 2000.
  • [34] L. Itti and C. Koch. Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3):194–203, Mar 2001.
  • [35] L. Itti and C. Koch. Feature combination strategies for saliency-based visual attention systems. Journal of Electronic Imaging, 10(1):161–169, Jan 2001.
  • [36] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, Nov 1998.
  • [37] S. Jia. Eml-net: An expandable multi-layer network for saliency prediction. arXiv preprint arXiv:1805.01047, 2018.
  • [38] L. Jiang, M. Xu, and Z. Wang. Predicting video saliency with object-to-motion cnn and two-layer convolutional lstm. arXiv preprint arXiv:1709.06316, 2017.
  • [39] M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency in context. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1072–1080. IEEE, 2015.
  • [40] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In Computer Vision, 2009 IEEE 12th international conference on, pages 2106–2113. IEEE, 2009.
  • [41] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In International Conference on Computer Vision (ICCV), 2009.
  • [42] N. W. Kim, Z. Bylinskii, M. A. Borkin, K. Z. Gajos, A. Oliva, F. Durand, and H. Pfister. Bubbleview: an alternative to eye-tracking for crowdsourcing image importance. arXiv preprint arXiv:1702.05150, 2017.
  • [43] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol, 4(4):219–27, 1985.
  • [44] K. Koehler, F. Guo, S. Zhang, and M. P. Eckstein. What do saliency models predict? Journal of vision, 14(3):14–14, 2014.
  • [45] S. S. Kruthiventi, K. Ayush, and R. V. Babu. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 2017.
  • [46] M. Kümmerer, L. Theis, and M. Bethge. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.
  • [47] M. Kummerer, T. S. Wallis, and M. Bethge. Saliency benchmarking made easy: Separating models, maps and metrics. In Proceedings of the European Conference on Computer Vision (ECCV), pages 770–787, 2018.
  • [48] M. F. Land and M. M. Hayhoe. In what ways do eye movements contribute to everyday activities? Vision research, 41(25):3559–3565, 2001.
  • [49] M. F. Land and D. N. Lee. Where do we look when we steer. Nature, 1994.
  • [50] V. Leboran, A. Garcia-Diaz, X. R. Fdez-Vidal, and X. M. Pardo. Dynamic whitening saliency. IEEE Transactions on pattern analysis and machine intelligence, 39(5):893–907, 2017.
  • [51] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
  • [52] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu. Predicting eye fixations using convolutional neural networks. In CVPR, pages 362–370, 2015.
  • [53] F. Miau, C. Papageorgiou, and L. Itti. Neuromorphic algorithms for computer vision and attention. In B. Bosacchi, D. B. Fogel, and J. C. Bezdek, editors, Proc. SPIE 46 Annual International Symposium on Optical Science and Technology, volume 4479, pages 12–23, Bellingham, WA, Nov 2001. SPIE Press.
  • [54] V. Navalpakkam and L. Itti. Modeling the influence of task on attention. Vision Research, 45(2):205–231, Jan 2005.
  • [55] D. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in the allocation of overt visual attention. Vision research, 42(1):107–123, 2002.
  • [56] D. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in the allocation of overt visual attention. Vision Res, 42(1):107–123, Jan 2002.
  • [57] D. Parks, A. Borji, and L. Itti. Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes. Vision Research, 116B:113–126, 2015.
  • [58] R. J. Peters, A. Iyer, L. Itti, and C. Koch. Components of bottom-up gaze allocation in natural images. Vision Research, 45(8):2397–2416, Aug 2005.
  • [59] S. Rahman and N. Bruce. Saliency, scale and information: Towards a unifying theory. In Advances in Neural Information Processing Systems, pages 2188–2196, 2015.
  • [60] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba. Where are they looking? In Advances in Neural Information Processing Systems (NIPS), 2015. indicates equal contribution.
  • [61] A. L. Rothenstein and J. K. Tsotsos. Attention links sensing to recognition. Image and Vision Computing, 26(1):114–126, 2008.
  • [62] D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor. Learning video saliency from human gaze using candidate selection. In CVPR, pages 1147–1154, 2013.
  • [63] M. Sun, Z. Zhou, Q. Hu, Z. Wang, and J. Jiang. Sg-fcn: A motion and memory-based deep learning model for video saliency detection. IEEE Transactions on Cybernetics, 2018.
  • [64] B. W. Tatler. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of vision, 7(14):4–4, 2007.
  • [65] B. W. Tatler, M. M. Hayhoe, M. F. Land, and D. H. Ballard. Eye guidance in natural vision: Reinterpreting salience. Journal of vision, 11(5):5–5, 2011.
  • [66] H. R. Tavakoli, F. Ahmed, A. Borji, and J. Laaksonen. Saliency revisited: Analysis of mouse movements versus fixations. CVPR 2017, 2017.
  • [67] A. M. Treisman and G. Gelade. A feature-integration theory of attention. Cognitive psychology, 12(1):97–136, 1980.
  • [68] E. Vig, M. Dorr, and D. Cox. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2798–2805, 2014.
  • [69] W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4894–4903, 2018.
  • [70] S. Winkler and R. Subramanian. Overview of eye tracking datasets. In Quality of Multimedia Experience (QoMEX), 2013 Fifth International Workshop on, pages 212–217. IEEE, 2013.
  • [71] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao. Predicting human gaze beyond pixels. Journal of vision, 14(1):28–28, 2014.
  • [72] M. Xu, L. Jiang, X. Sun, Z. Ye, and Z. Wang. Learning to detect video saliency with hevc features. IEEE Transactions on Image Processing, 26(1):369–385, 2017.
  • [73] P. Xu, K. A. Ehinger, Y. Zhang, A. Finkelstein, S. R. Kulkarni, and J. Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755, 2015.
  • [74] D. L. Yamins and J. J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356–365, 2016.
  • [75] D. Zanca and M. Gori. Variational laws of visual attention for dynamic scenes. In Advances in Neural Information Processing Systems, pages 3823–3832, 2017.
  • [76] J. Zhang and S. Sclaroff. Saliency detection: A boolean map approach. In IEEE International Conference on Computer Vision, pages 153–160, 2013.
  • [77] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell. Sun: A bayesian framework for saliency using natural statistics. Journal of vision, 8(7):32–32, 2008.
  • [78] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. arXiv, 2014.