Contextual Encoder-Decoder Network for Visual Saliency Prediction

02/18/2019 ∙ by Alexander Kroner, et al. ∙ 0

Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive results on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on selected examples. The network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources to estimate human fixations across complex natural scenes.



There are no comments yet.


page 2

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans demonstrate a remarkable ability to obtain relevant information from complex visual scenes Jonides et al. (1982); Irwin (1991). Overt attention is the mechanism that governs the processing of stimuli by directing gaze towards a spatial location within the visual field Posner (1980). This sequential selection ensures that the eyes sample prioritized aspects from all available information to reduce the cost of cortical computation Lennie (2003). In addition, only a small central region of the retina, known as the fovea, trans-forms incoming light into neural responses with high spatial resolution, whereas acuity decreases rapidly towards the periphery Cowey and Rolls (1974); Berkley et al. (1975). Given the limited number of photoreceptors in the eye, this arrangement allows to optimally process visual signals from its environment Cheung et al. (2016). The function of fixations is thus to resolve the trade-off between coverage and sampling resolution of the whole visual field Gegenfurtner (2016).

Figure 1: A visualization of four natural images with the corresponding empirical fixation maps, our model predictions, and estimated maps based on the work by Itti et al. (1998). The network proposed in this study was not trained on the stimuli shown here and thus exhibits its generalization ability to unseen instances. All image examples demonstrate a qualitative agreement of our model with the ground truth data, assigning high saliency to regions that contain semantic information, such as a door (a), flower (b), face (c), or text (d). On the contrary, the approach by Itti et al. (1998) detected low-level feature contrasts and wrongly predicted high values at object boundaries rather than their center.

The spatial allocation of attention when viewing natural images is commonly represented in the form of topographic saliency maps that depict which parts of a scene attract fixations reliably. Identifying the underlying properties of these regions would allow us to predict human fixation patterns and gain a deeper understanding of the processes that lead to the observed behavior. In computer vision, this challenging problem has originally been approached using models rooted in

Feature Integration Theory Treisman and Gelade (1980). The theory suggests that early visual features must first be registered in parallel before serial shifts of overt attention combine them into unitary object-based representations. This two-stage account of visual processing has emphasized the role of stimulus properties for explaining human gaze. In consequence, the development of feature-driven models has been considered sufficient to enable the prediction of fixation patterns under task-free viewing conditions. Koch and Ullman (1985) have introduced the notion of a central saliency map which integrates low-level information and serves as the basis for eye movements. This has resulted in a first model implementation by Itti et al. (1998) that influenced later work on biologically-inspired architectures.

With the advent of deep neural network solutions for visual tasks such as image classification Krizhevsky et al. (2012), saliency modelling has also undergone a paradigm shift from manual feature engineering towards automatic representation learning. In this work, we leveraged the capability of convolutional neural networks (CNNs) to extract relevant features from raw images and decode them towards a distribution of saliency across arbitrary scenes. Compared to the early approach by Itti et al. (1998), this approach allows predictions to be based on semantic information instead of low-level feature contrasts (see Figure 1). Furthermore, it is likely that complex representations at multiple spatial scales are necessary for accurate predictions of human fixation patterns. We therefore incorporated a contextual module that samples multi-scale information and augments it with global scene features. The contribution of the contextual module to the overall performance was assessed and final results were compared to previous work on two public saliency benchmarks. We achieved predictive accuracy on unseen test instances at the level of current state of the art approaches, while utilizing a computationally relatively inexpensive network backbone.

2 Related Work

Early approaches towards computational models of visual attention were defined in terms of different theoretical frameworks, including Bayesian Zhang et al. (2008) and graph-based formulations Harel et al. (2006). A mechanism inspired more by biological than mathematical principles was first implemented and described in the seminal work by Itti et al. (1998)

. Their model captures center-surround differences at multiple spatial scales with respect to three basic feature channels: color, intensity, and orientation. After normalization of activity levels, the output is fed into a common saliency map depicting local conspicuity in static scenes. This standard cognitive architecture has since been augmented with additional feature channels that capture semantic image content, such as faces and text

Cerf et al. (2009).

With the large-scale acquisition of eye tracking measurements under natural viewing conditions, data-driven machine learning techniques became more practicable.

Judd et al. (2009)

introduced a model based on support vector machines to estimate fixation densities from a set of low-, mid-, and high-level visual features. While this approach still relied on a hypothesis specifying which image properties would successfully contribute to the prediction of saliency, it marked the beginning of a progression from manual engineering to automatic learning of features. This development has ultimately led to applying deep neural networks with emergent representations for the estimation of human fixation patterns.

Vig et al. (2014) were the first to train an ensemble of shallow CNNs to derive saliency maps from natural images in an end-to-end fashion, but failed to capture object information due to limited network depth.

Later attempts addressed that shortcoming by taking advantage of classification architectures pre-trained on the ImageNet database Deng et al. (2009)

. This choice was motivated by the finding that features extracted from CNNs generalize well to other visual tasks 

Donahue et al. (2014). Consequently, DeepGaze I Kümmerer et al. (2014) and II Kümmerer et al. (2016) employed a pre-trained classification model to read out salient image locations from a small subset of encoding layers. This is similar to the network by Cornia et al. (2016) which utilizes the output at three stages of the hierarchy. Related approaches also focused on the potential benefits of incorporating activation from both coarse and fine image resolutions Huang et al. (2015), and recurrent connections to capture long-range spatial dependencies in convolutional feature maps Cornia et al. (2018); Liu and Han (2018). Our model explicitly combines semantic representations at multiple spatial scales to include con-textual information in the predictive process. For a more complete account of existing saliency architectures, we refer the interested reader to a comprehensive review by Borji (2018).

3 Methods

We propose a new CNN architecture with modules adapted from the semantic segmentation literature to predict fixation density maps of the same image resolution as the input. Our approach is based on a large body of research regarding saliency models that leverage object-specific features and functionally replicate human behavior under free-viewing conditions. In the following sections, we describe our contributions to this challenging task.

3.1 Architecture

Image-to-image learning problems require the preservation of spatial features throughout the whole processing stream. As a consequence, our network does not include any fully-connected layers and reduces the number of downsampling operations inherent to classification models. We adapted the popular VGG16 architecture Simonyan and Zisserman (2014)

as an image encoder by reusing the pre-trained convolutional layers to extract increasingly complex features along its hierarchy. Striding in the last two pooling layers was removed, which yields spatial representations at 18 of their original input size. All subsequent convolutional encoding layers were then dilated at a rate of 2 by expanding their kernel, and thus increased the receptive field to compensate for the higher resolution 

Yu and Koltun (2015). This modification still allowed us to initialize the model with pre-trained weights since the number of trainable parameters remained unchanged. Prior work has shown the effectiveness of this approach in the context of saliency prediction problems Cornia et al. (2018); Liu and Han (2018).

Figure 2:

An illustration of the modules that constitute our encoder-decoder architecture. The VGG16 backbone was modified to account for the requirements of dense prediction tasks by omitting feature downsampling in the last two max-pooling layers. Multi-level activations were then forwarded to the ASPP module, which captured information at different spatial scales in parallel. Finally, the input image dimensions were restored via the decoder network. Subscripts beneath convolutional layers denote the corresponding number of feature maps.

For related visual tasks such as semantic segmentation, information distributed over convolutional layers at different levels of the hierarchy can aid the preservation of fine spatial details Hariharan et al. (2015); Long et al. (2015). The prediction of fixation density maps does not require accurate class boundaries but still benefits from combined mid- to high-level feature responses Cornia et al. (2016); Kümmerer et al. (2014, 2016). Hence, we adapted the multi-level design proposed by Cornia et al. (2016)

and concatenated the output from layers 10, 14, and 18 into a common tensor with 1280 activation maps.

This representation constitutes the input to an Atrous Spatial Pyramid Pooling (ASPP) module Chen et al. (2018). It utilizes several convolutional layers with different dilation factors in parallel to capture multi-scale image information. Additionally, we incorporated scene content via global average pooling over the final encoder output, as motivated by the study of Torralba et al. (2006) who stated that contextual information plays an important role for the allocation of attention. Our implementation of the ASPP architecture thus closely follows the modifications proposed by Chen et al. (2017). These authors augmented multi-scale information with global context and demonstrated performance improvements on semantic segmentation tasks.

In this work, we laid out three convolutional layers with kernel sizes of and dilation rates of 4, 8, and 12 in parallel, together with a convolutional layer that could not learn new spatial dependencies but nonlinearly combined existing feature maps. Image-level context as the result of global average pooling was then brought to the same resolution as all other representations via bilinear upsampling, followed by another point-wise convolutional operation. Each of the five branches in the module contains 256 filters, which resulted in an aggregated tensor of 1280 feature maps. Finally, the combined output was forwarded to a convolutional layer with 256 channels that contained the resulting multi-scale responses.

To restore the original image resolution, extracted features were processed by a series of convolutional and upsampling layers. Previous work on saliency prediction has commonly utilized bilinear interpolation for that task 

Cornia et al. (2018); Liu and Han (2018), but we argue that a carefully chosen decoder architecture results in better approximations. Here we employed three upsampling blocks consisting of a bilinear scaling operation, which doubled the number of rows and columns, and a subsequent convolutional layer with kernel size . This setup has previously been shown to prevent checkerboard artifacts in the upsampled image space in contrast to deconvolution Odena et al. (2016). Besides an increase of resolution throughout the decoder, the amount of channels was halved in each block to yield 32 feature maps. Our last network layer transformed activations into a continuous saliency distribution by applying a final

convolution. The outputs of all but the last linear layer were modified via rectified linear units. Figure 

2 visualizes the overall architecture design as described in this section.

3.2 Training

Weight values from the ASPP module and decoder were initialized according to the Xavier method by Glorot and Bengio (2010)

. It specifies parameter values as samples drawn from a uniform distribution with zero mean and a variance depending on the total number of incoming and outgoing connections. Such initialization schemes are demonstrably important for training deep neural networks successfully from scratch 

Sutskever et al. (2013). The encoding layers were based on the VGG16 architecture pre-trained on both ImageNet Deng et al. (2009) and Places2 Zhou et al. (2017)

data towards object and scene classification respectively.

We normalized the model output such that all values are non-negative with unit sum. The estimation of saliency maps can hence be regarded as a probability distribution prediction task as formulated by Jetley et al. (2016). To determine the difference between an estimated and a target distribution, the Kullback-Leibler (KL) divergence is an appropriate measure rooted in information theory to quantify the statistical distance . This can be defined as follows:


Here, represents the target distribution, its approximation, each pixel index, and a regularization constant. Equation (1

) served as the loss function which was gradually minimized via the

Adam optimization algorithm Kingma and Ba (2014). We defined an upper learning rate of and modified the weights in an online fashion due to a general inefficiency of batch training according to Wilson and Martinez (2003)

. Based on this general setup, we trained our network for 10 epochs and used the best-performing checkpoint for inference.

4 Experiments

The proposed encoder-decoder model was evaluated on three publicly available datasets that yielded qualitative and quantitative results. First, we provide a brief description of the images and empirical measurements utilized in this study. Second, the different metrics commonly used to assess the predictive performance of saliency models are summarized. Finally, we report the contribution of our architecture design choices and benchmark the overall results against baselines and related work in computer vision.

4.1 Datasets

A prerequisite for the successful application of deep learning techniques is a wealth of annotated data. Fortunately, the growing interest in developing and evaluating fixation models has lead to the release of large-scale eye tracking datasets such as

MIT1003 Judd et al. (2009) and CAT2000 Borji and Itti (2015). The costly acquisition of measurements, however, is a limiting factor for the number of stimuli. New data collection methodologies have emerged that leverage webcam-based eye movements Xu et al. (2015) or mouse movements Jiang et al. (2015) instead via crowdsourcing platforms. The latter approach resulted in the SALICON dataset, which consists of 10,000 training and 5,000 validation instances serving as a proxy for empirical gaze measurements. Due to its large size, we first trained our model on SALICON before fine-tuning the learned weights towards fixation predictions on either MIT1003 or CAT2000 with the same optimization parameters. This widely adopted procedure has been shown to improve the accuracy of eye movement estimations despite some disagreement between data originating from gaze and mouse tracking experiments Tavakoli et al. (2017).

The images presented during the acquisition of saliency maps in all three datasets are largely based on natural scenes. Stimuli of CAT2000 additionally fall into predefined categories such as Action, Fractal, Object, or Social

. Together with the corresponding fixation patterns, they constituted the input and desired output to our network architecture. In detail, we rescaled and padded all images from the SALICON dataset to

pixels, the MIT1003 dataset to pixels, and the CAT2000 dataset to pixels, such that the original aspect ratios were preserved. For the latter two sets we defined 80% of the samples as training data and the remainder as validation examples. The correct saliency distributions on test set images are held out and predictions must hence be submitted online for evaluation.

4.2 Metrics

Various measures are used in the literature and by benchmarks to evaluate the performance of fixation models. In practice, results are typically reported for all of them to include different notions about saliency and allow a fair comparison of model predictions Kümmerer et al. (2018); Riche et al. (2013). A set of nine metrics is commonly selected: Kullback-Leibler divergence (KLD), Pearson’s correlation coefficient (CC), histogram intersection (SIM), Earth Mover’s distance (EMD), information gain (IG), normalized scanpath saliency (NSS), and three variants of area under ROC curve

(AUC-Judd, AUC-Borji, shuffled AUC). The former four are location-based metrics, which require ground truth maps as binary fixation matrices. By contrast, the remaining metrics quantify saliency approximations after convolving gaze locations with a Gaussian kernel and representing the target output as a probability distribution. We refer readers to an overview by 

Bylinskii et al. (2018) for more information regarding the implementation details and properties of the stated measures.

In this work, we adopted KLD as an objective function and produced fixation density maps as output from our proposed network. This training setup is particularly sensitive to false negative predictions and thus the appropriate choice for applications aimed at salient target detection Bylinskii et al. (2018). Defining the problem of saliency prediction in a probabilistic framework also enables fair model ranking on public benchmarks for the MIT1003, CAT2000, and SALICON datasets Kümmerer et al. (2018). As a consequence, we evaluated our estimated gaze distributions without applying any metric-specific postprocessing methods.

4.3 Results

A quantitative comparison of results on independent test datasets was carried out to characterize how well our proposed network generalizes to unseen images. Here, we were mainly interested in estimating human eye movements and regarded mouse tracking measurements merely as a substitute for attention. The final outcome for the 2017 release of the SALICON dataset is therefore not reported in this work but our model results can be viewed on the public leaderboard111 under the user name akroner.

To assess the predictive performance for eye tracking measurements, the MIT saliency benchmark Bylinskii et al. (2015)

is commonly used to compare model results on two test datasets with respect to prior work. Final scores can then be submitted on a public leaderboard to allow fair model ranking on eight evaluation metrics. Table 

1 summarizes our results on the test dataset of MIT1003, namely MIT300 Judd et al. (2012), in the context of previous approaches. The evaluation shows that our model only marginally failed to achieve state-of-the-art performance on any of the individual metrics. When computing the cumulative rank (i.e. the sum of ranks) on a subset of weakly correlated measures (sAUC, NSS, KLD) Riche et al. (2013); Bylinskii et al. (2018), we ranked third behind the two architectures DPNSal and DenseSal from Oyama and Yamanaka (2018). However, their approaches were based on a pre-trained Dual Path Network with 131 layers Chen et al. (2017) and Densely Connected Convolutional Network with 161 layers Huang et al. (2017) respectively, both of which are computationally far more expensive than the VGG16 model used in this work. Among all entries with a VGG16 backbone Kümmerer et al. (2014); Cornia et al. (2016); Huang et al. (2015); Cornia et al. (2018); Kruthiventi et al. (2017), our network clearly achieved the highest performance.

Table 2 demonstrates that we obtained state-of-the-art results for the CAT2000 test dataset regarding the AUC-J, sAUC, and KLD evaluation metrics, and competitive results on the remaining measures. The cumulative rank (as computed above) suggests that our model outperformed all previous approaches, including the ones based on a pre-trained VGG16 classification network Cornia et al. (2018); Kruthiventi et al. (2017). Our final evaluation results for both the MIT300 and CAT2000 datasets can be viewed on the MIT saliency benchmark under the model name MSI-Net, representing our multi-scale information network. Qualitatively, the proposed architecture successfully captures semantically meaningful image features such as faces and text towards the prediction of saliency, as can be seen in Figure 1.

To quantify the contribution of multi-scale contextual information to the overall performance, we conducted a model ablation analysis. A baseline architecture without the ASPP module was constructed by replacing the five parallel convolutional layers with a single convolutional operation that resulted in 1280 activation maps. This representation was then forwarded to a convolutional layer with 256 channels. While the total number of feature maps stayed constant, the amount of trainable parameters increased in this ablation setting. Table 3 summarizes the re-sults according to validation instances of the MIT1003 and CAT2000 datasets for the model with and without an ASPP module. It can be seen that our multi-scale architecture clearly reached a higher performance on most metrics and is therefore able to leverage the information captured by convolutional layers with different receptive field sizes.









Ours 0.87 0.68 1.99 0.82 0.72 0.79 2.27 0.66
DPNSal Oyama and Yamanaka (2018) 0.87 0.69 2.05 0.80 0.74 0.82 2.41 0.91
DenseSal Oyama and Yamanaka (2018) 0.87 0.67 1.99 0.81 0.72 0.79 2.25 0.48
EML-NET Jia (2018) 0.88 0.68 1.84 0.77 0.70 0.79 2.47 0.84
DeepGaze II Kümmerer et al. (2016) 0.88 0.46 3.98 0.86 0.72 0.52 1.29 0.96
DeepGaze I Kümmerer et al. (2014) 0.84 0.39 4.97 0.83 0.66 0.48 1.22 1.23
DSCLRCN Liu and Han (2018) 0.87 0.68 2.17 0.79 0.72 0.80 2.35 0.95
SAM-ResNet Cornia et al. (2018) 0.87 0.68 2.15 0.78 0.70 0.78 2.34 1.27
SAM-VGG Cornia et al. (2018) 0.87 0.67 2.14 0.78 0.71 0.77 2.30 1.13
DeepFix Kruthiventi et al. (2017) 0.87 0.67 2.04 0.80 0.71 0.78 2.26 0.63
SALICON Huang et al. (2015) 0.87 0.60 2.62 0.85 0.74 0.74 2.12 0.54
ML-Net Cornia et al. (2016) 0.85 0.59 2.63 0.75 0.70 0.67 2.05 1.10
eDN Vig et al. (2014) 0.82 0.41 4.56 0.81 0.62 0.45 1.14 1.14
Judd Judd et al. (2009) 0.81 0.42 4.45 0.80 0.60 0.47 1.18 1.12
GBVS Harel et al. (2006) 0.81 0.48 3.51 0.80 0.63 0.48 1.24 0.87
Itti Itti et al. (1998) 0.75 0.44 4.26 0.74 0.63 0.37 0.97 1.03
SUN Zhang et al. (2008) 0.67 0.38 5.10 0.66 0.61 0.25 0.68 1.27
Table 1: Quantitative results of our model for the MIT300 test set in the context of prior work. The first line separates deep learning approaches with architectures pre-trained on image classification from shallow networks and other machine learning methods. Entries beneath the second line are models based on theoretical considerations and define a baseline rather than competitive performance. Arrows indicate whether the metrics assess similarity or dissimilarity between predictions and targets. The best results are marked in bold.









Ours 0.88 0.75 1.07 0.82 0.59 0.87 2.30 0.36
SAM-ResNet Cornia et al. (2018) 0.88 0.77 1.04 0.80 0.58 0.89 2.38 0.56
SAM-VGG Cornia et al. (2018) 0.88 0.76 1.07 0.79 0.58 0.89 2.38 0.54
EML-NET Jia (2018) 0.87 0.75 1.05 0.79 0.59 0.88 2.38 0.96
DeepFix Kruthiventi et al. (2017) 0.87 0.74 1.15 0.81 0.58 0.87 2.28 0.37
eDN Vig et al. (2014) 0.85 0.52 2.64 0.84 0.55 0.54 1.30 0.97
Judd Judd et al. (2009) 0.84 0.46 3.60 0.84 0.56 0.54 1.30 0.94
GBVS Harel et al. (2006) 0.80 0.51 2.99 0.79 0.58 0.50 1.23 0.80
Itti Itti et al. (1998) 0.77 0.48 3.44 0.76 0.59 0.42 1.06 0.92
SUN Zhang et al. (2008) 0.70 0.43 3.42 0.69 0.57 0.30 0.77 2.22
Table 2: Quantitative results of our model for the CAT2000 test set in the context of prior work. The first line separates deep learning approaches with architectures pre-trained on image classification from shallow networks and other machine learning methods. Entries beneath the second line are models based on theoretical considerations and define a baseline rather than competitive performance. Arrows indicate whether the metrics assess similarity or dissimilarity between predictions and targets. The best results are marked in bold.

The categorical organization of the CAT2000 database also allowed us to quantify the improvements by the ASPP module with respect to individual image classes. Table 4 lists the four categories that benefited the most from multi-scale information across all evaluation metrics on the validation set: Noisy, Satellite, Cartoon, Pattern. To understand the measured changes in predictive performance, it is instructive to inspect qualitative results of one representative example for each image category (see Figure 3). The visualizations demonstrate that large receptive fields allow the reweighting of relative importance assigned to image locations (Noisy, Satellite, Cartoon), detection of a central fixation bias (Noisy, Satellite, Cartoon), and allocation of saliency to low-level features that pop out from an array of distractors (Pattern).

Figure 3: A visualization of four example images from the CAT2000 validation set with the corresponding fixation heat maps, our best model predictions, and estimated maps based on the ablation network. The qualitative results indicate that multi-scale information augmented with global context enables a more accurate estimation of salient image regions.









MIT1003 ASPP 0.90 0.59 2.49 0.85 0.72 0.74 2.62 0.79
ASPP 0.89 0.57 2.63 0.82 0.72 0.70 2.55 0.82
CAT2000 ASPP 0.88 0.73 1.29 0.82 0.59 0.85 2.35 0.41
ASPP 0.87 0.68 1.49 0.83 0.60 0.77 2.09 0.49
Table 3: A summary of the quantitative results for the models with and without an ASPP module. Here, evaluation was performed on the validation set of MIT1003 and CAT2000 respectively. Arrows indicate whether the metrics assess similarity or dissimilarity between predictions and targets. The best results are marked in bold.









Noisy 0.01 0.07 0.25 0.01 0.00 0.13 0.41 0.11
Satellite 0.02 0.06 0.28 0.01 0.01 0.14 0.37 0.11
Cartoon 0.01 0.06 0.29 0.01 0.01 0.12 0.33 0.11
Pattern 0.01 0.05 0.28 0.00 0.00 0.08 0.27 0.09
Table 4: A list of the four image categories from the CAT2000 validation set that showed the largest improvement by the ASPP architecture. Entries are sorted in decreasing order of the cummulative rank across all evaluation measures. Arrows indicate whether the metrics assess similarity or dissimilarity between predictions and targets. Results that improved on the respective metric are marked in green, whereas results that impaired performance are marked in red.
Figure 4: A visualization of four example images from the CAT2000 validation set with the corresponding eye movement patterns and our model predictions. The stimuli demonstrate cases with a qualitative disagreement between the estimated saliency maps and ground truth data. Here, our model failed to capture an occluded face (a), small text (b), direction of gaze (c), and low-level feature contrast (d).

5 Discussion

Our proposed encoder-decoder model demonstrated competitive performance on two datasets towards visual saliency prediction. The ASPP module incorporated multi-scale information and global context based on semantic feature representations, which improved the results both qualitatively and quantitatively. This suggests that convolutional layers with large receptive fields at different dilation factors can enable a more holistic estimation of salient image regions in complex scenes. Moreover, our architecture is computationally lightweight compared to prior state-of-the-art approaches and outperformed all other networks defined with a pre-trained VGG16 backbone. For this performance assessment, we calculated the cumulative rank on a subset of evaluation metrics to resolve some of the inconsistencies in ranking models by a single measure or a set of correlated ones Riche et al. (2013); Bylinskii et al. (2018).

Further gains on benchmark results could potentially be achieved by a number of additions to the processing pipeline. Our model demonstrates a learned preference for predicting fixations in central regions of images, but we expect performance benefits from modelling the central bias in scene viewing explicitly Kümmerer et al. (2014, 2016); Cornia et al. (2016, 2018); Kruthiventi et al. (2017). Additionally, Bylinskii et al. (2015) summarized open problems for correctly assigning saliency in natural images, such as robustness in detecting semantic features, implied gaze and motion, and importance weighting of multiple salient regions. While the latter was addressed in this study, Figure 4 indicates that the remaining obstacles still persist for our proposed model.

Overcoming these issues requires a higher-level scene understanding that models object interactions and predicts implicit gaze and motion cues from static images. Robust object recognition could however be achieved through more recent classification networks as feature extractors Oyama and Yamanaka (2018). To detect interesting items in search array stimuli (see Figure 4d), a mechanism that additionally captures low-level feature contrasts might explain the empirical data better. Besides architectural changes, data augmentation in the context of saliency prediction tasks demonstrated its efficiency to improve the robustness of deep neural networks according to Che et al. (2018). These authors stated that visual transformations such as mirroring or inversion revealed a low impact on human gaze during scene viewing and could hence form an addition to future work on saliency models.


This study received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement Nos. 7202070 (HBP SGA1) and 737691 (HBP SGA2). Moreover, we gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU used for this research.


  • Jonides et al. (1982) J. Jonides, D. E. Irwin, S. Yantis, Integrating visual information from successive fixations, Science 215 (1982) 192–194.
  • Irwin (1991) D. E. Irwin, Information integration across saccadic eye movements, Cognitive Psychology 23 (1991) 420–456.
  • Posner (1980) M. I. Posner, Orienting of attention, Quarterly Journal of Experimental Psychology 32 (1980) 3–25.
  • Lennie (2003) P. Lennie, The cost of cortical computation, Current Biology 13 (2003) 493–497.
  • Cowey and Rolls (1974) A. Cowey, E. Rolls, Human cortical magnification factor and its relation to visual acuity, Experimental Brain Research 21 (1974) 447–454.
  • Berkley et al. (1975) M. A. Berkley, F. Kitterle, D. W. Watkins, Grating visibility as a function of orientation and retinal eccentricity, Vision Research 15 (1975) 239–244.
  • Cheung et al. (2016) B. Cheung, E. Weiss, B. Olshausen, Emergence of foveal image sampling from learning to attend in visual scenes, arXiv preprint arXiv:1611.09430 (2016).
  • Gegenfurtner (2016) K. R. Gegenfurtner, The interaction between vision and eye movements, Perception 45 (2016) 1333–1357.
  • Itti et al. (1998) L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1254–1259.
  • Treisman and Gelade (1980) A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1980) 97–136.
  • Koch and Ullman (1985) C. Koch, S. Ullman, Shifts in selective visual attention: Towards the underlying neural circuitry, Human Neurobiology 4 (1985) 219–227.
  • Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems 25 (2012) 1097–1105.
  • Zhang et al. (2008) L. Zhang, M. H. Tong, T. K. Marks, H. Shan, G. W. Cottrell, SUN: A Bayesian framework for saliency using natural statistics, Journal of Vision 8 (2008) 32.
  • Harel et al. (2006) J. Harel, C. Koch, P. Perona, Graph-based visual saliency, Advances in Neural Information Processing Systems 19 (2006) 545–552.
  • Cerf et al. (2009) M. Cerf, E. P. Frady, C. Koch, Faces and text attract gaze independent of the task: Experimental data and computer model, Journal of Vision 9 (2009) 10.
  • Judd et al. (2009) T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, Proceedings of the International Conference on Computer Vision (2009) 2106–2113.
  • Vig et al. (2014) E. Vig, M. Dorr, D. Cox, Large-scale optimization of hierarchical features for saliency prediction in natural images,

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014) 2798–2805.

  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009) 248–255.
  • Donahue et al. (2014) J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: A deep convolutional activation feature for generic visual recognition, Proceedings of the International Conference on Machine Learning (2014) 647–655.
  • Kümmerer et al. (2014) M. Kümmerer, L. Theis, M. Bethge, DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet, arXiv preprint arXiv:1411.1045 (2014).
  • Kümmerer et al. (2016) M. Kümmerer, T. S. Wallis, M. Bethge, DeepGaze II: Reading fixations from deep features trained on object recognition, arXiv preprint arXiv:1610.01563 (2016).
  • Cornia et al. (2016) M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, A deep multi-level network for saliency prediction, Proceedings of the International Conference on Pattern Recognition (2016) 3488–3493.
  • Huang et al. (2015) X. Huang, C. Shen, X. Boix, Q. Zhao, SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks, Proceedings of the International Conference on Computer Vision (2015) 262–270.
  • Cornia et al. (2018) M. Cornia, L. Baraldi, G. Serra, R. Cucchiara,

    Predicting human eye fixations via an LSTM-based saliency attentive model,

    IEEE Transactions on Image Processing 27 (2018) 5142–5154.
  • Liu and Han (2018) N. Liu, J. Han, A deep spatial contextual long-term recurrent convolutional network for saliency detection, IEEE Transactions on Image Processing 27 (2018) 3264–3274.
  • Borji (2018) A. Borji, Saliency prediction in the deep learning era: An empirical investigation, arXiv preprint arXiv:1810.03716 (2018).
  • Simonyan and Zisserman (2014) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
  • Yu and Koltun (2015) F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122 (2015).
  • Hariharan et al. (2015) B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Hypercolumns for object segmentation and fine-grained localization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 447–456.
  • Long et al. (2015) J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 3431–3440.
  • Chen et al. (2018) L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018) 834–848.
  • Torralba et al. (2006) A. Torralba, A. Oliva, M. S. Castelhano, J. M. Henderson, Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search, Psychological Review 113 (2006) 766.
  • Chen et al. (2017) L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587 (2017).
  • Odena et al. (2016) A. Odena, V. Dumoulin, C. Olah, Deconvolution and checkerboard artifacts, Distill 1 (2016) e3.
  • Glorot and Bengio (2010) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks,

    Proceedings of the International Conference on Artificial Intelligence and Statistics (2010) 249–256.

  • Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, Proceedings of the International Conference on Machine Learning (2013) 1139–1147.
  • Zhou et al. (2017) B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2017) 1452–1464.
  • Jetley et al. (2016) S. Jetley, N. Murray, E. Vig, End-to-end saliency mapping via probability distribution prediction, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 5753–5761.
  • Kingma and Ba (2014) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  • Wilson and Martinez (2003) D. R. Wilson, T. R. Martinez, The general inefficiency of batch training for gradient descent learning, Neural Networks 16 (2003) 1429–1451.
  • Borji and Itti (2015) A. Borji, L. Itti, CAT2000: A large scale fixation dataset for boosting saliency research, arXiv preprint arXiv:1505.03581 (2015).
  • Xu et al. (2015) P. Xu, K. A. Ehinger, Y. Zhang, A. Finkelstein, S. R. Kulkarni, J. Xiao, TurkerGaze: Crowdsourcing saliency with webcam based eye tracking, arXiv preprint arXiv:1504.06755 (2015).
  • Jiang et al. (2015) M. Jiang, S. Huang, J. Duan, Q. Zhao, SALICON: Saliency in context, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 1072–1080.
  • Tavakoli et al. (2017) H. R. Tavakoli, F. Ahmed, A. Borji, J. Laaksonen, Saliency revisited: Analysis of mouse movements versus fixations, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 6354–6362.
  • Kümmerer et al. (2018) M. Kümmerer, T. Wallis, M. Bethge, Saliency benchmarking made easy: Separating models, maps and metrics, Proceedings of the European Conference on Computer Vision (2018) 770–787.
  • Riche et al. (2013) N. Riche, M. Duvinage, M. Mancas, B. Gosselin, T. Dutoit, Saliency and human fixations: State-of-the-art and study of comparison metrics, Proceedings of the International Conference on Computer Vision (2013) 1153–1160.
  • Bylinskii et al. (2018) Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, F. Durand, What do different evaluation metrics tell us about saliency models?, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2018) 740–757.
  • Bylinskii et al. (2015) Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, A. Torralba, MIT saliency benchmark,, 2015.
  • Judd et al. (2012) T. Judd, F. Durand, A. Torralba, A benchmark of computational models of saliency to predict human fixations, 2012.
  • Oyama and Yamanaka (2018) T. Oyama, T. Yamanaka, Influence of image classification accuracy on saliency map estimation, arXiv preprint arXiv:1807.10657 (2018).
  • Chen et al. (2017) Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, J. Feng, Dual path networks, Advances in Neural Information Processing Systems 30 (2017) 4467–4475.
  • Huang et al. (2017) G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 2261–2269.
  • Kruthiventi et al. (2017) S. S. Kruthiventi, K. Ayush, R. V. Babu, DeepFix: A fully convolutional neural network for predicting human eye fixations, IEEE Transactions on Image Processing 26 (2017) 4446–4456.
  • Jia (2018) S. Jia, EML-NET: An expandable multi-layer network for saliency prediction, arXiv preprint arXiv:1805.01047 (2018).
  • Che et al. (2018) Z. Che, A. Borji, G. Zhai, X. Min, Invariance analysis of saliency models versus human gaze during scene free viewing, arXiv preprint arXiv:1810.04456 (2018).