The human visual system receives a large amount of information every second (about to bits). An essential mechanism that allows the human visual system to process such a vast amount of information in real time is its capacity to selectively focus attention on parts of the scene. This process has been extensively studied by Psychologists to discover which visual patterns capture human attention. Desimone & Duncan  found that parts of an image that differ from their surroundings stand out. This paradigm is called center-surround difference in early computational modeling of visual attention. Based on the center-surround difference and the feature integration theory proposed by Treisman & Gelade , many computational models of visual attention have been proposed [3, 4, 5].
In recent years, with the availability of large scale datasets recording mouse movements of human subjects as a proxy of gaze (e.g., 
) and of powerful parallel hardware, the development of data driven approaches based on deep learning have demonstrated significantly higher performance than previous models on all benchmarks. Currently, almost all deep saliency models treat the gaze map as a small scale map recording the density of fixations at every image location (downsampled from the ground truth [8, 9]
). Such models are almost invariably trained by minimizing the distance between the predicted saliency maps and the ground truth. At inference time, the saliency map is then upsampled to the input’s image size. Such deep saliency models have achieved much better performance than models based on hand-crafted features or psychological assumptions, but unlike for the task of image recognition, where the representations learned by deep neurons have been studied and visualized[10, 11], it remains unclear why deep saliency models perform so well or what salient patterns have deep neurons attuned to in the process. The complexity of some of the proposed architectures make them even more inscrutable.
In this paper, we use a simple yet powerful residual-like decoder with a new loss function for pixel-wise gaze prediction. The architecture is similar to the architecture in , but we dispense with the GAN training and instead propose a simpler, residual decoder. We demonstrate that the model although simpler, achieves better performance on most metrics and datasets. Additionally, we propose a novel method to visualize and analyze the representations learnt by deep saliency models. To the best of our knowledge, this is the first work which looks inside deep saliency models.
The rest of the paper is organized as follows. Section 2 reviews the state-of-the-art gaze prediction models as well as visualization methods for deep convolutional networks. Section 3 introduces the proposed deep saliency model and the model visualization method. Experimental results and benchmarks are presented in Sections 4 and 5.
2 Related work
In this section, we first review the state-of-the-art deep gaze prediction models before introducing visualization methods for deep convolutional networks.
2.1 Deep saliency models
The release of SALICON dataset , offered for the first time a large scale dataset for saliency, which spurred the development of a number of saliency models. For example, Deepnet  learns saliency using 8 convolutional layers, where only the first 3 layers were initialized from the pre-trained image classification model. PDP 
treats the gaze map as a small scale probability map. Authors investigated different loss functions for training their gaze prediction model and found that Bhattacharyya distance is the best loss function when the gaze map is treated as a small scale probability map. The Salicon model uses multi resolution inputs, and combines feature representations in the deep layers for gaze prediction. Deepfix  combined the deep architectures of VGG , Googlenet , and Dilated convolutions  in their network as well as adding a central bias, to achieve a higher performance than previous models. SalGAN 
uses an encoder-decoder architecture and proposes the binary cross entropy (BCE) loss function to perform pixel-wise (rather than image-wise) saliency estimation. After pre-training the encoder-decoder, they use a Generative Adversarial Network (GAN) to boost their model’s performance. DVA uses multiple deep layer representations, builds a decoder for each layer, and fuses them at the final stage for pixel-wise gaze prediction. DSCLRCN  also uses multiple inputs by adding a contextual information stream, and concatenates the original representation and the contextual representation into a LSTM network for the final prediction.
Table 1 provides a comparison of state-of-the-art deep saliency models. Complex architectures [8, 14, 18, 20, 19] are intrinsically inscrutable and difficult to interpret, hence in this article we propose to use a simple fully convolutional encoder with a residual decoder, using the exponential absolute distance (EAD) to do pixel-wise gaze prediction. We demonstrate that despite its simplicity, this architecture can compete, or even outperform, more complex state-of-the-art architectures.
|DSCLRCN ||multi inputs||Resnet , Places ||Yes||no||NSS||pixel|
|Deepfix ||single input||MA(VGG,Googlenet,Dilated)||no||yes||pixel|
|Salicon ||MR inputs||VGG||no||no||K-L||PD|
|SalGAN ||single input||VGG,GAN||no||no||BCE||pixel|
|PDP ||single input||VGG||no||no||Bha||PD|
|DVA ||single input||VGG,MD||no||no||BCE||pixel|
|Deepnet ||single input||Custom 8-layers||no||no||pixel|
Comparison of saliency prediction models. MA: multi-architecture, MR: multi-resolution, PD: probability distribution based, Bha: Bhattacharyya distance, MD: multiple decoders, CB: central bias, K-L: Kullback-Leibler divergence.
2.2 Visualizing deep neural networks
The success of deep convolutional neural networks has raised the question of what representations are learned by neurons located in deep layers. One approach towards understand how CNNs work and learn is to visualize individual neurons’ activations and receptive fields. For example, Zeiler & Fergus 
proposed a deconvolution network in order to visualize the original patterns that activate the corresponding activation maps. In the forward pass of a convolutional neural network the main operations are convolution, ReLU (or another nonlinearity) and pooling. Conversely, a deconvolution network is consists of the three steps of unpooling, transposed convolution (using the pre-trained weights in the forward pass, and transposing them for convolution), and the ReLU operation. Yosinski et al. developed two tools for understanding the deep convolutional neural networks. The first of these tools is designed to visualize the activation maps at different layers for a given input image. The second tool aims to estimate the input pattern which a network is maximally attuned to for a given object class. In practice, the last layer of a classification deep neural network typically features one neuron per object class. Yosinski et al. propose to use gradient ascent (with regularization) to find the input image that maximizes the output of a specific neuron (i.e., for a specific object class). Hence, it derives the optimum input that appeals to the network for a specific class.
Both visualization methods discussed above are essentially qualitative. In contrast, Bau et al.  proposed a quantitative method to give each activation map a semantic meaning. In their work, they proposed a dataset with 6 image categories and 63,305 images for network dissection, where each image is labeled with pixel-wise semantic meaning. At first, they forward all images in the dataset into a pre-trained deep model. For each activation map inside the model, different inputs have different patterns. Then, they compute the distribution of each unit activation map over the whole dataset, and determine a threshold for each unit based on its activation distribution. With the threshold for each unit, the activation map for each input image is quantized into a binary map. Finally, they compute the intersection over union (IOU) between the quantized activation map and the labeled ground truth to determine what objects or object parts that unit is detecting.
Although these approaches provide useful insight into the workings of deep neural networks, they are ill-suited for understanding deep saliency networks: If it is reasonable to expect that neurons in a dog/cat classifier will encode patterns characteristic of dogs and cats, a saliency model is expected to encode both salient patterns butalso non-salient ones. For this reason, we propose to use the normalized scan-path saliency (NSS)  score to determine whether individual neurons act as negative or positive predictors of gaze in the network. Moreover, in order to interpret what has been learnt as salient by the model, we use the network dissection approach of Bau et al.  to highlight what objects or object parts neurons in our saliency models are implicitly attuned to.
This section will first introduce the proposed simplified architecture for saliency estimation. In a second part, we then describe how to visualize and analyze deep saliency models.
3.1 Gaze prediction
The whole architecture of our network is illustrated in Figure 1. The input is first processed
by encoder network, and represented by a feature tensor () of shape .
where is the number of locations in the feature tensor, and is the dimension of each location. In our model, we use the first 5 convolutional blocks (we removed the last pooling layer, and kept the first 4 pooling layers in the encoder) of a pre-trained VGG16 
network to initialize the feature extraction part and fine-tune it during training. The input was resized to, hence the shape of the feature tensor is .
After feature extraction, the feature tensor was then fed into the residual-decoder. The decoder is consists of four blocks, where each block upsamples the feature tensor once to recover the resolution lost in the encoding stage. Each block shares three similar processes: convolution for dimension reduction, normal convolution, and deconvolution to recover the resolution lost in the encoder due to pooling.
In each block, the feature tensor from the previous block is first processed by a dimension reduction convolutional layer to reduce the number of feature maps . In our model, we halve the number of feature maps in each block of the decoder.
Then, the processed feature tensor () is processed by a conventional convolutional layer for further processing.
Finally, the two processed tensors ( and ) are added together and then sent to a deconvolutional layer to increase the tensor resolution and generate the block’s output tensor .
The kernel size was set to for convolutional layers and
for deconvolutional layers. Zero-padding was used to preserve the input’s scale. The last layer of the decoder is aconvolutional layer, which transforms the (output of last deconvolutional layer) activation maps into the saliency map. No further processing was implemented.
To train our model, we propose a new pixel-wise loss function, the exponential absolute distance (EAD), formulated as follows:
where, is the number of pixels in the gaze map, and is the prediction and ground truth at the pixel.
Compared to the distance, the EAD has a better gradient when the absolute difference is small. Compared to the distance, which is linear in the absolute difference, the EAD gives a larger punishment when the difference is large. In contrast to EAD, the BCE loss proposed in  yields a non-zero loss even for perfect predictions (as illustrated in Figure 225] with the Adam  optimizer. We set the initial leaning rate as , and decay it with a factor of
after each training epoch.
3.2 Model visualization
As discussed before, one important question is what is learnt by a deep saliency model that allows it to outperform hand-crafted shallow models based on psychological theories? In other words, what specific salient patterns are learnt by the model? One hypothesis is that such deep network encode semantic information about saliency going beyond classical centre-surround assumptions.
Here, we propose to use the actual saliency evaluation metric, the normalized scan-path saliency (NSS) score, to visualize and understand inner representations of deep saliency models. At first, we feed all images in the dataset with fixations (MIT1003 dataset ) to the pre-trained deep saliency model. For each single image, it produces a set of activation maps as the output of the feature extraction part (in our model, this is the output of the encoder), one per neuron. Each activation map has a unique pattern for a given input image. We rescale the activation map to the input’s scale and use the activation to compute the NSS score for each neuron over the whole dataset. We use the top 5 NSS scores for each unit activation map, and compute their mean as the mean NSS score of each unit activation map (As a convolutional feature channel can only correspond to a certain type visual pattern [28, 29]). Therefore, each neuron’s activation map has a mean NSS score, which indicates its correlation with human gaze locations. Using the mean NSS score for each neuron’s activation map, we normalize the mean NSS score across all activation maps between , and set a threshold (we choose in our experiment). Neurons with mean NSS score above threshold are identified as positive fixation detectors.
After selecting positive fixation detectors, we use network dissection  (using the same method and dataset as the authors) to reveal what kind of object or object part are those positive fixation detectors attuned to. We proceed as follows: For every image in the Broden dataset , there is a unique pattern for each unit activation map. For each neuron’s activation map, we compute the distribution of its values over the whole dataset and find the threshold such that the value larger than with a probability . Then, all activation maps for all images are scaled to the input size and are quantized to a binary map. Finally, the IOU  is computed for each activation map to determine what sort of objects or object parts they are attuned to detecting (more details are in the ). In our work, we only show the objects or object parts for the positive fixation detectors.
4 Saliency prediction performance
In this section, we first introduce the datasets used in the experiments and then show the performance of different pixel-wise loss functions as well as the comparison between our model and other state-of-the-art models.
SALICON : The SALICON dataset is the largest dataset in the field of visual saliency. Saliency maps are estimated from human observers’ mouse clicks gathered over 20,000 images, with 10,000 images in the training set, 5,000 images in the validation dataset, and another 5,000 images in the testing dataset. We use the SALICON training and validation datasets to train and validate our model.
MIT1003 : This dataset includes gaze data of 15 subjects using an eye tracker over 1,003 images. It is used in the visualization part. To compare our model with other state-of-the-art models on MIT300 benchmark, we also randomly choose 900 images from this dataset to fine-tune our model and another 103 images for testing the performance of different loss functions.
MIT300 : This dataset is the standard benchmark dataset for human gaze prediction. It includes the gaze data of 39 subjects over 300 images.
Broden : This dataset contains 63,305 images. With four subsets, ADE20K (22,211 images), Opensurfaces (25,351 images), DTD (5,639 images), and PASCAL(10,104 images), with pixel-wise semantic labels. It is used for network dissection.
4.2 Model performance
records the accuracy of predicted saliency maps according to a range of standard error measures: Normalised Scan-path (NSS), Cross Correlation (CC), Area Under ROC curve (AUC) and Similarity (Sim) (we refer to for a discussion of saliency metrics). The accuracy is recorded for different pixel-wise loss functions. We can see that our proposed EAD loss achieves the best performance among all pixel-wise loss functions. Furthermore, the loss function, which is used as a baseline loss function in many deep gaze prediction models [9, 12], also shows good performance, which demonstrates that the proposed architecture is competitive regardless of the loss function and despite its comparative simplicity. This experiment was performed on the MIT1003 dataset.
Tables 3 and 4 compare our model’s performance with state-of-the-art models. One can see that our model performs on par or better than all single architecture models (Table 1), especially when considering the NSS score, which is the metric of choice for ranking saliency models . In Table 4, we see that our model’s performance come close to considerably more complex, multi-architecture approaches such as Deepfix  and DSCLRCN . Figure 4 is a qualitative comparison of the saliency predicted by different models on some example images.
5 Visualizing salient patterns
This section analyses what is learnt by state-of-the-art deep saliency models using the visualization tools discussed in section 3.2. The model proposed in section 3.1 is trained on the SALICON training dataset as before, but without fine-tuning it on MIT1003 dataset to avoid overfitting when computing the NSS score for each activation map. In addition to the proposed model, we apply a similar analysis on three deep saliency models for which the code is publicly available: Deepnet, SalGAN and OpenSalicon. In Deepnet , the first five convolutional layers were determined as the feature extraction part. In SalGAN , the encoder was treated as the feature extraction part. In OpenSalicon , both coarse scale (Saliconc) and fine scale (Saliconf) were visualized.
Figures 5 to 9 show example patterns for activation maps with high mean NSS scores (far beyond the model’s performance) in different models. The patterns in these figures are generated as the product of an input image with the activation map, cropped to the active areas for legibility. From those figures, we can see that most activation maps with high mean NSS score focus on a unique object or part of an object (head or face, etc).
Figure 10 shows example patterns for activation maps with medium mean NSS scores. In these examples, we can see that the active regions are less clearly focused on a single object or part. More importantly, for activation maps with low mean NSS score, shown in Figure 11, the patterns show a negative central bias, clearly inhibiting the central part of the image. Note that the models analyzed in this figure do not include an explicit central bias constraint, therefore this bias has been learnt solely from the training data and appears to be encoded by low NSS neurons.
Since some activation maps (positive fixation detectors) have very high mean NSS scores, we investigate the relationships between the model performance (here we use the NSS score in MIT1003 dataset as the model performance) and the proportion of positive fixation detectors. Table 5 records this ratio for all analyzed models. We can see that models with better overall performance have higher proportions of fixation detectors (with a correlation coefficient of ), and our model has the highest ratio of positive fixation detectors amongst all analyzed models. Note that in all cases, the ratio of positive fixation detectors remains small.
|Model||# activation maps||# positive detectors||ratio||NSS|
|OpenSalicon [8, 32]||1024||19||1.8%||1.92|
After determining which activation maps are positive fixation detectors inside the deep models, the question remains of what are those detectors are attuned to (i.e., objects and object parts). For this purpose we use measure of the normalized mean detection frequency of a class as the ratio:
where is the total number of occurrence of the class in the dataset and is the number of detected instances.
Figure 12 records the normalized mean detection frequency for the analyzed models and for classes (left) and parts (right) in the Broden dataset. On the left hand side of the figure shows the results of this analysis for object labels: Most positive fixation detectors are attuned to common animals (dog, cat, cow, sheep, and person). The reason might be that they are fine-tuned from the image recognition models, which have already learnt rich object classes. This assumption is supported by the results on Deepnet, that has been trained from scratch without pre-training, and for which the first four object classes (motorbike, ball, bus, airplane) are not those animals. Interestingly, the detectors on Saliconc and Saliconf are attuned to different visual classes; the coarse model (Saliconc) appears to capture more common object classes than the fine model, as evidenced by higher scores.
The right hand side shows similar results but using parts labels instead of object labels. In these graphs, we see that almost all positive fixation detectors focus on the head or head parts (i.e., head, hair, torso, ear, and neck).
This paper set out to investigate the reason behind the high performance achieved by deep saliency models, compared to shallow models using hand-crafted features based on theoretical considerations about saliency (e.g., center-surround difference). To this end, we proposed a simple residual-like decoder combined with a pixel-wise exponential absolute distance loss function. The proposed loss function achieves best results among all pixel-wise loss functions and the model performance is on par or better than those state-of-the-art saliency models, despite being based on a simple architecture. Furthermore, we proposed a visualization method for deep gaze prediction models, and did a comprehensive study to reveal the inner representations inside those models.
Our analyses allow us to draw three conclusions about what is learned by deep saliency models. First, better performing models have developed higher proportions of deep neurons highly predictive of human gaze, and those neurons are attuned to very specific visual patterns. Second, another category of neurons, which are not predictive of human gaze, appear to encode a form of negative central bias into the model. Third, we have demonstrated that the predictive neurons are attuned to clear semantic categories such as animals (dogs, cats), objects (motorbike, ball) and parts (head, hair). These results provide evidence that the higher prediction performance achieved by deep saliency models is likely caused by the additional semantic content encoded by such networks, allowing the models to capture the fact that specific visual classes are salient in their own right, in contrast to shallow saliency models that rely on low level perceptual patterns (such as center-surround difference). This hints that saliency, as experienced by humans, is a process that likely involves high-level world knowledge in addition to low-level perceptual cues.
We believe that our results can be useful to measure the gap between current saliency models and the human inter-observer model and to build new models to close this gap. We will share our code to facilitate future research in this direction.
-  Desimone, R., Duncan, J.: Neural mechanisms of selective visual attention. Annual review of neuroscience 18(1) (1995) 193–222
-  Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive psychology 12(1) (1980) 97–136
-  Itti, L., Koch, C.: A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research 40(10-12) (2000) 1489–1506
-  Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Advances in neural information processing systems. (2007) 545–552
Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M., Dosil, R.:
Saliency from hierarchical adaptation through decorrelation and variance normalization.Image and Vision Computing 30(1) (2012) 51–64
Jiang, M., Huang, S., Duan, J., Zhao, Q.:
Salicon: Saliency in context.
In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, IEEE (2015) 1072–1080
-  Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A.: Mit saliency benchmark
-  Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 262–270
-  Jetley, S., Murray, N., Vig, E.: End-to-end saliency mapping via probability distribution prediction. Proceedings of Computer Vision and Pattern Recognition 2016 (2016) 5753–5761
-  Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer (2014) 818–833
-  Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying interpretability of deep visual representations. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 3319–3327
-  Pan, J., Canton, C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)
-  Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 598–606
-  Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing 26(9) (2017) 4446–4456
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions, Cvpr (2015)
-  Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
-  Wang, W., Shen, J.: Deep visual attention prediction. arXiv preprint arXiv:1705.02544 (2017)
-  Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional network for saliency detection. arXiv preprint arXiv:1610.01708 (2016)
-  Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an lstm-based saliency attentive model. arXiv preprint arXiv:1611.09571 (2016)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: In: Advances in neural information processing systems. (2014) 487–495
-  Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 (2015)
-  Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? arXiv preprint arXiv:1604.03605 (2016)
-  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: Computer Vision, 2009 IEEE 12th international conference on, IEEE (2009) 2106–2113
-  Simon, M., Rodner, E.: Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1143–1151
-  Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1134–1142
-  Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations. Technical report, MIT (2012)
-  https://competitions.codalab.org/competitions/17136 Accessed: 2018-03-08.
-  Thomas, C.: Opensalicon: An open source implementation of the salicon saliency model. arXiv preprint arXiv:1606.00110 (2016)