What Catches the Eye? Visualizing and Understanding Deep Saliency Models

03/15/2018 ∙ by Sen He, et al. ∙ 0

Deep convolutional neural networks have demonstrated high performances for fixation prediction in recent years. How they achieve this, however, is less explored and they remain to be black box models. Here, we attempt to shed light on the internal structure of deep saliency models and study what features they extract for fixation prediction. Specifically, we use a simple yet powerful architecture, consisting of only one CNN and a single resolution input, combined with a new loss function for pixel-wise fixation prediction during free viewing of natural scenes. We show that our simple method is on par or better than state-of-the-art complicated saliency models. Furthermore, we propose a method, related to saliency model evaluation metrics, to visualize deep models for fixation prediction. Our method reveals the inner representations of deep models for fixation prediction and provides evidence that saliency, as experienced by humans, is likely to involve high-level semantic knowledge in addition to low-level perceptual cues. Our results can be useful to measure the gap between current saliency models and the human inter-observer model and to build new models to close this gap.



There are no comments yet.


page 5

page 6

page 7

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human visual system receives a large amount of information every second (about to bits). An essential mechanism that allows the human visual system to process such a vast amount of information in real time is its capacity to selectively focus attention on parts of the scene. This process has been extensively studied by Psychologists to discover which visual patterns capture human attention. Desimone & Duncan [1] found that parts of an image that differ from their surroundings stand out. This paradigm is called center-surround difference in early computational modeling of visual attention. Based on the center-surround difference and the feature integration theory proposed by Treisman & Gelade [2], many computational models of visual attention have been proposed [3, 4, 5].

In recent years, with the availability of large scale datasets recording mouse movements of human subjects as a proxy of gaze (e.g., [6]

) and of powerful parallel hardware, the development of data driven approaches based on deep learning have demonstrated significantly higher performance than previous models on all benchmarks

[7]. Currently, almost all deep saliency models treat the gaze map as a small scale map recording the density of fixations at every image location (downsampled from the ground truth [8, 9]

). Such models are almost invariably trained by minimizing the distance between the predicted saliency maps and the ground truth. At inference time, the saliency map is then upsampled to the input’s image size. Such deep saliency models have achieved much better performance than models based on hand-crafted features or psychological assumptions, but unlike for the task of image recognition, where the representations learned by deep neurons have been studied and visualized 

[10, 11], it remains unclear why deep saliency models perform so well or what salient patterns have deep neurons attuned to in the process. The complexity of some of the proposed architectures make them even more inscrutable.

In this paper, we use a simple yet powerful residual-like decoder with a new loss function for pixel-wise gaze prediction. The architecture is similar to the architecture in [12], but we dispense with the GAN training and instead propose a simpler, residual decoder. We demonstrate that the model although simpler, achieves better performance on most metrics and datasets. Additionally, we propose a novel method to visualize and analyze the representations learnt by deep saliency models. To the best of our knowledge, this is the first work which looks inside deep saliency models.

The rest of the paper is organized as follows. Section 2 reviews the state-of-the-art gaze prediction models as well as visualization methods for deep convolutional networks. Section 3 introduces the proposed deep saliency model and the model visualization method. Experimental results and benchmarks are presented in Sections 4 and 5.

2 Related work

In this section, we first review the state-of-the-art deep gaze prediction models before introducing visualization methods for deep convolutional networks.

2.1 Deep saliency models

The release of SALICON dataset [6], offered for the first time a large scale dataset for saliency, which spurred the development of a number of saliency models. For example, Deepnet [13] learns saliency using 8 convolutional layers, where only the first 3 layers were initialized from the pre-trained image classification model. PDP [9]

treats the gaze map as a small scale probability map. Authors investigated different loss functions for training their gaze prediction model and found that Bhattacharyya distance is the best loss function when the gaze map is treated as a small scale probability map. The Salicon

[8] model uses multi resolution inputs, and combines feature representations in the deep layers for gaze prediction. Deepfix [14] combined the deep architectures of VGG [15], Googlenet [16], and Dilated convolutions [17] in their network as well as adding a central bias, to achieve a higher performance than previous models. SalGAN [12]

uses an encoder-decoder architecture and proposes the binary cross entropy (BCE) loss function to perform pixel-wise (rather than image-wise) saliency estimation. After pre-training the encoder-decoder, they use a Generative Adversarial Network (GAN) to boost their model’s performance. DVA

[18] uses multiple deep layer representations, builds a decoder for each layer, and fuses them at the final stage for pixel-wise gaze prediction. DSCLRCN [19] also uses multiple inputs by adding a contextual information stream, and concatenates the original representation and the contextual representation into a LSTM network for the final prediction.

Table 1 provides a comparison of state-of-the-art deep saliency models. Complex architectures [8, 14, 18, 20, 19] are intrinsically inscrutable and difficult to interpret, hence in this article we propose to use a simple fully convolutional encoder with a residual decoder, using the exponential absolute distance (EAD) to do pixel-wise gaze prediction. We demonstrate that despite its simplicity, this architecture can compete, or even outperform, more complex state-of-the-art architectures.

Model Input CNN LSTM CB Loss pixel/PD
DSCLRCN [19] multi inputs Resnet [21], Places [22] Yes no NSS pixel
Deepfix [14] single input MA(VGG,Googlenet,Dilated) no yes pixel
Salicon [8] MR inputs VGG no no K-L PD
SalGAN [12] single input VGG,GAN no no BCE pixel
PDP [9] single input VGG no no Bha PD
DVA [18] single input VGG,MD no no BCE pixel
Deepnet [13] single input Custom 8-layers no no pixel
Ours single input VGG no no EAD pixel
Table 1:

Comparison of saliency prediction models. MA: multi-architecture, MR: multi-resolution, PD: probability distribution based, Bha: Bhattacharyya distance, MD: multiple decoders, CB: central bias, K-L: Kullback-Leibler divergence.

2.2 Visualizing deep neural networks

The success of deep convolutional neural networks has raised the question of what representations are learned by neurons located in deep layers. One approach towards understand how CNNs work and learn is to visualize individual neurons’ activations and receptive fields. For example, Zeiler & Fergus [10]

proposed a deconvolution network in order to visualize the original patterns that activate the corresponding activation maps. In the forward pass of a convolutional neural network the main operations are convolution, ReLU (or another nonlinearity) and pooling. Conversely, a deconvolution network is consists of the three steps of unpooling, transposed convolution (using the pre-trained weights in the forward pass, and transposing them for convolution), and the ReLU operation. Yosinski et al. 

[23] developed two tools for understanding the deep convolutional neural networks. The first of these tools is designed to visualize the activation maps at different layers for a given input image. The second tool aims to estimate the input pattern which a network is maximally attuned to for a given object class. In practice, the last layer of a classification deep neural network typically features one neuron per object class. Yosinski et al. propose to use gradient ascent (with regularization) to find the input image that maximizes the output of a specific neuron (i.e., for a specific object class). Hence, it derives the optimum input that appeals to the network for a specific class.

Both visualization methods discussed above are essentially qualitative. In contrast, Bau et al. [11] proposed a quantitative method to give each activation map a semantic meaning. In their work, they proposed a dataset with 6 image categories and 63,305 images for network dissection, where each image is labeled with pixel-wise semantic meaning. At first, they forward all images in the dataset into a pre-trained deep model. For each activation map inside the model, different inputs have different patterns. Then, they compute the distribution of each unit activation map over the whole dataset, and determine a threshold for each unit based on its activation distribution. With the threshold for each unit, the activation map for each input image is quantized into a binary map. Finally, they compute the intersection over union (IOU) between the quantized activation map and the labeled ground truth to determine what objects or object parts that unit is detecting.

Although these approaches provide useful insight into the workings of deep neural networks, they are ill-suited for understanding deep saliency networks: If it is reasonable to expect that neurons in a dog/cat classifier will encode patterns characteristic of dogs and cats, a saliency model is expected to encode both salient patterns but

also non-salient ones. For this reason, we propose to use the normalized scan-path saliency (NSS) [24] score to determine whether individual neurons act as negative or positive predictors of gaze in the network. Moreover, in order to interpret what has been learnt as salient by the model, we use the network dissection approach of Bau et al. [11] to highlight what objects or object parts neurons in our saliency models are implicitly attuned to.

3 Methodology

This section will first introduce the proposed simplified architecture for saliency estimation. In a second part, we then describe how to visualize and analyze deep saliency models.

3.1 Gaze prediction

The whole architecture of our network is illustrated in Figure 1. The input is first processed

Figure 1: The encoder and residual-decoder architecture of our network.

by encoder network, and represented by a feature tensor (

) of shape .


where is the number of locations in the feature tensor, and is the dimension of each location. In our model, we use the first 5 convolutional blocks (we removed the last pooling layer, and kept the first 4 pooling layers in the encoder) of a pre-trained VGG16 [15]

network to initialize the feature extraction part and fine-tune it during training. The input was resized to

, hence the shape of the feature tensor is .

After feature extraction, the feature tensor was then fed into the residual-decoder. The decoder is consists of four blocks, where each block upsamples the feature tensor once to recover the resolution lost in the encoding stage. Each block shares three similar processes: convolution for dimension reduction, normal convolution, and deconvolution to recover the resolution lost in the encoder due to pooling.

In each block, the feature tensor from the previous block is first processed by a dimension reduction convolutional layer to reduce the number of feature maps . In our model, we halve the number of feature maps in each block of the decoder.


Then, the processed feature tensor () is processed by a conventional convolutional layer for further processing.


Finally, the two processed tensors ( and ) are added together and then sent to a deconvolutional layer to increase the tensor resolution and generate the block’s output tensor .


The kernel size was set to for convolutional layers and

for deconvolutional layers. Zero-padding was used to preserve the input’s scale. The last layer of the decoder is a

convolutional layer, which transforms the (output of last deconvolutional layer) activation maps into the saliency map. No further processing was implemented.

To train our model, we propose a new pixel-wise loss function, the exponential absolute distance (EAD), formulated as follows:


where, is the number of pixels in the gaze map, and is the prediction and ground truth at the pixel.

(a) 3 loss functions
(b) EAD loss map
(c) BCE loss map
Figure 2: Properties of different loss functions.

Compared to the distance, the EAD has a better gradient when the absolute difference is small. Compared to the distance, which is linear in the absolute difference, the EAD gives a larger punishment when the difference is large. In contrast to EAD, the BCE loss proposed in [12] yields a non-zero loss even for perfect predictions (as illustrated in Figure 2

). The unbounded nature of the BCE requires the application of an additional sigmoid function to produce pixel-wise saliency values in the range [0,1]. The model is trained using Tensorflow

[25] with the Adam [26] optimizer. We set the initial leaning rate as , and decay it with a factor of

after each training epoch.

3.2 Model visualization

As discussed before, one important question is what is learnt by a deep saliency model that allows it to outperform hand-crafted shallow models based on psychological theories? In other words, what specific salient patterns are learnt by the model? One hypothesis is that such deep network encode semantic information about saliency going beyond classical centre-surround assumptions.

Figure 3: The visualization method to compute the NSS score for each unit activation map.

Here, we propose to use the actual saliency evaluation metric, the normalized scan-path saliency (NSS) score, to visualize and understand inner representations of deep saliency models. At first, we feed all images in the dataset with fixations (MIT1003 dataset [27]) to the pre-trained deep saliency model. For each single image, it produces a set of activation maps as the output of the feature extraction part (in our model, this is the output of the encoder), one per neuron. Each activation map has a unique pattern for a given input image. We rescale the activation map to the input’s scale and use the activation to compute the NSS score for each neuron over the whole dataset. We use the top 5 NSS scores for each unit activation map, and compute their mean as the mean NSS score of each unit activation map (As a convolutional feature channel can only correspond to a certain type visual pattern [28, 29]). Therefore, each neuron’s activation map has a mean NSS score, which indicates its correlation with human gaze locations. Using the mean NSS score for each neuron’s activation map, we normalize the mean NSS score across all activation maps between , and set a threshold (we choose in our experiment). Neurons with mean NSS score above threshold are identified as positive fixation detectors.

After selecting positive fixation detectors, we use network dissection [11] (using the same method and dataset as the authors) to reveal what kind of object or object part are those positive fixation detectors attuned to. We proceed as follows: For every image in the Broden dataset [11], there is a unique pattern for each unit activation map. For each neuron’s activation map, we compute the distribution of its values over the whole dataset and find the threshold such that the value larger than with a probability . Then, all activation maps for all images are scaled to the input size and are quantized to a binary map. Finally, the IOU [11] is computed for each activation map to determine what sort of objects or object parts they are attuned to detecting (more details are in the [11]). In our work, we only show the objects or object parts for the positive fixation detectors.

4 Saliency prediction performance

In this section, we first introduce the datasets used in the experiments and then show the performance of different pixel-wise loss functions as well as the comparison between our model and other state-of-the-art models.

4.1 Datasets

SALICON [6]: The SALICON dataset is the largest dataset in the field of visual saliency. Saliency maps are estimated from human observers’ mouse clicks gathered over 20,000 images, with 10,000 images in the training set, 5,000 images in the validation dataset, and another 5,000 images in the testing dataset. We use the SALICON training and validation datasets to train and validate our model.

MIT1003 [27]: This dataset includes gaze data of 15 subjects using an eye tracker over 1,003 images. It is used in the visualization part. To compare our model with other state-of-the-art models on MIT300 benchmark, we also randomly choose 900 images from this dataset to fine-tune our model and another 103 images for testing the performance of different loss functions.

MIT300 [30]: This dataset is the standard benchmark dataset for human gaze prediction. It includes the gaze data of 39 subjects over 300 images.

Broden [11]: This dataset contains 63,305 images. With four subsets, ADE20K (22,211 images), Opensurfaces (25,351 images), DTD (5,639 images), and PASCAL(10,104 images), with pixel-wise semantic labels. It is used for network dissection.

4.2 Model performance

Table 2

records the accuracy of predicted saliency maps according to a range of standard error measures: Normalised Scan-path (NSS), Cross Correlation (CC), Area Under ROC curve (AUC) and Similarity (Sim) (we refer to

[24] for a discussion of saliency metrics). The accuracy is recorded for different pixel-wise loss functions. We can see that our proposed EAD loss achieves the best performance among all pixel-wise loss functions. Furthermore, the loss function, which is used as a baseline loss function in many deep gaze prediction models [9, 12], also shows good performance, which demonstrates that the proposed architecture is competitive regardless of the loss function and despite its comparative simplicity. This experiment was performed on the MIT1003 dataset.

Loss function NSS CC AUC Sim
L1 2.388 0.684 0.855 0.556
L2 2.389 0.686 0.881 0.532
BCE 2.083 0.614 0.851 0.488
EAD(proposed) 2.404 0.701 0.869 0.570
Table 2: The performance of different loss functions on MIT1003 testing dataset.

Tables 3 and 4 compare our model’s performance with state-of-the-art models. One can see that our model performs on par or better than all single architecture models (Table 1), especially when considering the NSS score, which is the metric of choice for ranking saliency models [7]. In Table 4, we see that our model’s performance come close to considerably more complex, multi-architecture approaches such as Deepfix [14] and DSCLRCN [19]. Figure 4 is a qualitative comparison of the saliency predicted by different models on some example images.

Model NSS CC AUC Sim
Salicon* [8] 1.557 0.659 0.808 0.600
SalGAN [12] 1.816 0.844 0.857 0.728
Deepnet [13] 1.555 0.763 0.840 0.639
proposed 1.896 0.871 0.852 0.760
Table 3: Comparison of different models on LSUN 2017 saliency prediction challenge [31] (SALICON testing dataset). *As the code for Salicon is not available, we use the open source implementation [32].
Model NSS CC AUC Sim
DSCLRCN [19] 2.35 0.80 0.87 0.68
Deepfix [14] 2.26 0.78 0.87 0.67
proposed 2.17 0.74 0.83 0.60
Salicon [8] 2.12 0.74 0.87 0.67
SalGAN [12] 2.04 0.73 0.86 0.63
PDP [9] 2.05 0.70 0.85 0.60
DVA [18] 1.98 0.68 0.85 0.58
Table 4: The comparison of different models on MIT300 dataset.
Figure 4: Qualitative comparisons of different models (GT stands for Ground Truth). *As the code for Salicon is not available, we use the open source implementation [32].

5 Visualizing salient patterns

This section analyses what is learnt by state-of-the-art deep saliency models using the visualization tools discussed in section 3.2. The model proposed in section 3.1 is trained on the SALICON training dataset as before, but without fine-tuning it on MIT1003 dataset to avoid overfitting when computing the NSS score for each activation map. In addition to the proposed model, we apply a similar analysis on three deep saliency models for which the code is publicly available: Deepnet, SalGAN and OpenSalicon. In Deepnet [13], the first five convolutional layers were determined as the feature extraction part. In SalGAN [12], the encoder was treated as the feature extraction part. In OpenSalicon [32], both coarse scale (Saliconc) and fine scale (Saliconf) were visualized.

Figure 5: Examples of patterns produced for the activation map 115 of the proposed model, with a mean NSS score of 4.5808.
Figure 6: Example of patterns produced for the activation map 221 of SalGAN, with a mean NSS score of 4.6019.
Figure 7: Example of patterns produced for the activation map 434 of Salicon at fine resolution (Saliconf), with a mean NSS score of 5.3637.
Figure 8: Example of patterns produced for the activation map 232 of Salicon at coarse resolution (Saliconc), with a mean NSS score of 5.0027.
Figure 9: Example of patterns produced for the activation map 162 of Deepnet, with a mean NSS score of 4.0101.

Figures 5 to 9 show example patterns for activation maps with high mean NSS scores (far beyond the model’s performance) in different models. The patterns in these figures are generated as the product of an input image with the activation map, cropped to the active areas for legibility. From those figures, we can see that most activation maps with high mean NSS score focus on a unique object or part of an object (head or face, etc).

Figure 10 shows example patterns for activation maps with medium mean NSS scores. In these examples, we can see that the active regions are less clearly focused on a single object or part. More importantly, for activation maps with low mean NSS score, shown in Figure 11, the patterns show a negative central bias, clearly inhibiting the central part of the image. Note that the models analyzed in this figure do not include an explicit central bias constraint, therefore this bias has been learnt solely from the training data and appears to be encoded by low NSS neurons.

(a) Deepnet
(b) SalGAN
(c) Saliconc
(d) Saliconf
(e) ours
Figure 10: Example patterns for activation maps with medium mean NSS score, drawn for different models.
(a) Deepnet

(b) SalGAN

(c) Saliconc

(d) Saliconf

(e) ours

Figure 11: Example patterns for activation maps with low mean NSS score, drawn for different models.

Since some activation maps (positive fixation detectors) have very high mean NSS scores, we investigate the relationships between the model performance (here we use the NSS score in MIT1003 dataset as the model performance) and the proportion of positive fixation detectors. Table 5 records this ratio for all analyzed models. We can see that models with better overall performance have higher proportions of fixation detectors (with a correlation coefficient of ), and our model has the highest ratio of positive fixation detectors amongst all analyzed models. Note that in all cases, the ratio of positive fixation detectors remains small.

Model # activation maps # positive detectors ratio NSS
Deepnet [13] 512 5 1% 1.68
SalGAN [12] 512 14 2.7% 2.15
OpenSalicon [8, 32] 1024 19 1.8% 1.92
proposed 512 21 4.1% 2.21
Table 5: The relationship between the model performance and the number of positive fixation detectors inside the model. The correlation between NSS and ratio is 0.94.

After determining which activation maps are positive fixation detectors inside the deep models, the question remains of what are those detectors are attuned to (i.e., objects and object parts). For this purpose we use measure of the normalized mean detection frequency of a class as the ratio:


where is the total number of occurrence of the class in the dataset and is the number of detected instances.

(a) Deepnet
(b) SalGAN
(c) Saliconc
(d) Saliconf
(e) ours
Figure 12: The top 15 objects (left) and parts (right) statistics for different saliency models’ positive fixation detectors.

Figure 12 records the normalized mean detection frequency for the analyzed models and for classes (left) and parts (right) in the Broden dataset. On the left hand side of the figure shows the results of this analysis for object labels: Most positive fixation detectors are attuned to common animals (dog, cat, cow, sheep, and person). The reason might be that they are fine-tuned from the image recognition models, which have already learnt rich object classes. This assumption is supported by the results on Deepnet, that has been trained from scratch without pre-training, and for which the first four object classes (motorbike, ball, bus, airplane) are not those animals. Interestingly, the detectors on Saliconc and Saliconf are attuned to different visual classes; the coarse model (Saliconc) appears to capture more common object classes than the fine model, as evidenced by higher scores.

The right hand side shows similar results but using parts labels instead of object labels. In these graphs, we see that almost all positive fixation detectors focus on the head or head parts (i.e., head, hair, torso, ear, and neck).

6 Conclusion

This paper set out to investigate the reason behind the high performance achieved by deep saliency models, compared to shallow models using hand-crafted features based on theoretical considerations about saliency (e.g., center-surround difference). To this end, we proposed a simple residual-like decoder combined with a pixel-wise exponential absolute distance loss function. The proposed loss function achieves best results among all pixel-wise loss functions and the model performance is on par or better than those state-of-the-art saliency models, despite being based on a simple architecture. Furthermore, we proposed a visualization method for deep gaze prediction models, and did a comprehensive study to reveal the inner representations inside those models.

Our analyses allow us to draw three conclusions about what is learned by deep saliency models. First, better performing models have developed higher proportions of deep neurons highly predictive of human gaze, and those neurons are attuned to very specific visual patterns. Second, another category of neurons, which are not predictive of human gaze, appear to encode a form of negative central bias into the model. Third, we have demonstrated that the predictive neurons are attuned to clear semantic categories such as animals (dogs, cats), objects (motorbike, ball) and parts (head, hair). These results provide evidence that the higher prediction performance achieved by deep saliency models is likely caused by the additional semantic content encoded by such networks, allowing the models to capture the fact that specific visual classes are salient in their own right, in contrast to shallow saliency models that rely on low level perceptual patterns (such as center-surround difference). This hints that saliency, as experienced by humans, is a process that likely involves high-level world knowledge in addition to low-level perceptual cues.

We believe that our results can be useful to measure the gap between current saliency models and the human inter-observer model and to build new models to close this gap. We will share our code to facilitate future research in this direction.