Top-Down Saliency Detection Driven by Visual Classification

09/15/2017 ∙ by Francesca Murabito, et al. ∙ University of Catania 0

This paper presents an approach for top-down saliency detection guided by visual classification tasks. We first learn how to compute visual saliency when a specific visual task has to be accomplished, as opposed to most state-of-the-art methods which assess saliency merely through bottom-up principles. Afterwards, we investigate if and to what extent visual saliency can support visual classification in nontrivial cases. To achieve this, we propose SalClassNet, a CNN framework consisting of two networks jointly trained: a) the first one computing top-down saliency maps from input images, and b) the second one exploiting the computed saliency maps for visual classification. To test our approach, we collected a dataset of eye-gaze maps, using a Tobii T60 eye tracker, by asking several subjects to look at images from the Stanford Dogs dataset, with the objective of distinguishing dog breeds. Performance analysis on our dataset and other saliency bench-marking datasets, such as POET, showed that SalClassNet out-performs state-of-the-art saliency detectors, such as SalNet and SALICON. Finally, we analyzed the performance of SalClassNet in a fine-grained recognition task and found out that it generalizes better than existing visual classifiers. The achieved results, thus, demonstrate that 1) conditioning saliency detectors with object classes reaches state-of-the-art performance, and 2) providing explicitly top-down saliency maps to visual classifiers enhances classification accuracy.



There are no comments yet.


page 1

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computer vision and machine learning methods have long attempted to emulate humans while performing visual tasks. Despite the high intentions, the majority of the existing automated methods rely on a common schema, i.e., learning low- and mid-level visual features for a given task, often without taking into account the peculiarities of the task itself. One of the most relevant example of task-driven human process is visual attention, i.e., gating visual information to be processed by the brain according to the intrinsic visual characteristics of scenes (bottom-up process) and to the task to be performed (top-down process). Saliency detection building only on the bottom-up process mainly employs low-level visual cues, modeling unconscious vision mechanisms, and shows huge limitations in task-oriented computer vision methods. For example, traditional saliency methods (Itti and Koch, 2000) miss objects of interest in highly cluttered backgrounds since they detect visual stimuli, which often are unrelated to the task to be accomplished, as shown in Fig. 1. Analogously, image classifiers fail in cases of cluttered images as they tend to extract low and mid-level visual descriptors and match them with learned data distributions without focusing on the most salient image parts.

Figure 1: First column — Eye fixations in free-viewing experiments in images with multiple objects. Some of the salient regions cannot be used for dog species classification. Second Column — Eye fixation shifts when asking to guess dog breeds.

Under this scenario, the contribution of this paper is twofold: a) we present a method for saliency detection guided by a classification task; and b) we demonstrate that exploiting task-based saliency maps improves classification performance. More specifically, we propose and train, in an end-to-end fashion, a convolutional neural network

SalClassNet — consisting of two parts: the first one generating top-down (classification-guided) saliency maps from input images, while the second one taking images and the learned maps as input to perform visual categorization.

We tested the saliency detector of SalClassNet over saliency benchmarks, where it significantly outperformed existing methods such as SalNet (Pan et al., 2016) and SALICON (Huang et al., 2015). In particular, we demonstrate how the propagation of a mixed saliency/classification loss throughout the upstream SalClassNet saliency detector is the key to learn task-guided saliency maps able to better detect the most discriminative features in the categorization process.

As for evaluating the performance of SalClassNet for visual categorization, we tested it on fine-grained classification tasks over the Stanford Dogs (Khosla et al., 2011), the CUB-200-2011 (Wah et al., 2011), and the Oxford Flower 102 (Nilsback and Zisserman, 2008) datasets, showing that explicitly providing visual classifiers with saliency leads to improved performance.

As an additional contribution, we release our saliency dataset containing of about 10,000 maps recorded from multiple users when performing visual classification on the 120 Stanford Dogs classes, as well as with the SalClassNetTorch code and all trained models.

2 Related work

Visual attention in humans can be seen as the integration between a) an early bottom-up unconscious process where the attention is principally guided by some coarse visual stimuli, which can be local (e.g., center-surround mechanisms) or global (dependent from the context); and b) a late top-down process, which biases the observation towards those regions that consciously attract users’ attention according to a specific visual task. While the former has been extensively researched in the computer vision field with a significant number of proposed saliency detection methods (Li and Yu, 2016b; Kümmerer et al., 2015; Huang et al., 2015; Pan et al., 2016; He et al., 2015; Liu et al., 2015; Li and Yu, 2015; Zhao et al., 2015; Han et al., 2016), top-down processes have received much less attention (Peters and Itti, 2007; Judd et al., 2009; Itti, 2012; Zhu et al., 2014), mainly because of the greater difficulty to emulate high-level cognitive processes than low-level cues based on orientation, intensity and color (Itti and Koch, 2000). However, understanding the processes which are behind task-controlled visual attention may be of crucial importance to make machines see and understand the visual world as humans do and to solve complex vision tasks, such as recognition of multiple objects in cluttered scenes (Walther et al., 2005).

Recently, the rediscovery of convolutional neural networks and their high performance on visual tasks have led to the development of deep saliency detection networks that either adopt multi-scale patches for global/coarser and local/finer features extraction for further saliency assessment (

He et al., 2015; Liu et al., 2015; Zhao et al., 2015; Li and Yu, 2015; Lin et al., 2014; Han et al., 2016; Li and Yu, 2016b; Huang et al., 2015; Liu et al., 2015; Shen and Zhao, 2014; Wang et al., 2015; Zhao et al., 2015; Tang and Wu, 2016; Chen et al., 2016) or learn, in an end-to-end fashion, saliency maps as in Huang et al. (2015); Pan et al. (2016); Li and Yu (2016a). In particular, the recent work by Pan et al. (2016) presents a fully-convolutional CNN (partly trained from scratch and partly re-using low-level layers from existing models) for saliency prediction; another fully-convolutional architecture is the one presented in Huang et al. (2015), which processes images at two different scales and is based on deep neural networks trained for object recognition; the latter was used as basis for our work as described later.
Lately, the idea of using saliency for improving classification performance has gained significant attention from the computer vision community, coming up with saliency detection models that have been integrated into visual classification methods. In Ren et al. (2015), saliency maps are employed to weigh features both in the learning and in the representation steps of a sparse coding process, whereas in Zhang et al. (2016b)

CNN-based part detections are encoded via Fisher Vectors and the importance of each descriptor is assigned through a saliency map.

Ba et al. (2015)

extended the recurrent attention model (RAM) presented in 

Mnih et al. (2014)

(a model based on a combination between recurrent neural networks and reinforcement learning to identify

glimpse locations) by training it to detect and classify objects after identifying a fixed number of glimpses. Similarly, recent saliency detection methods have been fed with high level information in order to include top-down attention processes. In Cao et al. (2015)

, given the class label as prior, the parameters of a new feedback layer are learned to optimize the target neuron output by filtering out noisy signals; in 

Zhang et al. (2016a)

a new backpropagation scheme, “Excitation Backprop”, based on a probabilistic version of the Winner-Take-All principle, is introduced to identify task-relevant neurons for weakly-supervised localization. Our saliency maps differ from the ones computed by those methods since the only top-down signal introduced in our training is a class-agnostic classification loss; hence, our maps are able to highlight those areas which are relevant for classifying generic images. A work similar in the spirit to ours is 

Almahairi et al. (2016), where a low-capacity network initially scans the input image to locate salient regions using a gradient entropy with respect to feature vectors; then, a high-capacity network is applied to the most salient regions and, finally, the two networks are combined through their top layers in order to classify the input image. Our objective, however, is to perform end-to-end training, so that the classification error gradient can directly affect the saliency generation process. Given these premises, the most interesting saliency network architectures for our purpose are the fully-convolutional ones, whose output can be seamlessly integrated into a larger framework with a cascading classification module. Tab. 1 summarizes the results of state-of-the-art fully-convolutional saliency networks on a set of commonly-employed datasets for saliency detection benchmarking, namely SALICON validation and test sets (Huang et al., 2015), iSUN validation and test sets (Xu et al., 2015) and MIT300 (Bylinskii et al., ). In this work, we focus our attention, both as building blocks and evaluation baselines, on the SALICON (Huang et al., 2015) and SalNet (Pan et al., 2016) models, thanks to code availability and their fully-convolutional nature.

Method N. Layers Framework Training Dataset SALICON Test SALICON Val iSUN Test iSUN Val MIT300
JuntingNet 5 Lasagne SALICON
CC = 0.60
Shuffled-AUC = 0.67
AUC Borji = 0.83
CC = 0.58
Shuffled-AUC = 0.67
AUC Borji = 0.83
CC = 0.82
Shuffled-AUC = 0.67
AUC Borji = 0.85
CC = 0.59
Shuffled-AUC = 0.64
AUC Borji = 0.79
CC = 0.53
Shuffled-AUC = 0.64
AUC Borji = 0.78
SalNet 10 Caffe SALICON
CC = 0.62
Shuffled-AUC = 0.72
AUC Borji = 0.86
CC = 0.61
Shuffled-AUC = 0.73
AUC Borji = 0.86
CC = 0.62
Shuffled-AUC = 0.72
AUC Borji = 0.86
CC = 0.53
Shuffled-AUC = 0.63
AUC Borji = 0.80
CC = 0.58
Shuffled-AUC = 0.69
AUC Borji = 0.82
CC = 0.74
Shuffled-AUC = 0.74
AUC Borji = 0.85
DeepGaze 5 Not Available MIT1003
CC = 0.48
Shuffled-AUC = 0.66
AUC Borji = 0.83
DeepGaze 2 19 Web Service SALICON - MIT1003
CC = 0.51
Shuffled-AUC = 0.77
AUC Borji = 0.86
ML-NET 19 + 2 Theano SALICON
CC = 0.76
Shuffled-AUC = 0.78
CC = 0.69
Shuffled-AUC = 0.70
AUC Borji = 0.77
DeepFix 20 Not Available SALICON
CC = 0.78
Shuffled-AUC = 0.71
AUC Borji = 0.80
eDN Ensemble Sthor MIT1003
CC = 0.45
Shuffled-AUC = 0.62
AUC Borji = 0.81
PDP 16 + 3 Not Available SALICON
CC = 0.77
Shuffled-AUC = 0.78
AUC Borji = 0.88
CC = 0.74
Shuffled-AUC = 0.78
CC = 0.70
Shuffled-AUC = 0.73
AUC Borji = 0.80
Table 1: A summary of state-of-art fully-convolutional methods and their results, according to the most common metrics, on several saliency datasets. Dataset references: SALICON Test and Val: Jiang et al. (2015); iSUN Test and Val: Xu et al. (2015); MIT300: Bylinskii et al. . Method references: JuntingNet and SalNet: Pan et al. (2016); SALICON: Huang et al. (2015); DeepGaze: Kümmerer et al. (2015); DeepGaze2: Kümmerer et al. (2016); ML-NET: Cornia et al. (2016): DeepFix: Kruthiventi et al. (2017); eDN: Vig et al. (2014); PDP: Jetley et al. (2016).


3 SalClassNet: A CNN model for top-down saliency detection

Figure 2: Architecture of the proposed model – SalClassNet– for saliency detection guided by a visual classification task. Input images are processed by a saliency detector, whose output together with input images are fed to a classification network with 4-channel first-layer kernels for processing image color and saliency and providing image classes as output.

The general architecture of our network is shown in Fig. 2 and is made up of two cascaded modules: a saliency detector and a visual classifier, which are jointly trained in a multi-loss framework.

3.1 Top-down saliency detection network

Although we will discuss the details of the employed saliency dataset and its generation process in Sect. 4, it is necessary to introduce some related information at this stage, which is important to understand the overall model.

In the dataset generation protocol, human subjects were explicitly asked to look at images and to guess their visual classes (e.g., dog breeds). Therefore, our experiments aimed to enforce top-down saliency driven by a specific classification task, rather than bottom-up saliency. In other words, instead of emphasizing the location of image regions which are visually interesting per se (which, of course, may include the target object), our visual attention maps focus on the location of features needed for identifying the target classes, ignoring anything else that may be salient but not relevant to the classification task. Hence, our saliency detector has to be able, given an input image, to produce a map of the most salient image locations useful for classification purposes.

To accomplish that, we propose a CNN-based saliency detector composed by thirteen convolutional and five max pooling layers taken from VGG-19 (

Simonyan and Zisserman, 2014). The output of the last pooling layer, i.e., 5121010 feature maps (for a 3299299 input image), is then processed by a 11 convolution to compute a saliency score for each “pixel” in the feature maps of the previous layer, producing a single-channel map. Finally, in order to generate the input for the subsequent classification network, the 1010 saliency maps are upsampled to

(which is the default input size of the next classification module) through bilinear interpolation.

As for the size of the output maps, it has to be noted that saliency is a primitive mechanism, employed by humans to drive the attention towards objects of interest, which is evoked by coarse visual stimuli (Itti and Koch, 2000). Thus, increasing the resolution of saliency maps for identifying finer image details from a visual scene is not necessary, beside introducing noisy information potentially affecting negatively the classification performance (indeed, when we increased the saliency map size, the saliency accuracy did not improve). Therefore, in spite of the low spatial resolution of saliency maps, our experiments (see Sect. 5) show that the 1010 feature maps are able to encode the information needed to detect salient areas and to drive a classifier with them.

3.2 Saliency-based classification network

Our visual classifier is a convolutional neural network which receives as input a 4-channel RGBS image, combining the RGB image with the corresponding saliency (S) map, and provides as output the corresponding class. The underlying idea is that the network should employ those salient regions (as indicated by the input saliency map S) which are more meaningful for classification purposes.

This network is based on the Inception network (Szegedy et al., 2015

), which comprises sixteen convolutional and one fully connected layer followed by a final softmax layer, with the first-layer convolutional kernels modified to support the 4-channel input. In particular, the 32 3

33 kernels in the first layer are converted into 32 433 kernels, whose weights corresponding to the RGB channels are taken from a pre-trained version of Inception network (see next Sect. 5), whereas the new weights, corresponding to the saliency input, are randomly initialized. Since the model includes a combination of trained weights (the ones from the original Inception) and untrained weights (the ones related to the saliency channel) we set different learning rates in order to speed up the convergence of untrained weights while not destabilizing the already learned ones.

3.3 Multi-loss saliency-classification training

The networks described in the previous sections are joined together into a single sequential model and trained using RGB images as input and the corresponding class labels as output. We introduced a batch normalization module between the saliency detector and the classifier, to enforce a zero-mean and unitary–standard-deviation distribution at the classifier’s input. During training, we minimize a multi-loss objective function given by a linear combination of cross-entropy classification loss

, and saliency detection loss computed as the mean square error (MSE) of the intermediate saliency detector’s output (obtained after the last upsampling layer) with respect to the ground-truth saliency map for the corresponding input image:




where is the cross-entropy loss computed for the softmax output vector y and the correct class , indicates the number of classes in the dataset, is the mean square error loss computed on the saliency detector’s output map Y and the ground-truth heatmap T, and are the size of the heatmap, and is the indicator function, which returns if is true; bold symbols denote vectors (lower case) and matrices (upper case).

The adopted multi-loss affects the model in several ways. First of all, backpropagating the classification loss to the saliency detector forces it to learn saliency features useful for classification. Secondly, backpropagating the mean square error on the saliency maps ensures that the saliency detector does not degenerate into identifying generic image features and become a convolutional layer as any other.

Fig. 3 shows two output examples of how saliency changes when using only saliency loss to train the saliency detector and when driving it by the classification loss : the saliency is shifted from generic scene elements to more discriminative features.

Figure 3: From saliency maps including only sensory information (bottom-up attention processes) to maps integrating task-related information (top-down processes). (Top row) Two example images. (Middle row) Bottom-up saliency maps generated by our CNN-based saliency detector fine-tuned over the Stanford Dog dataset using ground-truth heatmaps. (Bottom row) Shift of saliency guided by the classification task, as resulting from training SalClassNet.

4 Top-down Saliency Dataset

To test our saliency detector, we built a top-down saliency dataset – SalDogs – consisting of eye-gaze data recorded from multiple human subjects while observing dog images taken from the Stanford Dogs dataset  (Khosla et al., 2011), a collection of 20,580 images of dogs from 120 breeds (about 170 images per class). From the whole Stanford Dogs dataset, we used a subset of 9,861 images keeping the original class distribution. The eye-gaze acquisition protocol involved 12 users, who underwent breed-classification training sessions (randomly showing dog images with the related classes), and then were asked to identify the learned breeds from images. To guide top-down visual attention of participants, according to psychology research (Enns and MacDonald, 2013

), images were blurred with a Gaussian filter whose variance was initially set to 10 and then gradually reduced by 1 each half second until subjects were able to recognize their classes or they were completely de-blurred. Users took, on average, 2.6 seconds to identify dog breeds and 2,763 images were not identified till the end of the de-blurring process. Eye-gaze gaze were recorded through a 60-Hz Tobii T60 eye-tracker. Tab. 

2 provides an overview of the SalDogs dataset. To the best of our knowledge, this is one of the first publicly-available datasets with saliency maps driven by visual classification tasks, and the first one dealing with a large number of fine-grained object classes.

Our Dataset
Number of images 9,861
Number of classes 120
Avg. number of images per class 82.2
Avg. number of fixation points per image 6.2
Table 2: Information on the generated saliency dataset.

A dataset similar to ours is POET  (Papadopoulos et al., 2014), which, however, does not deal with fine-grained classification tasks, but with classification at the basic level and with much fewer classes (10 Pascal VOC classes vs 120 in our case). Tab. 3 reports a comparison, in terms of enforced attention mechanism (e.g., tasks accomplished by participants), number of viewers, collected images and acquisition devices, between our dataset and recent saliency benchmarking datasets. Finally, to test the generalization capabilities of our saliency detector, we also collected eye gaze data from the same 12 subjects, employing the same data acquisition protocol described above, on: a) bird images (referred in the following as SalBirds), using a subset of 400 images taken from CUB-200-2011 dataset (Wah et al., 2011), an image dataset containing 11,788 images from 200 classes representing different bird species; and b) flower images (referred as SalFlowers) by selecting 400 images from Oxford Flowers-102 (Nilsback and Zisserman, 2008), which contains over 8,000 images from 102 different flower varieties.

Dataset Capture method Task Viewers Train Validation Test Tot
SALICON (Jiang et al. (2015)) Mouse clicks Free-viewing Crowd 10,000 5,000 5,000 20,000
iSUN (Xu et al. (2015)) Camera-based eye tracker Free-viewing Crowd 6,000 926 2,000 8,926
MIT300 (Bylinskii et al. ) ISCAN video-based eye tracker Free-viewing 39 - - - 300
CAT2000 (Borji and Itti (2015)) Eyelink 1000 eye tracker Free-viewing 24 2,000 - 2,000 4,000
FIGRIM (Bylinskii et al. (2015)) Eyelink 1000 eye tracker Memory 15 - - - 2,787
EyeCrowd (Jiang et al. (2014)) Eyelink 1000 eye tracker Free-viewing 16 450 - 50 500
OSIE (Xu et al. (2014)) Eyelink 1000 eye tracker Free-viewing 15 500 200 700
PASCAL-S (Li et al. (2014)) Eyelink 1000 eye tracker Free-viewing 8 - - - 850
ImgSal (Li et al. (2013)) Tobii T60 eye tracker Free-viewing 21 - - - 235
POET (Papadopoulos et al. (2014)) Eyelink 2000 eye tracker Basic classification 28 441 - 5,829 6,270
SalDogs Tobii T60 eye tracker Fine-grained classification 12 8,005 928 928
Table 3: Comparison between our dataset and others from the state of the art.

5 Performance analysis

The performance analysis focuses on assessing the quality of our model and its comparison to state-of-the-art approaches on two tasks: a) generating task-driven saliency maps from images; b) fine-grained visual recognition task.

5.1 Datasets

The main benchmarking dataset used for the evaluation of both saliency detection and classification models was SalDogs (9,861 images with heatmaps), which was split into training set (80%, 8,005 images – SalDogs-train), validation set (10%, 928 images – SalDogs-val) and test set (10%, 928 images – SalDogs-test).

Specifically for saliency detection, we also employed the POET, SalBirds and SalFlowers datasets (described in Sect. 4) to assess the generalization capabilities of the models trained on SalDogs.

For visual classification evaluation, we first carried out a comparison of different models on SalDogs, aimed at investigating the contribution of visual saliency to classification. Then, we assessed the generalization capabilities of SalClassNet on the CUB-200-2011 and Oxford Flower 102 fine-grained datasets.

All classification networks (SalClassNet and baseline) were first pre-trained on a de-duped version of ImageNet, obtained by removing from ImageNet the 120 classes present in the Stanford Dogs Dataset. This guarantees fairness between models regardless of pre-training: indeed, since the whole Stanford Dogs is included in ImageNet, publicly-available pre-trained VGG-19 and Inception models would have the advantage of having been trained on images included in


5.2 Training details

The saliency detector in SalClassNet consists of a cascade of convolutional feature extractors initialized from a pre-trained VGG-19, followed by a layer (to train from scratch) which maps each location of the final feature map into a saliency score. An initial pre-training stage was carried out on OSIE (Xu et al., 2014), as done also in SALICON. This pre-training employed mini-batch SGD optimization (learning rate: 0.00001, momentum: 0.9, weight decay: 0.0005, batch size: 16) of the MSE loss between the output and target saliency maps; data augmentation was performed by rescaling each image (and the corresponding ground-truth heatmap) to 340 pixels on the short side, while keeping aspect ratio, and randomly extracting five 299299 crops, plus the corresponding horizontal flips. After this initial pre-training, the resulting model was fine-tuned on SalDogs-train: the learning rate was initialized to 0.001 and gradually reduced through the decay rule, i.e., at iteration it was computed as , with

being the initial learning rate. During this fine-tuning stage, the same data augmentation approach described above and the same values for the other hyperparameters were used.

The saliency-based classifier module of SalClassNet was initially pre-trained as a regular Inception network. Due to the inclusion of Stanford Dogs in ImageNet, we did not employ a publicly-available pre-trained network, and instead trained an Inception architecture from scratch on the de-duped version of ImageNet described in the previous section. We trained the model for 70 epochs, using mini-batch SGD for optimization, with a learning rate schedule going from 0.01 to 0.0001 over the first 53 epochs, weight decay 0.0005 up to the 30


epoch (and 0 afterwards), momentum 0.9 and batch size 32. Data augmentation on the input images was performed as described above. After this pre-training was completed, we modified the first-layer kernels to support RGB color plus saliency input, by adding a dimension with randomly-initialized weights to the relevant kernel tensors, and we fine-tuned the model on

SalDogs-train for classification, passing as input, each image with the corresponding ground-truth saliency map. Since some weights in the model had already been pre-trained and others had to be trained from scratch, the learning rate was initially set to 0.05 for the untrained parameters, and to 0.001 for the others. We used the same procedures for learning rate decay and data augmentation as in the fine-tuning of the saliency detector, and a batch size of 16.

The final version of the SalClassNet model - which is the one employed in the following experiments - was obtained by concatenating the saliency detector and the saliency-based classifier and fine-tuning it, in an end-to-end fashion, on SalDogs-train. Indeed, up to this point, the saliency detector had never been provided with an error signal related to a classification loss, as well as the saliency-based classifier had never been provided with input maps computed by an automated method. Again, the previous procedures for data augmentation and learning rate decay were employed, with a single initial learning rate of 0.001. The value in Eq. 1, weighing the classification loss with respect to the saliency MSE loss, was set to 0.2, since it provided the best accuracy trade-off (see Fig. 4).

Figure 4: Classification accuracy and MSE w.r.t. values: 0.2 was chosen as the best trade-off between the two performance metrics.

During the fine-tuning stages of the individual modules and of the end-to-end model, at the end of each epoch we monitored the classification accuracy and the saliency MSE loss over SalDogs-val (evaluating only the central crop of each rescaled image), and stopped training when both had not improved for 10 consecutive epochs: in practice, all models converged in 70-120 epochs. Model selection was performed by choosing the model for which the best relevant accuracy measure (MSE loss for the saliency detector, classification accuracy for the saliency-based classifier and the full SalClassNet model) had been obtained.

5.3 Saliency detection performance

To evaluate the capabilities of SalClassNet for saliency detection, we employed the metrics defined by Borji et al. (2013) — shuffled area under curve (s-AUC), normalized scanpath saliency (NSS) and correlation coefficient (CC) scores — and compared its performance to those achieved by the SALICON and SalNet models, in their original versions (i.e., as released, pre-trained on the datasets in Tab. 1) and after fine-tuning on SalDogs-train.
Tab. 4 reports a quantitative comparison between these approaches over the SalDogs-test, POET, SalBirds and SalFlowers datasets. It is possible to notice that SalClassNet is able to generate more accurate (and generalizes better) top-down saliency maps than existing methods, which suggests that driving the generation of saliency maps with a specific goal does lead to better performance than fine-tuning already-trained models. Fig. 5 and 6 report some output examples of the tested methods on different input images from, respectively, SalDogs-, POET, SalBirds, SalFlowers. Quantitative and qualitative results show SalClassNet’s capabilities to generalize well the top-down visual attention process across different datasets.

Method s-AUC NSS CC
Dataset SalDogs
Human Baseline 0.984 11.195 1
SalNet 0.720 1.839 0.231
SALICON 0.805 2.056 0.261
Fine-tuned SalNet 0.817 4.174 0.432
Fine-tuned SALICON 0.837 3.899 0.428
SalClassNet 0.862 4.239 0.461
Dataset POET
Human Baseline 0.975 5.189 1
SalNet 0.646 1.274 0.342
SALICON 0.723 1.270 0.355
Fine-tuned SalNet 0.660 1.378 0.300
Fine-tuned SALICON 0.695 1.669 0.356
SalClassNet 0.715 1.908 0.387
Dataset SalBirds
Human Baseline 0.743 9.323 1
SalNet 0.642 2.252 0.330
SALICON 0.680 2.247 0.346
Fine-tuned SalNet 0.644 3.504 0.403
Fine-tuned SALICON 0.686 4.252 0.507
SalClassNet 0.708 4.404 0.529
Dataset SalFlowers
Human Baseline 0.975 9.787 1
SalNet 0.606 1.311 0.1973
SALICON 0.653 1.081 0.1803
Fine-tuned SalNet 0.576 0.916 0.136
Fine-tuned SALICON 0.661 1.599 0.234
SalClassNet 0.683 1.675 0.245
Table 4: Comparison in terms of shuffled area under curve (s-AUC), normalized scanpath saliency (NSS) and correlation coefficient (CC) between the proposed SalClassNet and the baseline models. For each dataset we report the human baseline, i.e., the scores computed using the ground truth maps.
Figure 5: Comparison of saliency output maps of different methods. Each row, from left to right, shows an example image, the corresponding ground-truth saliency map, and the output maps computed, in order, by SalNet and SALICON, as released, and fine-tuned over SalDogs-train and the proposed end-to-end SalClassNet model. Beside being able to identify those areas which can be useful for recognition (see first three rows), our method can highlight multiple salient objects (both dogs in the fourth row), or suppress those objects which are not salient for the task (see fifth row).
Figure 6: Examples of output saliency maps generated by different methods on CUB-200-2011 (first two rows) Oxford Flower 102 (third and forth row) and POET (last two rows row) and compared to SALICON-generated saliency maps. Each row, from left to right, shows an example image, the corresponding ground-truth saliency map, and the output maps computed, in order, by SalNet and SALICON, both as released and fine-tuned over SalDogs-train, and the proposed end-to-end SalClassNet model. SalClassNet, when compared to SALICON (the second best model in Table 4), shows better capabilities to filter out image parts which are salient in general but not necessary for classification.

5.4 Effect of saliency maps on visual classification performance

In this section, we investigate if, and to what extent, explicitly providing saliency maps can contribute to improve classification performance. To this end, we first assessed the performance of VGG-19 and Inception over SalDogs when using as input a) only color images (3-channel models) and b) ground-truth saliency maps plus color images (4-channel models). In both cases, as mentioned earlier, we re-trained Inception and VGG-19 from scratch, on the de-duped version of ImageNet and then fine-tuned them on SalDogs-train, to force the 4-channel versions to use saliency information coming from the upstream module. Indeed, the publicly-available versions of Inception and VGG had already learned dog breed distributions (trained over 150,000 ImageNet dog images), thus they tended to ignore additional inputs such as saliency. Furthermore, a comparison with Inception and VGG-19 pre-trained on the whole ImageNet would have been unfair also because SalDogs contains only about 9,000 images (versus 150,000).

We compared the above methods to our SalClassNet, which automatically generates saliency maps and uses them for classification. Besides the version of SalClassNet described in Sect. 3 (which is also used in all the next experiments), we tested a variant of SalClassNet which employs VGG-19 (suitably modified to account for the saliency input) as classifier: this model is indicated in the results as “SalClassNet (VGG)”.

Tab. 5 shows the achieved mean classification accuracies for all the tested methods. It is possible to notice that explicitly providing saliency information (both as ground-truth saliency maps and generated by SalClassNet) to traditional visual classifiers yields improved performance. Indeed, both VGG and Inception suitably extended to make use of saliency information and SalClassNet outperformed the traditional Inception and VGG-19. The lower classification accuracies of the RGBS versions of Inception and VGG (trained with ground truth saliency maps) w.r.t. the SalClassNet variants depend likely by end-to-end training of both saliency and classification networks, which results in extracting and combining, in a more effective way, saliency information with visual cues for the final classification.

Method MCA
VGG (3 channels) 43.4%
VGG (4 channels) + ground truth saliency maps 47.2%
SalClassNet (VGG) 49.0%
Inception (3 channels) 67.1%
Inception (4 channels) + ground truth saliency maps 68.4%
SalClassNet 70.5%
Table 5: Comparison in terms of mean classification accuracy on SalDogs-test between the original Inception and VGG models, pre-trained on ImageNetDD (ImageNet without the dog image classes) and fine-tuned on SalDogs-train, their RGBS variants trained on ground-truth saliency heatmaps and the respective two variants of SalClassNet.

Furthermore, SalClassNet showed good generalization capabilities over different datasets, namely, CUB-200-2011 and Oxford Flower 102. In particular, we employed SalClassNet as a feature extractor for a subsequent softmax classifier and compared its performance to those achieved, on the same datasets, by Inception and VGG-19 (fine-tuned on SalDogs-train and employed also as feature extractors followed by a softmax classifier). Results are shown in Tab. 6 and confirm our previous claim. The better generalization performance of our method can be explained by a) the fact that the features learned by the classifiers are not strictly dog-specific, but, more likely, belonging to a wider pattern of fine details that can be generally interpreted as significant features (e.g, eyes, ears, mouth, tail, etc.) for classification, thus applicable to a variety of domains; b) SalClassNet, building and improving on the features by Inception, exploits saliency to weigh better the most distinctive features for classification. Hence, although SalClassNet has not been trained on the flower and bird datasets, the generic nature of the learned features and the improved feature filtering gained through saliency led to high accuracy also on those. In order to demonstrate the effectiveness of SalClassNet’s kernels on different datasets, we computed the features learned by SalClassNet for classification over Stanford Dogs, CUB-200-2011 and Oxford Flowers 102. Table 7 shows some of these features, extracted at different SalClassNet depths and visualized by feeding the whole datasets to the network and identifying the image regions which maximally activate the neurons of certain feature maps. It can be seen how meaningful features for dogs turn out to be meaningful for birds and flowers as well.

CUB-200-2011 Oxford Flower 102
VGG 47.6% 59.2%
Inception 61.8% 77.8%
SalClassNet 63.2% 79.4%
Table 6: Performance obtained by VGG, Inception and SalClassNet over, respectively, CUB-200-2011 and Oxford Flower 102
Layer Stanford Dogs CUB-200-2011 Oxford Flower 102
Table 7: Examples of features employed by SalClassNet for classification over three different datasets. Each row of images in the tables shows sample which provide high activations for a certain feature map. For each of the tested datasets, we show a 34 block of images, where the first column represents the average image computed over the highest 50 activations for that dataset; the last three columns show the three top activations.

6 Concluding remarks

In this work, we proposed a deep architecture — SalClassNet — which generates top-down saliency maps by conditioning, through the object class supervision, the saliency detection process and, at the same time, exploits such saliency maps for visual classification. Performance analysis, both in terms of saliency detection and classification, showed that SalClassNet identifies regions corresponding to class-discriminative features, hence emulating top-down saliency, unlike most of the existing saliency detection methods which produce bottom-up maps of generic salient visual features. Although we tested our framework using two specific networks for saliency detection and visual classification, its architecture and our software implementation are general and can be used with any fully-convolutional saliency detector or classification network by simply replacing one of the two subnetworks, respectively, before or after the connecting batch normalization module. As further contribution of this paper, we built a dataset of saliency maps (by means of eye-gaze tracking experiments on 12 subjects who were asked to guess dog breeds) for a subset of the Stanford Dog dataset, creating what is, to the best of our knowledge, the first publicly-available top-down saliency dataset driven by a fine-grained visual classification task. We hope that our flexible deep network architecture (all source code is available) together with our eye-gaze dataset will push the research in the direction of emulating human visual processing through a deeper understanding of the higher-level (such as top-down visual attention) processes behind it.


  • Almahairi et al. (2016) Almahairi, A., Ballas, N., Cooijmans, T., Zheng, Y., Larochelle, H., Courville, A., 2016. Dynamic capacity networks, in: ICML 2016.
  • Ba et al. (2015) Ba, J., Mnih, V., Kavukcuoglu, K., 2015. Multiple object recognition with visual attention, in: ICLR 2015.
  • Borji and Itti (2015) Borji, A., Itti, L., 2015. Cat2000: A large scale fixation dataset for boosting saliency research. CVPRW 2015 .
  • Borji et al. (2013) Borji, A., Sihite, D.N., Itti, L., 2013. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. TIP 2013 .
  • Bylinskii et al. (2015) Bylinskii, Z., Isola, P., Bainbridge, C., Torralba, A., Oliva, A., 2015. Intrinsic and extrinsic effects on image memorability. Vision research .
  • (6) Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A., . Mit saliency benchmark.
  • Cao et al. (2015) Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., Huang, Y., Wang, L., Huang, C., Xu, W., et al., 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks, in: CVPR 2015, pp. 2956–2964.
  • Chen et al. (2016) Chen, T., Lin, L., Liu, L., Luo, X., Li, X., 2016. Disc: Deep image saliency computing via progressive representation learning. Transaction on NNLS 2016 , 1135–1149.
  • Cornia et al. (2016) Cornia, M., Baraldi, L., Serra, G., Cucchiara, R., 2016. Multi-level net: A visual saliency prediction model, in: ECCVW 2016, pp. 302–315.
  • Enns and MacDonald (2013) Enns, J.T., MacDonald, S.C., 2013. The role of clarity and blur in guiding visual attention in photographs. J Exp Psychol Hum Percept Perform 39, 568–578.
  • Han et al. (2016) Han, J., Zhang, D., Wen, S., Guo, L., Liu, T., Li, X., 2016. Two-stage learning to predict human eye fixations via sdaes. Transaction on Cybernetics 2016 , 487–498.
  • He et al. (2015) He, S., Lau, R.W.H., Liu, W., Huang, Z., Yang, Q., 2015. SuperCNN: A superpixelwise convolutional neural network for salient object detection. IJCV 2015 , 330–344.
  • Huang et al. (2015) Huang, X., Shen, C., Boix, X., Zhao, Q., 2015. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks, in: ICCV 2015, pp. 262–270.
  • Itti (2012) Itti, L., 2012.

    Probabilistic learning of task-specific visual attention, in: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 470–477.

  • Itti and Koch (2000) Itti, L., Koch, C., 2000. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 2000 , 1489–1506.
  • Jetley et al. (2016) Jetley, S., Murray, N., Vig, E., 2016.

    End-to-end saliency mapping via probability distribution prediction, in: CVPR 2016, pp. 5753–5761.

  • Jiang et al. (2015) Jiang, M., Huang, S., Duan, J., Zhao, Q., 2015. Salicon: Saliency in context, in: CVPR 2015, pp. 1072–1080.
  • Jiang et al. (2014) Jiang, M., Xu, J., Zhao, Q., 2014. Saliency in crowd, in: ECCV, Springer. pp. 17–32.
  • Judd et al. (2009) Judd, T., Ehinger, K., Durand, F., Torralba, A., 2009. Learning to predict where humans look, in: 2009 IEEE 12th International Conference on Computer Vision, pp. 2106–2113.
  • Khosla et al. (2011) Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L., 2011. Novel dataset for fine-grained image categorization, in: CVPRW 2011.
  • Kruthiventi et al. (2017) Kruthiventi, S.S., Ayush, K., Babu, R.V., 2017. Deepfix: A fully convolutional neural network for predicting human eye fixations. IP 2017 .
  • Kümmerer et al. (2015) Kümmerer, M., Theis, L., Bethge, M., 2015. Deep Gaze I: Boosting saliency prediction with feature maps trained on imagenet, in: ICLRW 2015.
  • Kümmerer et al. (2016) Kümmerer, M., Wallis, T.S., Bethge, M., 2016. Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563 .
  • Li and Yu (2015) Li, G., Yu, Y., 2015.

    Visual saliency based on multiscale deep features, in: CVPR 2015.

  • Li and Yu (2016a) Li, G., Yu, Y., 2016a. Deep contrast learning for salient object detection, in: CVPR 2016.
  • Li and Yu (2016b) Li, G., Yu, Y., 2016b. Visual saliency detection based on multiscale deep CNN features. Transactions on Image Processing 2016 , 5012–5024.
  • Li et al. (2013) Li, J., Levine, M.D., An, X., Xu, X., He, H., 2013. Visual saliency based on scale-space analysis in the frequency domain. PAMI 35, 996–1010.
  • Li et al. (2014) Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L., 2014. The secrets of salient object segmentation, in: CVPR 2014, pp. 280–287.
  • Lin et al. (2014) Lin, Y., Kong, S., Wang, D., Zhuang, Y., 2014. Saliency detection within a deep convolutional architecture, in: AAAIW.
  • Liu et al. (2015) Liu, N., Han, J., Zhang, D., Wen, S., Liu, T., 2015. Predicting eye fixations using convolutional neural networks, in: CVPR 2015.
  • Mnih et al. (2014) Mnih, V., Heess, N., Graves, A., kavukcuoglu, k., 2014. Recurrent models of visual attention, in: NIPS 2014, pp. 2204–2212.
  • Nilsback and Zisserman (2008) Nilsback, M.E., Zisserman, A., 2008. Automated flower classification over a large number of classes, in: ICVGIP 2008.
  • Pan et al. (2016) Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E., 2016. Shallow and deep convolutional networks for saliency prediction, in: CVPR 2016.
  • Papadopoulos et al. (2014) Papadopoulos, D.P., Clarke, A.D., Keller, F., Ferrari, V., 2014. Training object class detectors from eye tracking data, in: ECCV 2014, pp. 361–376.
  • Peters and Itti (2007) Peters, R.J., Itti, L., 2007. Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. doi:10.1109/CVPR.2007.383337.
  • Ren et al. (2015) Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H., 2015. Region-based saliency detection and its application in object recognition, in: TCSVT 2015.
  • Shen and Zhao (2014) Shen, C., Zhao, Q., 2014. Learning to predict eye fixations for semantic contents using multi-layer sparse network. Neurocomputing 2014 , 61 – 68.
  • Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. ICLR 2015 abs/1409.1556.
  • Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: CVPR 2015.
  • Tang and Wu (2016) Tang, Y., Wu, X., 2016. Saliency detection via combining region-level and pixel-level predictions with cnns, in: ECCV 2016, pp. 809–825.
  • Vig et al. (2014) Vig, E., Dorr, M., Cox, D., 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images, in: CVPR 2014, pp. 2798–2805.
  • Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S., 2011. The Caltech-UCSD Birds-200-2011 Dataset .
  • Walther et al. (2005) Walther, D., Rutishauser, U., Koch, C., Perona, P., 2005. Selective visual attention enables learning and recognition of multiple objects in cluttered scenes. Computer Vision and Image Understanding 100, 41 – 63.
  • Wang et al. (2015) Wang, L., Lu, H., Ruan, X., Yang, M.H., 2015.

    Deep networks for saliency detection via local estimation and global search, in: CVPR 2015, pp. 3183–3192.

  • Xu et al. (2014) Xu, J., Jiang, M., Wang, S., Kankanhalli, M.S., Zhao, Q., 2014. Predicting human gaze beyond pixels. JoV 2014 , 1–20.
  • Xu et al. (2015) Xu, P., Ehinger, K.A., Zhang, Y., Finkelstein, A., Kulkarni, S.R., Xiao, J., 2015. Turkergaze: crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755 .
  • Zhang et al. (2016a) Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S., 2016a. Top-down neural attention by excitation backprop, in: ECCV 2016, pp. 543–559.
  • Zhang et al. (2016b) Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q., 2016b. Picking deep filter responses for fine-grained image recognition, in: CVPR 2016.
  • Zhao et al. (2015) Zhao, R., Ouyang, W., Li, H., Wang, X., 2015.

    Saliency detection by multi-context deep learning, in: CVPR 2015, pp. 1265–1274.

  • Zhu et al. (2014) Zhu, G., Wang, Q., Yuan, Y., 2014. Tag-saliency: Combining bottom-up and top-down information for saliency detection. Computer Vision and Image Understanding 118, 40 – 49.