Saliency prediction aims to extract a mask map, termed ”saliency map”, that gives the probability of most attractive part from a given image. Such a saliency map can be utilized for multiple applications, such as image thumbnailing and content-aware image resizing . By reducing the scale of the field-to-be-perceived to specific salient region, it can also be used as preprocessing method to further speed-up other visual tasks, e.g. [3, 4]. The extensive usages leads to the popularity of saliency prediction, and recently many approaches are proposed towards this issue.
Essentially, the task is to map the implicated saliency information from the high-dimensional image data to low-dimensional saliency map. This mapping usually takes the form of extracting a gray scale map of salient area indicated by bright gaussian blob from original image. Traditional approaches hierarchically model the handcrafted features to extract saliency in unsupervised style 
. More recently, deep neural networks (DNNs) based methods such as Deep Fixation and SALICON  improved saliency prediction performance by a great margin, especially by transferring the pretrained model with rich semantic features learned from large scale image classification task to boost up the learning. Saliency prediction datasets usually have two types of labels: fixation pixel map which records the human eye movements in discrete individual pixels, and fixation blob map generated by performing gaussian blurring on the corresponding fixation pixel map. Current approaches tend to learn saliency prediction by regressing the input image to a gray scale map with fixation blob as label. As saliency prediction field continuously developing, many new approaches have been proposed to perform saliency prediction in such fixation blob regression style, yet learning saliency prediction from the raw format of fixation pixels is not been explored.
Here in this work, we propose a novel approach of learning saliency prediction from fixation pixel map instead of fixation blob map. We use clustering to construct a sparse fixation pixel maps label set from the raw fixation pixel maps. When applying naive regression loss functions such as Kullback-Leibler Divergence and Mean Squared Error on such sparse output and label, nearby but not overlapping saliency pixels will cause false penalty and lead to undesirable results. Thus we propose a novel loss function with max-pooling transform on output to learn from such sparse fixation pixels.
We summarise the contribution of our work as follows:
a first-of-its-kind approach of learning saliency prediction from sparse fixation pixels instead of fixation maps.
a novel loss function for training from such sparse label fixation map.
Ii Relative Works
Current saliency prediction approaches which directly learn from fixation blob map can be organized into two categories: traditional approach which models handcrafted visual cues and DNNs based approach which models automatically learned visual features.
Traditional saliency prediction approaches are mainly driven by biological and psychological studies on human attention mechanism. Following the early study ”Feature Integration Theory” 
on human visual attention mechanism, Koch et al. propose a biological plausible visual attention model which combines low level visual features such as color and contrast to produce a saliency indicating map. Later, Itti et al propose a saliency prediction approach based on the behaviour and neural architecture of primate visual system . They extract low level visual cues into multiple conspicuity map, and use a dynamic neural network to integrate them into a single saliency map.
, which models a 3 layers convolutional layers for feature extraction and a following SVM for salient classification. Later, Deep Gaze adopt a deeper architecture with convolutional layers from AlexNet 
to extract feature maps from different levels. The feature maps then integrated into saliency map by a learned linear model. As Deep Gaze introduces transfer learning into saliency prediction, later approaches extensively utilize trained models from image classification to boost up the saliency prediction performance. Kruthiventi et al. propose an fully convolutional neural network model named Deep Fix
, which utilize inception module and kernel with hole to extract multiple scale features, and applys VGG-16 pretrained model to initialize its early feature extraion layers. In the work of SALICON , Huang et al. explored multiple pretrained models from image classification for feature extraction in saliency prediction, including AlexNet, VGG-16 and GoogLeNet .
Task-oriented loss functions are also explored to improve the saliency prediction performance. Normalized saliency map can be understood as a spatial probability distribution, and Saumya et al. proved that distributional perspective loss functions outperform standard regression loss funtions in their PDP model. They explored multiple probabilistic distribution distance as loss function, namely
Divergence, Total-variation Distance, Cosine Distance, Bhattacharyya Distance and Kullback-Leibler Divergence, and all achieved better performance than Euclidean and Huber distance. Evaluation metrics for saliency prediction are also explored as loss function in SALICON. They utilize Normalized Scanpath Saliency (NSS), Similarity (Sim), Linear Correlation Coefficient (CC) and Kullback-Leibler Divergence (KLD) as loss function, and achieve best performance with KLD loss.
Different from previous works, we propose a saliency prediction approach which learn from sparse fixation pixel maps rather than fixation blob maps. We perform clustering on original fixation pixel map extract a sparse representation for each map, and fine tune a Inception-V3 model  to learn pixel level saliency information from the new label. Inspired by task-oriented losses, we apply KLD as loss function, and perform max-pooling on the output sparse saliency map to avoid false penalty on nearby but not overlapping saliency pixels between output and label maps.
Iii Proposed Approach
In this section, we introduce our work of a novel approach for learning saliency prediction from sparse fixation pixel map. To learn from sparse fixation pixels, our work consists of two steps: constructing a new type of ground truth fixation map, and designing a DNN based saliency prediction model.
Human attention is mostly draw by certain objects, and saliency prediction also activates mostly on certain salient objects. CNNs use hierarchical integrating of visual features to percept different objects, which is eventually represented by strong activation on the center of the object region in the corresponding feature map. Thus besides regressing the probabilistic distribution map, saliency could also be learned from extracting salient level for certain objects in given image. In most saliency prediction datasets, the final ground truth fixation pixels is constructed by aggregating the fixation pixels from multiple observers. We explored raw fixation pixels as discrete distribution samples and find it roughly obey the mixed Gaussian distribution, thus it is reasonable to assume that salient object center is located in the center of corresponding fixation pixel cluster. Following this assumption, we can sparsely represent saliency map with activation center of all the salient objects in the stimuli, and construct such sparse activation ground truth by perform clustering on the raw fixation pixel map.
After constructing the sparse fixation label set, we use a deep convolutional neural network based model to predict sparse saliency pixels from a given image. Considering saliency prediction datasets are usually too small to train large scale DNNs from scratch, transfer learning is commonly applied to initialize network parameters from fine-trained classification models. We also build our model based on pretrained model, and the saliency prediction pipeline of our approach is shown in Figure 2.
Iii-a Sparse Fixation
The fixation pixel ground truth from each datasets are usually distributed in a relatively uneven and clustered style, and each object is represented by multiple fixation pixels. The number of fixation pixels ranges from hundreds to thousands between different datasets, due to different extracting equipments and strategies. We assume pixels with gray scale greater than 250 being fixation points, then the average fixation pixel number is 66 for MIT1003, 334 for CAT2000 , and 4609 for SALICON . For a salient object, the corresponding salient region is fill with random sampled fixation and non-fixation pixels in the fixation pixel map. When learning from the such fixation pixel ground truth, the non-fixation pixels in salient area will cause false penalty when calculating loss in training phase, thus making the model harder to train. To learn saliency prediction from fixation points, we need to sparsify fixation pixels to a level that roughly one fixation pixel represents one object, while maintaining the representative saliency information.
We perform clustering the raw fixation pixel map to extract sparse fixation pixels. We cluster the fixation points to certain amount of clusters, and use cluster center to represent each clusters. The cluster centers are calculated by the average pixel location in each cluster, and thus forming the sparse representation of fixation information.
To find an appropriate clustering method which maximize the fixation sparsity while preserving most salient information, we explored multiple clustering methods with various cluster number and parameter setups for fixation pixels sparsification. We visualize multiple results in Fig.3 to give intuitively comparison between multiple clustering method on raw fixation map. In Fig.3 a raw fixation map from SALICON dataset are clustered into 24 clusters by KMeans, Hierarchical Clustering and Gaussian Mixture . From the raw fixation map we can see there are two salient objects represented by two spots of fixation pixels, thus the ideal cluster would be two main clusters on each object and others and others with fewer fixation pixels. As shown, the fixation pixels in the two salient spots are clustered into 4 clusters by Hierarchical Clustering, while clustered into 8 clusters by KMeans and Gaussian Mixtures. Thus we choose Hierarchical Clustering to sparsify the fixation information.
After select clustering for sparsification, we explored several cluster number to setup the clustering which maximize sparsity while appropriately preserve as much saliency information. The level of salient information preservation is evaluated by common metrics in saliency prediction such as AUD-Judd, NSS, Similarity and KL-Divergence in saliency prediction evaluation, and calculated between gaussian blurred raw fixation map and sparse fixation map. The result and visualization for different cluster number setup is shown in Fig.4. We select 24 to setup the cluster number for clustering on fixation pixel maps.
Finally, we construct a new ground truth dataset with 10000 sparse fixation map label from SALICON dataset. We use Hierarchical Clustering with cluster number set to 24, affinity set to Euclidean distance and linkage set to ward linkage.
Iii-B Network Architecture
We model our approach using deep CNNs in fully convolutional fashion. We apply the convolutional layers of Inception-V3 
model to extract visual features from input images. The Inception-V3 model is trained on ImageNet classification dataset with one million images, thus its kernels have a strong representative power. At the top of Inception-V3 convolutional layers, we add a simple 11 kernel to reduce the final feature maps into a single output saliency map with detected sparse activated pixels.
The convolutional part of Inception-V3 model consists of 23 convolution and pooling layers, in which 5 of the layers will cause 2-times downsampling, including 1 convolutional layer with stride of 2 and 4 pooling layers with stride of 2. In such structure the input image will be downsampled into the scale of 1/32 comparing with its original size, that is 2015 for image with size of 640480 from selected dataset SALICON. We evaluate the accuracy loss on different downsampled-upsampled fixation map with original fixation map to find an appropriate down sample scale. We evaluate the loss on SALICON and MIT1003 datasets and plot the loss in Fig.5. We can see that after 20-times downsample scale the accuracy loss drops fast. Since 32-times downsample by Inception-V3 model maybe too large for saliency prediction, we choose the downsample scale of 16 and modify the original network structure to achieve so. The higher layer of DNN has a strong dependency on previous layers, thus we modify the higher layer downsample scale to minimize the effect. We replace the stride of 2 in the last inception module by 1, and keep a downsample scale of 1/16.
Iii-C Pooling KLD Loss
Since the saliency map is like discrete probabilistic distribution, we use Kullback-Leibler Divergence (KLD) as loss function. Let denotes the ground truth sparse fixation map, and denotes the output sparse saliency map. The original KLD loss is in the form of
where and is a normalized N-dimensional distribution. The original form in KLD performs well on pixel-wised regression approach, but poorly on clustered center approach. KLD calculate the distance between and by cumulating the pixel-wise difference on the two distribution matrix, thus two closely neighbored activated point from each matrix has no contribution to the closeness. This will cause false penalty between two spatially close but non-overlapping activated point from two matrix, and further results in wrong direction at gradient descending. To tackle this issue, we propose an alternative for original KLD loss, termed ”pooling KLD”.
Pooling KLD first extract a new output note by
by perform a padded max-pooling on original, then calculate the KLD on and . Note that we do not normalize the again, since the in our case is sparse thus could perform as a gateway for cumulation, and the False Positive caused by max-pooling will be ignored if the corresponding gateway is not activated. The sparse label matrix and max pooling prediction matrix allow us to take neighboring information into account and learn the saliency points in more representative and robust fashion.
We feature two sparse fixation maps with nearby but non-overlapping activation in Fig.6, and calculate the original and pooling KLD between them. As shown, the KLD of original and is 5.50577, while the KLD of and is 0.74061. As pooling KLD can efficiently suppress the false penalty, we use pooling KLD for training saliency prediction from such sparse fixation maps.
Iv Experimental Result
We construct the sparse fixation training set from the SALIency in CONtext (SALICON) dataset  by perform clustering with Scikit-Learn package . SALICON is the largest open access dataset in the area of saliency prediction, which consist of 10000 training samples, 5000 validation samples and 5000 testing samples from MS COCO dataset . The fixation information is gathered by an alternative eye tracking paradigm. Multiple observers use mouse to direct their fixation on image stimuli during 5 seconds free viewing and 2 seconds followed waiting interval, and the mouse trajectory is recorded and aggregated to indicate where people find most interesting in such stimuli. We perform clustering that we described previously to cluster fixation pixels of each sample into 24 clusters, and use the center of each clusters to represent the corresponding salient object.
At training phase, we use batch training to accelerate convergence and improve generalization capability, with batch size set to 16. We use Adam  optimizer to for fine-tuning the Inception-V3 model. Since saliency prediction and image classification can share the same low level features, we only fine-tune the final 6 inception blocks with higher level features by blocking the gradient at layer 6 from back propagating. The pretrained Inception-V3 model is downloaded from model zoo of MxNet framework. The learning rate for fine-tuning and the learning of final 1
1 kernel is set to 0.00001. The entire training takes about 24 hours on a 12G ram NVIDIA Tesla K40m GPU with the MxNet deep learning framework on Ubuntu 16.04 operation system.
Iv-B Evaluation Metrics
At evaluation, multiple metrics are used, since previous study by Riche et al.  shows that no single metric has concrete guarantee of fair comparison. We briefly describe the used metrics for better understanding of the results. We denote for output saliency map, for ground truth fixation blob map and for ground truth fixation pixel map at following description.
: Area Under ROC Curve (AUC) measures the area under the Receiver Operating Characteristic (ROC) curve, which consists of true and false positive rate under different binary classifier threshold betweenand . Three different AUC implementations are mainly used in saliency prediction, namely AUC-Judd , AUC-Borji  and shuffled-AUC , and we mainly adopt AUC-Judd in the evaluation. They are differed in how the true and false positive rate are calculated. The higher the true positive rate and the lower the false positive rate are, the larger the AUC is, and thus the better performance we have.
NSS: Normalized Scanpath Saliency  is the mean value at on the fixation pixels location in normalized
with zero mean and unit standard deviation. Larger NSS score represents better performance.
CC: Correlation Coefficient (CC) measures the linear relationship between saliency matrix and . CC score of 1 means and are identical, while 0 means and are uncorrelated. Thus larger CC score represents better performance.
Sim: Similarity (Sim) first normalizes and to and , then calculate the sum of element-wised minimum between and . Thus larger Similarity represent better performance.
KLD: Kullback-Leibler Divergence (KLD) is a non-symmetric metric. It measures the information lost when using to encode . Lesser KLD score represents better saliency prediction performance.
We evaluate our approach on multiple benchmark datasets, and the results are as follows:
MIT300 We mainly evaluate our model on the testing set of MIT300  benchmark dataset. The MIT300 benchmark dataset is composed of 300 samples with various indoor and outdoor scenes and objects. The fixation information is extracted by directly recording the eye movements of 39 observers at 3 seconds free viewing at given sample. To avoid overfitting the dataset, the ground truth fixation maps are held out at the benchmark server for evaluation remotely, and the maximum submission is limited to 2 times per month. The sample sizes from MIT300 are ranged with x-axis from 679 to 1024 and y-axis from 457 to 1024, which are larger than from SALICON that we train our model on. Thus when evaluating on MIT300 dataset, we first resize the sample with short axis to 480 and long axis accordingly. The evaluation results are show in Table.1.
CAT2000 We also evaluate our approach on CAT2000  benchmark dataset. The CAT2000 dataset consists of one training set with accessable ground truth and one testing set with held out ground truth fixation maps. The training and testing set contains 20 different categories (100 images for each one) from Action to Line Drawing. The fixations are integrated from 5 seconds free viewing of 24 observers. Since the sample size of CAT2000 dataset is 19201080, we resize the samples to 854480 for evaluation.
The evaluation in both datasets shows the practicability of learning saliency prediction from fixation pixels.
In this work, we propose a first-of-its-kind method of learning saliency prediction from sparse fixation pixel map instead of gaussian blurred fixation map. A sparse fixation pixel map is extracted by hierarchical clustering the raw fixation ground truth and use the cluster center and sample number to represent the location and salient level of corresponding object. To tackle the problem of false penalty in sparse fixation regression, we propose a novel loss function with max pooling on the output. The proposed approach achieves state-of-the-art performance in multiple benchmark datasets, and provide a novel perspective on how saliency prediction can be learned.
-  Marchesotti, Luca, C. Cifarelli, and G. Csurka. ”A framework for visual saliency prediction with applications to image thumbnailing.” IEEE, International Conference on Computer Vision IEEE, 2009:2232-2239.
-  Achanta, Radhakrishna, and S. Susstrunk. ”Saliency prediction for content-aware image resizing.” IEEE International Conference on Image Processing 2009:1005-1008.
-  Borji, Ali, et al. ”Online learning of task-driven object-based visual attention control.” Image and Vision Computing 28.7 (2010): 1130-1145.
-  Dankers, Andrew, N. Barnes, and A. Zelinsky. ”A Reactive Vision System: Active-Dynamic Saliency.” 2007.
-  Koch, C, and S. Ullman. ”Shifts in selective visual attention: towards the underlying neural circuitry.” Hum Neurobiol 4.4(1987):219-227.
-  Sss, Kruthiventi, K. Ayush, and R. V. Babu. ”DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations.” IEEE Transactions on Image Processing 26.9(2017):4446-4456.
-  Huang, Xun, et al. ”SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks.” IEEE International Conference on Computer Vision IEEE Computer Society, 2015:262-270.
-  Treisman, Anne, and Garry Gelade. ”A feature-integration theory of attention.” Cognitive Psychology 12.1 (1980): 97-136.
-  Koch, C, and S. Ullman. ”Shifts in selective visual attention: towards the underlying neural circuitry.” Hum Neurobiol 4.4(1987):219-227.
-  Itti, Laurent, C. Koch, and E. Niebur. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Computer Society, 1998.
-  Simonyan, Karen, and A. Zisserman. ”Very Deep Convolutional Networks for Large-Scale Image Recognition.” Computer Science (2014).
He, Kaiming, et al. ”Deep Residual Learning for Image Recognition.” Computer Vision and Pattern Recognition IEEE, 2016:770-778.
-  Wang, Naiyan, and D. Y. Yeung. ”Learning a deep compact image representation for visual tracking.” International Conference on Neural Information Processing Systems Curran Associates Inc. 2013:809-817.
-  Ma, Chao, et al. ”Hierarchical Convolutional Features for Visual Tracking.” IEEE International Conference on Computer Vision IEEE, 2016:3074-3082.
-  Vig, Eleonora, M. Dorr, and D. Cox. ”Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images.” Computer Vision and Pattern Recognition IEEE, 2014:2798-2805.
-  Kummerer, Matthias, L. Theis, and M. Bethge. ”Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet.” Computer Science (2014).
-  Krizhevsky, Alex, I. Sutskever, and G. E. Hinton. ”ImageNet classification with deep convolutional neural networks.” International Conference on Neural Information Processing Systems Curran Associates Inc. 2012:1097-1105.
Kummerer, Matthias, T. S. A. Wallis, and M. Bethge. ”DeepGaze II: Reading fixations from deep features trained on object recognition.” (2016).
-  Jiang, Ming, et al. ”SALICON: Saliency in Context.” Computer Vision and Pattern Recognition IEEE, 2015:1072-1080.
-  Szegedy, Christian, et al. ”Going deeper with convolutions.” IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society, 2015:1-9.
-  Jetley, Saumya, N. Murray, and E. Vig. ”End-to-End Saliency Mapping via Probability Distribution Prediction.” Computer Vision and Pattern Recognition IEEE, 2016:5753-5761.
-  Szegedy, Christian, et al. ”Rethinking the Inception Architecture for Computer Vision.” computer vision and pattern recognition (2016): 2818-2826.
-  Judd, T, et al. ”Learning to predict where humans look.” IEEE, International Conference on Computer Vision IEEE, 2010:2106-2113.
-  Borji, Ali, and L. Itti. ”CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research.” Computer Science (2015).
-  Jiang, Ming, et al. ”SALICON: Saliency in Context.” Computer Vision and Pattern Recognition IEEE, 2015:1072-1080.
-  Rokach, Lior, and Oded Maimon. ”Clustering methods.” Data mining and knowledge discovery handbook. Springer US, 2005. 321-352.
-  Lindsay, Bruce, et al. ”Mixture Models: Inference and Applications to Clustering.” Journal of the American Statistical Association 84.405(1989):337.
-  Deng, Jia, et al. ”ImageNet: A large-scale hierarchical image database.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on IEEE, 2009:248-255.
-  Lin, Tsung Yi, et al. ”Microsoft COCO: Common Objects in Context.” 8693(2014):740-755.
-  Kingma, Diederik, and J. Ba. ”Adam: A Method for Stochastic Optimization.” Computer Science (2014).
Chen, Tianqi, et al. ”MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems.” Statistics (2015).
-  Riche, Nicolas, et al. ”Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics.” IEEE International Conference on Computer Vision IEEE, 2014:1153-1160.
-  Borji, Ali, et al. ”Analysis of Scores, Datasets, and Models in Visual Saliency Prediction.” IEEE International Conference on Computer Vision IEEE Computer Society, 2013:921-928.
-  Zhang, L., et al. ”SUN: A Bayesian framework for saliency using natural statistics. ” J Vis 8.7(2008):32.1.
-  Peters, R. J., et al. ”Components of bottom-up gaze allocation in natural images.” Vision Research 45.18(2005):2397-2416.
-  Judd, Tilke, F. Durand, and A. Torralba. ”A Benchmark of Computational Models of Saliency to Predict Human Fixations.” (2012).
-  Zhang, Jianming, and S. Sclaroff. ”Saliency prediction: A Boolean Map Approach.” IEEE International Conference on Computer Vision IEEE Computer Society, 2013:153-160.
-  Liu, Nian, et al. ”Predicting eye fixations using convolutional neural networks.” Computer Vision and Pattern Recognition IEEE, 2015:362-370.
-  Sch?lkopf, Bernhard, J. Platt, and T. Hofmann. ”Graph-Based Visual Saliency.” International Conference on Neural Information Processing Systems MIT Press, 2006:545-552.
-  Fang, Shu, et al. ”Learning Discriminative Subspaces on Random Contrasts for Image Saliency Analysis.” IEEE Transactions on Neural Networks & Learning Systems 28.5(2017):1095-1108.
-  Goferman, S, L. Zelnikmanor, and A. Tal. ”Context-aware saliency prediction. ” IEEE Transactions on Pattern Analysis & Machine Intelligence 34.10(2012):1915-1926.
-  Cornia, Marcella, et al. ”Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.” (2016).
-  Pedregosa, Fabian, et al. “Scikit-Learn: Machine Learning in Python.“ Journal of Machine Learning Research, vol. 12, 2011, pp. 2825–2830.
-  Walther, D, and C. Koch. ”Modeling attention to salient proto-objects. ” Neural Networks the Official Journal of the International Neural Network Society 19.9(2006):1395.
-  Cornia, Marcella, et al. ”A Deep Multi-Level Network for Saliency Prediction.” (2016).