Saliency detection aims to extract a mask map, termed ”saliency map”, that gives the probability of most attractive part from a given image. Such a saliency map can be utilized for multiple applications, such as image thumbnailing and content-aware image resizing . By reducing the scale of the field-to-be-perceived to specific salient region, it can also be used as preprocessing method to further speed-up other visual tasks, e.g. [3, 4]. The extensive usages leads to the popularity of saliency detection, and recently many approaches are proposed towards this issue.
Recent approaches adopt deep convolutional neural networks (DCNNs) to boost up saliency detection. Multiple DCNNs models such as[10, 16]
have revolutionized the area of large scale image classification, and then extensively used in multiple visual tasks with transfer learning, such as target tracking, image segmentation and also saliency detection. Previous DCNNs based saliency detection approaches, for example, DeepFix utilize VGG16 to initiate the parameter of its lower layers, and DeepGazeII 
utilize VGG19 to initiate its whole feature extraction layers. Though the representative power of transferred models help previous methods achieve high performance, the computational cost also raises as the transferred models become increasingly computational and storage intensive.
The computational cost of DCNNs can be mainly described as the network workload, which is the theoretical number of basic operations needed in the DCNNs computation from the algorithmic aspect . It can also present as the temporary intermediate data size which generated from the forwarding layer-wise calculation, and the model size determined by the parameter number. From hardware perspective, the computational cost can be described by CPU, RAM and disk consumption. Recently, DNNs deployment on mobile device receives increasing attention, while the computational power of mobile devices is limited comparing with PC and server. Since the gap between high computational complexity of DNNs and low computational capacity of mobile devices exists, multiple methods were proposed towards this issue. SqueezeNet  proposed a bottleneck structure to reduce the computational cost by reducing the channels before feature extraction. MobileNet  introduced the inverted residual block composed with depthwise separable convolution, and achieved AlexNet  level accuracy with much lower computational cost.
As a simulation for early stage visual attention task, saliency detection naturally requires for low computational cost, yet few effort have been made toward this direction. In this work, we propose a novel saliency detection approach with distinctively lower computational cost comparing with previous methods. To achieve light-weighted network structuring, we utilize depthwise separable convolutional block to reduce the network parameter number, and we replace the processing of whole image with serial of regions to reduce the spatial scale of intermediate feature maps. We also build our approach from scratch with simpler model structure instead of transferring from trained image classification networks.
We first evaluate the saliency detection performance of our approach on multiple benchmark datasets, and achieve competitive results comparing with state-of-the-art methods. We then evaluate the computational cost by recording the runtime RAM cost, time consumption for processing single image and model size, and yielding distinctively lighter weighted model result.
Ii Relative Works
In this section, we introduce the related works with efforts on improving saliency detection performance, and on achieving light-weighted neural network designing.
Traditional saliency detection approaches are mainly driven by biological and psychological studies on human attention mechanism. Following the early study ”Feature Integration Theory”  on human visual attention mechanism, Koch et al
. propose a biological plausible visual attention model which combines low level visual features such as color and contrast to produce a saliency indicating map. Later, Itti et al propose a saliency detection approach based on the behaviour and neural architecture of primate visual system . They extract low level visual cues into multiple conspicuity map, and use a dynamic neural network to integrate them into a single saliency map.
More recently, DCNNs based approaches have improved the state-of-the-art of saliency detection by a great margin. An early work of applying CNNs to saliency detection is ensembles of Deep Networks (eDN) , which models a 3 layers convolution for feature extraction and a following SVM for salient classification. Later, DeepGaze  adopt a deeper architecture with convolutional layers from AlexNet to extract feature maps from different levels. The feature maps then integrated into saliency map by a learned linear model. As DeepGaze introduces transfer learning into saliency detection, later approaches extensively utilize trained models from image classification to boost up the saliency detection performance. Kruthiventi et al. propose an fully convolutional neural network model named DeepFix 
, which utilize inception module and kernel with hole to extract multiple scale features, and applies VGG-16 pretrained model to initialize its early feature extraction layers. In the work of SALICON , Huang et al. explored multiple pretrained models from image classification for feature extraction in saliency detection, including AlexNet, VGG-16 and GoogLeNet . As the performance improves, the computational cost for DCNNs based saliency detection approaches is continuously increasing.
As the popularity of mobile and embedded hardware based DNNs deployment grows, the gap between limited computational capacity hardware and computational complexity of popular DNNs also enlarges. Deep Compression  proposed multiple techniques, namely pruning, trained quantization and Huffman coding to reduce the storage requirement of DNNs. Though Deep Compression successfully compressed the size of network model file, the runtime memory cost is not reduced since the original network structure need to be recovered from the compressed network data when performing forward computing.
Recently, efforts on designing efficient models from the bottom draws more attention. SqueezeNet  propose a novel paradigm for designing more light-weighted neural network, including replacing kernels with kernels and decreasing the number of input channels to kernels. MobileNet  adopt depth-wise separable convolution to design a highly light-weighted neural network feasible for mobile device. ShuffleNet  utilize group convolution and channel shuffle to reduce the model size while maintaining high performance.
In this work, by utilizing these efficient model designing paradigm, we propose a DCNNs based saliency detection approach with distinctively lower computational cost comparing with previous methods.
Iii Proposed Approach
In this section, we introduce the details of our proposed light-weighted saliency detection approach. The main objective of our work is to develop a light-weighted saliency detection approach. To achieve this, we adopt three strategies when designing the network architecture:
Depthwise Separable Convolution: Replacing the normal convolution layers with depth-wise separable convolutional blocks to reduce the model size from the bottom;
Regional Input: Processing with serial of two multi-resolution image regions instead of the whole image to reduce the feature map data blob size;
Simplified Networks: Constructing network with less depth and width from scratch instead of transferring from large scale models for image classification to reduce model size. We use two fully convolutional neural networks with independent parameters for feature extraction.
The pipeline of our approach is visualized in Fig.3, and can be briefly described as follows: 1) Raw input image and border padded input image are cropped into serial of regions at the same central location and resized into the same size; 2) The two regions are feed into two fully convolutional networks with independent parameters for hierarchical feature extraction, then produce two saliency region respectively; 3) The two saliency regions are merged by element-wise multiply; 4) The serial of merged saliency regions are resized and concatenated to produce the final saliency map.
We describe the details of the three adopted strategies in the following.
Iii-a Network Architecture
Depthwise Separable Convolution: The fully convolutional feature extraction networks are build on stacked depth-wise separable convolutional blocks with inverted residual and linear bottleneck that introduced in MobileNetV2 . Recent works with effort of designing more efficient neural networks are extensively adopting depth-wise separable convolution as the key building block, e.g. [34, 35, 36]. The basic idea for depth-wise separable convolution to achieve efficiency is separating the channel and spatial wise computation, which split one traditional convolutional layer into one depth-wise and one point-wise convolutional layer.
Assume a convolutional layer with input channel of , output channel of , kernel size of , and feature map size of , then the computation to produce the output feature map (assuming the input and output feature map size is consistent by zero padding) by traditional convolution is . While the computation of depth-wise separable convolution with the same setting is . Thus the proportion of computation between traditional and depth separable convolution can be represented as
Thus according to Equation.1, the bigger kernel size is, the more parameter get reduced. For a common choice of 128 output channel and kernel, the depth-wise separable convolution reduces the parameter for approximately 8 times. Based on this, the adopted depth-wise separable convolutional block add one expansion layer with kernel before the depth-wise convolution, which expand the input channels to support more sufficient feature extraction. As shown in Figure.2, the depth-wise separable convolutional block is consist of three layers of convolution: expansion convolutional layer denote as , single channel convolution layer denote as and bottleneck convolution layer denote as .
We also adopt linear bottleneck and inverted residual on the depth-wise separable convolutional block, which use linear activation instead of ReLU activation after the bottleneck convolution layer, and add an element-wise plus operationif the channel number is consistent. By stacking the bottleneck block and various convolutional layers, we construct the feature extraction network for our saliency detection approach. The detail settings of the network is shown in Table.1.
Regional Input: To reduce the memory cost for temporarily storing feature maps with large spatial sizes, we replace the processing of the whole image to processing of serial of regions. For example, with the sample network structure, the memory cost for processing a sized image is 48 times larger than processing a sized image. Thus we crop the input image into regions and sequentially produce the saliency regions, then concatenate the output regions to the final saliency map.
Since we process sample images with cropped non-overlapping regions, the features are extracted from more local and object-oriented perspectives, resulting in limited representative power from more global perspective. Thus we apply multiscale feature extraction to learn richer semantics, with two networks to extract features from relatively coarse and fine resolution simultaneously. The two networks share the same structure shown in Table.1, and trained independently with fine and coarse resolution regions with size of . We obtain the two input regions following:
Resizing the input image to short axis of 480 pixels and long axis accordingly to get fine input image, then copying the border to pad the fine input image by 80 at each end of axis to get coarse input image;
Cropping the fine resolution input image to serial of non-overlapping sized regions;
Cropping a serial of sized regions from coarse resolution input image at the same center location with fine resolution regions, then resizing to .
Simplified Networks: Before designing the network structure, we visualized the feature activation on the top layers from image classification methods and saliency detection methods that transferred from image classification networks. We find the activation of saliency detection is more densely activated and multiple nearly identical feature maps exist, while the activation of image classification method is sparse and no obviously identical activated feature maps exist. This indicates that when using the same structure, the classification method network is redundant for saliency detection task.
To avoid extra computational complexity caused by the redundancy, we build our network from scratch instead of transferring from exist model such as VGGNet . The network includes 1 kernel layer and 12 depth-wise separable convolutional block layers for feature extraction, with less channels at the bottom layers.
Iii-B Model Training
After the model construction, we train our approach using gradient descent. We use mean absolute error as loss function to calculate the distance between the output saliency region and label region, and use Adam optimizer to update the parameter with the initial learning rate set to 0.001.
We train our approach on SALICON saliency detection dataset . SALICON dataset is currently the biggest dataset for saliency detection task, with 10000 training samples, 5000 validation samples and 5000 testing samples from MS COCO dataset . The stimuli sample set is consist of various indoor and outdoor scenes and objects with rich semantics. The ground truth fixation information is obtained by recording the mouse trajectory when multiple observers using mouse to direct their fixation on image stimuli during 5 seconds free viewing and 2 seconds followed waiting interval. To use SALICON to train our approach, each sized sample image is cropped to 48 fine and coarse resolution regions, forming a new training set with 480000 samples.
Iv Experimental Result
In this section, we run experiments on both performance and computational cost evaluation. The performance result is to give a description on saliency detection performance of our approach. The computational cost result is to evaluate how much we achieve the light-weighted objective. We evaluate the performance on two benchmark datasets.
MIT300 We mainly evaluate our model on the testing set of MIT300  benchmark dataset. The MIT300 benchmark dataset is composed of 300 samples with various indoor and outdoor scenes and objects. The fixation information is extracted by directly recording the eye movements of 39 observers at 3 seconds free viewing at given sample. To avoid overfitting the dataset, the ground truth fixation maps are held out at the benchmark server for evaluation remotely, and the maximum submission is limited to 2 times per month. The sample sizes from MIT300 are ranged with x-axis from 679 to 1024 and y-axis from 457 to 1024, which are larger than samples from SALICON that we train our model on. Thus when evaluating on MIT300 dataset, we first resize the sample with short axis to 480 and long axis accordingly. The evaluation results are show in Table.2.
CAT2000 We also evaluate our approach on CAT2000  benchmark dataset. The CAT2000 dataset consists of one training set with accessable ground truth and one testing set with held out ground truth fixation maps. The training and testing set contains 20 different categories (100 images for each one) from Action to Line Drawing. The fixations are integrated from 5 seconds free viewing of 24 observers. Since the sample size of CAT2000 dataset is 19201080, we resize the samples to 854480 for evaluation.
Iv-a Performance Metrics and Result
At performance evaluation, multiple metrics are used, since previous study by Riche et al.  shows that no single metric has concrete guarantee of fair comparison. We briefly describe the used metrics for better understanding of the results. We denote for output saliency map, for ground truth fixation map with Gaussian blur and for ground truth fixation pixel map at following description.
: Area Under ROC Curve (AUC) measures the area under the Receiver Operating Characteristic (ROC) curve, which consists of true and false positive rate under different binary classifier threshold betweenand . Three different AUC implementations are mainly used in saliency detection, namely AUC-Judd , AUC-Borji  and shuffled-AUC . They are differed in how the true and false positive rate are calculated. The higher the true positive rate and the lower the false positive rate are, the larger the AUC is, and thus the better performance we have.
EMD: Earth Mover’s Distance (EMD) normalizes and to two 2-dimensional distribution, and calculate the minimal cost of transferring to . Thus lesser EMD score represents better performance.
NSS: Normalized Scanpath Saliency  is the mean value at on the fixation pixels location in normalized
with zero mean and unit standard deviation. Larger NSS score represents better performance.
|model||speed||total memory||net memory||model size||parameter||computation|
|unit||second per sample||MB||MB||MB||-||-|
CC: Correlation Coefficient (CC) measures the linear relationship between saliency matrix and . CC score of 1 means and are identical, while 0 means and are uncorrelated. Thus larger CC score represents better performance.
Sim: Similarity (Sim) first normalizes and to and , then calculate the sum of element-wised minimum between and . Thus larger Similarity represent better performance.
: Kullback-Leibler Divergence (KLD) is a non-symmetric metric. It measures the information lost when usingto encode . Lesser KLD score represents better saliency detection performance.
We evaluate the saliency detection performance of our approach on test set of MIT300 and CAT2000 benchmark datasets, the results are shown in Table.2 and Table.3 respectively. From the tables we can see that our approach achieves competitive results comparing with previous methods transferred from large scale image classification models, and outperforms traditional and shallow network based methods.
Iv-B Computational Cost Measurements and Result
As the goal of this work is to propose a light-weighted DNNs based architecture for saliency prediction, we evaluate the computational complexity of our approach and some previous works. We briefly describe the measurements we used for computational complexity evaluation. We run all the tests on a Intel Core i7-4710MQ CPU with MxNet deep learning framework.
Speed: speed is the time cost of processing one sized image. Since our approach process sized image by processing the cropped 48 sized region, we evaluate the speed of our approach by measuring the average time cost of processing a whole image with the cropping and concatenating procedure.
Memory Cost: memory cost is the average RAM consumption when processing one sized image. Since the deep learning framework have extra memory requirement besides the actual model consumed RAM, we give two results for memory cost: one for total memory cost when running the model, one for the net consumption which is the rough cost minus the framework minimal RAM requirement.
Model Size: we evaluate the model size by measuring the bytes of the parameter file for each model. The parameter file is a cross platform .params file.
Parameter Size: parameter size is the total number of learnable parameters for each model such as the convolution kernels and fully connection weights. The parameter is calculated follows the equation
where denotes kernel size and denotes the channel number of layer . Notice that denotes the input channel, for RGB image .
Computation Size: computation size is the total number of computation for processing one sized image. For our approach, we measure the total computation by multiply the computation number of processing two regions (coarse and fine resolution regions) by 48. The computation is calculated follows the equation
where and denotes the feature map width and height.
The results are shown in Table.4. We compared the computational cost of our approach with some deep learning based previous works on the described measurements. Notice that when running neural networks, the framework (in our case MxNet) has a minimum resource consumption. Thus we run a emtpy network with no operation as baseline for minimum computational cost and further calculate the actual computational cost of the networks.
We can see that the net memory cost of our approach is 42 (PDP) to 99 (DeepGazeII) times less than the previous works, the model size is 63 (ML-Net) to 129 (SalGAN) times smaller. The parameter size of our approach is 63 (ML-Net) to 117 (DeepFix) times smaller, and the computation size is 27 (PDP) to 56 (SalGAN) times smaller. While the computational cost is distinctively reduced, the time cost for processing one sized image is still less than other previous methods.
In this work, we propose a light-weighted neural networks architecture for saliency detection with distinctively lower memory cost and model size, while maintaining competitive performance comparing with previous approaches. We mainly adopt three strategies to achieve light-weight goal: 1) depth-wise separable convolutional block to reduce parameter number; 2) regional input to reduce intermediate feature map size; 3) simplified network structure to reduce model size. Experimental result shows that our approach reduce the runtime memory cost by 42 to 99 times, and model storage size by 63 to 129 times comparing with previous approaches.
Marchesotti, Luca, C. Cifarelli, and G. Csurka. ”A framework for visual saliency detection with applications to image thumbnailing.” IEEE, International Conference on Computer Vision IEEE, 2009:2232-2239.
-  Achanta, Radhakrishna, and S. Susstrunk. ”Saliency detection for content-aware image resizing.” IEEE International Conference on Image Processing 2009:1005-1008.
-  Borji, Ali, et al. ”Online learning of task-driven object-based visual attention control.” Image and Vision Computing 28.7 (2010): 1130-1145.
-  Dankers, Andrew, N. Barnes, and A. Zelinsky. ”A Reactive Vision System: Active-Dynamic Saliency.” 2007.
-  Sss, Kruthiventi, K. Ayush, and R. V. Babu. ”DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations.” IEEE Transactions on Image Processing 26.9(2017):4446-4456.
-  Huang, Xun, et al. ”SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks.” IEEE International Conference on Computer Vision IEEE Computer Society, 2015:262-270.
-  Treisman, Anne, and Garry Gelade. ”A feature-integration theory of attention.” Cognitive Psychology 12.1 (1980): 97-136.
-  Koch, C, and S. Ullman. ”Shifts in selective visual attention: towards the underlying neural circuitry.” Hum Neurobiol 4.4(1987):219-227.
-  Itti, Laurent, C. Koch, and E. Niebur. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Computer Society, 1998.
-  Simonyan, Karen, and A. Zisserman. ”Very Deep Convolutional Networks for Large-Scale Image Recognition.” Computer Science (2014).
Vig, Eleonora, M. Dorr, and D. Cox. ”Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images.” Computer Vision and Pattern Recognition IEEE, 2014:2798-2805.
Kummerer, Matthias, L. Theis, and M. Bethge. ”Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet.” Computer Science (2014).
-  Krizhevsky, Alex, I. Sutskever, and G. E. Hinton. ”ImageNet classification with deep convolutional neural networks.” International Conference on Neural Information Processing Systems Curran Associates Inc. 2012:1097-1105.
Kummerer, Matthias, T. S. A. Wallis, and M. Bethge. ”DeepGaze II: Reading fixations from deep features trained on object recognition.” (2016).
-  Jiang, Ming, et al. ”SALICON: Saliency in Context.” Computer Vision and Pattern Recognition IEEE, 2015:1072-1080.
-  Szegedy, Christian, et al. ”Going deeper with convolutions.” IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society, 2015:1-9.
Jetley, Saumya, N. Murray, and E. Vig. ”End-to-End Saliency Mapping via Probability Distribution Prediction.” Computer Vision and Pattern Recognition IEEE, 2016:5753-5761.
-  Judd, T, et al. ”Learning to predict where humans look.” IEEE, International Conference on Computer Vision IEEE, 2010:2106-2113.
-  Borji, Ali, and L. Itti. ”CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research.” Computer Science (2015).
-  Lin, Tsung Yi, et al. ”Microsoft COCO: Common Objects in Context.” 8693(2014):740-755.
-  Kingma, Diederik, and J. Ba. ”Adam: A Method for Stochastic Optimization.” Computer Science (2014).
Chen, Tianqi, et al. ”MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems.” Statistics (2015).
-  Riche, Nicolas, et al. ”Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics.” IEEE International Conference on Computer Vision IEEE, 2014:1153-1160.
-  Borji, Ali, et al. ”Analysis of Scores, Datasets, and Models in Visual Saliency Prediction.” IEEE International Conference on Computer Vision IEEE Computer Society, 2013:921-928.
-  Zhang, L., et al. ”SUN: A Bayesian framework for saliency using natural statistics. ” J Vis 8.7(2008):32.1.
-  Peters, R. J., et al. ”Components of bottom-up gaze allocation in natural images.” Vision Research 45.18(2005):2397-2416.
-  Judd, Tilke, F. Durand, and A. Torralba. ”A Benchmark of Computational Models of Saliency to Predict Human Fixations.” (2012).
-  Zhang, Jianming, and S. Sclaroff. ”Saliency Detection: A Boolean Map Approach.” IEEE International Conference on Computer Vision IEEE Computer Society, 2013:153-160.
-  Liu, Nian, et al. ”Predicting eye fixations using convolutional neural networks.” Computer Vision and Pattern Recognition IEEE, 2015:362-370.
-  Sch?lkopf, Bernhard, J. Platt, and T. Hofmann. ”Graph-Based Visual Saliency.” International Conference on Neural Information Processing Systems MIT Press, 2006:545-552.
-  Fang, Shu, et al. ”Learning Discriminative Subspaces on Random Contrasts for Image Saliency Analysis.” IEEE Transactions on Neural Networks & Learning Systems 28.5(2017):1095-1108.
-  Goferman, S, L. Zelnikmanor, and A. Tal. ”Context-aware saliency detection. ” IEEE Transactions on Pattern Analysis & Machine Intelligence 34.10(2012):1915-1926.
-  Cornia, Marcella, et al. ”Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.” (2016).
-  Zhang, Xiangyu, et al. ”ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices.” (2017).
-  Howard, Andrew G, et al. ”MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” (2017).
-  Chollet, Francois. ”Xception: Deep Learning with Depthwise Separable Convolutions.” (2016):1800-1807.
-  Sandler, Mark, et al. ”MobileNetV2: Inverted Residuals and Linear Bottlenecks.” (2018).
-  Cong, Jason, and B. Xiao. Minimizing Computation in Convolutional Neural Networks. Artificial Neural Networks and Machine Learning - ICANN 2014. Springer International Publishing, 2014:281-290.
-  Iandola, Forrest N, et al. ”SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and ¡0.5MB model size.” (2016).
-  Han, Song, H. Mao, and W. J. Dally. ”Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” Fiber 56.4(2015):3–7.
-  Pan, Junting, et al. ”SalGAN: Visual Saliency Prediction with Generative Adversarial Networks.” (2017).
-  Cornia, Marcella, et al. ”A deep multi-level network for saliency prediction.” international conference on pattern recognition (2016): 3488-3493.