|(a) Feature Visualization |
|(b) Generative Adversarial Example |
|(c) Saliency Detection |
|(d) Semantic Segmentation |
Saliency detection targets to identify the most important and conspicuous objects or regions in an image. As a pre-processing procedure in computer vision, saliency detection has greatly benefited many practical applications such as object retargeting [6, 40]37], semantic segmentation  and visual tracking [29, 14]. Although significant progress has been made [15, 1, 30, 21, 33, 46], saliency detection remains very challenging due to complex factors in real world scenarios. In this work we focus on the task of improving robustness of saliency detection models, which has been ignored in the literature.
Previous saliency detection methods utilize several hand-crafted visual features and heuristic priors. Recently, deep learning based methods become more and more popular, and have set the benchmark on many datasets[27, 49, 3]. Their superior performance is partly attributed to the strong representation power in modeling object appearances and varied scenarios. However, existing methods fail to provide a probabilistic interpretability of the “black-box” learning in deep neural networks, and mainly enjoy the models’ exceptional performance. A reasonable probabilistic interpretation can provide relational confidences alongside predictions and make the prediction system into a more robust one . In addition, since the uncertainty is a natural part of any predictive system, modeling the uncertainty is of crucial importance. For instance, the object boundary strongly affects the prediction accuracy of a saliency model, it is desirable that the model can provide meaningful uncertainties on where the boundary of distinct objects is. As far as we know, there is no work to model and analyze the uncertainty of saliency detection methods based on deep learning.
Another important issue is the checkerboard artifact in pixel-wise vision tasks, which target to generate images or feature maps from low to high resolution. Several typical examples are shown in Fig. 1 (ref. 
for more details). The odd artifacts sometimes are very fatal for deep CNNs based approaches. For example, when the artifacts appear in the output of a fully convolutional network (FCN), the network training may fail and the prediction can be completely wrong. We find that the actual cause of these artifacts is the upsampling mechanism, which generally utilizes the deconvolution operation. Thus, it is of great interest to explore new upsampling methods to better reduce the artifacts for pixel-wise vision tasks. Meanwhile, the artifacts are also closely related to the uncertainty learning of deep CNNs.
All of the issues discussed above motivate us to learn uncertain features (probabilistic learning) through deep networks to achieve accurate saliency detection. Our model has several unique features, as outlined below.
Different from existing saliency detection methods, our model is extremely simplified. It consists of an encoder FCN, a corresponding decoder FCN followed by a pixel-wise classification layer. The encoder FCN hierarchically learns visual features from raw images while the decoder FCN progressively upsamples the encoded feature maps to the input size for the pixel-wise classification.
Our model can learn deep uncertain convolutional features (UCF) for more accurate saliency detection. The key ingredient is inspired by dropout . We propose a reformulated dropout (R-dropout), leading to an adaptive ensemble of the internal feature units in specific convolutional layers. Uncertain features are achieved with no additional parameterization.
We propose a new upsampling method to reduce the checkerboard artifacts of deconvolution operations. The new upsampling method has two obvious advantages. On the one hand it separates out upsampling (to generate higher resolution feature maps) from convolution (to extract convolutional features), on the other hand it is compatible with the regular deconvolution.
The uncertain feature extraction and saliency detection are unified in an encoder-decoder network architecture. The parameters of the proposed model (i.e., weights and biases in all the layers) are jointly trained by end to end gradient learning.
Our methods show good generalization on saliency detection and other pixel-wise vision tasks. Without any post-processing steps, our model yields comparable even better performance on public saliency detection, semantic segmentation and eye fixation datasets.
2 Related Work
Recently, deep learning has delivered superior performance in saliency detection. For instance, Wang 
propose two deep neural networks to integrate local estimation and global search for saliency detection. Li train fully connected layers of mutiple CNNs to predict the saliency degree of each superpixel. To deal with the problem that salient objects may appear in a low-contrast background, Zhao  take global and local context into account and model the saliency prediction in a multi-context deep CNN framework. These methods have excellent performances, however, all of them include fully connected layers, which are very computationally expensive. What’s more, fully connected layers drop spatial information of input images. To address these issues, Li  propose a FCN trained under the multi-task learning framework for saliency detection. Wang  design a recurrent FCN to leverage saliency priors and refine the coarse predictions.
Although motivated by the similar spirit, our method significantly differs from [26, 45] in three aspects. First, the network architecture is very different. The FCN we used is in the encoder-decoder style, which is in the view of main information reconstruction. In [26, 45], the FCN originates from the FCN-8s  designed with both long and short skip connections for the segmentation task. Second, instead of simply using FCNs as predictors in [26, 45], our model can learn uncertain convolutional features by using multiple reformulated dropouts, which improve the robustness and accuracy of saliency detection. Third, our model is equipped with a new upsampling method, that naturally handles the checkerboard artifacts of deconvolution operations. The checkerboard artifacts can be reduced through training the entire neural network. In contrast, the artifacts is handled by hand-crafted methods in [26, 45]. Specifically,  uses superpixel segmentation to smooth the prediction. In , an edge-aware erosion procedure is used.
Our work is also related to the model uncertainty in deep learning. Gal 
mathematically prove that a multilayer perceptron models (MLPs) with dropout applied before every weight layer, is equivalent to an approximation to the probabilistic deep Gaussian process. Though the provided theory is solid, a full verification on deep CNNs is underexplored. Base on this fact, we make a further step in this direction and show that a reformulated dropout can be used in convolutional layers for learning uncertain feature ensembles. Another representative work on the model uncertainty is the Bayesian SegNet. The Bayesian SegNet is able to predict pixel-wise scene segmentation with a measure of the model uncertainty. They achieve the model uncertainty by Monte Carlo sampling. Dropout is activated at test time to generate a posterior distribution of pixel class labels. Different from , our model focuses on learning uncertain convolutional features during training.
3 The Proposed Model
3.1 Network Architecture Overview
Our architecture is partly inspired by the stacked denoising auto-encoder 
. We generalize the auto-encoder to a deep fully convolutional encoder-decoder architecture. The resulting network forms a novel hybrid FCN which consists of an encoder FCN for high-level feature extraction, a corresponding decoder FCN for low-level information reconstruction and a pixel-wise classifier for saliency prediction. The overall architecture is illustrated in Fig.2
. More specifically, the encoder FCN consists of multiple convolutional layers with batch normalizations (BN)
and rectified linear units (ReLU), followed by non-overlapping max pooling. The corresponding decoder FCN additionally introduces upsampling operations to build feature maps up from low to high resolution. We use thesoftmax classifier for the pixel-wise saliency prediction. In order to achieve the uncertainty of learned convolutional features, we utilize the reformulated dropout (dubbed R-Dropout) after several convolutional layers. The detailed network configuration is included in supplementary materials. We will fully elaborate the R-Dropout, our new upsampling method and the training strategy in the following subsections.
3.2 R-Dropout for Deep Uncertain Convolutional
Dropout is typically interpreted as bagging a large number of individual models [13, 39]. Although plenty of experiments show that dropout for fully connected layers improves the generalization ability of deep networks, there is a lack of research about using dropout for other type layers, such as convolutional layers. In this subsection, we show that using modified dropout after convolutional layers can be interpreted as a kind of probabilistic feature ensembles. In light of this fact, we provide a strategy on learning uncertain convolutional features.
R-Dropout in Convolution:
is a 3D tensor, andis a convolution operation in CNNs, projecting X to the space by parameters W and b:
be a non-linear activation function. When the original dropout is applied to the outputs of , we can get its disturbed version by
where denotes element-wise product and M is a binary mask matrix of size with each element drawn independently from . Eq.(3) denotes the activation with dropout during training, and Eq.(2) denotes the activation at test time. In addition, Srivastava  suggest to scale the activations with at test time to obtain an approximate average of the unit activation.
where denotes the cross-channel element-wise product.
From above equations, we can derive that when is still binary, Eq.(7) implies that a kind of stochastic properties111 Stochastic property means that one can use a specific probability distribution to generate the learnable tensor In R-Dropout, the generator can be any probability distribution. The original dropout is a special case of the R-Dropout, when the generator is the Bernoulli distribution.
Stochastic property means that one can use a specific probability distribution to generate the learnable tensorS during each training iteration. The update of S forms a stochastic process not a certain decision. is applied at the inputs to the activation function. Let and , the above equations will strictly construct an ensemble of internal feature units of X. However, in practice there is certainly no evidence to hold above constraints. Even though, we note that: (1) the stochastic mask matrix S mainly depends on the mask generator222
In R-Dropout, the generator can be any probability distribution. The original dropout is a special case of the R-Dropout, when the generator is the Bernoulli distribution.; (2) when training deep convolutional networks, the R-Dropout after convolutional layers acts as an uncertain ensemble of convolutional features; (3) this kind of feature ensemble is element-wisely probabilistic, thus it can bring forth robustness in the prediction of dense labeling vision tasks such as saliency detection and semantic segmentation.
Uncertain Convolutional Feature Extraction:
Motivated by above insights, we employ the R-Dropout into convolutional layers of our model, thereby generating deep uncertain convolutional feature maps. Since our model consists of alternating convolutional and pooling layers, there exist two typical cases in our model. For notational simplicity, we subsequently drop the batch normalization (BN).
1) Conv+R-Dropout+Conv: If the proposed R-Dropout is followed by a convolutional layer, the forward propagation of input is formulated as
where is the layer number and Conv is the convolution operation. As we can see from Eq.(9), the disturbed activation is convolved with filter to produce convolved features . In this way, the network will focus on learning the weight and bias parameters, i.e., W and b, and the uncertainty of using the R-Dropout will be dissipated during training deep networks.
2) Conv+R-Dropout+Pooling: In this case, the forward propagation of input becomes
Here denotes the max-pooling function. is the pooling region at layer and
is the activition of each neuron within. is the number of units in . To formulate the uncertainty, without loss of generality, we suppose the activations in each pooling region are ordered in non-decreasing order, i.e. . As a result, will be selected as the pooled activation on conditions that (1) are dropped out, and (2) is retained. This event occurs with probability of
according to the probability theory,
Therefore, performing R-dropout before the max-pooling operation is exactly sampling from the following multinomial distribution to select an index , then the pooled activation is simply ,
where is the special event that all the units in a pooling region is dropped out.
The latter strategy exhibits the effectiveness of building the uncertainty by employing the R-Dropout into convolutional layers. We adopt it to build up our network architecture (see Fig. 2). We will experimentally demonstrate that the R-Dropout based FCN yields marvelous results on the saliency detection datasets in Section 4.
3.3 Hybrid Upsampling for Prediction Smoothing
In this subsection, we first explicate the cause of checkerboard artifacts by the deconvolution arithmetic . Then we derive a new upsampling method to reduce the artifacts as much as possible for the network training and inference.
Without loss of generality, we focus on the square input (), square kernel size (
), same stride (
) and same zero padding () (if used) along both axes. Since we aim to implement upsampling, we set . In general, the convolution operation can be described by
where is the input, is the filter with stride , is the discrete convolution and O is the output whose dimension is . The convolution has an associated deconvolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and the output size of the deconvolution is 333The constraint on the size of the input can be relaxed by introducing another parameter that allows to distinguish between the different cases that all lead to the same .. This indicates that the regular deconvolution operator is equivalent to performing convolution on a new input with inserted zeros (). A toy example is shown in Fig. 3. In addition, when the filter size can not be divided by the stride , the deconvolution will cause the overlapping issue. If the stretched input is high-frequency or near periodic, i.e., the value is extremely undulating when zeros are inserted, the output results of deconvolution operations naturally have numerical artifacts like a checkerboard.
Base on the above observations, we propose two strategies to avoid the artifacts produced by the regular deconvolution. The first one is restricting the filter size. We can simply ensure the filter size is a multiple of the stride size, avoiding the overlapping issue, i.e.,
Then the deconvolution will dispose the zero-inserted input with the equivalent convolution, deriving a smooth output. However, because this method only focuses on changing the receptive fields of the output, and can not change the frequency distribution of the zero-inserted input, the artifacts can still leak through in several extreme cases. We propose another alternative strategy which separates out upsampling from equivalent convolution. We first resize the original input into the desired size by interpolations, and then perform some equivalent convolutions. Although this strategy may destroy the learned features in deep CNNs, we find that high resolution maps built by iteratively stacking this kind of upsampling can reduce artifacts amazingly. In order to take the strength of both strategies, we introduce the hybrid upsampling method by summing up the outputs of the two strategies. Fig. 4 illustrates the proposed upsampling method. In our proposed model, we use bilinear (or nearest-neighbor) operations for the interpolation. These interpolation methods are linear operations, and can be embedded into the deep CNNs as efficient matrix multiplications.
3.4 Training the Entire Network
Since there is a lack of enough saliency detection data for training our model from scratch, we utilize the front-end of the VGG-16 model  as our encoder FCN (13 convolutional layers and 5 pooling layers pre-trained on ILSVRC 2014 for the image classification task). Our decoder FCN is a mirrored version of the encoder FCN, and has multiple series of upsampling, convolution and rectification layers. Batch normalization (BN) is added to the output of every convolutional layer. We add the R-dropout with an equal sampling rate after specific convolutional layers, as shown in Fig. 2. For saliency detection, we randomly initialize the weights of the decoder FCN and fine-tune the entire network on the MSRA10K dataset , which is widely used in salient object detection community (More details will be described in Section 4). We convert the ground-truth saliency map of each image in that dataset to be a 0-1 binary map. This kind of transform perfectly matches the channel output of the FCN when we use the softmax
cross-entropy loss function given by the following equation (17) for separating saliency foreground from general background.
where is the label of a pixel in the image and is the probability that the pixel is the saliency foreground. The value of
is obtained from the output of the network. Before putting the training images into our proposed model, each image is subtracted with the ImageNet mean and rescaled into the same size (448
448). For the correspondence, we also rescale the 0-1 binary maps to the same size. The model is trained end to end using the mini-batch stochastic gradient descent (SGD) with a momentum, learning rate decay schedule. The detailed settings of parameters are included in the supplementary material.
3.5 Saliency Inference
Because our model is a fully convolutional network, it can take images with arbitrary size as inputs when testing. After the feed-forward process, the output of the network is composed of a foreground excitation map () and a background excitation map (). We use the difference between and , and clip the negative values to obtain the resulting saliency map, i.e.,
This subtraction strategy not only increases the pixel-level discrimination but also captures context contrast information. Optionally, we can take the ensemble of multi-scale predicted maps to further improve performance.
In this section, we start by describing the experimental setup for saliency detection. Then, we thoroughly evaluate and analyze our proposed model on public saliency detection datasets. Finally, we provide additional experiments to verify the generalization of our methods on other pixel-wise vision tasks, i.e., semantic segmentation and eye fixation.
4.1 Experimental Setup
Saliency Datasets: For training the proposed network, we simply augment the MSRA10K dataset  by the mirror reflection and rotation techniques (), producing 80,000 training images totally.
For the detection performance evaluation, we adopt six widely used saliency detection datasets as follows,
DUT-OMRON . This dataset consists of 5,168 high quality images. Images in this dataset have one or more salient objects and relatively complex background. Thus, this dataset is difficult and challenging in saliency detection.
ECSSD . This dataset contains 1,000 natural images, including many semantically meaningful and complex structures in the ground truth segmentations.
HKU-IS . This dataset contains 4,447 images with high quality pixel-wise annotations. Images in this dataset are well chosen to include multiple disconnected objects or objects touching the image boundary.
SED . This dataset contains two different subsets: SED1 and SED2. The SED1 has 100 images each containing only one salient object, while the SED2 has 100 images each containing two salient objects.
We implement our approach based on the MATLAB R2014b platform with the modified Caffe toolbox. We run our approach in a quad-core PC machine with an i7-4790 CPU (with 16G memory) and one NVIDIA Titan X GPU (with 12G memory). The training process of our model takes almost 23 hours and converges after 200k iterations of the min-batch SGD. The proposed saliency detection algorithm runs at about 7 fps with resolution (23 fps with resolution). The source code can be found at [rgb]1,0,0http://ice.dlut.edu.cn/lu/.
. The precision and recall are computed by thresholding the predicted saliency map, and comparing the binary map with the ground truth. The PR curve of a dataset indicates the mean precision and recall of saliency maps at different thresholds. The F-measure is a balanced mean of average precision and average recall, and can be calculated by
to be 0.3 to weigh precision more than recall. We report the performance when each saliency map is adaptively binarized with an image-dependent threshold. The threshold is determined to be twice the mean saliency of the image:
where and are width and height of an image, is the saliency value of the pixel at .
We also calculate the mean absolute error (MAE) for fair comparisons as suggested by . The MAE evaluates the saliency detection accuracy by
where is the binary ground truth mask.
|(a) ECSSD||(b) SED1||(c) SED2|
4.2 Performance Comparison with State-of-the-art
We compare the proposed UCF algorithm with other 10 state-of-the-art ones including 6 deep learning based algorithms (DCL , DS , ELD , LEGS , MDF , RFCN ) and 4 conventional counterparts (BL , BSCA , DRFI , DSR ). The source codes with recommended parameters or the saliency maps of the competing methods are adopted for fair comparison.
As shown in Fig. 5 and Tab. 1, our proposed UCF model can consistently outperform existing methods across almost all the datasets in terms of all evaluation metrics, which convincingly indicates the effectiveness of the proposed methods. Refer to the supplemental material for more results on DUT-OMRON, HKU-IS, PASCAL-S and SOD datasets.
From these results, we have several fundamental observations: (1) Our UCF model outperforms other algorithms on ECSSD and SED datasets with a large margin in terms of F-measure and MAE. More specifically, our model improves the F-measure achieved by the best-performing existing algorithm by 3.9% and 6.15% on ECSSD and SED datasets, respectively. The MAE is consistently improved. (2) Although our proposed UCF is not the best on HKU-IS and PASCAL-S datasets, it is still very competitive (our model ranks the second on these datasets). It is necessary to note that only the augmented MSRA10K dataset is used for training our model. The RFCN, DS and DCL methods are pre-trained on the additional PASCAL VOC segmentation dataset , which is overlaped with the PASCAL-S and HKU-IS datasets. This fact may interpret their success on the two datasets. However, their performance on other datasets is obviously inferior. (3) Compared with other methods, our proposed UCF achieves lower MAE on most of datasets. It means that our model is more convinced of the predicted regions by the uncertain feature learning.
The visual comparison of different methods on the typical images is shown in Fig. 7. Our saliency maps can reliably highlight the salient objects in various challenging scenarios, , low contrast between objects and backgrounds (the first two rows), multiple disconnected salient objects (the 3-4 rows) and objects near the image boundary (the 5-6 rows). In addition, our saliency maps provide more accurate boundaries of salient objects (the 1, 3, 6-8 rows).
To verify the contributions of each component, we also evaluate several variants of the proposed UCF model with different settings as illustrated in Tab. 2. The corresponding performance are reported in Tab. 1. The V-A model is an approximation of the DeconvNet . The comparison between V-A and V-B demonstrates that our uncertain learning mechanism can indeed benefit to learn more robust features for accurate saliency inference. The comparison between V-B and V-C shows the effects with two upsampling strategies. Results imply that the interpolation strategy performs much better in saliency detection. The joint comparison of V-B, V-C and UCF confirms that our hybrid upsampling method is capable of better refining the output saliency maps. An example on the visual effects is illustrated in Fig. 6. In addition, the V-D model and V-E model verify the usefulness of deconvolution and interpolation upsampling, respectively. The V-B and V-C models achieve competitive even better results than other saliency methods. This further confirms the strength of our methods.
4.3 Generalization Evaluation
To verify the generalization of our methods, we perform additional experiments on other pixel-wise vision tasks.
Following existing works [31, 20], we simply change the classifier into 21 classes and perform the semantic segmentation task on the PASCAL VOC 2012 dataset . Our UCF model is trained with the PASCAL VOC 2011 training and validation data, using the Berekely’s extended annotations . We achieve expressive results (mean IOU: 68.25, mean pix.accuracy: 92.19, pix.accuracy: 77.28), which are very comparable with other state-of-the-art segmentation methods. In addition, though the segmentation performance gaps are not as large as in saliency detection, our new upsampling method indeed performs better than regular deconvolution (mean IOU: 67.45 vs 65.173, mean pix.accuracy: 91.21 vs 90.84, pix.accuracy: 76.18 vs 75.73).
The task of eye fixation prediction is essentially different from our classification task. We use the Euclidean loss for the gaze prediction. We submit our results to servers of MIT300 , iSUN  and SALICON  benchmarks with standard setups. Our model also achieves comparable results shown in Tab. 3. All above results on semantic segmentation and eye fixation tasks indicate that our model has a strong generalization in other pixel-wise tasks.
In this paper, we propose a novel fully convolutional network for saliency detection. A reformulated dropout is utilized to facilitate probabilistic training and inference. This uncertain learning mechanism enables our method to learn uncertain convolutional features and yield more accurate saliency prediction. A new upsampling method is also proposed to reduce the artifacts of deconvolution operations, and explicitly enforce the network to learn accurate boundary for saliency detection. Extensive evaluations demonstrate that our methods can significantly improve performance of saliency detection and show good generalization on other pixel-wise vision tasks.
. We thank to Alex Kendall for sharing the SegNet code. This paper is supported by the Natural Science Foundation of China #61472060, #61502070, #61528101 and #61632006.
-  B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010.
-  A. Borji. What is a salient object? a dataset and a baseline model for salient object detection. IEEE TIP, 24(2):742–756, 2015.
-  A. Borji, M.-M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE TIP, 24(12):5706–5722, 2015.
-  M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu. Global contrast based salient region detection. IEEE TPAMI, 37(3):569–582, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet:a large-scale hierarchical image database. In CVPR, 2009.
-  Y. Ding, J. Xiao, and J. Yu. Importance filtering for image retargeting. In CVPR, pages 89–96, 2011.
-  A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. arXiv:1506.02753, 2015.
-  M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
-  Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Insights and applications. In Deep Learning Workshop, ICML, 2015.
-  B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In CVPR, pages 1026–1034, 2015.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
-  S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with convolutional neural network. arXiv:1502.06796, 2015.
-  X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In CVPR, pages 1–8, 2007.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
-  M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency in context. In CVPR, 2015.
-  P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Uniqueness, focusness and objectness. In ICCV, pages 1976–1983, 2013.
-  T. Judd, F. Durand, and A. Torralba. A benchmark of computational models of saliency to predict human fixations. In MIT Technical Report, 2012.
-  A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv:1511.02680, 2015.
-  D. A. Klein and S. Frintrop. Center-surround divergence of feature statistics for salient object detection. In ICCV, pages 2214–2219, 2011.
-  G. Lee, Y.-W. Tai, and J. Kim. Deep saliency with encoded low level distance map and high level features. In CVPR, June 2016.
G. Li and Y. Yu.
Visual saliency based on multiscale deep features.In CVPR, pages 5455–5463, 2015.
-  G. Li and Y. Yu. Deep contrast learning for salient object detection. In CVPR, pages 478–487, 2016.
-  X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang. Saliency detection via dense and sparse reconstruction. In ICCV, pages 2976–2983, 2013.
-  X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE TIP, 25(8):3919–3930, 2016.
-  Y. Li, X. Hou, C. Koch, J. Rehg, and A. Yuille. The secrets of salient object segmentation. In CVPR, pages 280–287, 2014.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
-  V. Mahadevan and N. Vasconcelos. Saliency-based discriminant tracking. In CVPR, pages 1007–1013, 2009.
-  L. Marchesotti, C. Cifarelli, and G. Csurka. A framework for visual saliency detection with applications to image thumbnailing. In ICCV, pages 2232–2239, 2009.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, pages 1520–1528, 2015.
-  A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconv-checkerboard/, 2016.
-  Y. Qin, H. Lu, Y. Xu, and H. Wang. Saliency detection via cellular automata. In CVPR, pages 110–119, 2015.
-  C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM TOG, volume 23, pages 309–314, 2004.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv:1606.03498, 2016.
W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert,
and Z. Wang.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.In CVPR, 2016.
-  C. Siagian and L. Itti. Rapid biologically-inspired scene classification using features shared with visual attention. IEEE TPAMI, 29(2):300–312, 2007.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
-  J. Sun and H. Ling. Scale and object aware image retargeting for thumbnail browsing. In ICCV, pages 1511–1518, 2011.
-  N. Tong, H. Lu, X. Ruan, and M.-H. Yang. Salient object detection via bootstrap learning. In CVPR, pages 1884–1892, 2015.
-  D. Vincent and V. Francesco. A guide to convolution arithmetic for deep learning. arXiv:1603.07285, 2016.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.JMLR, 11(Dec):3371–3408, 2010.
-  L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency detection via local estimation and global search. In CVPR, pages 3183–3192, 2015.
-  L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan. Saliency detection with recurrent fully convolutional networks. In ECCV, pages 825–841, 2016.
-  T. Wang, L. Zhang, H. Lu, C. Sun, and J. Qi. Kernelized subspace ranking for saliency detection. In ECCV, pages 450–466, 2016.
-  P. Xu, K. A. Ehinger, Y. Zhang, A. Finkelstein, S. R. Kulkarni, and J. Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755, 2015.
-  Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In CVPR, pages 1155–1162, 2013.
-  C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In CVPR, pages 3166–3173, 2013.
-  R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In CVPR, pages 1265–1274, 2015.