1 Introduction
(a) Feature Visualization [7] 
(b) Generative Adversarial Example [35] 
(c) Saliency Detection [45] 
(d) Semantic Segmentation [31] 
Saliency detection targets to identify the most important and conspicuous objects or regions in an image. As a preprocessing procedure in computer vision, saliency detection has greatly benefited many practical applications such as object retargeting [6, 40]
[37], semantic segmentation [34] and visual tracking [29, 14]. Although significant progress has been made [15, 1, 30, 21, 33, 46], saliency detection remains very challenging due to complex factors in real world scenarios. In this work we focus on the task of improving robustness of saliency detection models, which has been ignored in the literature.Previous saliency detection methods utilize several handcrafted visual features and heuristic priors. Recently, deep learning based methods become more and more popular, and have set the benchmark on many datasets
[27, 49, 3]. Their superior performance is partly attributed to the strong representation power in modeling object appearances and varied scenarios. However, existing methods fail to provide a probabilistic interpretability of the “blackbox” learning in deep neural networks, and mainly enjoy the models’ exceptional performance. A reasonable probabilistic interpretation can provide relational confidences alongside predictions and make the prediction system into a more robust one [10]. In addition, since the uncertainty is a natural part of any predictive system, modeling the uncertainty is of crucial importance. For instance, the object boundary strongly affects the prediction accuracy of a saliency model, it is desirable that the model can provide meaningful uncertainties on where the boundary of distinct objects is. As far as we know, there is no work to model and analyze the uncertainty of saliency detection methods based on deep learning.Another important issue is the checkerboard artifact in pixelwise vision tasks, which target to generate images or feature maps from low to high resolution. Several typical examples are shown in Fig. 1 (ref. [32]
for more details). The odd artifacts sometimes are very fatal for deep CNNs based approaches. For example, when the artifacts appear in the output of a fully convolutional network (FCN), the network training may fail and the prediction can be completely wrong
[36]. We find that the actual cause of these artifacts is the upsampling mechanism, which generally utilizes the deconvolution operation. Thus, it is of great interest to explore new upsampling methods to better reduce the artifacts for pixelwise vision tasks. Meanwhile, the artifacts are also closely related to the uncertainty learning of deep CNNs.All of the issues discussed above motivate us to learn uncertain features (probabilistic learning) through deep networks to achieve accurate saliency detection. Our model has several unique features, as outlined below.

Different from existing saliency detection methods, our model is extremely simplified. It consists of an encoder FCN, a corresponding decoder FCN followed by a pixelwise classification layer. The encoder FCN hierarchically learns visual features from raw images while the decoder FCN progressively upsamples the encoded feature maps to the input size for the pixelwise classification.

Our model can learn deep uncertain convolutional features (UCF) for more accurate saliency detection. The key ingredient is inspired by dropout [13]. We propose a reformulated dropout (Rdropout), leading to an adaptive ensemble of the internal feature units in specific convolutional layers. Uncertain features are achieved with no additional parameterization.

We propose a new upsampling method to reduce the checkerboard artifacts of deconvolution operations. The new upsampling method has two obvious advantages. On the one hand it separates out upsampling (to generate higher resolution feature maps) from convolution (to extract convolutional features), on the other hand it is compatible with the regular deconvolution.

The uncertain feature extraction and saliency detection are unified in an encoderdecoder network architecture. The parameters of the proposed model (i.e., weights and biases in all the layers) are jointly trained by end to end gradient learning.

Our methods show good generalization on saliency detection and other pixelwise vision tasks. Without any postprocessing steps, our model yields comparable even better performance on public saliency detection, semantic segmentation and eye fixation datasets.
2 Related Work
Recently, deep learning has delivered superior performance in saliency detection. For instance, Wang [44]
propose two deep neural networks to integrate local estimation and global search for saliency detection. Li
[23] train fully connected layers of mutiple CNNs to predict the saliency degree of each superpixel. To deal with the problem that salient objects may appear in a lowcontrast background, Zhao [50] take global and local context into account and model the saliency prediction in a multicontext deep CNN framework. These methods have excellent performances, however, all of them include fully connected layers, which are very computationally expensive. What’s more, fully connected layers drop spatial information of input images. To address these issues, Li [26] propose a FCN trained under the multitask learning framework for saliency detection. Wang [45] design a recurrent FCN to leverage saliency priors and refine the coarse predictions.Although motivated by the similar spirit, our method significantly differs from [26, 45] in three aspects. First, the network architecture is very different. The FCN we used is in the encoderdecoder style, which is in the view of main information reconstruction. In [26, 45], the FCN originates from the FCN8s [28] designed with both long and short skip connections for the segmentation task. Second, instead of simply using FCNs as predictors in [26, 45], our model can learn uncertain convolutional features by using multiple reformulated dropouts, which improve the robustness and accuracy of saliency detection. Third, our model is equipped with a new upsampling method, that naturally handles the checkerboard artifacts of deconvolution operations. The checkerboard artifacts can be reduced through training the entire neural network. In contrast, the artifacts is handled by handcrafted methods in [26, 45]. Specifically, [26] uses superpixel segmentation to smooth the prediction. In [45], an edgeaware erosion procedure is used.
Our work is also related to the model uncertainty in deep learning. Gal [10]
mathematically prove that a multilayer perceptron models (MLPs) with dropout applied before every weight layer, is equivalent to an approximation to the probabilistic deep Gaussian process. Though the provided theory is solid, a full verification on deep CNNs is underexplored. Base on this fact, we make a further step in this direction and show that a reformulated dropout can be used in convolutional layers for learning uncertain feature ensembles. Another representative work on the model uncertainty is the Bayesian SegNet
[20]. The Bayesian SegNet is able to predict pixelwise scene segmentation with a measure of the model uncertainty. They achieve the model uncertainty by Monte Carlo sampling. Dropout is activated at test time to generate a posterior distribution of pixel class labels. Different from [20], our model focuses on learning uncertain convolutional features during training.3 The Proposed Model
3.1 Network Architecture Overview
Our architecture is partly inspired by the stacked denoising autoencoder [43]
. We generalize the autoencoder to a deep fully convolutional encoderdecoder architecture. The resulting network forms a novel hybrid FCN which consists of an encoder FCN for highlevel feature extraction, a corresponding decoder FCN for lowlevel information reconstruction and a pixelwise classifier for saliency prediction. The overall architecture is illustrated in Fig.
2. More specifically, the encoder FCN consists of multiple convolutional layers with batch normalizations (BN)
[16]and rectified linear units (ReLU), followed by nonoverlapping max pooling. The corresponding decoder FCN additionally introduces upsampling operations to build feature maps up from low to high resolution. We use the
softmax classifier for the pixelwise saliency prediction. In order to achieve the uncertainty of learned convolutional features, we utilize the reformulated dropout (dubbed RDropout) after several convolutional layers. The detailed network configuration is included in supplementary materials. We will fully elaborate the RDropout, our new upsampling method and the training strategy in the following subsections.
3.2 RDropout for Deep Uncertain Convolutional
Feature Ensemble
Dropout is typically interpreted as bagging a large number of individual models [13, 39]. Although plenty of experiments show that dropout for fully connected layers improves the generalization ability of deep networks, there is a lack of research about using dropout for other type layers, such as convolutional layers. In this subsection, we show that using modified dropout after convolutional layers can be interpreted as a kind of probabilistic feature ensembles. In light of this fact, we provide a strategy on learning uncertain convolutional features.
RDropout in Convolution:
Assume
is a 3D tensor, and
is a convolution operation in CNNs, projecting X to the space by parameters W and b:(1) 
Let
be a nonlinear activation function. When the original dropout
[13] is applied to the outputs of , we can get its disturbed version by(2) 
(3) 
where denotes elementwise product and M is a binary mask matrix of size with each element drawn independently from . Eq.(3) denotes the activation with dropout during training, and Eq.(2) denotes the activation at test time. In addition, Srivastava [39] suggest to scale the activations with at test time to obtain an approximate average of the unit activation.
Many commonly used activation functions such as Tanh, ReLU and LReLU [12], have the property that . Thus, Eq.(3) can be rewritten as the RDropout formula,
(4)  
(5)  
(6)  
(7) 
where denotes the crosschannel elementwise product. From above equations, we can derive that when is still binary, Eq.(7) implies that a kind of stochastic properties^{1}^{1}1
Stochastic property means that one can use a specific probability distribution to generate the learnable tensor
S during each training iteration. The update of S forms a stochastic process not a certain decision. is applied at the inputs to the activation function. Let and , the above equations will strictly construct an ensemble of internal feature units of X. However, in practice there is certainly no evidence to hold above constraints. Even though, we note that: (1) the stochastic mask matrix S mainly depends on the mask generator^{2}^{2}2In RDropout, the generator can be any probability distribution. The original dropout is a special case of the RDropout, when the generator is the Bernoulli distribution.
; (2) when training deep convolutional networks, the RDropout after convolutional layers acts as an uncertain ensemble of convolutional features; (3) this kind of feature ensemble is elementwisely probabilistic, thus it can bring forth robustness in the prediction of dense labeling vision tasks such as saliency detection and semantic segmentation.Uncertain Convolutional Feature Extraction:
Motivated by above insights, we employ the RDropout into convolutional layers of our model, thereby generating deep uncertain convolutional feature maps. Since our model consists of alternating convolutional and pooling layers, there exist two typical cases in our model. For notational simplicity, we subsequently drop the batch normalization (BN).
1) Conv+RDropout+Conv: If the proposed RDropout is followed by a convolutional layer, the forward propagation of input is formulated as
(8) 
(9) 
(10) 
where is the layer number and Conv is the convolution operation. As we can see from Eq.(9), the disturbed activation is convolved with filter to produce convolved features . In this way, the network will focus on learning the weight and bias parameters, i.e., W and b, and the uncertainty of using the RDropout will be dissipated during training deep networks.
2) Conv+RDropout+Pooling: In this case, the forward propagation of input becomes
(11) 
(12) 
Here denotes the maxpooling function. is the pooling region at layer and
is the activition of each neuron within
. is the number of units in . To formulate the uncertainty, without loss of generality, we suppose the activations in each pooling region are ordered in nondecreasing order, i.e. . As a result, will be selected as the pooled activation on conditions that (1) are dropped out, and (2) is retained. This event occurs with probability ofaccording to the probability theory,
(13) 
Therefore, performing Rdropout before the maxpooling operation is exactly sampling from the following multinomial distribution to select an index , then the pooled activation is simply ,
(14) 
where is the special event that all the units in a pooling region is dropped out.
The latter strategy exhibits the effectiveness of building the uncertainty by employing the RDropout into convolutional layers. We adopt it to build up our network architecture (see Fig. 2). We will experimentally demonstrate that the RDropout based FCN yields marvelous results on the saliency detection datasets in Section 4.
3.3 Hybrid Upsampling for Prediction Smoothing
In this subsection, we first explicate the cause of checkerboard artifacts by the deconvolution arithmetic [42]. Then we derive a new upsampling method to reduce the artifacts as much as possible for the network training and inference.
Without loss of generality, we focus on the square input (), square kernel size (
), same stride (
) and same zero padding (
) (if used) along both axes. Since we aim to implement upsampling, we set . In general, the convolution operation can be described by(15) 
where is the input, is the filter with stride , is the discrete convolution and O is the output whose dimension is . The convolution has an associated deconvolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and the output size of the deconvolution is ^{3}^{3}3The constraint on the size of the input can be relaxed by introducing another parameter that allows to distinguish between the different cases that all lead to the same .. This indicates that the regular deconvolution operator is equivalent to performing convolution on a new input with inserted zeros (). A toy example is shown in Fig. 3. In addition, when the filter size can not be divided by the stride , the deconvolution will cause the overlapping issue. If the stretched input is highfrequency or near periodic, i.e., the value is extremely undulating when zeros are inserted, the output results of deconvolution operations naturally have numerical artifacts like a checkerboard.
Base on the above observations, we propose two strategies to avoid the artifacts produced by the regular deconvolution. The first one is restricting the filter size. We can simply ensure the filter size is a multiple of the stride size, avoiding the overlapping issue, i.e.,
(16) 
Then the deconvolution will dispose the zeroinserted input with the equivalent convolution, deriving a smooth output. However, because this method only focuses on changing the receptive fields of the output, and can not change the frequency distribution of the zeroinserted input, the artifacts can still leak through in several extreme cases. We propose another alternative strategy which separates out upsampling from equivalent convolution. We first resize the original input into the desired size by interpolations, and then perform some equivalent convolutions. Although this strategy may destroy the learned features in deep CNNs, we find that high resolution maps built by iteratively stacking this kind of upsampling can reduce artifacts amazingly. In order to take the strength of both strategies, we introduce the hybrid upsampling method by summing up the outputs of the two strategies. Fig. 4 illustrates the proposed upsampling method. In our proposed model, we use bilinear (or nearestneighbor) operations for the interpolation. These interpolation methods are linear operations, and can be embedded into the deep CNNs as efficient matrix multiplications.
3.4 Training the Entire Network
Since there is a lack of enough saliency detection data for training our model from scratch, we utilize the frontend of the VGG16 model [38] as our encoder FCN (13 convolutional layers and 5 pooling layers pretrained on ILSVRC 2014 for the image classification task). Our decoder FCN is a mirrored version of the encoder FCN, and has multiple series of upsampling, convolution and rectification layers. Batch normalization (BN) is added to the output of every convolutional layer. We add the Rdropout with an equal sampling rate after specific convolutional layers, as shown in Fig. 2. For saliency detection, we randomly initialize the weights of the decoder FCN and finetune the entire network on the MSRA10K dataset [4], which is widely used in salient object detection community (More details will be described in Section 4). We convert the groundtruth saliency map of each image in that dataset to be a 01 binary map. This kind of transform perfectly matches the channel output of the FCN when we use the softmax
crossentropy loss function given by the following equation (
17) for separating saliency foreground from general background.(17) 
where is the label of a pixel in the image and is the probability that the pixel is the saliency foreground. The value of
is obtained from the output of the network. Before putting the training images into our proposed model, each image is subtracted with the ImageNet mean
[5] and rescaled into the same size (448448). For the correspondence, we also rescale the 01 binary maps to the same size. The model is trained end to end using the minibatch stochastic gradient descent (SGD) with a momentum, learning rate decay schedule. The detailed settings of parameters are included in the supplementary material.
3.5 Saliency Inference
Because our model is a fully convolutional network, it can take images with arbitrary size as inputs when testing. After the feedforward process, the output of the network is composed of a foreground excitation map () and a background excitation map (). We use the difference between and , and clip the negative values to obtain the resulting saliency map, i.e.,
Sal  (18) 
This subtraction strategy not only increases the pixellevel discrimination but also captures context contrast information. Optionally, we can take the ensemble of multiscale predicted maps to further improve performance.
4 Experiments
In this section, we start by describing the experimental setup for saliency detection. Then, we thoroughly evaluate and analyze our proposed model on public saliency detection datasets. Finally, we provide additional experiments to verify the generalization of our methods on other pixelwise vision tasks, i.e., semantic segmentation and eye fixation.
4.1 Experimental Setup
Saliency Datasets: For training the proposed network, we simply augment the MSRA10K dataset [4] by the mirror reflection and rotation techniques (), producing 80,000 training images totally.
For the detection performance evaluation, we adopt six widely used saliency detection datasets as follows,
DUTOMRON [49]. This dataset consists of 5,168 high quality images. Images in this dataset have one or more salient objects and relatively complex background. Thus, this dataset is difficult and challenging in saliency detection.
ECSSD [48]. This dataset contains 1,000 natural images, including many semantically meaningful and complex structures in the ground truth segmentations.
HKUIS [50]. This dataset contains 4,447 images with high quality pixelwise annotations. Images in this dataset are well chosen to include multiple disconnected objects or objects touching the image boundary.
PASCALS [27]. This dataset is carefully selected from the PASCAL VOC dataset [8] and contains 850 images.
SED [2]. This dataset contains two different subsets: SED1 and SED2. The SED1 has 100 images each containing only one salient object, while the SED2 has 100 images each containing two salient objects.
SOD [48]. This dataset has 300 images, and it was originally designed for image segmentation. Pixelwise annotation of salient objects was generated by [18].
Implementation Details:
We implement our approach based on the MATLAB R2014b platform with the modified Caffe toolbox
[20]. We run our approach in a quadcore PC machine with an i74790 CPU (with 16G memory) and one NVIDIA Titan X GPU (with 12G memory). The training process of our model takes almost 23 hours and converges after 200k iterations of the minbatch SGD. The proposed saliency detection algorithm runs at about 7 fps with resolution (23 fps with resolution). The source code can be found at [rgb]1,0,0http://ice.dlut.edu.cn/lu/.Saliency Evaluation Metrics:
We adopt three widely used metrics to measure the performance of all algorithms, i.e., the PrecisionRecall (PR) curves, Fmeasure and Mean Absolute Error (MAE) [3]. The precision and recall are computed by thresholding the predicted saliency map, and comparing the binary map with the ground truth. The PR curve of a dataset indicates the mean precision and recall of saliency maps at different thresholds. The Fmeasure is a balanced mean of average precision and average recall, and can be calculated by
(19) 
Following existing works [48] [44] [3] [49], we set
to be 0.3 to weigh precision more than recall. We report the performance when each saliency map is adaptively binarized with an imagedependent threshold. The threshold is determined to be twice the mean saliency of the image:
(20) 
where and are width and height of an image, is the saliency value of the pixel at .
We also calculate the mean absolute error (MAE) for fair comparisons as suggested by [3]. The MAE evaluates the saliency detection accuracy by
(21) 
where is the binary ground truth mask.
(a) ECSSD  (b) SED1  (c) SED2 
DUTOMRON  ECSSD  HKUIS  PASCALS  SED1  SED2  
Methods  
UCF  [rgb]0,0,10.6283  0.1203  [rgb]1,0,00.8517  [rgb]1,0,00.0689  [rgb]0,0,10.8232  [rgb]1,0,00.0620  [rgb]0,1,00.7413  [rgb]1,0,00.1160  [rgb]1,0,00.8647  [rgb]1,0,00.0631  [rgb]1,0,00.8102  [rgb]1,0,00.0680 
VE  0.6135  0.1224  0.7857  0.0795  0.7716  0.0785  0.6303  0.1284  0.8128  0.0732  0.7576  0.0851 
VD  0.5072  0.1345  0.6942  0.1195  0.6851  0.0967  0.5695  0.1624  0.7754  0.0844  0.6930  0.0954 
VC  0.6165  0.1210  [rgb]0,1,00.8426  [rgb]0,1,00.0711  0.8156  [rgb]0,0,10.0670  [rgb]0,0,10.7201  [rgb]0,1,00.1203  [rgb]0,1,00.8665  [rgb]0,1,00.0653  [rgb]0,1,00.8014  [rgb]0,0,10.0795 
VB  0.6168  0.1305  [rgb]0,0,10.8356  [rgb]0,0,10.0781  0.8060  [rgb]0,1,00.0651  0.6845  0.1254  0.8547  [rgb]0,0,10.0685  0.7905  [rgb]0,1,00.0709 
VA  0.6128  0.1409  0.8166  0.0811  0.7346  0.0988  0.6172  0.1367  0.7641  0.1023  0.6536  0.1044 
DCL [24]  [rgb]1,0,00.6842  0.1573  0.8293  0.1495  [rgb]1,0,00.8533  0.1359  0.7141  0.1807  0.8546  0.1513  0.7946  0.1565 
DS [26]  0.6028  0.1204  0.8255  0.1216  0.7851  0.0780  0.6590  0.1760  0.8445  0.0931  0.7541  0.1233 
ELD [22]  0.6109  [rgb]0,1,00.0924  0.8102  0.0796  0.7694  0.0741  0.7180  [rgb]0,0,10.1232  [rgb]1,0,00.8715  0.0670  0.7591  0.1028 
LEGS [44]  0.5915  0.1334  0.7853  0.1180  0.7228  0.1193      0.8542  0.1034  0.7358  0.1236 
MDF [50]  [rgb]0,1,00.6442  [rgb]1,0,00.0916  0.8070  0.1049  0.8006  0.0957  0.7087  0.1458  0.8419  0.0989  [rgb]0,0,10.8003  0.1014 
RFCN [45]  0.6265  [rgb]0,0,10.1105  0.8340  0.1069  [rgb]0,1,00.8349  0.0889  [rgb]1,0,00.7512  0.1324  0.8502  0.1166  0.7667  0.1131 
BL [41]  0.4988  0.2388  0.6841  0.2159  0.6597  0.2071  0.5742  0.2487  0.7675  0.1849  0.7047  0.1856 
BSCA [33]  0.5091  0.1902  0.7048  0.1821  0.6544  0.1748  0.6006  0.2229  0.8048  0.1535  0.7062  0.1578 
DRFI [18]  0.5504  0.1378  0.7331  0.1642  0.7218  0.1445  0.6182  0.2065  0.8068  0.1480  0.7341  0.1334 
DSR [25]  0.5242  0.1389  0.6621  0.1784  0.6772  0.1422  0.5575  0.2149  0.7909  0.1579  0.7116  0.1406 
4.2 Performance Comparison with Stateoftheart
We compare the proposed UCF algorithm with other 10 stateoftheart ones including 6 deep learning based algorithms (DCL [24], DS [26], ELD [22], LEGS [44], MDF [50], RFCN [45]) and 4 conventional counterparts (BL [41], BSCA [33], DRFI [18], DSR [25]). The source codes with recommended parameters or the saliency maps of the competing methods are adopted for fair comparison.
As shown in Fig. 5 and Tab. 1, our proposed UCF model can consistently outperform existing methods across almost all the datasets in terms of all evaluation metrics, which convincingly indicates the effectiveness of the proposed methods. Refer to the supplemental material for more results on DUTOMRON, HKUIS, PASCALS and SOD datasets.
From these results, we have several fundamental observations: (1) Our UCF model outperforms other algorithms on ECSSD and SED datasets with a large margin in terms of Fmeasure and MAE. More specifically, our model improves the Fmeasure achieved by the bestperforming existing algorithm by 3.9% and 6.15% on ECSSD and SED datasets, respectively. The MAE is consistently improved. (2) Although our proposed UCF is not the best on HKUIS and PASCALS datasets, it is still very competitive (our model ranks the second on these datasets). It is necessary to note that only the augmented MSRA10K dataset is used for training our model. The RFCN, DS and DCL methods are pretrained on the additional PASCAL VOC segmentation dataset [9], which is overlaped with the PASCALS and HKUIS datasets. This fact may interpret their success on the two datasets. However, their performance on other datasets is obviously inferior. (3) Compared with other methods, our proposed UCF achieves lower MAE on most of datasets. It means that our model is more convinced of the predicted regions by the uncertain feature learning.
The visual comparison of different methods on the typical images is shown in Fig. 7. Our saliency maps can reliably highlight the salient objects in various challenging scenarios, , low contrast between objects and backgrounds (the first two rows), multiple disconnected salient objects (the 34 rows) and objects near the image boundary (the 56 rows). In addition, our saliency maps provide more accurate boundaries of salient objects (the 1, 3, 68 rows).
Settings  VA  VB  VC  VD  VE  UCF 

+Dropout  
+RDropout  
+Rest Deconv  
+Inter 
(a)  (b)  (c)  (d)  (e) 
(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  (i)  (j) 

Ablation Studies:
To verify the contributions of each component, we also evaluate several variants of the proposed UCF model with different settings as illustrated in Tab. 2. The corresponding performance are reported in Tab. 1. The VA model is an approximation of the DeconvNet [31]. The comparison between VA and VB demonstrates that our uncertain learning mechanism can indeed benefit to learn more robust features for accurate saliency inference. The comparison between VB and VC shows the effects with two upsampling strategies. Results imply that the interpolation strategy performs much better in saliency detection. The joint comparison of VB, VC and UCF confirms that our hybrid upsampling method is capable of better refining the output saliency maps. An example on the visual effects is illustrated in Fig. 6. In addition, the VD model and VE model verify the usefulness of deconvolution and interpolation upsampling, respectively. The VB and VC models achieve competitive even better results than other saliency methods. This further confirms the strength of our methods.
4.3 Generalization Evaluation
To verify the generalization of our methods, we perform additional experiments on other pixelwise vision tasks.
Semantic Segmentation:
Following existing works [31, 20], we simply change the classifier into 21 classes and perform the semantic segmentation task on the PASCAL VOC 2012 dataset [9]. Our UCF model is trained with the PASCAL VOC 2011 training and validation data, using the Berekely’s extended annotations [11]. We achieve expressive results (mean IOU: 68.25, mean pix.accuracy: 92.19, pix.accuracy: 77.28), which are very comparable with other stateoftheart segmentation methods. In addition, though the segmentation performance gaps are not as large as in saliency detection, our new upsampling method indeed performs better than regular deconvolution (mean IOU: 67.45 vs 65.173, mean pix.accuracy: 91.21 vs 90.84, pix.accuracy: 76.18 vs 75.73).
Eye Fixation:
The task of eye fixation prediction is essentially different from our classification task. We use the Euclidean loss for the gaze prediction. We submit our results to servers of MIT300 [19], iSUN [47] and SALICON [17] benchmarks with standard setups. Our model also achieves comparable results shown in Tab. 3. All above results on semantic segmentation and eye fixation tasks indicate that our model has a strong generalization in other pixelwise tasks.
5 Conclusion
In this paper, we propose a novel fully convolutional network for saliency detection. A reformulated dropout is utilized to facilitate probabilistic training and inference. This uncertain learning mechanism enables our method to learn uncertain convolutional features and yield more accurate saliency prediction. A new upsampling method is also proposed to reduce the artifacts of deconvolution operations, and explicitly enforce the network to learn accurate boundary for saliency detection. Extensive evaluations demonstrate that our methods can significantly improve performance of saliency detection and show good generalization on other pixelwise vision tasks.
Acknowledgment
. We thank to Alex Kendall for sharing the SegNet code. This paper is supported by the Natural Science Foundation of China #61472060, #61502070, #61528101 and #61632006.
References
 [1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010.
 [2] A. Borji. What is a salient object? a dataset and a baseline model for salient object detection. IEEE TIP, 24(2):742–756, 2015.
 [3] A. Borji, M.M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE TIP, 24(12):5706–5722, 2015.
 [4] M.M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.M. Hu. Global contrast based salient region detection. IEEE TPAMI, 37(3):569–582, 2015.
 [5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet:a largescale hierarchical image database. In CVPR, 2009.
 [6] Y. Ding, J. Xiao, and J. Yu. Importance filtering for image retargeting. In CVPR, pages 89–96, 2011.
 [7] A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. arXiv:1506.02753, 2015.
 [8] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
 [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
 [10] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Insights and applications. In Deep Learning Workshop, ICML, 2015.
 [11] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In CVPR, pages 1026–1034, 2015.
 [13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580, 2012.
 [14] S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with convolutional neural network. arXiv:1502.06796, 2015.
 [15] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In CVPR, pages 1–8, 2007.
 [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
 [17] M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency in context. In CVPR, 2015.
 [18] P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Uniqueness, focusness and objectness. In ICCV, pages 1976–1983, 2013.
 [19] T. Judd, F. Durand, and A. Torralba. A benchmark of computational models of saliency to predict human fixations. In MIT Technical Report, 2012.
 [20] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. arXiv:1511.02680, 2015.
 [21] D. A. Klein and S. Frintrop. Centersurround divergence of feature statistics for salient object detection. In ICCV, pages 2214–2219, 2011.
 [22] G. Lee, Y.W. Tai, and J. Kim. Deep saliency with encoded low level distance map and high level features. In CVPR, June 2016.

[23]
G. Li and Y. Yu.
Visual saliency based on multiscale deep features.
In CVPR, pages 5455–5463, 2015.  [24] G. Li and Y. Yu. Deep contrast learning for salient object detection. In CVPR, pages 478–487, 2016.
 [25] X. Li, H. Lu, L. Zhang, X. Ruan, and M.H. Yang. Saliency detection via dense and sparse reconstruction. In ICCV, pages 2976–2983, 2013.
 [26] X. Li, L. Zhao, L. Wei, M.H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multitask deep neural network model for salient object detection. IEEE TIP, 25(8):3919–3930, 2016.
 [27] Y. Li, X. Hou, C. Koch, J. Rehg, and A. Yuille. The secrets of salient object segmentation. In CVPR, pages 280–287, 2014.
 [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
 [29] V. Mahadevan and N. Vasconcelos. Saliencybased discriminant tracking. In CVPR, pages 1007–1013, 2009.
 [30] L. Marchesotti, C. Cifarelli, and G. Csurka. A framework for visual saliency detection with applications to image thumbnailing. In ICCV, pages 2232–2239, 2009.
 [31] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, pages 1520–1528, 2015.
 [32] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconvcheckerboard/, 2016.
 [33] Y. Qin, H. Lu, Y. Xu, and H. Wang. Saliency detection via cellular automata. In CVPR, pages 110–119, 2015.
 [34] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In ACM TOG, volume 23, pages 309–314, 2004.
 [35] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv:1606.03498, 2016.

[36]
W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert,
and Z. Wang.
Realtime single image and video superresolution using an efficient subpixel convolutional neural network.
In CVPR, 2016.  [37] C. Siagian and L. Itti. Rapid biologicallyinspired scene classification using features shared with visual attention. IEEE TPAMI, 29(2):300–312, 2007.
 [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 [39] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
 [40] J. Sun and H. Ling. Scale and object aware image retargeting for thumbnail browsing. In ICCV, pages 1511–1518, 2011.
 [41] N. Tong, H. Lu, X. Ruan, and M.H. Yang. Salient object detection via bootstrap learning. In CVPR, pages 1884–1892, 2015.
 [42] D. Vincent and V. Francesco. A guide to convolution arithmetic for deep learning. arXiv:1603.07285, 2016.

[43]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
JMLR, 11(Dec):3371–3408, 2010.  [44] L. Wang, H. Lu, X. Ruan, and M.H. Yang. Deep networks for saliency detection via local estimation and global search. In CVPR, pages 3183–3192, 2015.
 [45] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan. Saliency detection with recurrent fully convolutional networks. In ECCV, pages 825–841, 2016.
 [46] T. Wang, L. Zhang, H. Lu, C. Sun, and J. Qi. Kernelized subspace ranking for saliency detection. In ECCV, pages 450–466, 2016.
 [47] P. Xu, K. A. Ehinger, Y. Zhang, A. Finkelstein, S. R. Kulkarni, and J. Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755, 2015.
 [48] Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In CVPR, pages 1155–1162, 2013.
 [49] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.H. Yang. Saliency detection via graphbased manifold ranking. In CVPR, pages 3166–3173, 2013.
 [50] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multicontext deep learning. In CVPR, pages 1265–1274, 2015.
Comments
There are no comments yet.