MEnet: A Metric Expression Network for Salient Object Segmentation

05/15/2018 ∙ by Shulian Cai, et al. ∙ Xiamen University South China University of Technology International Student Union Columbia University 0

Recent CNN-based saliency models have achieved great performance on public datasets, however, most of them are sensitive to distortion (e.g., noise, compression). In this paper, an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) is proposed to overcome this drawback. Within this architecture, we construct a new topological metric space, with the implicit metric being determined by the deep network. In this way, we succeed in grouping all the pixels within the observed image semantically within this latent space into two regions: a salient region and a non-salient region. With this method, all feature extractions are carried out at the pixel level, which makes the output boundaries of salient object fine-grained. Experimental results show that the proposed metric can generate robust salient maps that allow for object segmentation. By testing the method on several public benchmarks, we show that the performance of MEnet has achieved good results. Furthermore, the proposed method outperforms previous CNN-based methods on distorted images.



There are no comments yet.


page 1

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image Image saliency detection and segmentation is of significant interest in the fields of computer vision and pattern recognition. Recent saliency detection studies can be divided into two categories:

hand-crafted features based and learning-based approaches. In previous literature, the majority of saliency detection methods use hand-crafted features. Traditional low-level features for such saliency detection models mainly consist of color, intensity, texture and structure [1, 2, 3]

. Though hand-crafted features with heuristic priors perform well in simple scenes, they are not robust to more challenging cases, such as when salient regions have similar color to background.

On the other hand,learning-based methods, in particular the convolutional neural networks (CNNs) [4] have been proposed to address the shortcomings of hand-crafted features for saliency detection and achieved remarkable performance. [5] uses multistage refinement mechanism to effectively combine high-level object-level semantics with low-level image features to produce high-resolution saliency maps. [6, 7, 8] exploit multi-level and multi-scale convolutional features for object segmentation. But, even with great performance, CNN-based approaches also have a certain room to improve. That is few previous works pay attention to the robustness in distored scenes, while the performance of neural networks are susceptible to typical distortions such as noise [9].

(a) Image
(b) GT
(c) MEnet
Fig. 1: Saliency segmentation by our algorithm.
Fig. 2: The proposed framework for saliency segmentation.

Recently, metric learning has received much attention in computer vision, such as image segmentation [10]

, face recognition

[11] and human identification [12], for measuring similarity between objects. Inspired by the metric learning framework, we propose a saliency model that works in a learned metric space.

We propose a deep metric learning architecture for saliency segmentation with potentially distored images. Instead, we use semantic features extracted from a deep CNN to learn a homogeneous metric space. The features are at the pixel level and allow for distinguishing between salient regions and background through a distance measure. Simultaneously, We introduce a novel metric loss function which is based on metric learning and cross entropy. We also use multi-level information for feature extraction, similar to approaches such as Hypercolumns

[13] and U-net [14]. We experiment with several benchmark data sets and achieve state-of-art level results. For instance, Figure 1 shows an example of a natural image, its ground truth salient region in white, and our result. Moreover, the proposed model is robust to distored images.

Ii Metric Expression Network (MEnet)

We illustrate our model architecture in Figure 2

. An encoder-decoder CNN first generates feature maps at different scales (blocks), which through convolution and up-sampling gives a feature vector for each pixel of an image according to how it maps through the layers. These extracted features are then used in a metric loss and cross entropy function by convolutions for saliency detection as described below.

Ii-a Encoder-decoder CNN for feature extraction

In SegNet [15] and U-net, the encoder-decoder is used to extraction multi-scale features. We use a similar structure in the proposed model. Since global information plays an important role in saliency segmentation [16], we use convolutions and pooling layers to increase the receptive field of the model, and compress all feature information into feature maps whose size are , as shown as the white box in Figure 2.

Through the decoder module, we up-sample these feature maps and the feature map at each scale represents information at one semantic level. We therefore propose a symmetric encoder-decoder CNN architecture.

Fig. 3: Basic encoder (left) and decoder (right) blocks.

The encoder-decoder network of Figure 2 uses a deep symmetric CNN architecture with short connections as indicated by black arrows. It consists of an encoder half (left) and a decoder half (right), each block of which has an application of either of the two basic blocks as shown in Figure 3

. For encoding, at each down-sampling step we double the number of feature channels using a convolution with stride 2. For decoding, each step in the decoder path consists of an up-sampling of the feature map by a deconvolution after being concatenated the input with the short connection, also with stride 2.

In the decoder path, we concatenate the corresponding feature maps from the encoder path. This part is similar to U-Net. But the difference is that U-net is designed for edge-detection, it works well even it crops feature maps (in Fig.1 of U-net) from the encoder path as it doesn’t impact edge-detection. While for saliency segmentation, we maintain the size of the feature map to make full use of all the information as its receptive field need to be much larger. We believe that the size of the feature maps contain rich global information of the original image that can be used by later layers for better prediction.

Our goal in using a symmetric CNN is to generate different scales of feature maps, which are concatenated to give feature vectors for each corresponding pixel in the input image that contains multi-scale information across the dimensions. Furthermore, we process some more convolution to balance the dimensional unevenness as described in the following paragraph without doing direct classification.

We ultimately want to distinguish salient objects from background and so want to map image pixels into a feature space where that distance across salient and background regions is large, but within regions is small.

However, previous work in this direction showed that deep CNNs can learn such a feature representation that captures local and global context information for saliency segmentation[17].

Therefore, as it is shown in Figure 2, we can convert the blocks from the 13 different scales of the encoder-decoder network into a bundle of feature maps as indicated by the green dashed lines. That is, in the feature extraction part, each scale generates one output feature map of the same size via a single convolution and up-sampling; while the first “feature map” is simply obtained from convolving the original image across its RGB channels.

Though the proposed algorithm may be partially similar to Hypercolumns model, during testing Hypercolumns model takes the outputs of all these layers, upsamples them using bilinear interpolation and sums them for the final prediction. But the difference is that, within training Hypercolumns model predicts heatmaps from feature maps of different scales by stacking on additional convolutional layers. Hypercolumns is more like DHSNet

[7] which utilizes multi-scale saliency labels for segmentation. Contrary, MEnet upsamples each scale of feature map with the same size during training. As the components from these features with 13 scales are uneven to each other, they cannot directly be applied to classification, e.g., assigning any loss function on them. These components from different scales should be balanced and one possible way is to filter these features of 13 dimension by convolution. Consequently in the proposed way, after concatenating the feature maps at each level we further use convolution operation with 16 kernels to generate the final feature map to balance the dimensional characteristics with the constraint of minimizing the cross entropy (see the following section). In this case, the final feature vector is in .

Ii-B Loss function

Most previous work on saliency detection based on deep learning used cross entropy (CE) to optimize the network

[18, 6]. The loss functions is written as follows:


where is the set of learnable parameters of network to influence , is the pixel domain of the image, denotes the loss for the -th image in the training set, is the indicator function; and , where denotes the salient pixel and denotes the non-salient pixel.

is the label probability of the

-th pixel predicted by network. In MEnet, we generate via a convolution with 2 kernels from feature extraction part as shown in Figure 2.

Inspired by the metric learning, we also introduce our metric loss function (ML) defined as Equation 2. In our network, the input is an RGB image whose size is , and all the images are resized to , hence here. The output is a feature metric space which generated by 16-kernels convolution in Figure 2, and the size is (in our method we set ). Each pixel in the image corresponds to a C-dimension vector in the salient feature maps. The metric loss function is defined as following:


where is the set of learnable parameters of network to influence , denotes the feature vectors corresponding to the pixel in the -th image of the training set. We denote (or ), s.t., , by meaning that is the positive (negative) feature vector of , respectively. That is, and are from the same region (salient or non-salient), otherwise, is from a different region with respect to . We use Euclidean distance to calculate the distance between two feature vectors.

This loss function (2) seeks to find an encoder-decoder network that enlarges the distance between any pair of feature vectors from different regions, and reduces the distance from the same region. In this way, the two region is expected to be homogenous by themselves. Then by trivial deduction, it is equivalent to


where we average all in Equation 3 to get and . That is is the mean of all positive pixels from a single image, while corresponds to all negative pixels. Intuitively, Equation 3 enforces that the feature vectors extracted from the same region be close to the center of that region while keeping away from the center of other region in salient feature space. In this case, we can obtain a more robust distance evaluation between the salient object and background. We also add a second cross entropy loss function as a constraint which shares the same network architecture to the objective function and empirically have noticed that the combined results were significantly better than only using either the metric or cross entropy losses. Therefore, our final loss function is defined as below:


where and is set to 1 in our experiment.

Ii-C Semantic distance expression

If we train the MEnet to minimize the loss function , we will obtain a converged network , where is converged state of . Given an observed input image for testing, where the pixel domain is , we usually describe pixel by its intensities s across the channels. But it is difficult to define the semantical distance by , e.g., by Euclidean distance . However, through transformation of , we will obtain the corresponding feature vectors to represent the input. Then the distance can be expressed by , and finally the saliency map for saliency segmentation is obtained by:



is the probability distribution function of the feature vector

, and , where and denote the background region and salient region only computed from the component of in the loss function (4) within the whole converged network , respectively. Note that, and are not accurate segmentation and they are to be further investigated in the experiment part. To conclude, by network transformation we succeed to express with . As illustrated in Figure 4, we anticipate that through a space transformation, the intra-class distance will be smaller than the inter-class distance.

Fig. 4: Semantic Distance expression with the network.

Iii Experiments

We test on several public saliency datasets and distored images compare with state-of-the-art saliency detection methods. We use the Caffe software package to train our model


Iii-a Datasets

The datasets we consider are: MSRA10K [2], DUT-OMRON (DUT-O) [20], HKU-IS [21], ECSSD [22], MSRA1000 (MSRA1K) [23] and SOD [24]. MSRA10K contains 10000 images, it is the largest dataset and covers a large variety of contents. HKU-IS contains 4447 images, most images containing two salient objects or multiple objects. ECSSD dataset contains 1000 images. DUT-OMRON contains 5168 images, which was originally designed for image segmentation. This datasets is very challenging since most of the images contain complex scenes; existing saliency detection models have yet to achieve high accuracy on this dataset. MSRA1K including 1000 images, all belongs to the MSRA10K. SOD contains 300 images.

Iii-B Training

We use stochastic gradient descent (SGD) for optimization, and the MSRA10K and HKU-IS are selected for training. For MSRA10K, 8500 images for training, 500 images for validation and the MSRA1K for testing; HKU-IS was divided into approximately 80/5/15 training-validation-testing splits. To prevent overfitting, all of our models use cropping and flipping images randomly as data augmentation. We utilize batch normalization

[25] to steep up the convergence of MEnet.

All experiments are performed on a PC with Intel(R) Xeon(R) CPU I7-6900k, 96GB RAM and GTX TITAN X Pascal.We use a 4 convolutional layer block in the upsample and downsample operations. Therefore the depth of our MEnet is 52 layers. The parameter sizes are shown in Figure 2 and Figure 3. We set the learning rate to 0.1 with weight decay of , a momentum of 0.9 and a mini-batch size of 5. We train for 110,000 iterations. Since salient pixels and non-salient pixels are very imbalanced, network convergence to a good local optimum is challenging. Inspired by object detection methods such as SSD [26], we adopt hard negative mining to address this problem. This sampling scheme ensures salient and non-salient sample ratio equal to 1, eliminating label bias.

Ours 0.732 0.074 0.879 0.044 0.880 0.060 0.928 0.028 0.594 0.139
SRM 0.718 0.071 0.877 0.046 0.892 0.056 0.894 0.045 0.617 0.120
NLDF 0.691 0.080 0.873 0.048 0.880 0.063 0.591 0.130
Amulet 0.654 0.098 0.841 0.052 0.873 0.060 0.550 0.160
UCF 0.645 0.132 0.820 0.072 0.854 0.078 0.557 0.186
DCL 0.660 0.095 0.844 0.063 0.857 0.078 0.922 0.035 0.573 0.147
DS 0.646 0.084 0.790 0.079 0.834 0.079 0.858 0.059 0.552 0.141
DHSNet 0.859 0.053 0.877 0.060 0.595 0.124
ELD 0.618 0.092 0.779 0.072 0.810 0.080 0.882 0.037 0.540 0.150
MC 0.622 0.094 0.733 0.099 0.779 0.106 0.885 0.044 0.497 0.160
TABLE I: Comparison of quantitative results including F-measure (lager is better) and MAE (smaller is better). The top two results are shown in red and blue, respectively. DHSNet is trained on MSRA-B and DUT-O, MSRNet is trained on HKU-IS and MSRA-B, and UCF, Amulet and NLDF are all trained on MSRA-B dataset which contains MSRA1K, therefore, we are not compared our model with this four models on this dataset.
(a) Images
(b) GT
(c) Ours
(d) SRM
(e) NLDF
(f) Amulet
(g) UCF
(h) DHSNet
(i) DCL
(j) DS
(k) ELD
(l) MC
Fig. 5: Visual comparisons with nine methods. MEnet can obtain detailed and accurate saliency maps.
Fig. 6: Comparison of precision-recall curves of other CNN-based methods on four datasets corrupted by AWGN(with random strengths) .
Ours 0.586 0.649 0.710 0.801 0.716 0.792 0.867 0.910 0.466 0.485
SRM 0.200 0.543 0.221 0.658 0.215 0.663 0.504 0.819 0.136 0.415
NLDF 0.402 0.561 0.531 0.700 0.565 0.693 0.352 0.433
Amulet 0.534 0.529 0.677 0.686 0.695 0.708 0.420 0.420
UCF 0.519 0.524 0.656 0.682 0.668 0.698 0.381 0.418
DCL 0.374 0.523 0.477 0.677 0.505 0.657 0.664 0.832 0.286 0.386
DS 0.368 0.497 0.477 0.611 0.532 0.649 0.619 0.771 0.313 0.405
DHSNet 0.605 0.735 0.622 0.753 0.394 0.461
ELD 0.454 0.548 0.531 0.686 0.603 0.730 0.737 0.841 0.376 0.444
MC 0.415 0.496 0.475 0.539 0.509 0.648 0.747 0.787 0.305 0.392
TABLE II: Quantitative comparison with recent deep methods based on deep learning methods in difference distorted scenes via F-measure (lager is better). The top two results are shown in red and blue, respectively. HSNet is trained on MSRA-B and DUT-O, MSRNet is trained on HKU-IS and MSRA-B, and UCF, Amulet and NLDF are all trained on MSRA-B dataset which contains MSRA1K, therefore, we are not compared our model with this four models on this dataset. JPEG denotes JPEG Compression method.
(a) Image
(b) GT
(c) MEnet
(d) scale0-or
(e) scale0-en
(f) scale1-en
(g) scale2-en
(h) scale3-en
(i) scale4-en
(j) scale5-en
(k) scale5-de
(l) scale4-de
(m) scale3-de
(n) scale2-de
(o) scale1-de
(p) scale0-de
Fig. 7: Feature maps visualization, where (d) is the feature of original data, (e)-(j) and (k)-(p)denote the learned feature maps of encoder and decoder, respectively.
Fig. 8:

The curves of difference methods on DUT-O dataset under various noise variances.

Data Indexes CE-plain CE-only MEnet
DUT-O 0.631 0.678
0.098 0.084
HKU-IS 0.803 0.872
0.064 0.056
ECSSD 0.794 0.855
0.093 0.072
MSRA1K 0.884 0.915
0.037 0.034
SOD 0.525 0.555
0.156 0.159
TABLE III: The performance of different strategies.

Iii-C Performance Comparison

We compare MEnet with 9 state-of-the-art models for saliency detection: MC [27], ELD [16], DCL [18], DHSNet [7], DS [28], UCF [29], Amulet [8], SRM [5], NLDF [6] and 2 traditional metric learning methods: AML [30] and Lu’s method [31].

A visual comparison is shown in Figure 5 along with other state-of-the-art methods. MEnet performs better in these challenging scenes, e.g., when the salient region is similar to background. And F-measure scores and MAE are shown in Table I. It is noted that the better models (e.g., DHSNet, NLDF, Amulet, SRM, UCF and etc.) need pre-trained model, and conditional random field (CRF) method [32] is used as post-processing in DCL. MEnet is trained from scratch, it is still competitive with state-of-the-art models, particularly on some challenging datasets DUT-O and HKU-IS. Our model cost to generate each saliency map with GPU.

Iii-C1 Evaluation on distored images

It’s noted that MEnet does not train on distored images is the same as previous works. To show the robustness of MEnet in the distored setting, we work with public datasets corrupted by Additive White Gaussian Noise (AWGN) and jpeg compression (with random strengths), We compare F-measure scores in Table II. We can see that MEnet clearly outperforms other methods. Additionally, we show PR curves of our approach in Figure 6. Since the saliency maps generated by metric loss prediction tend to be binary, it is harmful to draw PR curves which need continuous salient values. Therefore, we select saliency maps generated by CE prediction to draw PR curves. Through Figure 6, we observe that the performance of the proposed method is a little better than others on distored datasets. As shown in Figure 8, with the noise variance goes, the performance of other methods degrade rapidly, while MEnet still achieves robust performance. The reason for the robustness of MEnet owes to the fact that multi-scale features and metric loss are integrated into this structure, where abundant features from either low or high levels are fully utilized and metric loss idea correlates every pixel to the rest pixels for optimization, e.g., similar metric loss idea is shown to be robust in human re-identification [12] that is insensitive to light, deformation and angle which can be regarded as “noise”. In real-world scene, images are easy to be impacted by noise and compression. Therefore, we consider that proposed work is beneficial to construct a robust model.

Iii-C2 Advantages of MEnet

To intuitively illustrate the advantage of MEnet, we select several feature maps visualization for analysis. As the layers of each scale goes deeper, the receptive field of each neuron becomes larger. As shown in Figure

7, we observe that each convolutional layer contains different semantic information, and going deeper allows the models to capture richer structures. Within the decoding parts, scale2-de, 3-de, 4-de are sensitive to the salient region, while scale1-de has higher response against the background region. Other layers like scale0-de can distinguish the boundary of salient objects.

To show the effectiveness of our proposed multi-scale features extraction and loss function, we use different strategies for semantic saliency detection/segmentation as shown in Table III. The difference between CE-only and CE-plain is that CE-plain does not utilize multi-scale information which will cause performance degradation. We also note that the performance of MEnet is improved after introducing metric loss.

We compare MEnet with two other traditional metric learning methods for saliency segmentation, AML [30] and Lu [31]. The results in Table IV demonstrate the potential superiority of deep metric learning method over traditional metric learning for semantic saliency detection.

Data Indexs AML Lu’s Ours
ECSSD 0.667 0.715 0.880
0.165 0.136 0.060
MSRA1K 0.794 0.806 0.928
0.089 0.080 0.028
TABLE IV: Comparison with two traditional methods based on metric learning with F-measure and MAE scores.

Iv Conclusion

In this paper, we present an end-to-end deep metric learning architecture (called MEnet) for salient object segmentation. To this end, we use multi-scale features extraction to obtain abundant semantic information and combine with deep metric learning for mapping pixels into a “saliency space” where Euclidean distances can be measured. The resulting mapping distinguishes salient image elements (pixels) from background efficiently. Note that, the proposed model is trained from scratch and does not require pre/post-processing. Experiments on benchmark datasets clearly demonstrate the effectiveness of our model, and robustness to distored images.


  • [1] C. Yang, L. Zhang et al., “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3166–3173.
  • [2] M.-M. Cheng, N. J. Mitra et al., “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2015.
  • [3] A. Borji and L. Itti, “Exploiting local and global patch rarities for saliency detection,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 478–485.
  • [4] Y. LeCun, L. Bottou et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [5] T. Wang, A. Borji et al., “A stagewise refinement model for detecting salient objects in images,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [6] Z. Luo, A. Mishra et al.

    , “Non-local deep features for salient object detection,” in

    IEEE CVPR, 2017.
  • [7] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 678–686.
  • [8] P. Zhang, D. Wang et al., “Amulet: Aggregating multi-level convolutional features for salient object detection,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [9] Z. Chen, W. Lin et al., “Image quality assessment guided deep neural networks training,” arXiv preprint arXiv:1708.03880, 2017.
  • [10] A. Fathi, Z. Wojna et al., “Semantic instance segmentation via deep metric learning,” arXiv preprint arXiv:1703.10277, 2017.
  • [11] J. Hu, J. Lu et al., “Discriminative deep metric learning for face verification in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1875–1882.
  • [12] D. Yi, Z. Lei et al., “Deep metric learning for person re-identification,” in Pattern Recognition (ICPR), 2014 22nd International Conference on.   IEEE, 2014, pp. 34–39.
  • [13] B. Hariharan, P. Arbeláez et al., “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 447–456.
  • [14] O. Ronneberger, P. Fischer et al., “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [15] V. Badrinarayanan, A. Kendall et al., “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [16] L. Wang, H. Lu et al.

    , “Deep networks for saliency detection via local estimation and global search,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183–3192.
  • [17] R. Zhao, W. Ouyang et al., “Saliency detection by multi-context deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
  • [18] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 478–487.
  • [19] Y. Jia, E. Shelhamer et al., “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 675–678.
  • [20] C. Yang, L. Zhang et al., “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3166–3173.
  • [21] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5455–5463.
  • [22] Q. Yan, L. Xu et al., “Hierarchical saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1155–1162.
  • [23] T. Liu, Z. Yuan et al., “Learning to detect a salient object,” IEEE Transactions on Pattern analysis and machine intelligence, vol. 33, no. 2, pp. 353–367, 2011.
  • [24] D. Martin, C. Fowlkes et al., “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2.   IEEE, 2001, pp. 416–423.
  • [25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    International conference on machine learning

    , 2015, pp. 448–456.
  • [26] W. Liu, D. Anguelov et al., “Ssd: Single shot multibox detector,” in European Conference on Computer Vision.   Springer, 2016, pp. 21–37.
  • [27] R. Zhao, W. Ouyang et al., “Saliency detection by multi-context deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
  • [28] X. Li, L. Zhao et al., “Deepsaliency: Multi-task deep neural network model for salient object detection,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3919–3930, 2016.
  • [29] P. Zhang, D. Wang et al., “Learning uncertain convolutional features for accurate saliency detection,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [30] S. Li, H. Lu et al., “Adaptive metric learning for saliency detection,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3321–3331, 2015.
  • [31] J. You, L. Zhang et al., “Salient object detection via point-to-set metric learning,” Pattern Recognition Letters, vol. 84, pp. 85–90, 2016.
  • [32] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.