Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection

by   Huajun Zhou, et al.

Unsupervised Salient Object Detection (USOD) is of paramount significance for both industrial applications and downstream tasks. Existing deep-learning (DL) based USOD methods utilize some low-quality saliency predictions extracted by several traditional SOD methods as saliency cues, which mainly capture some conspicuous regions in images. Furthermore, they refine these saliency cues with the assistant of semantic information, which is obtained from some models trained by supervised learning in other related vision tasks. In this work, we propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues and uses these cues to train a robust saliency detector. More importantly, no human annotations are involved in our framework during the whole training process. In the first stage, we transform a pretrained network (MoCo v2) to aggregate multi-level features to a single activation map, where an Adaptive Decision Boundary (ADB) is proposed to assist the training of the transformed network. To facilitate the generation of high-quality pseudo labels, we propose a loss function to enlarges the feature distances between pixels and their means. In the second stage, an Online Label Rectifying (OLR) strategy updates the pseudo labels during the training process to reduce the negative impact of distractors. In addition, we construct a lightweight saliency detector using two Residual Attention Modules (RAMs), which refine the high-level features using the complementary information in low-level features, such as edges and colors. Extensive experiments on several SOD benchmarks prove that our framework reports significant performance compared with existing USOD methods. Moreover, training our framework on 3000 images consumes about 1 hour, which is over 30x faster than previous state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 5

page 6

page 7

page 10

page 11


MFNet: Multi-filter Directive Network for Weakly Supervised Salient Object Detection

Weakly supervised salient object detection (WSOD) targets to train a CNN...

DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision

Deep neural network (DNN) based salient object detection in images based...

An Integration of Bottom-up and Top-Down Salient Cues on RGB-D Data: Saliency from Objectness vs. Non-Objectness

Bottom-up and top-down visual cues are two types of information that hel...

Cascaded Partial Decoder for Fast and Accurate Salient Object Detection

Existing state-of-the-art salient object detection networks rely on aggr...

Co-salient Object Detection Based on Deep Saliency Networks and Seed Propagation over an Integrated Graph

This paper presents a co-salient object detection method to find common ...

Leverage eye-movement data for saliency modeling: Invariance Analysis and a Robust New Model

Data size is the bottleneck for developing deep saliency models, because...

Visual saliency estimation by integrating features using multiple kernel learning

In the last few decades, significant achievements have been attained in ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Researches on supervised Salient Object Detection (SOD) has reached impressive achievements [f3net, inv, rfcn, ucf, page, gl, dcl, dscn, scrn]

owing to the developments of Convolutional Neural Networks (CNNs)

[ghost, res2net, efficientnet, mobilev2]. An essential prerequisite for these advancements is the large-scale high-quality human-labeled datasets. However, annotating salient objects at pixel-level is laborious. Therefore, Unsupervised Salient Object Detection (USOD) receives increasing attention because it does not require extra efforts for annotating. The main challenge of USOD is how to model the image saliency with prior knowledge and generate high-quality pseudo labels for training a saliency detector.

Fig. 1: Saliency information extracted from the activation map using the proposed adaptive decision boundary (ADB). Green and red blocks mean correct and inversed saliency maps, respectively.

Since hand-crafted features lack semantic information, traditional SOD methods can easily segment some regions with conspicuous colors but struggle to capture more complex salient objects. Existing deep-learning (DL) based USOD methods [sbf, mnl, usps] collect saliency predictions of several traditional SOD methods as saliency cues, and refine them with the assistant of semantic information. This semantic information is obtained from the models trained by supervised learning in other vision tasks, such as object recognition [vgg, resnet] and semantic segmentation [deeplab, cityscape]. Methods in [sbf, mnl] directly use these saliency cues as pseudo labels to train a saliency detector with the help of some auxiliary models, like fusion and noise models. Recently, USPS [usps] utilizes these saliency cues to train multiple networks and refine them using intermediate predictions. However, hand-crafted feature based methods usually generate low-confidence regions, which greatly limits the effectiveness of existing USOD methods. Different from the above methods, all components in our framework are trained without human annotations, including encoder (MoCo v2 [mocov2]), decoder, and saliency detector. Moreover, our framework extracts more semantically reasonable saliency cues to train a saliency detector.

Fig. 2: Pipelines for deep-learning based USOD methods. Existing methods contain three stages, which are denoted as different colors. Yellow: extracting saliency cues; Blue: constructing an auxiliary model; Green: training a saliency detector. Since our framework extracts high-quality saliency cues, it only contains two stages to tackle the USOD task.

Unlike hand-crafted features that are intuitive to the human vision system, multiple researchers [yosinski, zhaodiversified, xiaoapp, mahendran]

have made efforts to understand the features extracted by deep networks. A few works

[cam, gradcam, rcnn, vis] have proved that CNNs pretrained on large-scale data usually produce high activations on some primary objects, yet are difficult to activate other non-salient objects and background within an image. It means that pretrained networks are capable of differentiating salient objects and background in images. Moreover, multi-level features extracted by pretrained networks are rich in semantic information. Therefore, they can help us generate more semantically reasonable pseudo labels than hand-crafted features, which only contain low-level information, such as colors and edges. To prove this point, we transform ResNet-50 [resnet] pretrained by MoCo v2 [mocov2] to aggregate multi-level feature maps as a single activation map, denoted as . As shown in Fig. 1, some regions in are distinctive from other pixels. By refining these activations, we extract high-quality saliency cues to serve as pseudo labels for subsequent steps.

In this work, we propose an efficient two-stage Activation-to-Saliency (A2S) framework for Unsupervised Salient Object Detection (USOD) task. The proposed A2S framework uses the learned features in pretrained networks, instead of hand-crafted features, to better localize salient objects. In the first stage, multi-level features in a pretrained network are integrated using four auxiliary SE blocks [senet] to produce an activation map for each image. Based on our observation illustrated in Fig. 1

, we propose to employ a pixel-wise linear classifier to find suitable regions as saliency cues. Due to the diverse contents in different images, learning a single classifier for all images is sub-optimal. Instead, we design an image-specific classifier and form our Adaptive Decision Boundary (ADB). Based on ADB, we develop a loss function to enlarge the distances between features and their means. Note that our linear classifier is trained by the proposed loss function

without manual annotations. In the second stage, we propose a self-rectification learning method to refine these pseudo labels. To achieve rectification, an Online Label Rectifying (OLR) strategy is proposed to update pseudo labels.

To further reduce the risk of overfitting, we construct a lightweight saliency detector using two novel Residual Attention Modules (RAMs). The proposed RAM enhances the topmost encoder feature using low-level features. Extensive experiments prove that the proposed framework achieves state-of-the-art performance against existing USOD methods and is competitive to some latest supervised SOD methods. In addition, our framework consumes about 1 hour to train with 3000 images, which is about 30 faster than previous state-of-the-art USOD methods.

In summary, our main contributions are:

  • We propose an efficient Activation-to-Saliency (A2S) framework to address Unsupervised Salient Object Detection (USOD) problem.

  • We propose an Adaptive Decision Boundary (ADB) to extract high-quality saliency cues from images.

  • We propose an Online Label Rectifying (OLR) strategy to reduce the negative effect of distractors by updating the pseudo labels during the training process.

  • We propose a lightweight saliency detector, which only has an encoder and two additional RAMs. Our detector has less parameters compared to state-of-the-art detector, largely reducing the risk of overfitting.

Fig. 3: Our framework contains two stages. Two encoders are initialized by the same weights. Blocks with the same colors use the same structure, but their weights are independent.

Ii Related Works

Ii-a SOD by Supervised Deep-Learning

In recent researches, most of supervised DL-based SOD methods [amulet, dss, cag, contour, fcn, poolnet] improved their performances by enhancing a U-shape structure [unet]. Liu et al. [picanet] proposed a pixel-wise contextual attention network to enhance the learned features using context information. Luo et al. [nldf] developed the contrast features that subtract each feature from its local average to enhance the features in skip connections. Liu et al. [dhsnet] used a Recurrent Convolution Layer (RCL) to hierarchically and progressively render image details. Zhao et al. [pfa] divided five encoder features into two branches, and aggregated the fused features in these branches to produce the final predictions. Wang et al. [srm] fused high-level semantic knowledge and spatially rich information of low-level features from two different encoders to produce more robust predictions. Zhao et al. [gate] designed a novel gated dual branch structure to build the cooperation among different levels of features. In addition, [pagrn, sac, bmp, mlm, PFPN] improved the performance by introducing various feature fusion modules.

Many recent works noticed that edge information can assist the SOD methods to produce more accurate boundary. Feng et al. [afnet] employed contour maps to supervise the edge of saliency predictions. Li et al. [ckt] proposed an alternative structure that saliency and contour maps are utilized as supervision signal for each other. Zhao et al. [egnet] supervised the most shallow encoder feature using edges generated from images and integrated it with other higher-level encoder features. Zhou et al. [itsd] constructed a two stream decoder that integrates contour and saliency information to each other alternatively. Wei et al. [ldf] decoupled the salient objects into body and contour maps, and employed three decoders to output these maps as well as saliency maps, respectively.

Although these methods achieved impressive results on SOD benchmarks, they require a significant amount of human-labeled data for training, which are expensive to collect.

Ii-B SOD by Unsupervised Hand-crafted Methods

Conventional SOD methods [dsr, mc, rbd, cssd] extracted saliency cues from images by using different priors and hand-crafted features. Li et al. [dsr] computed dense and sparse reconstruction based on the background templates for each image region, and predicted the pixel-level saliency with the integration of multi-scale reconstruction errors. Jiang et al. [mc] predicted the saliency maps by calculating the distances between boundary superpixels and non-boundary superpixels. Zhu et al. [rbd] proposed a robust background measure to characterize the spatial layout of image regions with respect to image boundaries, and employed an optimization framework to integrate multiple low-level information. Yan et al. [cssd] computed multiple saliency cues from three over-segmented maps and fed them into a tree-structure graphical model to get the final results.

These methods can segment conspicuous regions in images but fail to capture salient objects because hand-crafted features cannot well model global semantic information.

Ii-C SOD by Unsupervised Deep-Learning

A few works [sbf, mnl, usps] attempted to tackle USOD task via Deep-Learning techniques. They learned saliency from multiple noisy saliency cues produced by traditional SOD methods, as shown in Fig. 2. Next, they refine these saliency cues using semantic information from some supervised methods in other related vision tasks, such as object recognition [vgg, resnet] and semantic segmentation [deeplab, cityscape]. Zhang et al. [sbf] learned saliency by using the intra-image fusion stream and inter-image fusion stream to produce multi-level weights for the noisy saliency cues. Zhang et al. [mnl]

modeled the noise of each pixel as a zero-mean Gaussian distribution and reconstructed noisy saliency cues by integrating saliency predictions and randomly sampled noise. Recently, Nguyen et al.

[usps] claimed that directly using these noisy saliency cues to train a saliency detector is sub-optimal. Therefore, they first train multiple refinement networks to improve the quality of these saliency cues. Subsequently, they generated multiple homologous saliency maps by excavating inter-image consistency between these cues to train a saliency detector. Although it has reported impressive results, training a series of refinement networks greatly reduces its efficiency.

Unlike previous DL-based USOD methods still benefit from some supervised methods, all components in our framework are trained in a fully unsupervised manner, including encoder (MoCo v2 [mocov2]), decoder and saliency detector. Moreover, instead of extracting noisy saliency cues using traditional SOD methods, we present a novel perspective to excavate high-quality saliency cues based on the learned features of a pretrained network. Our saliency cues are likely to capture salient objects because high-level features in the encoder have concluded more semantic information than hand-crafted features. Using our high-quality saliency cues as pseudo labels, we can train robust saliency detectors without the assistant of any auxiliary model.

Iii Our Approach

In this section, we elaborate more details about the proposed A2S framework, as shown in Fig. 3. Our framework contains two stages, including a stage for saliency cue extraction and a stage for self-rectification learning.

(a) Image (b) GT (c) Activation (d) Ours (e) DSR [dsr] (f) MC [mc] (g) RBD [rbd] (h) HS [cssd]
Fig. 12: Examples of the extracted saliency cues.

Iii-a Stage 1: Saliency Cue Extraction

Inconsistency between Appearance and Semantic. The goal of USOD task is to segment the whole salient objects in images instead of some conspicuous regions. However, traditional SOD methods are prone to segmenting regions based on their appearances. Due to the inconsistency between region appearances and object semantics, these methods are unlikely to capture salient objects in some challenging images. For example, in Fig. 12, DSR [dsr] (e) focuses on partial salient objects. MC [mc] (f) captures salient objects but fails to differentiate such objects with their surroundings. RBD [rbd] and HS [hs] (g-h) detect regions with distinctive colors. In activation map (c), we observe high activations around people regions. This activation map can be utilized to generate high-quality saliency cues (d).

Activation Synthesis. We base our method on the pretrained deep network to generate high-quality pseudo labels. Given a pretrained network (e.g., MoCo v2 [mocov2]), high-level features often contain more semantic information but lose many details since feature subsampling. Low-level features usually have more activations on texture details but fail to capture global statistics. Therefore, we leverage multi-level features in the proposed framework. Due to the lack of manual annotations, training too much parameters greatly increases the risk of overfitting. Moreover, it may destroy semantic information learned in the pretrained network and cause the network hard to converge, as demonstrated in our experiments. Thus, we add several auxiliary blocks to the pretrained network and only train those blocks. Specifically, outputs of stages 3, 4, 5 from the pretrained network are denoted as , , and , respectively. Each feature is processed by a Squeeze and Excitation (SE) block to enhance the learned representations, denoted as . Another SE block is employed to generate the fused feature map by integrating , , and . We define the above procedure as our transformed network:


where is the input image. We simplify the transformed network as for the following illustration.

(a) Image (b) GT (c) Activation (d) Otsu [otsu] (e) Mean (f) Median
Fig. 19: Saliency cues generated by different thresholds before training our transformed network.

Adaptive Decision Boundary. Given the feature set , where indicates the spatial index of feature. contains multiple feature maps that are stimulated by different conspicuous regions. To gather these regions in each image, we sum the feature maps in to generate a single activation map . In this way, each is converted to a scalar . As shown in Fig. 1,

reveals the potential saliency regions in each image. A coarse saliency prediction can be generated by binarizing

using a threshold. There are various strategies to produce a threshold for each image, such as Otsu algorithm [otsu], mean and median value of . For each image, we split all pixels into two groups using these strategies. As recommended by [otsu]

, we consider that larger inter-group variance

indicates a better threshold. The inter-group variance can be computed by:


where and

indicate the probability and average value of each group, respectively. Finally, the final score of each strategy is obtained by averaging the corresponding

across train and validation subsets of MSRA-B. As we can see in Tab. I, Otsu algorithm reports the maximum value compared to other strategies, which means that it can compute a better threshold to binarize . However, this strategy is relatively slow because it uses an iterative process to find the optimal threshold for each image. Interestingly, we find that , mean value of , reaches a similar result as Otsu algorithm. As shown in Fig. 19, is slightly inferior than Otsu algorithm but apparently much more efficient. Thus, we propose to use image-specific mean values to binarize .

Strategy Otsu Mean Median
Variance 0.1408 0.1405 0.1365
TABLE I: Inter-class variance for different strategies.

In the activation map , is selected as threshold to split all pixels into two different groups. Meanwhile, pixel with large distance to indicates that it is easy to be distinguished. This process is equivalent to finding a decision value of -th pixel on linear decision boundary:


where is the number of features in . Since is adaptive to images, Eqn. 3 is an Adaptive Decision Boundary (ADB) for each image.

Different from adjusting decision boundaries to fit the fixed features, parameters in our decision boundary are adaptive to images. The most simple method is directly using as decision value to generate pseudo labels. However, we expect features have larger distances to to make them become more distinctive. Thus, we propose the following loss function:



is sigmoid function and

is a hyperparameter.

and are L1 and L2 distance, respectively. is set to 0 because it indicates the decision values of pixels on decision boundary. By maximizing distance , the network is trained to extract more distinctive features based on our adaptive decision boundary. The L2 distance promotes the network to focus on distinctive samples far away from ADB, meanwhile, the L1 distance prevents the gradients of hard examples from vanishing. , , and are supervised by this loss to promote the network to learn more robust representations. For each iteration, half of pixels are randomly dropped in the loss function to alleviate the overfitting problem.

Although the above methods can locate salient regions based on activation map , their predictions exist some issues. Firstly, as shown in Fig. 1, our network may output inversed saliency maps for images. In another word, given two sets of features: and , positive may be either foreground or background. Settling this issue requires some extra prior knowledge. Specifically, we define a function that counts the features in each set. For instance, means that the area of is smaller than . The larger area usually includes image boundary, scene information, or inconspicuous objects and thus is treated as background. In this way, a function is proposed to inverse the activation values if the area of is larger than :


Using the function, foreground pixels are correspond to positive . Secondly, the boundary of salient regions are coarse because of no pixel-level supervision for training. Therefore, we employ denseCRF () [densecrf] to refine the boundary of foreground regions and Median Filter (

) to remove outliers. The whole post-process is:


where is the decision value map for each image. is the final pseudo label in the first stage, which will be utilized to train our saliency detector in the next stage.

Iii-B Stage 2: Self-rectification Learning

Saliency Detector. The predicted salient objects in pseudo labels may be inconsistent with manual annotations because they are produced without human interference. Training on such labels is prone to degrading the generalization ability of networks. Therefore, we develop a simple yet effective saliency detector to better integrate hierarchical representations and refine the learned saliency information.

Fig. 20: Internal structure of our RAM. and indicate subtraction, sigmoid, dot product and concatenation, respectively. and can be replaced by and .

In our saliency detector, we also employ the same pretrained network as encoder. In the first stage, we freeze the encoder to preserve the learned semantic information. However, in the second stage, we have pixel-level pseudo labels to train our saliency detector. Therefore, we fine-tune the encoder to produce more precise predictions.

To reduce the memory footprint, we fix all feature dimensions to 64 using several convolutional blocks:


After that, the topmost feature map is selected as the basic feature map because its global information can distinguish coarse semantic regions. The rest four feature maps are employed as supplemental information to refine from different scales because they contain low-level cues for some other important details. Specifically, , , and are evenly divided into local ( and ) and regional ( and ) groups according to their receptive fields. Each group will be combined with via our Residual Attention Module (RAM), as shown in Fig. 20. Our RAM produces the enhanced feature by combining and two other features . Notice that all features are upsampled to the same size as the largest input. In order to learn complementary features, we extract the low-level cues in by subtracting from it. After that, a convolutional layer and a sigmoid function are used to fuse the extracted low-level cues. Using these cues as attention maps, we strengthen low-level information in by:


where means the Hadamard Product operation. We then integrate the basic feature map with the enhanced features:


and are produced by using local ( and ) and regional ( and ) groups to enhance , respectively. Given two outputs, and , we generate final saliency map with the following:


where means aggregating values along the channel dimension. Finally, is trained by our pseudo label with the following loss function [basnet]:


where , and are BCE, SSIM and IOU losses.

Fig. 21: Visualization of the learned saliency maps. Training with our OLR removes some distractors in pseudo labels.

Online Label Rectifying. We observe that some distractors in pseudo labels may impose a negative impact on training robust saliency detectors. A visual illustration of our observation is shown in Fig. 21

. In the generated pseudo label, the shadow is considered as salient region because its color is similar to wheels. The detector trained on such labels can distinguish wheels and shadow in early epochs (see 1st image in 2nd row). However, it produces many low confidence predictions for boundary pixels, resulting in low ave-

scores. Moreover, the detector fails to segment both wheels and shadow when more training epochs are done (see 4th image in 2nd row). To address this issue, an Online Label Rectifying (OLR) strategy is developed to update pseudo labels using the current saliency predictions:


where means the current training epoch. groups all operations in our saliency detector. Each is supervised by corresponding over the training process. We initialize . is set to 1 for the first two epochs to prevent pseudo labels from being contaminated by low-quality predictions. After that, is set to 0.4 for the rest of training epochs.

ave- MAE ave- MAE ave- MAE ave- MAE ave- MAE ave- MAE
PiCANet [picanet] 0.885 0.046 0.874 0.055 0.710 0.068 0.804 0.076 0.759 0.041 0.870 0.044
CPD [cpd] 0.917 0.037 0.899 0.038 0.747 0.056 0.831 0.072 0.805 0.043 0.891 0.034
BASNet [basnet] 0.879 0.037 0.896 0.036 0.756 0.056 0.781 0.077 0.791 0.048 0.898 0.033
ITSD [itsd] 0.906 0.036 0.902 0.038 0.758 0.058 0.817 0.066 0.794 0.040 0.899 0.030
MINet [minet] 0.924 0.033 0.903 0.038 0.756 0.055 0.842 0.064 0.828 0.037 0.908 0.028
DSR [dsr] 0.639 0.174 0.723 0.121 0.558 0.137 0.579 0.260 0.512 0.148 0.675 0.143
MC [mc] 0.611 0.204 0.717 0.144 0.529 0.186 0.574 0.272
RBD [rbd] 0.686 0.189 0.751 0.117 0.510 0.201 0.609 0.223 0.508 0.194 0.657 0.178
HS [cssd] 0.623 0.228 0.713 0.161 0.521 0.227 0.595 0.286 0.460 0.258 0.623 0.223
Stage 1 (Ours) 0.840 0.072 0.857 0.053 0.634 0.096 0.750 0.114 0.693 0.080 0.822 0.056
Stage 1 (Ours) 0.851 0.081 0.881 0.050 0.690 0.080 0.753 0.124 0.733 0.073 0.859 0.054
SBF [sbf] 0.787 0.085 0.867 0.058 0.583 0.135 0.680 0.141 0.627 0.105 0.805 0.074
MNL [mnl] 0.878 0.070 0.877 0.056 0.716 0.086 0.842 0.139
USPS [usps] 0.882 0.064 0.901 0.042 0.696 0.077
Stage 2 (Ours) 0.880 0.060 0.886 0.043 0.683 0.082 0.775 0.106 0.714 0.076 0.856 0.048
Stage 2 (Ours) 0.888 0.064 0.902 0.040 0.719 0.069 0.790 0.106 0.750 0.065 0.887 0.042
TABLE II: Comparison with state-of-the-art SOD methods. To fairly compared with SBF [sbf], MNL [mnl] and USPS [usps]

, we also report results obtained by the network pretrained with human annotations (original ImageNet

[imagenet] labels), denoted as . results are from their papers. Best scores for both supervised and unsupervised methods are in bold.

Iv Experiment

Iv-a Experiment settings


All experiments are implemented on a single GTX 1080 Ti GPU using PyTorch

[Pytorch]. Unsupervised ResNet-50 [mocov2] is employed as the encoder in both stages. Note that no annotated labels are involved here. The batch size is 8, and images are resized to . For data augmentation, horizontal flipping is employed during the training process. SGD is utilized as the optimizer of our framework for both stages. The first stage includes 20 epochs in total with an initial learning rate of 1. The learning rate is decayed by a factor of 0.1 at 10-th and 16-th epochs. The second stage includes 25 epochs in total with an initial learning rate of 0.005. The learning rate is decayed by a factor of 0.1 at 15-th and 20-th epochs. Our framework takes about 1 hour to complete all training processes on 3000 images.

Metrics. We utilize ave- score and mean absolute error (MAE) as criteria. The formula of is:


where is set to 0.3 [THUR15K] in general. The ave- is average over a set of scores calculated by changing positive thresholds from 0 to 255. In addition, MAE is obtained by:


where and are -th pixel in prediction and ground truth, respectively.

Datasets. Following previous DL-based USOD works [usps, mnl], 3000 images in the train and val subsets of MSRA-B [msra] dataset are employed to train our framework. ECSSD [ecssd], PASCAL-S [pascal-S], HKU-IS [hku-is], DUTS-TE [duts], DUT-O [DUT-OMRON] as well as the test subset of MSRA-B are employed as the test sets. They have 1000, 850, 4447, 5019, 5168, and 2000 images, respectively.

Iv-B Main Results

We compare our framework with existing state-of-the-art DL-based USOD methods (SBF [sbf], MNL [mnl] and USPS [usps]) and several traditional methods (DSR [dsr], MC [mc], RBD [rbd] and HS [hs]). Some supervised SOD methods are included, such as PiCANet [picanet], CPD [cpd], BASNet [basnet], ITSD [itsd] and MINet [minet]. All results are listed in Tab. II.

Based on this table, we have four conclusions. First, the saliency cues extracted in the first stage are significantly more accurate than traditional SOD methods. Second, our A2S framework achieves competitive results against existing DL-based USOD methods. It is noteworthy that our framework uses unsupervised encoder, while existing methods take advantage of supervised encoder. To make a fair comparison, we also report results obtained by the network pretrained with human annotations (i.e., original ImageNet labels). As shown in Tab. II, our framework achieves state-of-the-art performance compared to all existing DL-based USOD methods. Third, MNL [mnl] reports a higher ave-F score on the PASCAL-S dataset, while its MAE score is significantly lower than our method. Moreover, our framework achieves better performance than MNL on all other test sets. Last, our framework is competitive to many latest supervised learning based methods on some datasets, such as MSRA-B.

Method Input Backbone Saliency cues Train time
SBF [sbf] VGG-16 [mb+, bms, cssd] 3h
MNL [mnl] ResNet-101 [rbd, dsr, mc, cssd] 4h
USPS [usps] ResNet-101 [rbd, dsr, mc, cssd] 30h
Ours ResNet-50 1h
TABLE III: A detailed comparison among DL-based USOD methods. We collect the train time for each method from its paper. Note that all existing methods exclude the time for extracting saliency cues.

Implementation details in Tab. III further demonstrate the effectiveness and efficiency of our framework. Our framework use ResNet-50 backbone with input, while MNL [mnl] and USPS [usps] use ResNet-101 backbone with larger input. Moreover, our A2S framework proposes a new method to extract saliency cues based on CNN features, while existing DL-based USOD methods rely on multiple saliency cues from traditional SOD methods. To compare the efficiency of these methods, we collect the time consumptions from their papers. SBF [sbf], MNL [mnl] and USPS [usps] take 3, 4 and 30 hours to finish their training processes. It is noted that all these methods exclude the time for extracting saliency cues. Our framework only takes about 1 hour for the whole process, including about 40 and 20 minutes for the first and second stage, respectively. This comparison well proves that our framework is effective and much more efficient than previous USOD methods.

As shown in Fig. 22, compared to existing DL-based USOD methods, our method can better distinguish between salient objects and their surrounding backgrounds. For example, our predictions of the first and fourth images have more distinctive boundaries than previous methods [sbf, usps]. Compared to traditional SOD methods, our method has significant improvements on the quality of saliency predictions. Moreover, our method shows competitive performance on all examples compared to supervised DL-based SOD methods.

Fig. 22: Visual comparison with state-of-the-art SOD methods. Predictions of MNL [mnl] has not been published.
Variants Description ave-F MAE
A0 Unfreezed encoder 0.028 0.350
A1 No auxiliary supervision 0.814 0.092
A2 No dropped pixels 0.837 0.081
A3 Ours 0.866 0.050
TABLE IV: Ablation study on stage 1. We list the results for different variants of the network in the first stage. All scores are tested on train split of MSRA-B dataset, which is adopted as train set in our experiment.

Iv-C Ablation Study on Stage 1

We conduct some experiments to validate the effectiveness of three variant methods of our stage 1. These variants are: 1) A0: unfreezed encoder; 2) A1: no auxiliary supervision; 3) A2: no pixel dropout in loss function. We denote our full method as A3. Results are listed in Tab. IV.

In summary, A3 reports the best results among these variants. A0 fails to converge because the unsupervised training process destroys the learned semantic information in the encoder. A1 only supervises the fused feature , and thus the network is hard to conclude more distinctive representations from images. Since we do not have precise labels for training, A2 using all pixels for training greatly increases the negative impact of overfitting.

Iv-D Ablation Study on Hyperparameters

In this section, we conduct a series of experiments to see the effect of and in Eqn. 4 and 14, respectively.

0 0.1 0.2 0.3
ave-F 0.845 0.866 0.862 0.859
MAE 0.078 0.050 0.052 0.055
TABLE V: Ablation study on . We list scores of the generated pseudo labels on train split of MSRA-B dataset.
Dataset Metric 0.2 0.3 0.4 0.5 0.6
ECSSD ave-F 0.870 0.876 0.880 0.879 0.873
MAE 0.066 0.063 0.060 0.062 0.065
MSRA-B ave-F 0.878 0.882 0.886 0.884 0.880
MAE 0.048 0.046 0.043 0.045 0.048
TABLE VI: Ablation study on . We list scores on ECSSD and test split of MSRA-B datasets.
Fig. 23: MAE curves of different decision boundaries.

Hyperparameter in Eqn. 4 is introduced to adjust the gradients for samples which are closed to the decision boundary. However, a large may induce the network pays too much attention on these pixels and ignores other easy samples. We train the network in the first stage by setting to 0, 0.1, 0.2, 0.3. As the results shown in Tab. V, reports the worst results because gradients for hard samples are vanished. obtains the best results among all competitors. When , the network pays more attention on hard samples, resulting in weakened performance. As we increase to 0.3, the network has been witnessed a large performance drop on the train set.

Hyperparameter in Eqn. 14 is designed to control the update process of pseudo labels. The results of different are shown in Tab. VI. Large means a relatively slow updating rate for pseudo labels. Distractors in pseudo labels cannot be quickly eliminated, and thus cause inferior results. Meanwhile, small

means a fast updating rate for pseudo labels. The networks are supervised by previous estimations, and fails to capture more saliency information from original pseudo labels.

Iv-E Design of ADB

In this section, we analyze different designs of our decision boundaries. We first review and define some symbols used in these decision boundaries: 1) indicates the mean feature of over train set; 2) indicates the mean feature of for each image. It usually contains information about salient objects; 3) means learnable weights (e.g. a fully-connected layer).

Based on these definitions, we define these decision boundaries as: 1) DB1: . This variant uses a fixed bias term to divide all pixels into two groups; 2) DB2: . This variant calculates the similarity between and each feature. Saliency regions are expected to have higher similarities; 3) DB3: . The dot product between and its difference with . 4) DB4: . This variant uses the learned weights for the difference between and ; 5) DB5: . This variant is the proposed ADB in our framework. Their MAE curves during the training process are shown in Fig. 23.

Overall, our ADB (DB5) reports the best and most stable performance among all competitors. DB1 and DB2 fail to converge due to the unbalanced ratio between two groups. They are prone to splitting all pixels into the same class. DB3 is unstable because background pixels greatly disturb the semantic cues in . Since the network in the first stage is trained by unsupervised learning, our supervision signals are significantly coarse than human annotations. Therefore, to improve the performance of DB4, more constraints need to be imposed on the learning process of .

Network # Paras. (M) ECSSD MSRA-B
ave- MAE ave- MAE
w/o OLR
MINet 164.43 0.859 0.067 0.867 0.050
Ours 28.25 0.860 0.068 0.865 0.051
Ours 28.40 0.865 0.066 0.872 0.047
w/ OLR
MINet 164.43 0.863 0.065 0.872 0.048
Ours 28.25 0.872 0.063 0.876 0.047
Ours 28.40 0.880 0.060 0.886 0.043
TABLE VII: Results of training MINet and our saliency detector with the generated pseudo labels. means the proposed RAM is replaced by a convolutional layer.
Image GT Label MINet Ours Ours
Fig. 24: The learned saliency on training samples. “Label” indicates the generated pseudo labels in our first stage. MINet [minet] overfits some distractors in pseudo labels, while our detectors eliminate these regions.

Iv-F Our Detector vs. State-of-the-art

We conduct an experiment to compare our detector with MINet [minet], which has reported state-of-the-art performance on various SOD benchmarks. We train these networks with our pseudo labels and report results in Tab. VII.

No matter using OLR or not, our detector reports significant improvements compared with MINet. MINet is likely to be overfitted to some distractors in pseudo labels due to its large amount of trainable parameters, as shown in Fig. 24. The lightweight structure of our detector alleviates the overfitting problem and thus results in better performance. In addition, RAMs further improve the performance of our saliency detector.

Fig. 25: Visualization of the learned features in RAMs. As a reference, we show image, ground truth and final prediction in the first column (from top to bottom).

Iv-G Visualization of RAM

We visualize the learn features in the proposed RAMs in Fig. 25. Activations from contain many low-level cues, such as points and plane edges, while responses from show more region-wise patterns. After multiplications, feature maps in well integrate multi-level information in and , and activate some regions of salient objects. By summing all feature maps in , the network aggregates these regions and compose the final predictions.

Method Tr. time(h) ECSSD MSRA-B
ave- MAE ave- MAE
HMA [usps] 5.33 0.843 0.078 0.876 0.049
OLR (Ours) 0.36 0.880 0.060 0.886 0.043
TABLE VIII: Comparison between HMA and our OLR.

Iv-H OLR vs. HMA

Similar to the proposed OLR, the Historical Moving Averages (HMA) strategy was proposed in previous work [usps]. Our OLR is different with HMA from several perspectives. First, HMA generates pseudo labels from low-quality saliency cues, while our OLR aims at rectifying the pseudo labels online. Second, HMA uses CRF to refine the network predictions before updating, while our OLR uses these predictions to eliminate distractors introduced by CRF. Last, HMA is much slower than our OLR due to the CRF operation.

To prove our points, we use HMA and the proposed OLR to train our detector. As shown in Tab. VIII, the proposed OLR achieves significantly improvements compared with HMA. Moreover, using the proposed OLR is 15 faster than HMA.

V Conclusion

In this work, we propose a two-stage Activation-to-Saliency (A2S) framework for Unsupervised Salient Object Detection (USOD) task. In the first stage, we transform a pretrained network to generate a single activation map from each image. An Adaptive Decision Boundary (ADB) is proposed to enlarge the feature distances, enabling to generate high-quality pseudo labels. In the second stage, we construct a saliency detector using an encoder and two Residual Attention Modules (RAMs) to alleviate the overfitting problem. In addition, an Online Label Rectifying (OLR) strategy updates the pseudo labels to reduce the negative impact of distractors. Extensive experiments on several SOD benchmarks prove the effectiveness and efficiency of the proposed framework.