Self-Supervised Difference Detection for Weakly-Supervised Semantic Segmentation

11/04/2019 ∙ by Wataru Shimoda, et al. ∙ 12

To minimize the annotation costs associated with the training of semantic segmentation models, researchers have extensively investigated weakly-supervised segmentation approaches. In the current weakly-supervised segmentation methods, the most widely adopted approach is based on visualization. However, the visualization results are not generally equal to semantic segmentation. Therefore, to perform accurate semantic segmentation under the weakly supervised condition, it is necessary to consider the mapping functions that convert the visualization results into semantic segmentation. For such mapping functions, the conditional random field and iterative re-training using the outputs of a segmentation model are usually used. However, these methods do not always guarantee improvements in accuracy; therefore, if we apply these mapping functions iteratively multiple times, eventually the accuracy will not improve or will decrease. In this paper, to make the most of such mapping functions, we assume that the results of the mapping function include noise, and we improve the accuracy by removing noise. To achieve our aim, we propose the self-supervised difference detection module, which estimates noise from the results of the mapping functions by predicting the difference between the segmentation masks before and after the mapping. We verified the effectiveness of the proposed method by performing experiments on the PASCAL Visual Object Classes 2012 dataset, and we achieved 64.9% in the val set and 65.5% in the test set. Both of the results become new state-of-the-art under the same setting of weakly supervised semantic segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 7

page 8

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is a promising image recognition technology that enables the detailed analysis of images for various practical applications. However, semantic segmentation methods require training data with pixel-level annotation, which is costly to create. On the other hand, image-level annotation is much easier to obtain than pixel-level annotation. In recent years, various weakly-supervised semantic segmentation (hereinafter WSS) methods that required only image-level annotation have been proposed to resolve the annotation problems. However, there is still a large performance gap between fully-supervised and weakly-supervised methods.

In weakly-supervised segmentation methods, visualization-based approaches [45, 38, 48] have been widely adopted. The visualization results highlight the regions that contributed to the classification, and we can roughly estimate the regions of the target objects by visualization. Class Activation Map (CAM) [48] is a standard method to visualize the classification results. However, the visualization results do not always match actual segmentation results; therefore, it is usually necessary to consider the mapping from the visualization results to the semantic segmentation in weakly-supervised segmentation. Conditional Random Field (CRF) [19]

is widely used as a mapping function. CRF is a method for optimizing the probability distribution to be fitted to the edge of regions by using color and position information as features. The iterative approach for the learning segmentation models proposed by Wei et al. 

[43] is a versatile approach for improving weakly supervised segmentation results. In this method, we generate pseudo pixel-level labels under weakly supervised conditions, and we train a segmentation model with the pseudo labels. Subsequently, we generate pseudo pixel-level labels from the outputs of the trained segmentation model, and we re-train a new segmentation model using the generated pseudo labels. Wei et al. [43]

showed that repeating this process absorbed outliers and gradually improved the accuracy. These methods can be regarded as mapping functions that bring inputs closer to the segmentation. However, the mapping functions of these methods 

[19, 43] do not guarantee any improvement in the accuracy of the semantic segmentation; therefore, the mapping results contain noise. In this paper, the mapping functions that make the above inputs close to the segmentation are treated as supervision containing noise, and we propose a robust learning method for such noise.

In this paper, we denote the information used as the inputs of the mapping functions as knowledge, and we consider the supervision containing the noise as advice

. The supervision for fully supervised learning that allows one-to-one mapping is

teacher. We assume that the advice provides supervision, which includes some correct and incorrect information. To make effective use of the information obtained from this advice, it is necessary to select useful information. In this paper, we regard the regions where opinions differ between knowledge and advice as difference. Since difference in the two segmentation masks can be obtained by simple processing without annotation, it is a kind of self-supervised learning to train a model, which predicts difference

. Self-supervised learning is a pretext task as a form of indirect supervision. For example, as notable works, colorization 

[5] and predicting the patch ordering [6] have been proposed.

Inferring difference in knowledge and advice from knowledge leads to predicting the advisor’s advice in advance. In predicting advice, there are predictable advice and unpredictable advice. Certain advice can be easily inferred because many similar samples are included during training. Here, we assumed that advice contains a sufficient number of good information, and predictable information can be considered to be useful information. Based on this idea, we propose a method for selecting information by finding the true information in advice that can be predicted from the inference results of difference detection. Fig.1 shows the concept of the proposed approach.

In this paper, we demonstrate that the proposed Self-Supervised Difference Detection (SSDD) module can be used in both the seed generation stage and the training stage of fully supervised segmentation. In the seed generation stage, we refine the CRF results for pixel-level semantic affinity (PSA) [2] by using the SSDD module. In the training stage, we introduce two SSDD modules inside the training loop of a fully supervised segmentation network. In the experiments, we demonstrate the effectiveness of the SSDD modules in both stages. In particular, the SSDD modules greatly boosted the performance of the WSS on the PASCAL visual object classes (VOC) 2012 dataset, and achieved new state-of-the-art. To summarize it, our contributions are as follows:

  • We propose an SSDD module, which estimates the noise of the mapping functions of the weakly supervised segmentation and select useful information.

  • We show that the SSDD modules can be effectively applied to both the seed generation stage and the training stage of a fully supervised segmentation model.

  • We obtained the best results on the PASCAL VOC 2012 dataset with 64.9% mean IoU on the val set and 65.5% on the test set.

Figure 1: The concept of the proposed approach. (a) We denote the inputs of the mapping functions as knowledge and the outputs as advice. (b) The proposed difference detection network (DD-Net) estimates the difference between knowledge and advice. (c) In difference, the advice is divided into true advice and false advice. We assume that if the amount of true advice is larger than the amount of false advice, that is, if a set of false advice are outliers, then the predictable advice has a strong correlation with the true advice.

2 Related Works

In this section, we review related research on CNN-based WSS methods by classifying them into several types.

Visualization   In the early works of CNN-based WSS, visualization-based methods were studied. The pixels that contributed to the classification were correlated to the regions of the target objects; therefore, the visualization methods can be used as segmentation methods under weakly supervised settings. Zeiler et al. [46] showed that the derivatives obtained by back-propagation from the CNN models trained for classification tasks highlight the region of a target object in an image. Simonyan et al. [38] used derivatives such as the GrabCut seeds and extended the visualization method to the WSS method. They also demonstrated that the regions of multi-class objects could also be captured by the difference in class-specific derivatives [15, 37]. Oquab et al.[26]

visualized the attention region by the forwarding process using activation and trained a classification model with large input images by using global max pooling. After this approach, several derived methods employing global pooling were also proposed 

[30, 48, 18]. In particular, CAM [48] has been widely adopted in recent weakly supervised segmentation methods.

Region refinement for WSS results using CRF  

In general, the segmentation results based on fully convolutional neural network (FCN) 

[24] tend to output ambiguous outlines. CRF [19] can refine the ambiguous outlines using low-level features such as the pixel colors. Chen et al. [27] and Pathak et al. [28] adopted CRF as a post-processing method for region refinement and demonstrated the effectiveness of the CRF for WSS. Kolesnikov et al. [18] proposed the use of CRF during the training of a semantic segmentation model. Ahn et al. [2] proposed a method to learn pixel-level similarity from the CRF results, and apply a random walk-based region refinement, which achieved the best results on the PASCAL VOC 2012 dataset. CRF plays an important role to improve the accuracy of weakly supervised segmentation. Furthermore, various researches employed the CRF for refining the coarse segmentation masks [37, 34, 33, 17, 43, 42, 11, 36]. However, CRF does not guarantee any improvement in the mean intersection over union (IoU) score, and it often degrades the segmentation masks and the scores. Therefore, we focus on preventing a segmentation mask from being degraded by applying CRF. We estimate the confidence maps of both the initial mask and the mask after CRF post-processing, and we integrate both masks based on the estimated confidence maps.

Training fully supervised segmentation model under weakly supervised setting   Certain researchers trained a fully supervised semantic segmentation (hereinafter FSS) model under a weakly supervised setting. First, Papandreou et al. [29] proposed MIL-FCN, which trained a fully supervised semantic segmentation model with a global max-pooling loss using only image-level labels. Wei et al. [43] proposed a novel approach to train an FSS model using pixel-level labels obtained by saliency maps [13]. This method is simple, and the obtained results are impressive. Wei et al. [43] also demonstrated that the outputs of the trained semantic segmentation model could be used as a new pixel-level annotation for re-training, and the re-trained FSS model achieved better results than the original model.

Generating pixel-level labels during training of an FSS model   Constrained convolutional neural network (CCNN) [28] and EM-adopt [27] generated pixel-level labels during training using class labels and outputs of the segmentation model. In both the studies similar constraints were made for generating pixel-level labels to obtain better results. They set the ratios of the foreground and the background in an image and generated pixel-level labels within the ratio. Wei et al. [42] proposed an online prohibitive segmentation learning (PSL). They generated pixel-level seed labels of training samples before the first training of an FSS model and re-generated pixel-level labels using the outputs of the segmentation model and the classification results. The semantic segmentation model was trained by both the pixel-level labels, and they achieved good performance without costly manual pixel-level annotation. We expected that the pixel-level seed labels would play the role of the constraint. Huang et al. [12]

proposed deep seeded region growing (DSRG), which is a method to expand the seed region during training. Before training, the authors prepared pixel-level seed labels that had unlabeled regions for unconsidered pixels. In this research, we proposed new constraints for generating pixel-level labels during the training of the FSS model. We trained an FSS model and the difference detection model in an end-to-end manner. Then, we interpolated a few pixel-level seed labels, that had different regions in the newly generated pixel-level labels and these labels could also be predicted by the difference detection model.

WSS methods using additional information   A few recent weakly supervised approaches achieved high accuracy by using additional annotations for image-level labels. Researchers have proposed the bounding box annotation for WSS  [27], and they showed that the bounding box annotation substantially boosted performance. As weaker additional annotation, point annotation and scribble annotation were also proposed [3]. Saleh et al. [34] proposed an approach to check the generated initial masks by minimal additional supervision by human visions. Motion segmentation of videos as additional training information for weakly supervised segmentation has also been proposed [39, 10]. There are also reports that web images were helpful for improving the weakly supervised segmentation accuracy [30, 43, 16, 36]. Recently, fully supervised saliency methods are being widely used for detecting the background regions, and certain researchers have reported that this approach could substantially boost performance [35, 42, 44, 12, 11, 41, 4]. Region proposal methods trained with fully supervised foreground masks such as MCG [31] have also been used in [30, 32]. Hu et al. [7]

used instance-level saliency maps for WSS. The concept of saliency can be used and helpful in various situation; however, the fully supervised saliency model was affected by its training data domain, which may cause negative effects on applications. WSS methods without saliency maps are also beneficial. In this paper, we do not use any additional information, and we use only PASCAL VOC images with image-level labels and CNN models pre-trained with ImageNet images and their image-level labels.

3 Method

There was no supervision for the mapping functions of segmentation in the weakly supervised setting; therefore, it was necessary to consider a mapping for bringing the input close to the better segmentation results by using a method that incorporated human knowledge. In this paper, we propose a method for selecting useful information from the results of the mapping functions by treating the results as supervision containing noise. We define the inputs of the mapping functions as knowledge, and the mapped results as advice. We predict the regions of differences between knowledge and advice, and we call this as the difference detection task. Using the inference results, we select the information of the advice.

Figure 2: Difference Detection Network (DD-Net).

3.1 Difference detection network

In this section, we formulate the difference detection task. In the proposed method, we predict the difference between knowledge and advice. Here, we define the segmentation mask of knowledge as , the segmentation mask of advice as , and their difference as .

(1)

where indicates a location of pixels, and is the number of pixels. Next, we define a network of difference detection for deducing the difference. We use feature maps extracted from a trained CNN to assist the difference detection. In particular, we use high-level features and low-level features extracted from a backbone network, such as ResNet. Here, is an input image, and is an embedding function parameterized by . As shown in Fig.3, the confidence map of the input mask is generated by difference detection network (DD-Net),  , where

is a one-hot vector mask with the same number of channels to the target class number,

is the parameter of the DD-Net, and . The architecture of DD-Net is shown in Fig.2; it consists of three convolutional layers and one Residual block with three inputs and one output. DD-Net takes either a raw mask or a processed mask as an input, and outputs the difference mask. This network performs learning using the following losses:

(2)

where is a set of pixels of the input spaces, and is assumed to be a function that returns a loss for the binary cross entropy.

Note that the parameters of the embedding function are independent of the optimization of . The training of DD-Net is self-supervised; therefore, neither special annotation nor additional data are needed.

3.2 Self-supervised difference detection module

In this section, we describe the details of the SSDD module shown in Fig.3, which integrates two masks adaptively according to the confidence maps. We denote a set of advice that are true in difference as , and a set of advice that are false as . The purpose of the method is to extract as many samples of as possible from the entire set of advice . Let be the inference results of advice from the given knowledge. The inference results are the probability distributions from 0 to 1, and the values have variations. The variations are caused by the difference in the difficulty of inference. The presence of similar patterns during training can have a strong influence on the difference in the difficulty of inference. Here, if there are a sufficient number of advice that are true values rather than false values, that is, if , the larger values indicate that their advice most likely belong to . However, for the values of at a boundary, it is not clear whether advice belongs to or not; this should probably be different from sample to sample. Therefore, it is difficult to deduce a good advice directly from the size of the value of . To alleviate the problem, we use the inference results about the state of knowledge for each advice. Although advices have large variations in their distribution, these variations are less than the variations in the distribution of knowledge in general. Therefore, using advice to infer knowledge is assumed to be easier than using knowledge to advice inference. In this paper, we consider the results of the inference of knowledge to advice for evaluating the difficulty of inference in each sample; we use the inferences for the thresholds for each sample. Specifically, we calculate the confidence scores of advice from the viewpoint of how close the values of to . The confidence score is defined by the following expression:

(3)

Here, is a hyper parameter for a threshold of the selection obtained by the difference detection, and it is also an enhanced value for the categories in the presence labels of the input image. The refined masks obtained from and are defined by the following expression:

(4)

We denote this processing flow for generating new segmentation mask as an SSDD module in the after notation.

(5)
Figure 3: Overview of the DD-Net. The figure on the left shows the training of the DD-Net, and the right figure shows the processing of the integration using the results of difference detection.

4 Introducing SSDD modules into the processing flow of WSS

In this section, we explain how to use SSDD modules in the processing flow of WSS. The proposed method can be adapted to various cases by applying inputs of the mapping function as knowledge and the results of the mapping function as advice. The processing flow that we adopted in this paper consists of two stages: the seed generation stage with static region refinement and the training stage of a segmentation model with dynamic region refinement. In the first stage, we adapted the proposed method by applying the results of PSA as knowledge and its CRF results as advice (Sec.4.1). In the second stage, we adapted the proposed method by applying the results of the first stage (Sec.4.1) as knowledge, and the outputs of the segmentation models trained by the masks were applied as advice (Sec.4.2).

4.1 Seed mask generation stage with static region refinement

PSA [2] is a method to propagate label responses to nearby areas that belong to the same semantic entity. Though PSA employs CRF for the refinement of the segmentation mask, CRF often fails to improve the segmentation masks; in fact, it degrades the masks. In this section, we refine the outputs of CRF in PSA by using the proposed SSDD module. We illustrate the processing flow of the first seed generation stage in Fig.4. Note that we omitted the input of the given image to an SSDD module for the sake of simplifying in the figure.

We denote an input image as ; the probability maps obtained by PSA are denoted as , and its CRF results are denoted as . We obtain the segmentation masks from the probability maps by taking the argument of the maximum of the presence labels including a background category. We computed the loss of the DD-Net as follows:

(6)

The proposed method is not effective when either of the segmentation masks or both of them do not have the correct labels. These cases are not only meaningless for the proposed refinement approach, but they may also harm the training of the DD-Net. We define the bad training samples by simple processing based on the difference in the number of the class-specific pixels, and we exclude them from the training.

In this work, we also train the embedding function by training a segmentation network with to obtain good representation for the inputs of high-level features and low-level features:

(7)
(8)

where is a set of locations that belong to the class on the mask ; is the conditional probability of observing any label at any location ; and is a set of class labels. are parameters of embedding functions and are parameters for the segmentation branch. The training of is independent of

. The final loss function for the static region refinement using the difference detection is as follows:

(9)

After training, we integrate the masks and obtain the integrated masks using the SSDD module with the trained parameter as follows:

(10)
Figure 4: Processing flow at the seed mask generation stage with static region refinement.

4.2 Training stage of a fully supervised segmentation model with a dynamic region refinement

When we train a fully supervised semantic segmentation model with pixel-level seed labels, the accuracy of the seed labels directly effects the performance of the segmentation. The performance gain is expected by replacing the seed labels to better the pixel-level labels during training. In this study, we propose a novel approach to constrain the interpolation of the seed labels during the training of a segmentation model. The idea of the constraint is to limit the interpolation of seed labels only to predictable regions of difference detection between newly generated pixel-level labels and seed labels.

Figure 5: Illustration of the processing flow for the dynamic region refinement. (“SegNet” does not represent any specific network but represents any kind of network for fully supervised semantic segmentation.)

In practice, we interpolate the pixel-level seed labels in two steps of each iteration as shown in Fig.5. Note that “SegNet” in the figure does not represent a specific segmentation network; it represents any fully supervised segmentation network. In the first step, for an input image , we obtain the outputs of the segmentation model and its CRF outputs . We obtain the segmentation masks from the probability maps by taking the argument of the maximum of the presence labels including a background category. Then, we obtain the refined pixel-level labels by applying the proposed refinement method as follows: . In the second step, we apply the proposed method to the seed labels and to the mask obtained in the first step. The further refined mask is obtained by . We generate the mask in each iteration and train the segmentation model using the generated mask . We train the semantic segmentation model with the generated mask as follows:

(11)

The loss of DD-Net for and is as follows:

(12)

In the second stage, we also exclude the bad samples (as done in Sec.static) based on the change ratio of pixels because the proposed method is not effective if the input segmentation masks do not have correct regions.

We explain how to train the DD-Net for (,). The masks () depend on the outputs of the segmentation model . Therefore, if the learning of the segmentation model falls into a local minimum, the masks will become meaningless; all the pixels become background pixels or single foreground pixels. In this case, the inference results of the difference detection is also always constant, that is, (), and Eq.(3) becomes . To escape from this local minimum, we create a new branch of a segmentation model and use it for learning the difference detection between and . Assume that the mask was obtained from outputs of the branch of the new segmentation model . In the training of difference detection, we trained the network to learn the differences among (, ) and (, ) as follows:

(13)

If is the output, which is halfway between and , the replacement of the training samples will let the segmentation model exit from the situation (), and the inference results of the difference detection will predict the regions that correlate with the difference between and . We train the parameters from the following loss to achieve the outputs that are halfway between and .

(14)

where is a hyper parameter of the mixing ratio of and .

The final loss function of the proposed dynamic region refinement method is calculated as follows:

(15)

5 Experiments

We evaluated the proposed methods using the PASCAL VOC 2012 data. The PASCAL VOC 2012 segmentation dataset has 1464 training images, 1449 validation images, and 1456 test images including 20 class pixel-level labels and image-level labels. Similar to the methodology followed by [30, 27, 18], we used the augmented PASCAL VOC training data provided by [9]

as well, wherein the training image number was 10,582. For evaluation, we used an IoU metric, which is the official evaluation metric in the PASCAL VOC segmentation task. For calculating the mean IoU on the val and test sets, we used the official evaluation server. We compared the best performance of our method with the state-of-the-art methods on both the val and test sets.

5.1 Implementation details

Our experiments are heavily based on the previous research [2]. For the generating results of PSA results, we used implementations and trained parameters provided by the authors that are publicly available. We followed the methodology of [2]

and set hyperparameters that gave the best performance. For the CRF parameters, we used the default settings provided by

[19]. For the semantic segmentation model, we used a ResNet-38 model, which had almost the same architecture as that in [2]. The only difference was in the last upsampling rate; in the paper on PSA, the authors set the upsampling rate to 8, while we set the rate to 2 for reducing the computational cost of CRF. The input image size was 448 for training, and the test images and the output feature map size before the upsampling was 56. In the DD-Net, we used features obtained from the segmentation model before the last layer as the high-level features and the features obtained before the second pooling layer as the low level features . These feature map sizes were adjusted to 112 by 112 using the simple linear interpolation approach. We initialized the parameters of the segmentation models by using parameters trained with the PASCAL VOC images and their image-level labels with a pre-trained model using ImageNet, which was also provided in [2]. The codes provided by [2] did not include the training and test code for the segmentation models; therefore, we implemented our own codes. In the original paper on PSA, though the authors optimized the segmentation models by Adam; however, the performance was unstable in our re-implementation, and there were several unclear settings. Therefore, we used SGD for training the entire networks. We set an initial learning rate to 1e-3 (1e-2 for initialization without the pre-trained model), and we decreased learning rate with cosine LR ramp down [25]

. For the static region refinement, we trained the network with batch sizes of 16 and 10 epochs. For the dynamic region refinement, we trained the network with batch sizes of 8 and 30 epochs. For the data augmentation and inference technique, we carefully followed the methodology used in 

[2]

. We implemented the proposed method using PyTorch. All the networks are trained using four NVIDIA Titan X PASCAL. We will open the results of the proposed method and training codes

111https://github.com/shimoda-uec/ssdd.

5.2 Analysis of static region refinement

In the proposed method, we used fully connected CRF [19] with the same parameter settings as those for PSA [2], (, ,, , ) in the following kernel potentials: . To examine the relationship between the CRF params and results, we changed the values of (, ) and evaluated the accuracy. Fig.6 shows a comparison of the proposed static region refinement with the PSA [2] and its CRF results on the training set. The weakening of decreases the difference only between the CRF and the SSDD+CRF results; therefore the effectiveness of the proposed method reduces. However, the proposed method always indicates a high accuracy. The optimal weights are different for each image, and it is expected to be difficult to search them for each image. We consider that the proposed method realized the improvement of CRF by correcting the partial failure of CRF.

Fig.7 shows the difference detection results and their refined segmentation masks. In the fourth and fifth rows of Fig.7, we show the typical failure cases of the proposed method. The regions of small objects tend to vanish in the CRF, and the DD-Net also learns such tendencies, which causes the failure of the proposed re-refinement method. In the fifth row, both of the input segmentation masks fail to provide segmentation. In such cases, the proposed method is also not effective.

Figure 6: mIoU of the seed masks of the training images with different params values with only CRF and with SSDD and CRF.
Figure 7: Each row shows (a) input images, (b) raw PSA segmentation masks, (c) difference detection maps of (b), (d) CRF masks of (b), (e) difference detection maps of (d), (f) refined segmentation masks by the proposed method, and (g) ground truth masks.
Methods

Bg

Aero

Bike

Bird

Boat

Bottle

Bus

Car

Cat

Chair

Cow

Table

Dog

Horse

Motor

Person

Plant

Sheep

Sofa

Train

Tv

mIoU

PSA [2] 88.2 68.2 30.6 81.1 49.6 61.0 77.8 66.1 75.1 29.0 66.0 40.2 80.4 62.0 70.4 73.7 42.5 70.7 42.6 68.1 51.6 61.7
SSDD 89.0 62.5 28.9 83.7 52.9 59.5 77.6 73.7 87.0 34.0 83.7 47.6 84.1 77.0 73.9 69.6 29.8 84.0 43.2 68.0 53.4 64.9
Gain +0.8 -5.7 -1.7 +2.6 +3.3 -1.5 -0.2 +7.6 red+11.9 +5.0 red+17.7 +7.4 +3.7 red+15.0 +3.5 -4.1 blue-12.7 red+13.3 +0.6 -0.1 +1.8 +3.2
Table 1: Results on PASCAL VOC 2012 val set.

5.3 Analysis of the whole proposed method

We denote the dynamic region refinement as “SSDD” in all the tables. The score of the SSDD is with the CRF with parameters (, ) that are default values from the author’s public implementation. We also used the parameters for the CRF during training.

Comparison with PSA   Table 1 shows the comparison of the dynamic region refinement method with the PSA. We observe that the proposed method outperforms PSA by more than 3.2 point margins. This clearly proves the effectiveness of the interpolation for the seed labels with the novel constraint by difference detection. The accuracy is greatly improved as compared with the results of the static region refinement because of the increase in the number of good advice by end-to-end learning of the segmentation model, that is, .

In Table 1, we also show the gains between the proposed method and PSA for detailed analysis. We obtain over 10% gain on the cat, cow, horse, and sheep classes. Interestingly, all the classes that gave the large gain belonged to the animal category. However, in the potted plant, airplane, and person class objects, it was hard to improve the segmentation mask by using the proposed method. In the proposed method, we considered the precondition that advise, which is a true value, was larger than the value that was not a true value(). When this precondition was satisfied, the accuracy of the classes improved. If the precondition was not satisfied, the accuracy did not improve or the accuracy decreased.

Fig.8 shows the examples of the results of re-implementation of PSA, the static region refinement, and the dynamic region refinement. Dynamic region refinement shows more accurate predictions on object location and boundary. The results of the static region refinement are outputs of a segmentation model re-trained with the masks in case of (, ) in Fig.6. Note that we show the results of before the CRF for detailed comparisons.

Method Val Test
FCN-MIL [29]ICLR2015 25.7 24.9
CCNN [28]ICCV2015 35.3 35.6
EM-Adapt [27]ICCV2015 38.2 39.6
DCSM [37]ECCV2016 44.1 45.1
BFBP [34]ECCV2016 46.6 48.0
SEC [18]ECCV2016 50.7 51.7
CBTS [33]CVPR2017 52.8 53.7
TPL [17]ICCV2017 53.1 53.8
MEFF [8]CVPR2018 - 55.6
PSA [2]CVPR2018 61.7 63.7
IRN [1]CVPR2019 63.5 64.8
SSDDICCV2019 64.9 65.5
Table 3: Comparison of the WSS methods with additional supervision.
Method Additional supervision Val Test
MIL-seg [30]CVPR2015 Saliency mask + Imagenet images 42.0 40.6
MCNN [39]ICCV2015 Web videos 38.1 39.8
AFF [32]ECCV2016 Saliency mask 54.3 55.5
STC [43]PAMI2017 Saliency mask + Web images 49.8 51.2
Oh et al. [35]CVPR2017 Saliency mask 55.7 56.7
AE-PSL [42]CVPR2017 Saliency mask 55.0 55.7
Hong et al. [10]CVPR2017 Web videos 58.1 58.7
WebS-i2 [16]CVPR2017 Web images 53.4 55.3
DCSP [4]BMVC2017 Saliency mask 60.8 61.9
GAIN [22]CVPR2018 Saliency mask 55.3 56.8
MDC [44]CVPR2018 Saliency mask 60.4 60.8
MCOF [41]CVPR2018 Saliency mask 60.3 61.2
DSRG [12]CVPR2018 Saliency mask 61.4 63.2
Shen et al. [36]CVPR2018 Web images 63.0 63.9
SeeNet [11]NIPS2018 Saliency mask 63.1 62.8
AISI [7]ECCV2018 Instance saliency mask 63.6 64.5
FickleNet [20]CVPR2019 Saliency mask 64.9 65.3
DSRG+EP. [40]ICCV2019 Saliency mask 61.5 62.7
AttnBN. [23]ICCV2019 Saliency mask + Single-label images 62.1 63.0
Zeng et al. [47]ICCV2019 Saliency mask 63.3 64.3
OAA+. [14]ICCV2019 Saliency mask 65.2 66.4
Lee et al. [21]ICCV2019 Web videos 66.5 67.4
SSDDICCV2019 - 64.9 65.5
Table 2: Comparison with the WSS methods without additional supervision.

Comparison with the state-of-the-art methods   Table 3 shows the results of the proposed method and the recent weakly supervised segmentation methods that do not use additional supervisions on the PASCAL VOC 2012 validation data and PASCAL VOC 2012 test data. We observed that our method achieves the highest score as compared with all the existing methods, which use the same types of supervision [28, 27, 37, 34, 18, 17, 33, 8, 2]. The proposed method outperforms the recent previous works on MEFF and TPL by large margins. As discussed earlier, the proposed method also outperforms the current state-of-the-art methods [2]. This result clearly indicates the effectiveness of the proposed method.

Table 3 shows the comparison of the proposed method with a few weakly supervised segmentation methods that employ relatively cheap additional information. Surprisingly, the proposed method also outperforms all the listed weakly supervised segmentation methods. The proposed methods outperformed the following methods: SeeNet [34], DSRG [43], MDC [18], GAIN [22], and MCOF [41] that employed fully supervised saliency methods. In addition, the score of the proposed method was also better than the results of AISC [7], which used instance-level saliency map methods. Note that AISC achieved 64.5% on the val set and 65.6% on the test set using an additional 24,000 ImageNet images for training. The score of the proposed method was also higher than the score of Shen et al. [36], which used 76.7k web images for training. It is not possible to have a completely fair comparison for them because of the difference of the network model, the augmentation technique, the number of iteration epochs, and so on. However, the proposed method demonstrates comparable performance or better performance without any additional training information.

6 Conclusions

In this paper, we proposed a novel method to refine a segmentation mask from a pair of segmentation masks before and after the refinement process such as the CRF by using the proposed SSDD module. We demonstrated that the proposed method could be used effectively in two stages: the static region refinement in the seed generation stage and the dynamic region refinement in the training stage. In the first stage, we refined the CRF results of PSA [2] by using the SSDD module. In the second stage, we refined the generated semantic segmentation masks by using a fully supervised segmentation model and CRF during the training. We demonstrated that three SSDD modules could greatly boost the performance of WSS and achieve the best results on the PASCAL VOC 2012 dataset over all the weakly supervised methods with and without additional supervision.

Acknowledgements   This work was supported by JSPS KAKENHI Grant Number 17J10261, 15H05915, 17H01745, 17H06100 and 19H04929.

Figure 8: Segmentation examples of results on PASCAL VOC 2012.

References

  • [1] J. Ahn, S. Cho, and S. Kwak (2019) Weakly supervised learning of instance segmentation with inter-pixel relations. In CVPR, Cited by: Table 3.
  • [2] J. Ahn and S. Kwak (2018) Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, Cited by: §1, §2, §4.1, §5.1, §5.2, §5.3, Table 1, Table 3, §6, Table A-4.
  • [3] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei (2016) What’s the point: semantic segmentation with point supervision. In ECCV, Cited by: §2.
  • [4] A. Chaudhry, K. P. Dokania, and H.S. P. Torr (2017) Discovering class-specific pixels for weakly-supervised semantic segmentation. In bmvc, Cited by: §2, Table 3, Table A-4.
  • [5] Z. Cheng, Q. Yang, and B. Sheng (2015) Deep colorization. In ICCV, Cited by: §1.
  • [6] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §1.
  • [7] R. Fan, Q. Hou, M. Cheng, G. Yu, R. R. Martin, and S. Hu (2018) Associating inter-image salient instances for weakly supervised semantic segmentation. In ECCV, Cited by: §2, §5.3, Table 3, Table A-4.
  • [8] W. Ge, S. Yang, and Y. Yu (2018) Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In CVPR, Cited by: §5.3, Table 3, Table A-4.
  • [9] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2014) Simultaneous detection and segmentation. In ECCV, Cited by: §5.
  • [10] S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han (2017) Weakly supervised semantic segmentation using web-crawled videos. In CVPR, Cited by: §2, Table 3, Table A-4.
  • [11] Q. Hou, P. Jiang, Y. Wei, and M. Cheng (2018) Self-erasing network for integral object attention. In NIPS, Cited by: §2, §2, Table 3, Table A-4.
  • [12] Z. Huang, W. Xinggang, W. Jiasi, W. Liu, and W. Jingdong (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, Cited by: §2, §2, Table 3, Table A-4.
  • [13] H. Jiang, Z. Yuan, M. Cheng, Y. Gong, N. Zheng, and J. Wang (2013) Salient object detection: a discriminative regional feature integration approach. In CVPR, Cited by: §2.
  • [14] P. Jiang, Q. Hou, Y. Cao, M. Cheng, Y. Wei, and H. Xiong (2019) Integral object mining via online attention accumulation. In ICCV, Cited by: Table 3.
  • [15] Z. Jianming, L. Zhe, B. Jonathan, S. Xiaouhui, and S. Sclaroff (2016) Top-down neural attention by excitation backprop. In ECCV, Cited by: §2.
  • [16] B. Jin, M. V. O. Segovia, and S. Susstrunk (2018) Webly supervised semantic segmentation. In CVPR, Cited by: §2, Table 3, Table A-4.
  • [17] D. Kim, D. Cho, D. Yoo, and I. S. Kweon (2017) Two-phase learning for weakly supervised object localization. In ICCV, Cited by: §2, §5.3, Table 3, Table A-4.
  • [18] A. Kolesnikov and C. H. Lampert (2016) Seed, expand and constrain: three principles for weakly-supervised image segmentation. In ECCV, Cited by: §2, §2, §5.3, §5.3, Table 3, §5, Table A-4.
  • [19] P. Krahenbuhl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, Cited by: §1, §2, §5.1, §5.2.
  • [20] J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon (2019) FickleNet: weakly and semi-supervised semantic image segmentation. In CVPR, Cited by: Table 3.
  • [21] J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon (2019) Frame-to-frame aggregation of active regions in web videos for weakly supervised semantic segmentation. In ICCV, Cited by: Table 3.
  • [22] K. Li, Z. Wu, K. Peng, J. Ernest, and Y. Fu (2018) Tell me where to look: guided attention inference network. In CVPR, Cited by: §5.3, Table 3, Table A-4.
  • [23] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu (2019) Attention bridging network for knowledge transfer. In ICCV, Cited by: Table 3.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.
  • [25] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    In ICLR, Cited by: §5.1.
  • [26] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014) Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, Cited by: §2.
  • [27] G. Papandreou, L. Chen, K. Murphy, and A. L. Yuille (2015)

    Weakly-and semi-supervised learning of a dcnn for semantic image segmentation

    .
    In ICCV, Cited by: §2, §2, §2, §5.3, Table 3, §5, Table A-4.
  • [28] D. Pathak, P. Krahenbuhl, and T. Darrell (2015) Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, Cited by: §2, §2, §5.3, Table 3, Table A-4.
  • [29] D. Pathak, E. Shelhamer, J. Long, and T. Darrell (2015) Fully convolutional multi-class multiple instance learning. In ICLR, Cited by: §2, Table 3, Table A-4.
  • [30] P. O. Pinheiro and R. Collobert (2015) From image-level to pixel-level labeling with convolutional networks. In CVPR, Cited by: §2, §2, Table 3, §5, Table A-4.
  • [31] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik (2014) Multiscale combinatorial grouping. In CVPR, Cited by: §2.
  • [32] X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia (2016) Augmented feedback in semantic segmentation under image level supervision. In ECCV, Cited by: §2, Table 3, Table A-4.
  • [33] A. Roy and S. Todorovic (2017) Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In CVPR, Cited by: §2, §5.3, Table 3, Table A-4.
  • [34] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvares (2016) Built-in foreground/background prior for weakly-supervised semantic segmentation. In ECCV, Cited by: §2, §2, §5.3, §5.3, Table 3, Table A-4.
  • [35] J. O. Seong, B. Rodrigo, K. Anna, A. Zeynep, and F. Mario (2017) Exploiting saliency for object segmentation from image level labels. In CVPR, Cited by: §2, Table 3, Table A-4.
  • [36] T. Shen, G. Lin, C. Shen, and R. Ian (2018) Bootstrapping the performance of webly supervised semantic segmentation. In CVPR, Cited by: §2, §2, §5.3, Table 3, Table A-4.
  • [37] W. Shimoda and K. Yanai (2016) Distinct class saliency maps for weakly supervised semantic segmentation. In ECCV, Cited by: §2, §2, §5.3, Table 3, Table A-4.
  • [38] K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In ICLR WS, External Links: Link Cited by: §1, §2.
  • [39] P. Tokmakov, K. Alahari, and C. Schmid (2016) Weakly-supervised semantic segmentation using motion cues. In ECCV, Cited by: §2, Table 3, Table A-4.
  • [40] W. Wan, J. Chen, T. Li, Y. Huang, J. Tian, C. Yu, and Y. Xue (2019) Information entropy based feature pooling for convolutional neural networks. In ICCV, Cited by: Table 3.
  • [41] X. Wang, S. You, X. Li, and H. Ma (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In CVPR, Cited by: §2, §5.3, Table 3, Table A-4.
  • [42] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In CVPR, Cited by: §2, §2, §2, Table 3, Table A-4.
  • [43] Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, J. Feng, Y. Zhao, and S. Yan (2017) STC: a simple to complex framework for weakly-supervised semantic segmentation. In IEEE Trans. on PAMI, Cited by: §1, §2, §2, §2, §5.3, Table 3, Table A-4.
  • [44] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang (2018) Revisiting dilated convolution: a simple approach for weakly- and semisupervised semantic segmentation. In CVPR, Cited by: §2, Table 3, Table A-4.
  • [45] M. D. Zeiler and R. Fergus (2011) Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, Cited by: §1.
  • [46] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, Cited by: §2.
  • [47] Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang (2019) Joint learning of saliency detection and weakly supervised semantic segmentation. In ICCV, Cited by: Table 3.
  • [48] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In CVPR, Cited by: §1, §2.

A.1 Details of the simple decision

In the proposed method, we select advice by inference results of difference detection. The confidence score is calculated from the viewpoint of how close the value of to . In the proposed method, if this difference is large enough, we ignore the advice. Therefore, if the inferences of the difference detection are too easy, the values of for advice that is not true become close to , and the proposed method does not work effectively. In particular, if the inference results of the difference detection are (), we cannot distinguish whether the advice belongs to the set of true values or the set of false values based on the results of the difference detection. Therefore, we judge the typical failure examples of and excluded them from the training sample so that the differences between and were large in the inference of the bad advice. To be concrete, when the number of differences in the pixels in each class of mask is obviously large, we assume that the advice has failed. We define the bad training samples as the pair of the masks for the difference detection that satisfies the following equation:

(16)

where is a set of image-level label of the input image. We decide the threshold 0.5 empirically.

A.2 Details of the bias in Eq.(3)

In Eq.(3), we use , which is a kind of hyperparameter. In this section, we discuss this . We define the as follows:

(17)

where satisfyand. is a bias for the difference between knowledge and advice, and is a bias for the class category. When the number of differences in the pixels in each class of mask is obviously large, it is assumed that the advice has failed, and to prioritize the label of that class over the results of the difference detection, we use the bias . We defined the values of and by using the grid search.

A.3 Values of hyperparameters

We explore good hyperparameters by a grid search and verify the effect of the hyperparameters. We change the values of the hyperparameters and measure the mean IoU scores. Table A-1 shows the hyper parameter values and the mean IoU scores. The hyperparameters () are used in Eq.(17) as the bias values. In , the mean IoU score becomes the maximum value. We also set the bias for the missing categories. We observe that the setting achieved a maximum mean IoU. It is expected that the class biases for the missing categories help to the train for robustness. In addition, we also verify the effect of hyperparameters for coefficients of losses in Eq.(11). Though we had expected that the value of would affect the performance, the hyper parameter was not critical for the change of the mean IoU. The balanced setting, that is, showed the best score.

0.0 0.1 0.2 0.3 0.4 0.5
mIoU 62.2 63.9 64.6 64.2 64.9 62.7
0.0 0.5 1.0 1.5 2.0
mIoU 64.3 63.0 64.9 64.5 63.7
1.0 0.75 0.5 0.25 0.0
mIoU 63.1 64.4 64.9 64.3 63.2
Table A-1: Experimental results with different parameters.

A.4 Detailed comparison with existing works on the PASCAL VOC 2012 val and test sets

methods

bg

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

motor

person

plant

sheep

sofa

train

tv

mIoU

MIL-FCN [29] - - - - - - - - - - - - - - - - - - - - - 24.9
CCNN [28] 68.5 25.5 18.0 25.4 20.2 36.3 46.8 47.1 48.0 15.8 37.9 21.0 44.5 34.5 46.2 40.7 30.4 36.3 22.2 38.8 36.9 35.3
EM-Adapt [27] - - - - - - - - - - - - - - - - - - - - - 38.2
DCSM [37] 76.7 45.1 24.6 40.8 23.0 34.8 61.0 51.9 52.4 15.5 45.9 32.7 54.9 48.6 57.4 51.8 38.2 55.4 32.2 42.6 39.6 44.1
BFBP [34] 79.2 60.1 20.4 50.7 41.2 46.3 62.6 49.2 62.3 13.3 49.7 38.1 58.4 49.0 57.0 48.2 27.8 55.1 29.6 54.6 26.6 46.6
SEC [18] 82.4 62.9 26.4 61.6 27.6 38.1 66.6 62.7 75.2 22.1 53.5 28.3 65.8 57.8 62.3 52.5 32.5 62.6 32.1 45.4 45.3 50.7
CBTS [33] 85.8 65.2 29.4 63.8 31.2 37.2 69.6 64.3 76.2 21.4 56.3 29.8 68.2 60.6 66.2 55.8 30.8 66.1 34.9 48.8 47.1 52.8
TPL [17] 82.8 62.2 23.1 65.8 21.1 43.1 71.1 66.2 76.1 21.3 59.6 35.1 70.2 58.8 62.3 66.1 35.8 69.9 33.4 45.9 45.6 53.1
MEFF [8] - - - - - - - - - - - - - - - - - - - - - -
PSA [2] 88.2 68.2 30.6 81.1 49.6 61.0 77.8 66.1 75.1 29.0 66.0 40.2 80.4 62.0 70.4 73.7 42.5 70.7 42.6 68.1 51.6 61.7
SSDD (ours) 89.0 62.5 28.9 83.7 52.9 59.5 77.6 73.7 87.0 34.0 83.7 47.6 84.1 77.0 73.9 69.6 29.8 84.0 43.2 68.0 53.4 64.9
Table A-3: Results on PASCAL VOC 2012 test set without additional supervision.
methods

bg

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

motor

person

plant

sheep

sofa

train

tv

mIoU

MIL-FCN [29] - - - - - - - - - - - - - - - - - - - - - 25.7
CCNN [28] 68.5 25.5 18.0 25.4 20.2 36.3 46.8 47.1 48.0 15.8 37.9 21.0 44.5 34.5 46.2 40.7 30.4 36.3 22.2 38.8 36.9 35.3
EM-Adapt [27] - - - - - - - - - - - - - - - - - - - - - 39.6
DCSM [37] 78.1 43.8 26.3 49.8 19.5 40.3 61.6 53.9 52.7 13.7 47.3 34.8 50.3 48.9 69.0 49.7 38.4 57.1 34.0 38.0 40.0 45.1
BFBP [34] 80.3 57.5 24.1 66.9 31.7 43.0 67.5 48.6 56.7 12.6 50.9 42.6 59.4 52.9 65.0 44.8 41.3 51.1 33.7 44.4 33.2 48.0
SEC [18] 83.5 56.4 28.5 64.1 23.6 46.5 70.6 58.5 71.3 23.2 54.0 28.0 68.1 62.1 70.0 55.0 38.4 58.0 39.9 38.4 48.3 51.7
CBTS [33] 85.7 58.8 30.5 67.6 24.7 44.7 74.8 61.8 73.7 22.9 57.4 27.5 71.3 64.8 72.4 57.3 37.0 60.4 42.8 42.2 50.6 53.7
TPL [17] 83.4 62.2 26.4 71.8 18.2 49.5 66.5 63.8 73.4 19.0 56.6 35.7 69.3 61.3 71.7 69.2 39.1 66.3 44.8 35.9 45.5 53.8
MEFF [8] 86.6 72.0 30.6 68.0 44.8 46.2 73.4 56.6 73.0 18.9 63.3 32.0 70.1 72.2 68.2 56.1 34.5 67.5 29.6 60.2 43.6 55.6
PSA [2] 89.1 70.6 31.6 77.2 42.2 68.9 79.1 66.5 74.9 29.6 68.7 56.1 82.1 64.8 78.6 73.5 50.8 70.7 47.7 63.9 51.1 63.7
SSDD (ours) 89.5 71.8 31.4 79.3 47.3 64.2 79.9 74.6 84.9 30.8 73.5 58.2 82.7 73.4 76.4 69.9 37.4 80.5 54.5 65.7 50.3 65.5
Table A-4: Results on PASCAL VOC 2012 val set with additional supervision.
methods

info type

bg

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

motor

person

plant

sheep

sofa

train

tv

mIoU

MIL-seg [30] S 79.6 50.2 21.6 40.6 34.9 40.5 45.9 51.5 60.6 12.6 51.2 11.6 56.8 52.9 44.8 42.7 31.2 55.4 21.5 38.8 36.9 42.0
MCNN [39] WV 77.5 47.9 17.2 39.4 28.0 25.6 52.7 47.0 57.8 10.4 38.0 24.3 49.9 40.8 48.2 42.0 21.6 35.2 19.6 52.5 24.7 38.1
AFF [32] S - - - - - - - - - - - - - - - - - - - - - 54.3
STC [43] S 84.5 68.0 19.5 60.5 42.5 44.8 68.4 64.0 64.8 14.5 52.0 22.8 58.0 55.3 57.8 60.5 40.6 56.7 23.0 57.1 31.2 49.8
Oh et al. [35] S - - - - - - - - - - - - - - - - - - - - - 55.7
AE-PSL [42] S 83.4 71.1 30.5 72.9 41.6 55.9 63.1 60.2 74.0 18.0 66.5 32.4 71.7 56.3 64.8 52.4 37.4 69.1 31.4 58.9 43.9 55.0
Hong et al. [10] WV 87.0 69.3 32.2 70.2 31.2 58.4 73.6 68.5 76.5 26.8 63.8 29.1 73.5 69.5 66.5 70.4 46.8 72.1 27.3 57.4 50.2 58.1
WebS-i2 [16] WI 84.3 65.3 27.4 65.4 53.9 46.3 70.1 69.8 79.4 13.8 61.1 17.4 73.8 58.1 57.8 56.2 35.7 66.5 22.0 50.1 46.2 53.4
DCSP [4] S 88.9 77.7 31.3 73.2 59.8 71.0 79.2 74.5 80.0 15.1 73.3 10.2 76.1 72.2 69.1 72.1 39.9 73.9 14.6 70.3 53.1 60.8
GAIN [22] S - - - - - - - - - - - - - - - - - - - - - 56.8
MDC [44] S 89.5 85.6 34.6 75.8 61.9 65.8 67.1 73.3 80.2 15.1 69.9 8.1 75.0 68.4 70.9 71.5 32.6 74.9 24.8 73.2 50.8 60.4
MCOF [41] S 87.0 78.4 29.4 68.0 44.0 67.3 80.3 74.1 82.2 21.1 70.7 28.2 73.2 71.5 67.2 53.0 47.7 74.5 32.4 71.0 45.8 60.3
DSRG [12] S - - - - - - - - - - - - - - - - - - - - - 61.4
Shen et al. [36] WI 86.8 71.2 32.4 77.0 24.4 69.8 85.3 71.9 86.5 27.6 78.9 40.7 78.5 79.1 72.7 73.1 49.6 74.8 36.1 48.1 59.2 63.0
SeeNet [11] S - - - - - - - - - - - - - - - - - - - - - 63.1
AISI [7] IS - - - - - - - - - - - - - - - - - - - - - 64.5
SSDD (ours) - 89.0 62.5 28.9 83.7 52.9 59.5 77.6 73.7 87.0 34.0 83.7 47.6 84.1 77.0 73.9 69.6 29.8 84.0 43.2 68.0 53.4 64.9

( AS:Saliency mask, WV:web videos. WI Web images. IS Instance saliency mask.)

Table A-2: Results on PASCAL VOC 2012 val set without additional supervision.