Saliency Guided Self-attention Network for Weakly-supervised Semantic Segmentation

10/12/2019 ∙ by Qi Yao, et al. ∙ 28

Weakly supervised semantic segmentation (WSSS) using only image-level labels can greatly reduce the annotation cost and therefore has attracted considerable research interest. However, its performance is still inferior to the fully supervised counterparts. To mitigate the performance gap, we propose a saliency guided self-attention network (SGAN) to address the WSSS problem. The introduced self-attention mechanism is able to capture rich and extensive contextual information but also may mis-spread attentions to unexpected regions. To enable this mechanism work effectively under weak supervision, we integrate class-agnostic saliency priors into the self-attention mechanism to prevent the attentions on discriminative parts from mis-spreading to the background. And meanwhile we utilize class-specific attention cues as an additional supervision for SGAN, which reduces the mis-spread of attentions in regions belonging to different foreground categories. The proposed approach is able to produce dense and accurate localization cues, by which the segmentation performance is boosted. Experiments on PASCAL VOC 2012 dataset show that the proposed approach outperforms all other state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

Code Repositories


Saliency Guided Self-attention Network for Weakly and Semi-supervised Semantic Segmentation(IEEE ACCESS)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic segmentation aims to predict a semantic label for each pixel in an image. Based upon the fundamental Fully Convolutional Networks (FCNs) [22], various techniques such as dilated convolution [3], spatial pyramid pooling [36], and encoder-decoders [26] have been developed in the last decade. These techniques gradually improve segmentation accuracy via exploiting more and more extensive contextual information. Recently, the self-attention mechanism [37, 7, 12] has been successfully employed to capture richer contextual information and boost the segmentation performance further. Although the above-mentioned methods have achieved high performance in semantic segmentation, they all work under full supervision. This supervision manner requires a large amount of pixel-wise annotations for training, which are very expensive and time-consuming.

To reduce the annotation burden, different supervision forms such as bounding boxes [35], scribbles [19], and image-level tags [16] have been considered for semantic segmentation. Among them, the form of using image-level tags has attracted major attention because of its minimal annotation cost as well as its great challenge. Recent work [39]

has shown that convolutional neural networks (CNNs) have the localization ability even if only image-level tags are used. This observation has inspired many weakly-supervised semantic segmentation (WSSS) researches. However, attentions in the class activation maps (CAMs) 

[39] inferred from image classification networks tend to focus on small discriminative parts of objects. The object location cues (also referred to as seeds) retrieved from these CAMs are too sparse to effectively train a segmentation model. Therefore, many efforts have been devoted to recover dense and reliable seeds [31, 13, 33, 17, 15].

Fig. 1: Illustrations of context attention maps learned by the self-attention scheme. (a) shows images where pixels of interest are marked by ’+’. (b) presents the attention maps learned in a fully-supervised segmentation network, in which the pixels belonging to the same category with the selected pixels are highlighted. (c) shows the results learned in a weakly-supervised scenario, in which the information of the selected pixels is mis-spread to unexpected regions.

In this paper, we aim to take advantage of the self-attention mechanism to mine high-quality seeds. As validated in [7, 37], this mechanism is able to successfully capture long-range contextual dependencies in fully-supervised semantic segmentation. However, it encounters the following challenges when applied to WSSS. (1) Some foreground objects may always be co-occurrent with the same background, like ’boat’ and ’water’, leading to a pathological bias [18]; (2) The global average pooling (GAP), which is commonly used in classification networks to aggregate pixel-wise responses into image-level label scores, encourages all responses to be high; (3) In the self-attention scheme, each pixel directly contributes to all other pixels and vice versa. These factors may result in a mis-spread of attentions from discriminative parts to unexpected regions. Typical examples are illustrated in Figure 1. In the fully-supervised setting, the information of the selected discriminative pixels can be correctly propagated to the pixels belonging to the same category. Contrastively, the discriminative information is diffused to the regions of the background and other categories under weak supervision.

To address the above-mentioned problems and enable the self-attention mechanism to work effectively under weak supervision, we construct a self-attention network that leverages the class-agnostic saliency as a guidance. A saliency map provides a rough detection of foreground objects so that it can prevent attentions from spreading to unexpected background regions. To further reduce the information diffusion among foreground categories, we integrate the class-specific attention cues as additional supervision. By this means, our network is able to generate dense and accurate seeds.

Our work distinguishes itself from the others as follows:

  • We propose a saliency-guided self-attention network (SGAN) for weakly supervised semantic segmentation. It integrates class-agnostic saliency maps and class-specific attention cues to enable the self-attention mechanism to work effectively under weak supervision.

  • In contrast to existing WSSS methods [2, 23, 29] that directly combine class-agnostic saliency maps with class-specific attention maps in user-defined ways, our approach fuses these two cues adaptively via the learning of the proposed self-attention network.

  • Our approach achieves the mIoU scores of 66.5% and 66.4% on PASCAL VOC 2012 val and test set respectively, which are new state-of-the-arts.

Ii Related Work

Ii-a Weakly-supervised Semantic Segmentation

Various supervision forms have been exploited for weakly-supervised semantic segmentation (WSSS). Here, we focus on the works using image-level tags. Most recent methods solve the WSSS problem by first mining reliable seeds and then take them as proxy ground-truth to train segmentation models. Thus, many efforts have been devoted to generate high-quality seeds.

A group of approaches take the class activation maps (CAMs) [39] generated from classification networks as initial seeds. Since CAMs only focus on small discriminative regions which are too sparse to effectively supervise a segmentation model, various techniques such as adversarial erasing [31, 18, 9, 2] and region growing [13, 28] have been developed to expand sparse object seeds. Another research line introduces dilated convolutions of different rates [33, 15, 17, 5] to enlarge receptive fields in classification networks and aggregate multiple attention maps to achieve dense localization cues. In this work, we adopt the self-attention scheme to capture richer and more extensive contextual information to mine integral object seeds, and meanwhile leverage both class-agnostic saliency cues and class-specific attention cues to ensure the seeds accurate.

Ii-B Self-attention Mechanism

The self-attention mechanism [20] computes the context at each position as a weighted sum of all positions. Its superiority in capturing long-range dependencies has been recently validated by various vision tasks [30, 10, 37, 7]. Particularly, in semantic segmentation, [37] integrated this mechanism into pyramid structures to capture multi-scale contextual information; [7] constructed a dual attention network to capture dependencies in both spatial and channel dimensions; [11] proposed an interlaced sparse approach to improve the efficiency of the self-attention mechanism; and [12] designed a recurrent criss-cross attention module to efficiently harvest the contextual information. These methods significantly boost the segmentation performance, but all of them perform under full supervision. Although [28] utilized the self-attention scheme for WSSS, they only used this scheme to learn a saliency detector that is trained also in a fully-supervised manner. In this work, we apply the self-attention scheme to a weakly-supervised scenario which is more challenging.

Fig. 2: An overview of the proposed saliency guided self-attention network.

Ii-C Saliency Guidance for WSSS

Salient object detection (SOD) [34] produces class-agnostic saliency maps that distinguish foreground objects from the background. The results of SOD have been extensively used in weakly-supervised semantic segmentation. For instance, many methods [31, 13, 17, 5, 18, 33] exploited saliency maps to generate background seeds. Moreover, [32] adopted a self-paced learning strategy to learn a segmentation model that was initialized under the full supervision of saliency maps of simple images. [28] utilized saliency maps to guide a CAM-seeded region growing process to expand object regions. [6] used instance-level saliency maps to construct and partition similarity graphs for WSSS. [2, 23, 29] combined class-agnostic saliency maps with class-specific attention maps in user-defined ways to obtain dense seeds. In our work, saliency maps and attention cues are adaptively fused within the self-attention network.

Iii The Proposed Approach

The proposed approach for weakly-supervised semantic segmentation is divided into two parts: (1) learning a saliency guided self-attention network to generate dense and accurate seeds, and (2) utilizing the high-quality seeds as proxy ground-truth to train a semantic segmentation model. The details are introduced in the followings.

Iii-a Saliency Guided Self-attention Network

Iii-A1 Network architecture

The overview of our proposed saliency guided self-attention network (SGAN) is illustrated in Figure 2

. It consists of three components: (1) a CNN backbone to learn deep feature representations; (2) a saliency guided self-attention module that propagates attentions from small discriminative parts to non-discriminative regions via capturing long-range contextual dependencies; (3) an image classification branch together with a seed segmentation branch to supervise the training of the entire network.

We adopt a slightly modified VGG-16 [16]

network as the backbone for feature extraction. The last two pooling layers are removed in order to increase the resolution of the output feature maps. Note that, unlike previous works 

[33, 5, 17] that enlarge the dilation rate of convolution kernels in conv5 block, we avoid the usage of dilated convolution and instead use the self-attention module to capture more extensive contexts.

Iii-A2 Saliency guided self-attention module

This module aims to take advantage of the self-attention mechanism to capture rich contextual information that is essential for discovering integral extent of objects and retrieving high-quality seeds. The self-attention mechanism has demonstrated its effectiveness in capturing long-range dependencies under full supervision [7, 37]. However, simply integrating it into a weakly-supervised network may suffer from a severe mis-spread problem as previously introduced. Thus, we propose to incorporate class-agnostic saliency priors to prohibit the spread of attentions from discriminative object regions to the background.

We formally describe the saliency guided self-attention module as follows. This module takes the feature map output from the VGG’s conv5 block, which is denoted as , together with a saliency map as the inputs. With the input feature map, a sequence of spatial matrix operations are performed to generate a spatial attention map , where is the number of positions. More specifically, is first fed into two convolutions for linear embedding and generating a key feature map and a query feature map respectively. These two feature maps are further reshaped to . Then, the spatial attention map is generated by computing the inner product of channel-wise features from any two positions of and . That is,


where are the indexes of positions, and are the channel-wise features. measures the relationship between the -th position and the -th position based upon the learned features. Note that different pairwise functions [30] can be used for this measurement, we take the inner product because it is simple but effective enough.

For the input saliency map, we first threshold it to get a binary mask and reshape it to . After that, a saliency attention map is computed by


where is an indicator function. It equals one if both positions and are salient or non-salient.

The final context attention map is generated via an element-wise production between the spatial attention map and the saliency attention map , followed by a linear normalization:


Once the context attention map is obtained, we use it to enhance the original feature map . Specifically, we reshape to and conduct a matrix multiplication between and the transpose of . Then we reshape the result back to and perform an element-wise summation with to obtain the enhanced features . That is,


where is a parameter initialized as 0 [7] and gradually learned in training. Equation (4) indicates that each position of is the sum of the weighted features at all positions and the original features. Therefore, this module captures contextual information from a whole image. By this means, attentions on small discriminative parts of objects can be propagated to non-discriminative object regions, but not to the background because of the guidance of saliency.

Iii-A3 Integrating class-specific attention cues

The class-agnostic saliency maps introduced above can only roughly separate foreground objects from the background, but provide no information about semantic categories. In order to prevent our SGAN from mis-spreading attentions among objects of different categories, we propose to integrate the class-specific attention cues that are obtained by the CAM method [39]. As introduced in Section I, initial CAMs are widely used in WSSS to provide reliable localization cues, but these cues are very sparse.

Specifically, we construct a segmentation branch in our SGAN. It takes the enhanced feature as the input and goes through a convolutional layer to produce segmentation maps, each of which corresponds to a foreground category. Meanwhile, we retrieve reliable but sparse foreground object seeds by thresholding the class activation maps obtained from the VGG-16 classification network with a high confidence value (empirically set to 0.3 in this work) and use them to supervise the segmentation maps. The corresponding seed loss is defined by


Here, denotes the set of foreground classes that present in an image and is a set of seed localizations corresponding to class . is the cardinality of the set.

denotes the probability of class

at position of the segmentation maps. Note that, in contrast to the seeding loss defined in [16, 13] that considers both foreground and background categories, our loss only takes into account the foreground classes.

Iii-A4 Training the saliency guided self-attention network

The network also has an image classification branch that is supervised by image-level labels. Let us denote the classification probability as and the corresponding image-level label as , which indicates the presence or absence of foreground object categories. Then the classification loss is defined by the sigmoid cross entropy. That is,


The overall loss for training our saliency guided self-attention network is defined by


where is a weighting factor to balance the two terms.

Iii-A5 Generating high-quality seeds

Once the proposed SGAN is trained, we employ the CAM method [39] to infer the class activation maps from SGAN and retrieve high-quality seeds. More specifically, for each foreground class, we retrieve object seeds by thresholding the corresponding class activation map with a high value . In addition, we retrieve background seeds by thresholding the input saliency map with a low value . Following [16, 13, 33], we set and in our experiments.

Iii-B Training the Segmentation Network

After obtaining the high-quality seeds, we can use them as proxy ground-truth labels to train an arbitrary semantic segmentation network. In this work, we adopt the balanced seed loss proposed in DSRG [13] for the seed supervision. It is


where denotes the background. , , and holds the same definitions as previous.

We further exploit the boundary constraint loss used in both DSRG [13] and SEC [16] to encourage segmentation results to match up with object boundaries. Let us denote as the input image and as the output probability map of the fully-connected CRF. Then the boundary constraint loss is defined as the mean KL-divergence between the segmentation map and the output of CRF:


Thus, the total loss for training the segmentation model is .

Iv Experiments

Iv-a Experimental Setup

Iv-A1 Dataset and evaluation metric

The proposed approach is evaluated on the PASCAL VOC 2012 segmentation benchmark [4], which provides pixel-wise annotations for 20 object classes and one background class. The dataset contains 1464 images for training, 1449 images for validation and 1456 images for testing. Following the common practice [16, 13, 33], we augment the training set to 10,582 images. Our network is trained on the augmented training set only using image-level annotations and evaluated on the validation set in terms of the mean intersection-over-union (mIoU) criterion. Evaluation results of the test set are obtained by submitting our prediction results to the official PASCAL VOC evaluation server.

Iv-A2 Training details

The saliency guided self-attention network is built on the VGG-16 network pre-trained on ImageNet. The remaining parameters of our SGAN are randomly initialized. Following 

[33], we use S-Net [34] to get a class-agnostic saliency map for each image. SGD with mini-batch is used for training. The batch size is set to 15, the momentum is 0.9 and the weight decay is 0.0005. Input images are resized to and no data augmentations except randomly horizontal flip are adopted. We train the SGAN for 8,000 iterations. The initial learning rate is 0.001 and it is decreased by a factor of 10 every 2,000 iterations.

The semantic segmentation model is chosen to use the Deeplab-ASPP [3] network in order to compare with other WSSS works. Both VGG-16 and ResNet-101 backbones are investigated. The batch size is set to 15, the momentum is 0.9 and the weight decay is 0.0005. Input images are resized to and randomly cropped to for training. Horizontal flip and color jittering are employed for data augmentation. We train the segmentation model for 12,000 iterations. The initial learning rate is 0.001 and it is decreased by a factor of 0.33 every 2,000 iterations.

Methods Publication val test
Backbone: VGG-16 network
SEC [16] ECCV’2016 50.7 51.1
AF-SS [25] ECCV’2016 52.6 52.7
CBTS [27] CVPR’2017 52.8 53.7
AE_PSL [31] CVPR’2017 55.0 55.7
DCSP [2] BMVC’2017 58.6 59.2
GAIN [18] CVPR’2018 55.3 56.8
MCOF [29] CVPR’2018 56.2 57.6
AffinityNet [1] CVPR’2018 58.4 60.5
DSRG [13] CVPR’2018 59.0 60.4
MDC [33] CVPR’2018 60.4 60.8
SeeNet [9] NIPS’2018 61.1 60.7
AISI [6] ECCV’2018 61.3 62.1
SGDN [28] PRL’2019 50.5 51.3
DSNA [38] TMM’2019 55.4 56.4
FickleNet [17] CVPR’2019 61.2 61.8
SGAN(Ours) - 63.7 63.6 111
Backbone: ResNet-101 network
DCSP [2] BMVC’2017 60.8 61.8
MCOF [29] CVPR’2018 60.3 61.2
DSRG [13] CVPR’2018 61.4 63.2
SeeNet [9] NIPS’2018 63.1 62.8
AISI [6] ECCV’2018 63.6 64.5
CIAN [5] Arxiv’2019 64.1 64.7
DFPN [15] TIP’2019 61.9 62.8
DSNA [38] TMM’2019 58.2 60.1
FickleNet [17] CVPR’2019 64.9 65.3
SGAN(Ours) - 66.5 66.4 222
TABLE I: Comparison of weakly-supervised semantic segmentation methods on PASCAL VOC 2012 validation and test sets in terms of mIoU (%).

Iv-A3 Reproducibility

We implement our SGAN on PyTorch 


for training and producing high-quality seeds. We use the official Deeplab-ASPP code implemented on Caffe 

[14] for semantic segmentation. All the experiments are conducted on a GTX 1080Ti GPU. We will make our code publicly available soon.

Iv-B Comparison to the State of the Art

We compare our approach with other state-of-the-art WSSS methods that are also supervised only by image-level labels. For fair comparison, we separate the methods into two groups according to the backbones upon which their segmentation models are built, as listed in Table I. Many of these methods use saliency detectors [21, 8, 34] to retrieve background seeds and utilize CAMs obtained from classification networks as initial seeds.

Table I shows that our method outperforms all the previous methods on both VGG-16 and ResNet-101 backbones. In particular, AE_PSL [31], GAIN [18], and SeeNet [9] use erasing techniques to get dense localization cues, which tend to identify true negative regions. AffinityNet [1], DSRG [13], and SGDN [28] adopt region growing techniques to expand seeds. It may be hard for them to expand to non-discriminative regions if initial seeds are concentrated on extremely small discriminative parts. MDC [33], DFPN [15], and FickleNet [17] use dilated convolutions to retrieve dense seeds, whose receptive fields are not adaptive to image contents and may result in over-expansion. In contrast, our method can achieve dense and accurate seeds, which is benefitted from the self-attention mechanism as well as the introduced saliency and attention cues. In addition, our approach outperforms DSNA [38], which uses a spatial attention scheme, by a great margin. Our approach also performs better than AISI [6] that leverages strong instance-level saliency information and CIAN [5] that utilizes cross-image affinities.

Iv-C Qualitative results

Figure 3 shows qualitative segmentation results obtained by the proposed approach. As we can see, our approach produces accurate segmentation results and recovers fine details of object boundaries for images containing scale variation, multiple objects, and complex background. A typical failure case is also presented in the last row, in which the dining table is indistinguishable from the background and thus misidentified as background.

Fig. 3: Examples of segmentation results obtained by the proposed approach.






















Baseline 86.5 68.1 29.8 71.8 56.2 56.3 47.6 69.7 75.4 18.6 60.6 18.3 62.6 62.1 67.1 59.3 34.4 69.7 27.3 58.4 55.0 55.0
SGAN-SAL-SEED 78.7 51.4 22.1 23.5 21.4 62.5 73.8 60.2 80.6 6.6 58.1 4.3 69.5 45.8 65.3 66.1 31.4 35.4 23.7 48.7 35.3 45.8
SGAN-SEED 89.5 75.4 31.0 75.1 60.0 66.3 68.3 73.8 82.3 23.0 74.8 25.1 76.2 69.0 69.1 72.8 40.3 71.5 32.8 73.2 60.6 62.4
SGAN 89.6 75.0 31.8 73.1 61.1 67.4 79.1 75.4 82.3 26.3 75.0 28.5 75.7 67.8 70.1 73.1 45.7 72.5 35.6 73.2 58.6 63.7
TABLE II: Comparison of the proposed model under different settings on VOC 2012 val set in terms of mIoU (%).
Fig. 4: Visualization of the context attention maps learned by different variants of our saliency guided self-attention network.

Iv-D Ablation Studies

Iv-D1 Effectivenss of the components in SGAN

To investigate the effectiveness of each component in the saliency guided self-attention network, we conduct a series of experiments in different settings while keeping the VGG-16-based segmentation model the same throughout these experiments. Particularly, the following four configurations are investigated: (1) the full model, which is referred to as SGAN; (2) the model without the segmentation branch and the seed loss, which is denoted as SGAN-SEED; (3) the model without the segmentation branch, the seed loss, and the saliency guidance, which in essence is directly integrating the self-attention mechanism into the modified VGG-16 classification network. We denote this variant as SGAN-SAL-SEED; and (4) the baseline model without our proposed saliency guided self-attention module, which is actually the modified VGG-16 classification network.

The comparison results are listed in Table II, from which we make the following observations: (1) The SGAN-SAL-SEED model that applies the self-attention mechanism directly in a weakly-supervised network degrades the segmentation performance, especially for the categories that are always co-occurrent with the same background, for instance, ’airplane’ with ’sky’, ’boat’ with ’water’ and ’horse’ with ’grass’. In such cases, SGAN-SAL-SEED tends to propagate attentions from foreground objects to the concurrent background and generate inaccurate seeds. (2) The SGAN-SEED model that uses the proposed saliency guided self-attention module outperforms the baseline model over all categories. In addition, by integrating class-specific attention cues via the seed loss, our full model boosts the performance further. (3) Compared to the baseline model, our full model boosts the performance significantly for the categories containing objects in large size, like ’bus’ (+31.5%) and ’train’ (+14.8%), and the categories with large scale variation such as ’person’ (+13.8%). For these categories, the initial localization cues are usually too sparse to delineate the integral object extent. Our model can effectively propagate attentions from small discriminative parts to non-discriminative regions of objects and generate more complete object seeds, leading to much better segmentation performance.

Iv-D2 Influence of the weighting factor

The weighting factor in the total loss of SGAN determines the impact of the seed loss. Without the seed loss, no class-specific attention cues are included and our SGAN cannot handle the problem of mis-spreading class-specific attentions among foreground categories. Whereas, putting too much weight on this term may cause inefficient training due to the sparsity of the seeds. Table III shows the influence of on the final semantic segmentation performance. By experiments we find out that leads to the best performance. In addition, when is larger than 0.5, the training procedure becomes unstable.

0 0.05 0.1 0.15 0.2 0.25 0.3
mIoU (%) 62.4 63.4 63.5 63.7 63.5 63.5 63.1
TABLE III: Influence of the weighting factor to the segmentation performance on PASCAL VOC 2012 validation set.

Iv-D3 Visualization of context attention map

To better understand how the proposed method behaves, Figure 4 visualizes the context attention maps learned by different variant of our SGAN. Specifically, we select one discriminative pixel in each image and mark it by a yellow ’+’. The attentions propagated from the selected pixel to all other pixels are indicated in the corresponding column of the learned context attention map. We reshape the column into the image size and overlay it on the color image for visualization. As shown in Figure 4(b), simply integrating the self-attention mechanism in a weakly-supervised network tends to mess up the attentions. Saliency priors can prohibit from spreading the attentions to the background. By further integrating the class-specific attention cues, our full model can restrict the attentions propagated mostly to the pixels belonging to the same category with the selected pixel.

V Conclusion

In this paper, we have presented a saliency guided self-attention network to address the semantic segmentation problem supervised by image-level labels only. To generate dense and accurate object seeds, we introduced the self-attention mechanism into the weakly-supervised scenario and utilized both class-agnostic saliency maps and class-specific attention cues to enable the mechanism work effectively. Extensive experiments on PASCAL VOC 2012 dataset show that the proposed method outperforms the baseline model with a large margin and performs better than all other state-of-the-art methods.


  • [1] J. Ahn and S. Kwak (2018) Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, pp. 4981–4990. Cited by: §IV-B, TABLE I.
  • [2] A. Chaudhry, P. K. Dokania, and P. H. Torr (2017) Discovering class-specific pixels for weakly-supervised semantic segmentation. In BMVC, Cited by: 2nd item, §II-A, §II-C, TABLE I.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §I, §IV-A2.
  • [4] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge.

    International journal of computer vision

    88 (2), pp. 303–338.
    Cited by: §IV-A1.
  • [5] J. Fan, Z. Zhang, and T. Tan (2018) CIAN: cross-image affinity net for weakly supervised semantic segmentation. arXiv preprint arXiv:1811.10842. Cited by: §II-A, §II-C, §III-A1, §IV-B, TABLE I.
  • [6] R. Fan, Q. Hou, M. Cheng, G. Yu, R. R. Martin, and S. Hu (2018) Associating inter-image salient instances for weakly supervised semantic segmentation. In ECCV, pp. 367–383. Cited by: §II-C, §IV-B, TABLE I.
  • [7] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In CVPR, Cited by: §I, §I, §II-B, §III-A2, §III-A2.
  • [8] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr (2017) Deeply supervised salient object detection with short connections. In CVPR, pp. 3203–3212. Cited by: §IV-B.
  • [9] Q. Hou, P. Jiang, Y. Wei, and M. Cheng (2018) Self-erasing network for integral object attention. In NIPS, pp. 547–557. Cited by: §II-A, §IV-B, TABLE I.
  • [10] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018) Relation networks for object detection. In CVPR, pp. 3588–3597. Cited by: §II-B.
  • [11] L. Huang, Y. Yuan, J. Guo, C. Zhang, X. Chen, and J. Wang (2019) Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273. Cited by: §II-B.
  • [12] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) CCNet: criss-cross attention for semantic segmentation. In ICCV, Cited by: §I, §II-B.
  • [13] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, pp. 7014–7023. Cited by: §I, §II-A, §II-C, §III-A3, §III-A5, §III-B, §III-B, §IV-A1, §IV-B, TABLE I.
  • [14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In ACM MM, pp. 675–678. Cited by: §IV-A3.
  • [15] L. Jing, Y. Chen, and Y. Tian (2019) Coarse-to-fine semantic segmentation from image-level labels. IEEE Transactions on Image Processing. Cited by: §I, §II-A, §IV-B, TABLE I.
  • [16] A. Kolesnikov and C. H. Lampert (2016) Seed, expand and constrain: three principles for weakly-supervised image segmentation. In ECCV, pp. 695–711. Cited by: §I, §III-A1, §III-A3, §III-A5, §III-B, §IV-A1, TABLE I.
  • [17] J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon (2019) FickleNet: weakly and semi-supervised semantic image segmentation using stochastic inference. In CVPR, Cited by: §I, §II-A, §II-C, §III-A1, §IV-B, TABLE I.
  • [18] K. Li, Z. Wu, K. Peng, J. Ernst, and Y. Fu (2018) Tell me where to look: guided attention inference network. In CVPR, pp. 9215–9223. Cited by: §I, §II-A, §II-C, §IV-B, TABLE I.
  • [19] D. Lin, J. Dai, J. Jia, K. He, and J. Sun (2016) ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In CVPR, pp. 3159–3167. Cited by: §I.
  • [20] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In ICLR, pp. 1–15. Cited by: §II-B.
  • [21] N. Liu and J. Han (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In CVPR, pp. 678–686. Cited by: §IV-B.
  • [22] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §I.
  • [23] S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele (2017) Exploiting saliency for object segmentation from image level labels. In CVPR, pp. 5038–5047. Cited by: 2nd item, §II-C.
  • [24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §IV-A3.
  • [25] X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia (2016) Augmented feedback in semantic segmentation under image level supervision. In ECCV, pp. 90–105. Cited by: TABLE I.
  • [26] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §I.
  • [27] A. Roy and S. Todorovic (2017) Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In CVPR, pp. 3529–3538. Cited by: TABLE I.
  • [28] F. Sun and W. Li (2019) Saliency guided deep network for weakly-supervised image segmentation. Pattern Recognition Letters 120, pp. 62–68. Cited by: §II-A, §II-B, §II-C, §IV-B, TABLE I.
  • [29] X. Wang, S. You, X. Li, and H. Ma (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In CVPR, pp. 1354–1362. Cited by: 2nd item, §II-C, TABLE I.
  • [30] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §II-B, §III-A2.
  • [31] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan (2017) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In CVPR, pp. 1568–1576. Cited by: §I, §II-A, §II-C, §IV-B, TABLE I.
  • [32] Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, J. Feng, Y. Zhao, and S. Yan (2017) Stc: a simple to complex framework for weakly-supervised semantic segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2314–2320. Cited by: §II-C.
  • [33] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang (2018) Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In CVPR, pp. 7268–7277. Cited by: §I, §II-A, §II-C, §III-A1, §III-A5, §IV-A1, §IV-A2, §IV-B, TABLE I.
  • [34] H. Xiao, J. Feng, Y. Wei, M. Zhang, and S. Yan (2018) Deep salient object detection with dense connections and distraction diagnosis. IEEE Transactions on Multimedia 20 (12), pp. 3239–3251. Cited by: §II-C, §IV-A2, §IV-B.
  • [35] J. Xu, A. G. Schwing, and R. Urtasun (2015) Learning to segment under various forms of weak supervision. In CVPR, pp. 3781–3790. Cited by: §I.
  • [36] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018) DenseASPP for semantic segmentation in street scenes. In CVPR, pp. 3684–3692. Cited by: §I.
  • [37] Y. Yuan and J. Wang (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916. Cited by: §I, §I, §II-B, §III-A2.
  • [38] T. Zhang, G. Lin, J. Cai, T. Shen, C. Shen, and A. C. Kot (2019) Decoupled spatial neural attention for weakly supervised semantic segmentation. IEEE Transactions on Multimedia. Cited by: §IV-B, TABLE I.
  • [39] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In CVPR, pp. 2921–2929. Cited by: §I, §II-A, §III-A3, §III-A5.