Semantic segmentation is a task to assign semantic label to every pixel within an image. In recent years, Deep convolutional neural networks (DCNNs) [1, 2, 3] have brought great improvement in semantic segmentation performance. Training DCNNs in a fully-supervised setting with pixel-wise ground-truth annotation achieves state-of-the-art semantic segmentation accuracy. However, the main limitation of such fully-supervised setting is that it is labor-intensive to obtain a large amount of accurate pixel-level annotations for training images. On the other hand, datasets with only image-level annotations are much easier to obtain. Therefore, weakly-supervised semantic segmentation supervised only with image labels has received much attention.
. The most widely used pipeline in weakly supervised semantic segmentation is to first estimate pseudo-annotations for the training images based on localization cues and then utilize the pseudo-annotations as the ground-truth to train the segmentation DCNNs. Clearly the quality of pseudo-annotations directly affects the final segmentation performance. In our work, we follow the same pipeline and mainly focus on the first step, which is to generate high-quality pseudo-annotations for the training images with only image-level labels. In recent years, top-down neural saliency[7, 8, 9] performs well in weakly-supervised localization tasks and consequently has been widely applied in generating pseudo-annotations for semantic segmentation supervised with image-level labels. However, as is pointed out by previous works , such top-down neural saliency is good at identifying the most discriminative regions of the objects instead of the whole extent of the objects. Thus the pseudo-annotations generated by these methods are far from the ground-truth annotations. To alleviate this problem, some works consist of multiple ad-hoc processing steps (e.g., iterative training), which are difficult to implement. Some works introduce external information (e.g., web data) to guide the supervision, which greatly increase data and computation load. On the contrary, our work proposes a principal pipeline which is simple and effective to implement.
Our aim is to generate pseudo-annotations for weakly supervised semantic segmentation efficiently and effectively. Inspired by the spatial neural attention mechanism which has been widely used in VQA  and image captioning , we introduce spatial neural attention into our pseudo-annotation generation pipeline and propose a decoupled spatial neural attention structure which simultaneously localizes the discriminative parts and estimates object regions in one end-to-end framework. Such structure helps to generate effective pseudo-annotations in one forward pass. The brief description of our decoupled attention structure is illustrated in Fig. 1.
Our major contributions can be summarized as follows:
We introduce spatial neural attention and propose a decoupled attention structure for generating pseudo-annotations for weakly supervised semantic segmentation.
Our decoupled attention model outputs two attention maps which focus on identifying object regions and mining the discriminative parts respectively. These two attention maps are complimentary to each other to generate high-quality pseudo-annotations.
We employ a simple and effective pipeline without heuristic multi-step iterative training steps, which is different from most existing methods of weakly-supervised semantic segmentation.
We perform detailed ablation experiments to verify the effectiveness of our decoupled attention structure. We achieve state-of-the-art weakly supervised semantic segmentation results on Pascal VOC 2012 image segmentation benchmark.
Ii Related Work
Ii-a Weakly Supervised Semantic Segmentation
In recent years, the performance of semantic segmentation has been greatly improved with the help of Deep Convolutional Neural Networks (DCNNs) [1, 12, 3, 13, 14, 15, 16, 17, 18]. Training DCNNs for semantic segmentation in a fully-supervised pipeline requires pixel-wise ground-truth annotations, which is very time-consuming to obtain.
Thus weakly-supervised semantic segmentation receives research attention to alleviate the workload of pixel-wise annotation for the training data. Among the weakly-supervised settings, image-level labels are the easiest annotations to obtain. As for the semantic segmentation with image-level labels, some early works [19, 20] tackle this problem as multiple instance learning (MIL) problem which views each image being positive if at least one pixel/superpixel within it is positive, and negative if all of the pixels are negative. Other early works 
apply Expectation-Maximization (EM) procedure, which alternates between predicting pixel labels and optimizing DCNNs parameters. However, due to the lack of effective location cues, the performance of early works is not satisfactory enough.
The performance of semantic segmentation with image-level labels was significantly improved after introducing location information to generate localization seeds/pseudo-annotations for segmentation DCNNs. The quality of pseudo-annotations directly influences the segmentation results. There are several categories of methods to estimate pseudo-annotations. The first category is Simple-to-Complex (STC) strategy [22, 23, 24]. The methods in this category assume that the pseudo-annotations of simple images (e.g., web images) can be accurately estimated by saliency detection  or co-segmentation . Then the segmentation models trained on the simple images are utilized to generate pseudo-annotations for the complex images. The methods in this category usually require a large amount of external data which consequently increase the data and computation load. The second category is region-mining based methods. The methods in this category rely on region-mining methods [7, 9, 8] to generate discriminative regions as localization seeds. Since such localization seeds mainly sparsely lie in the discriminate parts instead of the whole extent of the objects, which is far from the ground-truth annotation, many works try to alleviate this problem by expanding the localization seeds to the size of objects. Kolesnikov et al.  expand the seeds by aggregating the pixel-scores by global weighted rank-pooling. Wei et al.  apply an adversarial-erasing approach which iterates between suppressing the most discriminative image region and training the region mining model. It gradually localizes the next most discriminative regions through multiple iterations and merges all the mined discriminative regions into final pseudo-annotations. Similarly Two-phase  captures the full extent of the objects by suppressing and mining processing in two phases. Some works [5, 23] utilize external dependencies such as fully-supervised saliency method  trained on additional saliency datasets to facilitate estimating object scales.
To generate high-quality pseudo-annotations, the first category focuses on the quality of training data while the second category focuses on post-processing the localization seeds which is independent of the region-mining model structure. Different from previous methods, we focus on designing a region-mining model structure which is likely to highlight the object region. We aim to generate pseudo-annotations for weakly-supervised semantic segmentation in a single forward path without external data or external prior for efficiency and simplicity purpose.
Ii-B Mining Discriminative Regions
In this section we introduce some region-mining methods, which have been widely used in generating pseudo-annotations for semantic segmentation with image-level labels. Recent works of top-down neural saliency [7, 9, 8] perform well in weakly supervised localization tasks. Such works identify the discriminative regions respect to each individual class based on image classification DCNNs. Zhang et al.  propose Excitation Backprop to back-propagate in the network hierarchy to identify the discriminative regions. Zhou et al.  propose a technique called Class Activation Mapping (CAM) for identifying discriminative regions by replacing fully-connected layers in image classification CNNs with convolutional layers and global average pooling. Grad-CAM  is a strictly generalization of CAM  without the need of modifying DCNN structure. Among the methods listed above, CAM  is the most widely used one in weakly-supervised semantic segmentation [4, 25, 6] for generating pseudo-annotations.
Ii-C Spatial Attention Mechanism
Spatial neural attention is a mechanism to assign different weights to different feature spatial regions depending on their feature content. It automatically predicts the weighted heat map to enhance the relevant features and block the irrelevant features during the training process for specific tasks. Intuitively, such weighted heat map could be applied to our purpose of pseudo-annotation generation.
Spatial neural attention mechanism has been proven to be beneficial in many tasks such as image captioning [27, 11], machine translation , multi-label classification , human pose estimation  and saliency detection . Different from the previous works, we are the first to apply the attention mechanism to weakly supervised semantic segmentation to the best of our knowledge.
First we give a brief review to the conventional spatial neural attention model in Sec. III-A. Second we introduce our decoupled attention structure in Sec. III-B. Then we introduce how to further generate pseudo-annotations based on this decoupled attention model in Sec. III-C. Finally, the generated pseudo-annotations are utilized as the ground-truth annotation to train segmentation DCNN networks.
Iii-a Conventional Spatial Neural Attention
In this section, we will give a brief introduction to the conventional spatial neural attention mechanism. The conventional spatial neural attention structure is illustrated in Fig. 2. The attention structure consists of two modular branches: the main branch modular and attention detector modular. The attention detector modular is jointly trained with the main stream modular end-to-end.
Formally we denote the output features of some convolutional/pooing layers by . The attention detector takes feature map as the input and outputs spatial-normalized attention weights map . A is applied on X to get attended feature .
is fed into the classification modular to output image classification score vector, where is the number of classes.
For notation simplicity, we denote the 3-D feature output of DCNNs with upper-case letters and feature at one spatial location with its corresponding lower-case letters. For example, is the feature at the position of X. Attention detector outputs attention weights map A which acts as a spatial regularizer to enhance the relevant regions and suppress the non-relevant regions for feature X. Thus we are motivated to utilize the output of attention detector to generate pseudo-annotations for the task of semantic segmentation with image-level labels. The details of the attention detector modular are as follows: Attention detector modular consists of a convolutional layer, a non-linear activation layer (Eq. (1)) and a spatial-normalization (Eq. (2)) as is shown in Fig. 2:
is a non-linear activation function, such as exponential function in[11, 30]. and are the parameters of the attention detector model, which is a convolution layer. The attended feature is calculated as
The classification modular consists of a spatial average pooling layer and
convolutional layer as the image classifier. We denote theand as the parameters of the classifier for class , thus the classification score for class is calculated as:
where is the score for -th class.
Iii-B Decoupled Attention Structure
As described in Sec. III-A, the output of conventional spatial neural attention detector is class-agnostic. However, in semantic segmentation task each image may contain objects of multi-classes. Thus conventional class-agnostic attention map is not applicable to generate pseudo-annotations for such multi-class case since we need to predict pixel-wise label for each semantic class instead of just foreground/background. On the other hand, the output of conventional spatial neural attention detector is aimed to assist the task of image classification which may not necessarily generate desired pseudo-annotations for weakly-supervised semantic segmentation. Inspired by , we propose our decoupled attention structure especially for the task of generating pseudo-annotations for weakly supervised semantic segmentation to alleviate these problems.
We illustrate our structure in Fig. 3. Such structure extend the conventional attention structure to multi-class cases. Moreover, it generates two different types of attention maps which identifies object regions and predicts the discriminative parts respectively. In Fig. 3, the attention map generated by the Expansive attention detector in the top branch is named Expansive attention map, which aims at identifying object regions. The attention map generated by the Discriminative attention detector in the bottom branch is named Discriminative attention map, which aims at mining the discriminative parts. Such two attention maps have different properties which are complimentary to each other to generate pseudo-annotations for weakly supervised semantic segmentation.
The details of the structure in Fig. 3 are described as follows:
where the superscript/subscript denotes the value of -th channel/class.
As we mentioned in Sec. I, for the task of pseudo-annotation generation we aim to estimate the whole range of objects instead of only obtaining the most discriminative parts. Thus we design the E-A detector details as follows: we set
where , similar to . Besides the convolutional layer in the attention detector, we add one drop-out layer right before and after it respectively. Such drop-out layers randomly zero-out the elements in the training features and consequently the attention detector will highlight more relevant features instead of just the most relevant one for successful classification.
The Discriminative attention detector (D-A detector) consists of convolutional layer whose parameters are denoted as and same as the classifier in Sec. III-A. The D-A detector takes feature map X as the input and outputs class-specific attention map .
The attended feature is calculated as:
is fed into a spatial average pooling layer to generate image classification score . Hence is calculated as:
Compared with Eq. (4), in our decoupled model the D-A detector is to predict the class score for dense pixel position instead of predicting image label score. Thus our model remains the spatial information for the classification map which is more suitable for semantic segmentation tasks. The multi-label classification loss for each image is formulated as:
where is the binary image label corresponding to -th class.
We show examples of Expansive attention map and Discriminative attention map in Fig. 4, which illustrates that Expansive attention maps perform well at identifying large object regions while Discriminative attention maps perform well at mining the small discriminative parts. Such two attention maps show different and complimentary properties. Thus we merge such two attention maps using Eq. (11)
where is the normalized Expansive attention map and is the normalized Discriminative attention map corresponding to -th class. is the result merged attention map. is the softmax normalization result of the image classification score p. Such weighted combination is intuitive: the small prediction score usually correspond to the difficult objects of small size so more weights should be put on the task of mining the discriminative regions.
Iii-C Pseudo-annotation Generation
In this section we describe how to generate pseudo-annotations according to the attention maps. We generate pseudo-annotations by simple thresholding following the similar practice used in [4, 6, 7]. Given an attention map of an image with image label set (excluding the background), for each class in , we generate foreground regions as follows: first we perform min-max spatial normalization on the attention map corresponding to class to [0,1] range. Then we threshold the normalized attention map by 0.2. Inspired by , we sum the feature map X in the channel dimension and then perform min-max spatial normalization on it. Then we threshold this normalized map by 0.3 to generate background regions. Since the regions are generated independently for each class, there may exist label conflicts at some pixel positions. We choose foreground label with smallest region size for the conflict regions following the practice of . We assign the unclassified pixels with void label denoted as which represents the label is unclear at this position and will not be considered in the training process.
The generated hard annotations are coarse and have much unclear area. We can further apply denseCRF  on
to generate refined annotations. We first describe how to generate the class probability vector for each spatial locations. For an image, the labels which are present in are denoted as . All the class labels in the target dataset is denoted as . is the mask label of for pixel . We calculate the class probability vector for class at pixel as follows:
where is the manually fixed parameter. represents the number of labels that are present in image . The unary potential is calculated for the probability vector. We apply denseCRF  with the unary potential and take the result mask as the refined annotations.
Iv-a Experimental Set-up
Dataset and Evaluation Metric
Dataset and Evaluation MetricWe evaluate our method on the PASCAL VOC 2012 image segmentation benchmark , which has 21 semantic classes including background. The original dataset has 1464 training images (train), 1449 validation images (val) and 1456 testing images (test). The dataset is augmented to 10582 training images by  following the common practice. In our experiments, we only utilize image-level class labels of the training images. We use the val images to evaluate our method. As for the evaluation measure for segmentation performance, we use the standard PASCAL VOC 2012 segmentation metric: mean intersection-over-union (mIoU).
Training/Testing Settings We train the proposed decoupled attention network for pseudo-annotation estimation. Based on the generated pseudo-annotations we train the state-of-the-art semantic segmentation network to predict the final segmentation result.
We build the proposed decoupled attention model based on model. We transfer the layers from from the first layer to layer as the starting convolutional layers (as is shown in Fig. 3) which outputs . We use a mini-batch size of 15 images with the data augmentation methods such as random flip and random scale. We set the 0.01 as initial learning rate for the
transferred layers and 0.1 as the initial learning rate for the attention detector layers. We decrease the learning rate by a factor of 10 after 10 epochs. Training terminates after 20 epochs.
We train the DeepLab-LargeFOV ( based) of  as our segmentation model. The input image crops for the network are of size 321321 and outputs segmentation mask are of size 41
41. The initial base learning rate is 0.001 and it is decreased by a factor of 10 after 8 epochs. Training terminates after 12 epochs. We use the public available pytorch implementation of Deeplab-largeFOV111https://github.com/BardOfCodes/pytorchdeeplablargefov. In the inference phase of segmentation, we use multi-scale inference to combine output score at different scales, which is common practice as in [37, 3]. The final outputs are post-processed by denseCRF .
Iv-B Ablation Study of Decoupled Attention Model
Iv-B1 Properties of attention maps
As described in Sec. III-B, our decoupled attention model outputs Expansive attention map to identify object regions and Discriminative attention map to localize the discriminative parts. We qualitatively and quantitatively compare these two attention maps generated by our decoupled attention structure and demonstrate their differences.
First we provide visual examples of the Expansive attention maps and Discriminative attention maps of train images in Fig. 5 (column 2 and column 3). In Fig. 5 we provide some examples including images with single/multiple objects and single/multiple class labels. We observed that for simple cases (e.g., large objects), Expansive attention map is likely to cover whole region of objects, while Discriminative attention map is likely to locate the most discriminative part. We also observed that for difficult cases (e.g., small objects), Expansive attention map is likely to identify a broad region, while Discriminative attention map is likely to precisely localize the object. Thus we draw the conclusion that these two attention maps have different properties and are suitable for different situations. They are potentially complimentary to each other. Thus the combination of the two maps will result in attention maps that are applicable to different cases which lead to object annotations of high quality.
We also quantitatively compare these two attention maps. We apply our decoupled attention model on val images and generate annotation masks without denseCRF refinement for Expansive attention map and Discriminative attention map. We evaluate the generated estimated masks regarding to the groundtruth of val images.
We propose three evaluation measures for the estimated masks to demonstrate the different properties of the two attention maps:
1) We use the commonly used IoU to evaluate the performance of identifying the whole objects, which are denoted as:
2) We propose a criteria to evaluate the localization precision within the object region, which are denoted as:
3) We propose a criteria to evaluate the localization recall over the object region, which are denoted as:
emphasizes localizing the concentrate and small regions within the objects which not necessarily cover the whole object regions. emphasizes the expansion of the highlighted regions by measuring whether it includes the whole range of the objects which not necessarily localize within object regions. emphasizes the accuracy of identifying the whole object regions which is the final criteria to indicate the quality of the attention map for generating pseudo-annotations. We use the average score over all classes, which are denoted as , and as our criteria.
The evaluation results are shown in the first two columns of Table I. We observe that Expansive attention map gets higher and score, while Discriminative attention map gets higher . This verifies different properties of attention maps: Discriminative attention map is likely to localize partial interior object regions while Expansive attention map highlights the regions of larger expansion and is likely to identify the whole object region.
As described in Sec. III-B, we can merge these two attention maps to improve the quality of generated pseudo-annotations. We merge the two attention maps as Eq. (11) and follow the criteria. The results are shown in the last column of Table I. We observe that of merged attention map is higher than only using single type of attention map, which demonstrates that Expansive attention map and Discriminative attention map are complimentary to each other in localizing the whole object regions. The pseudo-annotations generated from the merged attention map are relatively close to the ground-truth, as is shown in Fig. 5.
Iv-B2 Evaluation of dropout layers
As described in Sec. III-B, we add dropout layers in the Expansive attention detector modular to have expansion effect. In this section we evaluate and discuss the effect of drop-out layers.
In our experiments we use the default drop-out rate of 0.5. We also evaluate other drop-out rates and follow the same criteria as Table I on the merged attention maps. The results are reported in Table II. It shows that with the increase of drop-out rates, constantly increases while constantly decreases. This demonstrates that increasing the drop-out rates helps expand the identified region from the interior parts to the entire object scales. We also show the visualization examples of the attention maps generated by different drop-out rates in Fig. 6, which demonstrate the expansion effect results from larger drop-out rates.
The effects of drop-out layers could be explained as follows: Drop-out layers randomly set the feature grids to zeros by a rate which consequently suppress the feature space by noise. Such zero-out process is operated directly on the CNN feature space at random spatial locations and feature dimensions. If the discriminative features are suppressed by drop-out process, the training process will mine other discriminative features for classification purpose. Thus the attention model will adapt to highlight more relevant features instead of the most discriminative ones.
Iv-B3 Evaluation of decoupled structure
In this section, we aim to show that our decoupled attention structure is effective. We implement the case of removing the Expansive attention detector modular and only train the remaining Discriminative attention detector modular. We add one drop-out layer before and after the convolutional layer of Discriminative attention modular respectively. Then we evaluate the Discriminative attention map follow the same criteria as Table I. The result of , and are 48.3 , 49.7 and 31.5 respectively. These results are far lower than our decoupled attention in Table I. It verifies that our decoupled attention structure is more effective than the single-stream attention.
|AE w/o PSL |
Expansive attention map
|Discriminatve attention map||52.2|
|merged attention map||55.4|
Iv-C Comparisons with Region-Mining Based Methods
Our method is related to region-mining based weakly supervised semantic segmentation approaches. In this section, we compare with two recently proposed region-mining based approaches, i.e., SEC  and Adversarial-Erasing (AE) . Both SEC and AE use region-mining methods CAM  to generate pseudo-annotations. CAM  is a method to mine the discriminate object regions by image-level labels, which is related to our attention based methods.
SEC  contains three losses to optimize: is the segmentation loss based on pseudo-annotations. is for expanding the localization seed to object scale the and is to make the segmentation results agree with the denseCRF  result. AE  iteratively mine the object regions through multiple iteration steps by erasing the most discriminative regions from the original images and re-train the CAM localization network to mine the next most discriminative regions. The mined regions of all the iterations are combined as the final pseudo-annotations. We list segmentation results of the middle steps of SEC and AE in Table III. For SEC, we show the segmentation results with different combination of losses. For AE, we list the segmentation results using different number of erasing steps without prohibitive segmentation learning (PSL).
For our methods, we list the segmentation results using the pseudo-annotations generated by Expansive attention map, Dircriminative attention map and merged attention map in Table III. Since merged attention map achieves better performance over the other two attention maps, we will use the segmentation results generated by merged attention map for further experiments by default. We outperform SEC and AE w/o PSL with various settings, which indicates our attention based region mining approach are significantly more effective than the CAM based mining methods. Moreover, our method employed a simple pipeline without complicated iterative training in contrast with AE.
|What’t the point ||46.1||point annotation|
|BoxSup ||62.0||Box annotation|
|with external source|
|Co-segmentation ||56.4||Web images|
|Webly-supervised ||53.4||Web images|
|Crawled-Video ||58.1||Web videos|
|TransferNet ||52.1||MSCOCO  pixelwise label|
|AF-MCG ||54.3||MCG proposal|
|Joon et al. ||55.7||Supervised Saliency|
|DCSP-VGG16 ||58.6||Supervised Saliency|
|DCSP-ResNet-101 ||60.8||Supervised Saliency|
|Mining-pixels ||58.7||Supervised Saliency|
AE w/o PSL 
|AE-PSL ||55.0||Supervised Saliency|
|w/o external source|
|Anirban et al. ||52.8|
|BoxSup ||64.2||Box annotation|
|with external source|
|Co-segmentation ||56.9||Web images|
|Webly-supervised ||55.3||Web images|
|Crawled-Video ||58.7||Web videos|
|TransferNet ||51.2||MSCOCO  pixelwise label|
|AF-MCG ||55.5||MCG proposal|
|Joon et al. ||56.7||Supervised Saliency|
|DCSP-VGG16 ||59.2||Supervised Saliency|
|DCSP-ResNet-101 ||61.9||Supervised Saliency|
|Mining-pixels ||59.6||Supervised Saliency|
|AE w/o PSL ||Supervised Saliency|
|AE-PSL ||55.7||Supervised Saliency|
|w/o external source|
|Anirban et al. ||53.7|
Iv-D Comparisons with State-of-the-arts
In this section, we compare our segmentation results with other weakly supervised semantic segmentation methods. The comparison results are listed in Table IV and Table V for val and test images respectively. We divide weakly supervised semantic segmentation into three categories based on different levels of weak supervisions and whether additional information (data/supervision) is implicitly used in their pipeline:
In the first category, the methods utilize interactive input such as scribble  or point  as supervision, which is relatively more precise indicator of object location and scale. The results in this category are listed in the first block of Table IV and Table V.
In the second category, the methods only use image-level label as supervision, but introduce information of external sources to improve segmentation results. The segmentation performance in this category are listed in the second block of Table IV and Table V. Some work use additional data to assist training. For example, STC  crawled 50K web images for initial training step and Crawled-Video  crawled large amount of online videos, which significantly increase the training data amount. Some work implicitly utilize the pixel-wise annotations in their training. For example, TransferNet  transfer the pixel-wise annotation from MSCOCO dataset  to PASCAL dataset. AF-MCG  utilize MCG proposal method which is trained with PASCAL VOC pixel-wise ground-truth. Some works utilize fully-supervised saliency detection methods in localization seed expansion  or foreground/background detection [22, 6] , which implicitly utilize external saliency ground-truth data to train saliency detector.
In the third category, the methods only use image-level label as supervision without the information of external sources. Our methods belong to this category. The segmentation performance in this category are listed in the third block of Table IV and Table IV.
We mainly focus on comparing with the methods in the third category. We achieve the mIoU score of 55.4 and 56.4 on val and test images respectively, which significantly outperforms other methods using the same supervision settings. In order to further improve our segmentation results, we use Deeplab-like model in  based on Resnet-101  as the segmentation model. We achieve the mIoU score of 58.2 on val images and 60.1 on test images, which outperforms all the existing methods in weakly supervised semantic segmentation using the same level of supervision setting up to date. We also list our segmentation results for each class in the Table VI for reference. We show the qualitative examples of the segmentation results in Fig. 7.
Moreover, we also want to emphasize that we do not iteratively train the models in multiple steps. The whole pipeline only need training decoupled attention structure and segmentation structure once on the PASCAL VOC train data. To our knowledge our approach has the most simple pipeline for weakly-supervised semantic segmentation.
In this work we have presented a novel decoupled spatial neural attention work to generate pseudo-annotations for weakly-supervised semantic segmentation with image-level labels. This decoupled attention model simultaneously outputs two class-specific attention maps with different properties and are effective for estimating pseudo-annotations. We perform detailed ablation experiments to verify the effectiveness of our decoupled attention structure. Finally we achieve state-of-the-art weakly supervised semantic segmentation results on Pascal VOC 2012 image segmentation benchmark.
This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the National Research Foundation, Singapore, under its IDM Futures Funding Initiative and administered by the Interactive and Digital Media Programme. We also thank NVIDIA for their donation of GPU.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
-  L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, 2015.
-  G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path refinement networks for high-resolution semantic segmentation,” in CVPR, 2016.
-  A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Three principles for weakly-supervised image segmentation,” in ECCV, 2016.
-  S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele, “Exploiting saliency for object segmentation from image level labels,” in CVPR, 2017.
-  Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, and S. Yan, “Object region mining with adversarial erasing: A simple classification to semantic segmentation approach,” in CVPR, 2017.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” inCVPR, 2016.
-  J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-down neural attention by excitation backprop,” in ECCV, 2016.
-  R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” in ICCV, 2017.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in CVPR, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in ICCV, 2015.
-  T. Shen, G. Lin, C. Shen, and I. Reid, “Learning multi-level region consistency with dense multi-label networks for semantic segmentation,” in IJCAI, 2017.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017.
-  L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in CVPR, 2016.
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” inICCV, 2015.
-  X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang, “Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade,” in CVPR, 2017.
-  G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficient piecewise training of deep structured models for semantic segmentation,” in CVPR, 2016.
-  D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” in ICLR Workshop, 2015.
-  P. O. Pinheiro and R. Collobert, “From image-level to pixel-level labeling with convolutional networks,” in CVPR, 2015.
G. Papandreou, L. Chen, K. Murphy, and A. L. Yuille, “Weakly- and semi-supervised learning of a deep convolutional network for semantic image segmentation,” inICCV, 2015.
-  Y. Wei, X. Liang, Y. Chen, X. Shen, M. Cheng, Y. Zhao, and S. Yan, “Stc: A simple to complex framework for weakly-supervised semantic segmentation,” TPAMI, 2016.
-  Q. Hou, P. K. Dokania, D. Massiceti, Y. Wei, M. Cheng, and P. Torr, “Mining pixels: Weakly supervised semantic segmentation using image labels,” arXiv preprint arXiv:1612.02101, 2016.
-  T. Shen, G. Lin, L. Liu, C. Shen, and I. Reid, “Weakly supervised semantic segmentation based on co-segmentation,” 2017.
-  D. Kim, D. Yoo, I. S. Kweon et al., “Two-phase learning for weakly supervised object localization,” in ICCV, 2017.
-  N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in CVPR, 2016.
-  Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in CVPR, 2016.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” inICLR, 2015.
-  F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with image-level supervisions for multi-label image classification,” in CVPR, 2017.
-  X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in CVPR, 2017.
-  J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in CVPR, 2016.
B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid, “Attend in groups: a weakly-supervised deep learning framework for learning from web data,” inCVPR, 2017.
-  F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez, “Built-in foreground/background prior for weakly-supervised semantic segmentation,” in ECCV, 2016.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in NIPS, 2011.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, 2010.
-  B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in ICCV, 2011.
-  C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in CVPR, 2017.
-  D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in CVPR, 2016.
-  A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, “What’s the point: Semantic segmentation with point supervision,” in ECCV, 2016.
-  J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in ICCV, 2015.
-  B. Jin, M. V. Ortiz Segovia, and S. Susstrunk, “Webly supervised semantic segmentation,” in CVPR, 2017.
-  S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han, “Weakly supervised semantic segmentation using web-crawled videos,” 2017.
-  S. Hong, J. Oh, B. Han, and H. Lee, “Learning transferrable knowledge for semantic segmentation with deep convolutional neural network,” in CVPR, 2015.
-  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
-  X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia, “Augmented feedback in semantic segmentation under image level supervision,” in ECCV, 2016.
-  A. Chaudhry, P. K. Dokania, and P. H. Torr, “Discovering class-specific pixels for weakly-supervised semantic segmentation,” in BMVC, 2017.
-  W. Shimoda and K. Yanai, “Distinct class-specific saliency maps for weakly supervised semantic segmentation,” in ECCV, 2016.
-  A. Roy and S. Todorovic, “Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation,” in CVPR, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.