Learning Pixel-wise Labeling from the Internet without Human Interaction

05/19/2018 ∙ by Yun Liu, et al. ∙ Nankai University 0

Deep learning stands at the forefront in many computer vision tasks. However, deep neural networks are usually data-hungry and require a huge amount of well-annotated training samples. Collecting sufficient annotated data is very expensive in many applications, especially for pixel-level prediction tasks such as semantic segmentation. To solve this fundamental issue, we consider a new challenging vision task, Internetly supervised semantic segmentation, which only uses Internet data with noisy image-level supervision of corresponding query keywords for segmentation model training. We address this task by proposing the following solution. A class-specific attention model unifying multiscale forward and backward convolutional features is proposed to provide initial segmentation "ground truth". The model trained with such noisy annotations is then improved by an online fine-tuning procedure. It achieves state-of-the-art performance under the weakly-supervised setting on PASCAL VOC2012 dataset. The proposed framework also paves a new way towards learning from the Internet without human interaction and could serve as a strong baseline therein. Code and data will be released upon the paper acceptance.



There are no comments yet.


page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been shown useful simonyan2014very ; ren2015faster ; long2015fully for many computer vision tasks, but they are still limited by requiring large-scale well-annotated datasets for network training. However, manual labeling is costly, time-consuming, and requires massive human intervention for every new task. This is often impractical, especially for pixel-wise prediction such as semantic segmentation. On the other hand, multimedia (e.g., images with user tags) on the Internet is growing rapidly. Thus it is natural to think of training deep networks with data from the Internet. While certain progress in this thread has been achieved by using Internet data as the extra training set together with some well-annotated datasets wang2017learning ; hong2017weakly ; jin2017webly , how to automatically learn semantic pixel-wise labeling from the Internet without any human interaction has not been exploited.

To address this data shortage problem, we propose a principled learning framework for semantic segmentation, which aims at assigning a semantic category label for each pixel in an image. We are particularly interested in utilizing the Internet as the only supervision source for training DNNs to segment arbitrary target semantic categories without requiring any additional human annotation efforts. To this end, we first present this new problem in Section 2, which is called Internetly supervised semantic segmentation. Specifically, unlike previous weakly supervised semantic segmentation, the supervision of human-cleaned image tags papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image , bounding boxes dai2015boxsup ; hu2017learning , as well as auxiliary cues (e.g. saliency maps wei2017stc ; wei2017object , edges qi2016augmented ; pinheiro2015image , attention hou2017bottom ; saleh2016built ) that are trained with strong supervisions, should not be used in our new task.

Compared with previous weakly-supervised semantic segmentation kolesnikov2016seed ; qi2016augmented ; saleh2016built ; hou2017bottom ; shimoda2016distinct ; wei2017object ; wei2016learning ; wei2017stc ; jin2017webly that is limited to pre-defined categories due to the limitation of human-annotated training data, Internetly supervised semantic segmentation can learn to segment arbitrary semantic categories. Moreover, the accuracy of previous weakly supervised semantic segmentation heavily depends on the auxiliary cues (e.g. saliency, edges, and attention) that are trained with strong supervision such as pixel-accurate label maps or human annotated tags, inferior to our Internetly supervised segmentation method. On the other hand, the Internetly supervised task is partially similar to unsupervised learning

because both of them aim at learning to describe hidden structures from free available data. Unsupervised learning usually uses unlabeled videos or raw images to learn edges

li2016unsupervised , foreground masks pathak2017learning , video representation srivastava2015unsupervised , etc, and it can not learn semantic information with multiple categories. Since more free information (i.e. the Internet tags/texts) is used in our Internetly supervised task, it can learn pixel-wise semantic labeling.

We search and download images from Flickr 111https://www.flickr.com/ using the tags of target categories. Thus each target category can correspond to a large number of noisy images that may contain target objects. We use a simple filtering strategy to clean the crawled Internet images, and an image classification network szegedy2015going is subsequently trained using the de-noised data. We also propose a new class-specific attention model that unifies multiscale forward and backward convolutional feature maps of the classification network to obtain high-quality attention maps. These attention maps are converted to “ground truth” using a trimap strategy. The generated “ground truth” is used as the supervision to train the semantic segmentation network and get the initial segmentation model. Then, an online fine-tuning procedure is proposed to improve the initial model.

In summary, our contributions include:

  • We introduce a new challenging vision task, Internetly supervised semantic segmentation, to learn pixel-wise labeling from the Internet without any human interaction.

  • We propose a robust attention model that generates class-specific attention maps by unifying multiscale forward and backward feature maps of the image classification networks. Those maps are further refined by a trimap strategy to provide initial “ground truth” for the training of semantic segmentation.

  • We propose an online fine-tuning method to improve the initially trained model, so that the final model can perform well although trained from noisy image-level supervision.

We conduct the numeric comparison of our proposed method and some weakly supervised methods that only use image-level supervision on the PASCAL VOC2012 dataset pascal-voc-2012 . Our method achieves state-of-the-art performance and is even better than these competitors when they use human-annotated image tags and PASCAL VOC images for training.

2 Problem Setup

We first define the new Internetly supervised semantic segmentation task.

Definition 1.

Internetly supervised semantic segmentation only uses Internet data with noisy image-level supervision to learn to perform semantic segmentation, without any human interaction. Internet data can be collected using various search engines (e.g. Google, Bing, Baidu, and Flickr, etc) or web crawling techniques, but only category information can be used in the search process.

Remark 1.

In this task, any human interaction is not allowed. This makes it more challenging than existing segmentation tasks. For example, one can not manually clean the collected noisy data wei2017stc , and can not use other human-annotated datasets wang2017learning ; hong2017weakly ; jin2017webly , e.g. to train auxiliary cues such as saliency, edge, object proposals and attention models kolesnikov2016seed ; qi2016augmented ; saleh2016built ; hou2017bottom ; shimoda2016distinct ; wei2017object ; wei2016learning ; wei2017stc ; jin2017webly

or get ImageNet pre-trained models

papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image ; kolesnikov2016seed ; qi2016augmented ; saleh2016built ; hou2017bottom ; shimoda2016distinct ; wei2017object ; wei2016learning ; wei2017stc ; jin2017webly . The only available information is the noisy Internet data.

The goal of unsupervised learning is to learn knowledge from free data such as unannotated videos and raw images. Similar to unsupervised learning, the proposed new task aims at learning from free data (i.e. the Internet images), too. This new task is also related to weakly supervised semantic segmentation. Weakly supervised methods can be roughly divided into three categories according to their supervision levels. Methods in the first category papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image only use image-level labels, which is the simplest supervision. Methods in the second category kolesnikov2016seed ; qi2016augmented ; saleh2016built ; hou2017bottom ; shimoda2016distinct ; wei2017object ; wei2016learning ; wei2017stc ; jin2017webly not only use image-level labels but also many strongly-supervised auxiliary cues, e.g. saliency, edges/boundaries, attention, or object proposals, etc. Methods in the third category dai2015boxsup ; lin2016scribblesup ; bearman2016s ; vernaza2017learning ; hu2017learning use coarser annotations, such as scribbles lin2016scribblesup ; lin2016scribblesup , bounding boxes dai2015boxsup ; hu2017learning , and instance points bearman2016s .

Compared to aforementioned weakly-supervised semantic segmentation, Internetly supervised semantic segmentation uses the least supervision of noisy Internet data. Besides, papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image use carefully annotated datasets with image tags, so they are limited to a few pre-defined semantic categories. Internetly supervised semantic segmentation, by contrast, search images with target category tags from Internet, and thus has the capability to learn to segment objects of arbitrary categories. Although these image-level supervision based methods can be directly applied to Internet data, our experiments in Section 4 show these methods struggle on the noisy data.

kolesnikov2016seed ; qi2016augmented ; saleh2016built ; hou2017bottom ; shimoda2016distinct ; wei2017object ; wei2016learning ; wei2017stc ; jin2017webly have significantly improved the segmentation performance using various auxiliary cues. However, they are limited to only a few semantic categories. Moreover, the results heavily rely on the accuracies of the adopted auxiliary cues. Besides, since these methods usually use different strategies to generate ground-truth with different auxiliary cues and datasets, it is unclear that how each component (i.e. auxiliary cues, ground-truth generation methods, learning approaches, and adopted datasets) contributes to the final performance. For example, Wei et al. wei2017stc use saliency maps generated by non-deep algorithm jiang2013salient , while Hou et al. hou2017bottom use both saliency maps generated by deep learning based method hou2017deeply and attention maps generated by zhang2016top . Internetly supervised semantic segmentation is advantageous in this case, because it not only advocates more intelligent systems by utilizing massive Internet data with minimum human efforts, but also provides a uniform testbed to re-gauge state-of-the-arts.

dai2015boxsup ; lin2016scribblesup ; bearman2016s ; vernaza2017learning ; hu2017learning use some sparse annotations

as the supervision to reduce the cost of human interaction. But Internetly supervised semantic segmentation aims at full automatic learning systems, not semi-supervised ones. With target categories as inputs, automatically learning pixel-wise knowledge from the Internet is more useful in many practical applications and also more consistent with the future goal of artificial intelligence. According to the definition of Internetly supervised semantic segmentation, two open problems exist here: (i) how to de-noise the Internet data; (ii) how to learn pixel-wise knowledge only with noisy image-level supervision. Hence it is a more challenging task than the previous weakly-supervised task.

3 Our Approach

Above we establish the new task of Internetly supervised semantic segmentation and analyze its differences from previous weakly-supervised semantic segmentation. In this section, we propose our approach to this new task. Note that the key point of a possible solution is how to learn from noisy data. To this end, we first introduce our class-specific attention model that can generate attention maps with noisy image labels. Then, an online fine-tuning process is further employed to improve the segmentation model. The whole system and implementation details are provided at the end.

3.1 Class-specific Attention Model

Some class-specific attention models zeiler2014visualizing ; simonyan2014deep ; zhou2016learning ; zhang2016top have been proposed to find neural attention regions using image classification networks. Neural attention regions usually cover discriminative objects or object parts in an input image and thus can be viewed as the coarse masks for a specific category. For Internetly supervised semantic segmentation, it is more challenging to find discriminative regions as the associated tags are highly noisy. For example, an image obtained by using the search tag “dog” may not contain a dog at all, and an image obtained by using the search tag “bicycle” may contain not only bicycles but also riders. The attention model should be robust for handling these cases. Besides, we expect the computed attention maps can cover complete objects instead of the most discriminative object-parts wei2017object , e.g. the complete person rather than his/her face. We address these two challenges by developing the following new model.

It is widely accepted that the bottom layers of DNNs contain fine details of an image but less discriminative representations, and the top layers have abstract discriminative representation but less fine details. Many network architectures xie2015holistically ; liu2017richer ; lin2017feature

have been proposed to fuse the bottom and top features for various vision tasks. This idea is consistent with our goal to estimate the attention maps of

complete objects. However, it is non-trivial to locate objects directly using top/bottom features. The features generated bottom layers usually capture representation of textures and edges. It is highly challenging to find discriminative objects using these features. We unify multiscale information from the forward and backward pass of a classification network and propose a new attention model that works well for localizing objects in our Internet learning system.

Formally, suppose we have a dataset of pairs, where represents an Internet image, and is its corresponding noisy category label coming from the set of classes. Our goal is to estimate the semantic segmentation mask for image . With Internet data, we can learn a function , representing a ConvNet is a non-linear composition function that consists of multiple levels of a hierarchy indexed by , where each level of the hierarchy consists of commonly used operators such as convolution and pooling. More formally, given an image , is defined as:


where is a network layer with learnable parameters (for some layers that do not have learnable parameters, ), and represents the

-th channel or a neuron at the

-th layer. At the lowest level, the inputs to consists of images from Internet. Our proposed framework is generic and in this study is embodied by a non-pretrained GoogleNet szegedy2015going whose last three layers are global average pooling, fully connected, and softmax

. Suppose the input feature tensor of pooling layer is

, where is the number of channels, and / represents the width/height. The top features is used to computed the attention scores at a coarse resolution as in zhou2016learning . Hence the output of global pooling is


At the fully connected layer, if we ignore the bias term, the output can be formulated as


in which represents the weights. where

is number of target categories. So the probability to predict current input data as category



From Eqn. (2) to Eqn. (4), one can find the importance of activation at

when classifying the input image

to category is proportional to


We use to represent the forward scores of attention, which aims at finding coarse locations that have large activations for class .

To obtain bottom features of attention, a backward operation is performed to explore the importance of activation at in a high resolution zhang2016top . The computation of current common neurons can be written as 222The bias term can be absorbed into ., in which is the input of and is the weight. If the child node set of (in top-down order) is and , the probability is defined as


is the normalization term so that we have . According to full probability formula, we have


in which we assume is the parent node set of (in top-down order). At the output layer of the image classification network, we set

to a one-hot vector. Specifically, since we have noisy labels for each image,

if an image is considered to belong to category , and otherwise . Thus we can obtain backward feature map , each neuron of which is computed using Eqn. (6) and Eqn. (7) in a top-down order.

Since the bottom layers usually contain fine details as well as more noises, we use backward attention features from different layers. Specifically, we use the layers conv2/norm2 and inception_3b/output of GoogleNet szegedy2015going , and the corresponding backward attention features are denoted as and , respectively. The top features from forward pass and the bottom features from backward pass are fused as follows:


in which is to upsample a feature map into dimensions of

times using bilinear interpolation.

and are factors to balance the forward and backward features, and here are both set to 1. In Eqn. (8), we multiply and to provide the bottom features of backward pass that are then added to forward features. The rationale is because and contain lots of noisy, and the multiplication, in this case, is likely to reduce the false alarms, as illustrated in Figure 1. On the other hand, is in low dimensions and usually have large activations on the most discriminative object parts, so the add operation is employed to emphasize the discriminative object parts.

Figure 1: An example of our attention model. From Left to Right of the Top row: Original image , forward feature map , backward feature maps and . From Left to Right of the bottom row: , attention map , attention map with image segments , and proxy “ground truth” . Assume here. In the bottom right figure, white pixels represent ignored region, and purple pixels belong to the horse.

After obtaining the fused attention map for class , a segment based smoothing is performed. For the image segmentation, Li et al. li2016unsupervised recently introduced an unsupervised edge detector. We use pont2017multiscale to convert the unsupervised edges into unsupervised image segments. Note this does not violate our “without human interaction" assumption because the edge-segment converter of pont2017multiscale is unsupervised. Given an image, suppose the set of all segments is . The smoothing operation can be formulated as


where is the indicator function. A trimap strategy is then applied to convert into the estimation of ground truth for image :


where are two fixed thresholds. Since image has noisy label , only the category of is considered in Eqn. (10). In the training process, with the value of 255 is ignored. Thus Eqn. (10) is to use pixels with confident labels for training, but ignore pixels with uncertain labels. An example of our attention model is shown in Figure 1.

3.2 Online Fine-tuning Algorithm

Using aforementioned estimation of ground truth, we can train an initial model for semantic segmentation. To further improve the performance, we propose an online fine-tuning algorithm to improve the initial model. For the training of initial model, we use a subset of the downloaded Internet data. After this, the rest of the data is used for fine-tuning. Our motivation is that the attention maps and initial model may not perform well on specific images, but will generate complementary information on different images. Besides, the massive Internet data can serve us an infinite space to search complementary information for a better solution.

Suppose the semantic segmentation network has weights and the image classifier of GoogleNet szegedy2015going has weights . Given a new image set with corresponding image label , we input each image into the semantic segmentation network. We compute a mask from the segmentation results as follows


in which is the probability of the -th category at position . Then we compute the element-wise multiplication of , and feed into the image classification network. If


where is a fixed threshold, is converted to using aforementioned segment smoothing and trimap strategy. We add to the new training set . is used to fine-tune the semantic segmentation network and get better weight .

3.3 The Whole System

In this part, we introduce the whole system. We first download images from Flickr using each of the target category tags. The searched images are associated with the corresponding category tags as the image labels. Since Internet data is very noisy, we filter out images with obviously wrong labels using following three criteria


in which is an image and its corresponding label is . Specifically, we train the first classification model on initial data, and filter out noisy images using Eqn. (13). The second model is then trained using the remaining data, and Eqn. (14) is used to remove wrongly labeled data further. Finally, the third model is trained, and Eqn. (15) is applied to obtain the final data that we use in this paper. This simple de-noising procedure works well in our system. Moreover, as the proposed framework is generic and other advanced de-noising methods can be readily applicable.

We train an image classification model using the remaining data, and compute the class-specific attention maps of this model using the algorithm in Section 3.1. The trimap strategy is applied to convert these attention maps into “ground truth” that is used to train a semantic segmentation network chen2015semantic . The resulting initial model is then fine-tuned using the online optimization algorithm introduced in Section 3.2. As in other semantic segmentation methods, we also consider CRF chen2015semantic as a post-processing step.

4 Experiments

4.1 Implementation Details

We totally download 970k images for the 20 PASCAL VOC categories pascal-voc-2012 , each of which has 48k images. After the filtering procedure, there remain 680k images. We randomly select a subset (290k) of the remained images to train the image classification network (GoogleNet szegedy2015going ). Then, we compute the attention maps and train initial segmentation model using the same image set. When training the classification network, we use the recommended settings: base learning rate of 0.02 that is multiplied by 0.96 after every 6.4k steps, momentum of 0.9, weight_decay of 0.0004, batch size of 512, and total 32k iterations of SGD. When training the semantic segmentation network chen2015semantic , we use total SGD iterations of 50k and batch size of 12. Default settings are used for other hyper-parameters. and are set to 0.5 and 0.65, respectively. Fine-tuning step uses 20k new Internet images as input. is set to 0.4. All of the data, code, and models used in this paper will be released upon paper acceptance.

We test our system on the val and test sets of PASCAL VOC2012 pascal-voc-2012 dataset, which consists of 1449 validation images and 1456 test images with corresponding carefully annotated segmentation ground truth. In the following sections, we first conduct ablation studies to evaluate different choices of our system, and then compare with other competitors. For the fair comparison, we compare to some weakly supervised semantic segmentation methods papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image that only use image-level supervision. The predicted results are evaluated using the standard mean intersection-over-union (mIoU) across all classes.

forward backward segment trimap online fine-tuning CRF mIoU (%)


Table 1: Effects of various design choices of our framework on the PASCAL VOC2012 val set.
Method val set (mIoU %) test set (mIoU %)
With annotated image labels:
MIL-FCN pathak2014fully 24.9 25.7
MIL-Base pinheiro2015image 17.8 -
MIL-Base w/ ILP pinheiro2015image 32.6 -
EM-Adapt w/o CRF papandreou2015weakly 32.0 -
EM-Adapt papandreou2015weakly 33.8 -
CCNN w/o CRF pathak2015constrained 33.3 -
CCNN pathak2015constrained 35.3 35.6
With Internet data:
EM-Adapt w/o CRF papandreou2015weakly 15.4 16.1
EM-Adapt papandreou2015weakly 15.9 16.7
CCNN w/o CRF pathak2015constrained 13.7 14.2
CCNN pathak2015constrained 14.1 14.6
Ours w/o CRF 38.7 39.5
Ours 39.6 40.4
Table 2: Comparison with some methods that only use image-level supervision.

4.2 Ablation Study

In this section, we evaluate the effectiveness of various design choices of our method on the VOC2012 val set. Results are summarized in Table 1. Note that CRF means whether CRF chen2015semantic is used as a post-processing step. The improvement from single forward/backward attention maps to fused attention maps demonstrate our observation that forward top attention features and multiscale backward bottom features have useful complementary information. Besides, segment smoothing on the attention maps seems very helpful for the training process, improving mIoU from 28.1 to 36.1. A smoothing operation on the attention maps seems critical to provide a reliable estimation of segmentation ground truth. The effectiveness of segment smooth can also be seen in Figure 1. The online fine-tuning can further improve initial model with 1.9% of mIoU.

Figure 2: A qualitative comparison between our method and competitors. The original images and ground truth are from PASCAL VOC2012 val set pascal-voc-2012 . From Left to Right: Original images, ground truth, EM-Adapt papandreou2015weakly , CCNN pathak2015constrained , and our method.

4.3 Comparison With Other Competitors

Here, we compare with papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image that only use image-level supervision. We report not only the evaluation results of these methods trained with carefully annotated datasets, e.g. PASCAL VOC2012 pascal-voc-2012 , SBD hariharan2011semantic , and ImageNet deng2009imagenet , but also the results of these methods trained with the same noisy Internet data that our method uses. Since only the code of papandreou2015weakly ; pathak2015constrained is publicly available, we only report them on our Internet data. Experimental results are displayed in Table 2. We can see that the performances of papandreou2015weakly ; pathak2015constrained decrease dramatically from annotated data to Internet data. This shows they are not robust to the noise. Our proposed method achieves the state-of-the-art performance, and even better than papandreou2015weakly ; pathak2015constrained ; pathak2014fully ; pinheiro2015image trained with annotated image labels. Specifically, with Internet data, the mIoU of our method is 23.7% higher than the second best method both on VOC2012 test and val set. This demonstrates the effectiveness of our method on noisy data. Our results can be viewed as a baseline for the future algorithms of Internetly supervised semantic segmentation. A qualitative comparison is displayed in Figure 2.

5 Conclusion

Considering the data shortage problem of deep learning, we propose a possible choice to learn from Internet. Specifically, because annotating pixel-wise labels for semantic segmentation is very expensive and time-consuming, we set up a new problem of Internetly supervised semantic segmentation which aims at automatically learning pixel-wise labeling from Internet without human interaction. To show an example solution for this task, we propose a unified attention model to train an initial model that is improved using a subsequent online fine-tuning algorithm. Our method achieves state-of-the-art performance on VOC2012 dataset pascal-voc-2012 . Moreover, the new task, Internetly supervised semantic segmentation, has the potential to obtain semantic segmentation for arbitrary categories freely. Both how to filter out noisy images and how to learn pixel-wise labeling from the noisy data are open problems. More solutions for this task are expected in the future.


  • (1) A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What’s the point: Semantic segmentation with point supervision. In ECCV, pages 549–565, 2016.
  • (2) L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  • (3) J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE ICCV, pages 1635–1643, 2015.
  • (4) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, pages 248–255, 2009.
  • (5) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  • (6) B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In IEEE ICCV, pages 991–998, 2011.
  • (7) S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han. Weakly supervised semantic segmentation using web-crawled videos. In IEEE CVPR, pages 7322–7330, 2017.
  • (8) Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. In IEEE CVPR, pages 5300–5309, 2017.
  • (9) Q. Hou, D. Massiceti, P. K. Dokania, Y. Wei, M.-M. Cheng, and P. H. Torr. Bottom-up top-down cues for weakly-supervised semantic segmentation. In EMMCVPR, pages 263–277, 2017.
  • (10) R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In IEEE CVPR, 2018.
  • (11) H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. Salient object detection: A discriminative regional feature integration approach. In IEEE CVPR, pages 2083–2090, 2013.
  • (12) B. Jin, M. V. Ortiz-Segovia, and S. Süsstrunk. Webly supervised semantic segmentation. In IEEE CVPR, pages 3626–3635, 2017.
  • (13) A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, pages 695–711, 2016.
  • (14) Y. Li, M. Paluri, J. M. Rehg, and P. Dollár. Unsupervised learning of edges. In IEEE CVPR, pages 1619–1627, 2016.
  • (15) D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In IEEE CVPR, pages 3159–3167, 2016.
  • (16) T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In IEEE CVPR, pages 2117–2125, 2017.
  • (17) Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai. Richer convolutional features for edge detection. In IEEE CVPR, pages 5872–5881, 2017.
  • (18) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE CVPR, pages 3431–3440, 2015.
  • (19) G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.

    Weakly-and semi-supervised learning of a dcnn for semantic image segmentation.

    In IEEE ICCV, pages 1742–1750, 2015.
  • (20) D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan. Learning features by watching objects move. In IEEE CVPR, pages 2701–2710, 2017.
  • (21) D. Pathak, P. Krahenbuhl, and T. Darrell.

    Constrained convolutional neural networks for weakly supervised segmentation.

    In IEEE ICCV, pages 1796–1804, 2015.
  • (22) D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learning. In ICLR Workshop, 2014.
  • (23) P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In IEEE CVPR, pages 1713–1721, 2015.
  • (24) J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE TPAMI, 39(1):128–140, 2017.
  • (25) X. Qi, Z. Liu, J. Shi, H. Zhao, and J. Jia. Augmented feedback in semantic segmentation under image level supervision. In ECCV, pages 90–105, 2016.
  • (26) S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • (27) F. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Built-in foreground/background prior for weakly-supervised semantic segmentation. In ECCV, pages 413–432, 2016.
  • (28) W. Shimoda and K. Yanai. Distinct class-specific saliency maps for weakly supervised semantic segmentation. In ECCV, pages 218–234, 2016.
  • (29) K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR Workshop, 2014.
  • (30) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • (31) N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, pages 843–852, 2015.
  • (32) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In IEEE CVPR, pages 1–9, 2015.
  • (33) P. Vernaza and M. Chandraker. Learning random-walk label propagation for weakly-supervised semantic segmentation. In IEEE CVPR, pages 7158–7166, 2017.
  • (34) G. Wang, P. Luo, L. Lin, and X. Wang. Learning object interactions and descriptions for semantic image segmentation. In IEEE CVPR, pages 5859–5867, 2017.
  • (35) Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR, pages 1568–1576, 2017.
  • (36) Y. Wei, X. Liang, Y. Chen, Z. Jie, Y. Xiao, Y. Zhao, and S. Yan. Learning to segment with image-level annotations. Pattern Recognition, 59:234–244, 2016.
  • (37) Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan. STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI, 39(11):2314–2320, 2017.
  • (38) S. Xie and Z. Tu. Holistically-nested edge detection. In IEEE ICCV, pages 1395–1403, 2015.
  • (39) M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014.
  • (40) J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. In ECCV, pages 543–559, 2016.
  • (41) B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In IEEE CVPR, pages 2921–2929, 2016.