LID 2020: The Learning from Imperfect Data Challenge Results

10/17/2020 ∙ by Yunchao Wei, et al. ∙ 4

Learning from imperfect data becomes an issue in many industrial applications after the research community has made profound progress in supervised learning from perfectly annotated datasets. The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency during training. A massive amount of user-generated data nowadays available on multiple internet services. How to leverage those and improve the machine learning models is a high impact problem. We organize the challenges in conjunction with the workshop. The goal of these challenges is to find the state-of-the-art approaches in the weakly supervised learning setting for object detection, semantic segmentation, and scene parsing. There are three tracks in the challenge, i.e., weakly supervised semantic segmentation (Track 1), weakly supervised scene parsing (Track 2), and weakly supervised object localization (Track 3). In Track 1, based on ILSVRC DET, we provide pixel-level annotations of 15K images from 200 categories for evaluation. In Track 2, we provide point-based annotations for the training set of ADE20K. In Track 3, based on ILSVRC CLS-LOC, we provide pixel-level annotations of 44,271 images for evaluation. Besides, we further introduce a new evaluation metric proposed by <cit.>, i.e., IoU curve, to measure the quality of the generated object localization maps. This technical report summarizes the highlights from the challenge. The challenge submission server and the leaderboard will continue to open for the researchers who are interested in it. More details regarding the challenge and the benchmarks are available at https://lidchallenge.github.io

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Weakly supervised learning refers to various studies that attempt to address the challenging image recognition tasks by learning from weak or imperfect supervision. Supervised learning methods, including Deep Convolutional Neural Networks (DCNNs), have significantly improved the performance in many problems in the field of computer vision, thanks to the rise of large-scale annotated data set and the advance in computing hardware. However, these supervised learning approaches are notorious “data-hungry”, which makes them are sometimes not practical in many real-world industrial applications. We are often facing the problem that we are not able to acquire enough amount of perfect annotations (

e.g

., object bounding boxes, and pixel-wise masks) for reliable training models. To address this problem, many efforts in so-called weakly supervised learning approaches have been made to improve the DCNNs training to deviate from traditional paths of supervised learning using imperfect data. For instance, various approaches have proposed new loss functions or novel training schemes. Weakly supervised learning is a popular research direction in Computer Vision and Machine Learning communities. Many research works have been devoted to related topics, leading to the rapid growth of related publications in the top-tier conferences and journals such as CVPR, ICCV, ECCV, NeurIPS, TIP, IJCV, and TPAMI.

This year, we provide additional annotations for existing benchmarks to enable the weakly-supervised training or evaluation and introduce three challenge tracks to advance the research of weakly-supervised semantic segmentation [35, 37, 61, 36, 41, 57, 22, 42, 38, 45, 30, 56, 6, 20, 55, 24, 60, 51, 66, 74, 1, 16, 27, 50, 59, 15, 46, 63, 44, 48, 28, 25, 53, 17, 13, 54, 49, 32, 14, 67, 40, 7, 23, 2, 52], weakly supervised scene parsing using point supervision [39] and weakly supervised object localization [58, 34, 29, 72, 4, 65, 11, 26, 19, 18, 75, 21, 47, 43, 68, 69, 5, 62, 10, 9, 71, 33, 64, 3, 70, 31], respectively. More details are given in the next section.

We organize this workshop to investigate principle ways of building industry level AI systems relying on learning from imperfect data. We hope this workshop will attract attention and discussions from both industry and academic people and ease the future research of weakly supervised learning for computer vision.

2 The LID Challenge

2.1 Datasets

ILSVRC-LID-200

is used in Track 1, aiming to perform object semantic segmentation using image-level annotations as supervision. This dataset is built upon the object detection track of ImageNet Large Scale Visual Recognition Competition (ILSVRC) 

[12], which totally includes 456,567 training images from 200 categories. By parsing the provided XML files given by [12], 349,310 images with object(s) from the 200 categories are left. To facilitate the pixel-level evaluation, we provide pixel-level annotations for 15K images, including 5,000 and 10,000 for validation and testing, respectively. Following previous practices, the mean Intersection-Over-Union (IoU) score over 200 categories is employed as the key evaluation metric.

ADE20K-LID is used in Track 2, aiming to learn to perform scene parsing using points-based annotation as supervision. This dataset is built upon the ADE20K dataset [73]. There are 20,210 images in the training set, 2,000 images in the validation set, and 3,000 images in the testing set. We provide the additional point-based annotations on the training set. In particular, this point-based weakly-supervised setting is firstly provided by [39]. Following [39], we consider 150 meaningful categories and generate the pixel annotation for each independent instance (or region) in each training image. The performances are evaluated by pixel-wise accuracy and mean IoU, which is consistent with the evaluation metrics of the standard ADE20K [73].

ILSVRC

is used in Track 3, aiming to make the classification networks be equipped with the ability of object localization. This dataset is built upon the image classification/localization track of ImageNet Large Scale Visual Recognition Competition (ILSVRC), which includes 1.2 million training images from 1000 categories. Different from previous works to evaluate the performance in an

indirect way, i.e. bounding box, we annotate pixel-level masks of 44,271 images (validation/testing: 23,151/21,120) to facilitate the evaluation to be performed in a direct way. These annotations are provided by [71], where a new evaluation metric, i.e

., IoU-Threshold curve, is also introduced. Particularly, the IoU-Threshold curve is obtained by calculating IoU scores with the masks binarized with a wide range of thresholds from 0 to 255. The best IoU score,

i.e., Peak-IoU, is used as the key evaluation metric for the comparison. Please refer to [71] for more details of the annotation masks and the new evaluation metric.

2.2 Rules and Descriptions

Rules This year, we issue two strict rules for all the teams

  • For training, only the images provided in the training set are permitted. Competitors can use the classification models pre-trained on the training set of ILSVRC CLS-LOC to initialize the parameters. However, they CANNOT leverage any datasets with pixel-level annotations. In particular, for Track 1 and Track 3, only the image-level annotations of training images can be leveraged for supervision, and the bounding-box annotations are NOT permitted.

  • We encourage competitors to design elegant and effective models competing for all the tracks rather than ensembling multiple models. Therefore, we restrict the parameter size of the inference model(s) should be LESS than 150M (slightly more than two DeepLab V3+ [8] models using ResNet-101 as the backbone). The competitors ranked in the Top 3 are required to submit the inference code for verification.

Timeline The challenge started on Mar 22, 2020, and ended on June 8, 2020. Each participant was allowed a maximum of 5 submissions for the testing split of each track.

Participating Teams We received submissions from 15 teams in total. In particular, the teams from the Computer Vision Lab at ETH Zurich, Tsinghua University, the Vision & Learning Lab at Seoul National University achieve the 1st place in Track 1, Track 2, and Track 3, respectively.

2.3 Results and Methods

Team Username Mean IoU Mean Accuracy Pixel Accuracy
ETH Zurich cvl 45.18 59.62 80.46
Seoul National University VL-task1 37.73 60.15 82.98
Ukrainian Catholic University UCU & SoftServe 37.34 54.87 83.64
- IOnlyHaveSevenDays 36.24 68.27 84.10
- play-njupt 31.90 46.07 82.63
- xingxiao 29.48 48.66 80.82
- hagenbreaker 22.50 39.92 77.38
- go-g0 19.80 38.30 76.21
- lasthours-try 12.56 24.65 64.35
- WH-ljs 7.79 16.59 62.52
Table 1: Results and rankings of methods submitted to Track 1.
Team Username Mean IoU Mean Accuracy Pixel Accuracy
Tsinghua&Intel fromandto 25.33 36.88 64.95
Table 2: Results and rankings of methods submitted to Track 2.
Team Username Peak-IoU Peak-Threshold
Seoul National University VL-task3 63.08 24
Mepro-MIC of Beijing Jiaotong U BJTU-Mepro-MIC 61.98 35
NUST LEAP Group of PCA Lab 61.48 7
- chohk (wsol-aug) 52.89 36
Beijing Normal University TEN 48.17 42
Table 3: Results and rankings of methods submitted to Track 3.
Figure 1: The comparison of IoU-Threshold curves of different teams for Track 3.

Tables 1, 2, 3 show the leaderboard results of Track 1, Track 2 and Track 3, respectively. Particularly, the team from ETH Zurich significantly outperforms others by a large margin in the Track 1. In the Track 3, the top 3 teams achieve similar Peak-IoU scores from 61.48 to 63.08. Furthermore, we demonstrate the comparison of IoU-Threshold curves of 5 teams for Track 3 in Figure 1.

2.3.1 CVL at ETH Zurich Team
(Won the 1st place of Track 1)

The proposed approach adopts cross-image semantic relations for comprehensive object pattern mining. Two neural co-attentions are incorporated into the classifier to complement capture cross-image semantic similarities and differences. In particular, the classifier is equipped with a differentiable co-attention mechanism that addresses semantic homogeneity and difference understanding across training image pairs. More specifically, two kinds of co-attentions are learned in the classifier. The former one aims to capture cross-image common semantics, which enables the classifier to better ground the common semantic labels over the co-attentive regions. The latter, called contrastive co-attention, focuses on the rest, unshared semantics, which helps the classifier better separate semantic patterns of different objects. These two co-attentions work in a cooperative and complimentary manner, together making the classifier understand object patterns more comprehensively. Another advantage is that the co-attention based classifier learning paradigm brings an efficient data augmentation strategy due to the use of training image pairs. An overview of the proposed approach is shown in Figure 

2.

Figure 2: An overview of the proposed approach by the team of CVL at ETH Zurich.

2.3.2 The Machine Learning Lab at Ukrainian Catholic University Team
(Won the 3rd place of Track 1)

The approach proposed by this team consists of three consecutive steps. The first two steps extract high-quality pseudo masks from image-level annotated data, which are then used to train a segmentation model on the third step. The presented approach also addresses two problems in the data: class imbalance and missing labels. All these three steps make the proposed approach be capable of segmenting various classes and complex objects using only image-level annotations as supervision. The results produced from different steps are shown in Figure 3

Figure 3: The produced localization maps of each consecutive step by the team of Machine Learning Lab at Ukrainian Catholic University.

2.3.3 Tsinghua&Intel Team
(Won the 1st place of Track 2)

The team reveals two critical issues existing in the current state-of-the-art method [39]

: 1) it relies upon softmax outputs, or say logits. It is known that logits can be over-confident upon the wrong prediction; 2) harvesting pseudo labels using logits would introduce thresholds, and it is very time-consuming to tune thresholds for modern deep networks. Some observations are shown in Figure 

4.

To tackle these issues, the proposed approach builds upon uncertainty measures instead of logits and is free of threshold tuning, which is motivated by a large-scale analysis of the distribution of uncertainty measures using strong models and challenging databases. This analysis leads to the discovery of a statistical phenomenon called uncertainty mixture. Inspired by this discovery, this team proposes to decompose the distribution of uncertainty measures with a Gamma mixture model, leading to a principled method to harvest reliable pseudo labels. Beyond that, the team assumes the uncertainty measures for labeled points are always drawn from a particular component. This amounts to a regularized Gamma mixture model. They provide a thorough theoretical analysis of this model, showing that it can be solved with an EM-style algorithm with a convergence guarantee.

Figure 4: Some observations discovered by the team of Machine Learning Lab at Ukrainian Catholic University. Pa: a wrong pseudo label with high uncertainty. Pb: a correct pseudo label with low uncertainty.

2.3.4 Vision&Learning Lab at Seoul National University Team
(Won the 1st and 2nd places for Track 3 and Track 1)

This team demonstrates the popular class activation maps [72] suffers from three fundamental issues: (i) the bias of GAP to assign a higher weight to a channel with a small activation area, (ii) negatively weighted activations inside the object regions, and (iii) instability from the use of the maximum value of a class activation map as a thresholding reference. They collectively cause the problem that the localization prediction to be highly limited to the small region of an object. The proposed approach incorporates three simple but robust techniques that alleviate the regarding problems, including thresholded average pooling, negative weight clamping, and percentile as a thresholding standard. More details can be found in Figure 5.

Figure 5: An overview of the proposed approach by the team of Vision&Learning Lab at Seoul National University.

2.3.5 Mepro at Beijing Jiaotong University Team
(Won the 2nd place of Track 3)

The proposed method achieves localization on any convolutional layer of a classification model by exploiting two kinds of gradients, called the Dual-Gradients Localization (DGL) framework. DGL framework is developed based on two branches: 1) Pixel-level Class Selection, leveraging gradients of target class to identify the correlation ratio of pixels to the target class within any convolutional feature maps, and 2) Class-aware Enhanced Map, utilizing gradients of the classification loss function to mine entire target object regions, which would not damage classification performance. The proposed architecture is shown in Figure 6.

To acquire the integral object regions, gradients of classification loss function are used to enhance the information of the specific class on any convolutional layer. DGL framework demonstrates that any convolutional layer of the classification model has the localization ability, and some layers have better performance than the last convolutional layer in an offline manner.

Figure 6: An overview of the Dual-Gradients Localization (DGL) framework proposed by the team of Mepro at Beijing Jiaotong University.

2.3.6 PCA Lab at Nanjing University of Science and Technology Team
(Won the 3rd place of Track 3)

Figure 7: An overview of the framework proposed by the team of PCA Lab at Nanjing University of Science and Technology.

The proposed model is composed of two auto-encoders, as shown in Figure 7. In the first auto-encoder, a classifier is trained with the global average pooling by using image-level annotations as supervision. The learned classifier is further applied to obtain Class Activation Maps (CAMs) according to [72]. Then, the team uses the binary images generated by CAMs as pseudo-pixel-level annotations to conduct binary classification. After the first decoder, a bilinear upsampling operation is further applied to get the binary image with the same size as the raw image.

In the second auto-encoder, the team aims to recover the binary image to the raw image to obtain the refined binary output image. The output binary image of the first decoder is used as the input of the second encoder. The binary image is compounded with a raw image and is then divided into front and background. The front part and the background part are operated in different channels of a layer independently and then are fused to the final prediction image. To be specific, they transfer front and background images in all layers into binary images with a threshold, add front and background binary images together, and then fuse the binary images generated by all layers. As a result, we may get the refined binary output images, which are used as pseudo annotations for the next iteration. The above process iterates until convergence.

2.3.7 Beijing Normal University
(Won the 5th place of Track 3)

This team simply apply the approach proposed in [10] to compete the Track 3.

Acknowledgments

We thank our sponsor of the PaddlePaddle Team from Baidu.

References

  • [1] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, pages 4981–4990, 2018.
  • [2] Aditya Arun, CV Jawahar, and M Pawan Kumar. Weakly supervised instance segmentation by learning annotation consistent instances. In ECCV, 2020.
  • [3] Wonho Bae, Junhyug Noh, and Gunhee Kim. Rethinking class activation mapping for weakly supervised object localization. In ECCV, 2020.
  • [4] Archith John Bency, Heesung Kwon, Hyungtae Lee, S Karthikeyan, and BS Manjunath.

    Weakly supervised localization using deep feature maps.

    In eccv, pages 714–731, 2016.
  • [5] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, pages 839–847, 2018.
  • [6] Arslan Chaudhry, Puneet K Dokania, and Philip HS Torr. Discovering class-specific pixels for weakly-supervised semantic segmentation. arXiv preprint arXiv:1707.05821, 2017.
  • [7] Liyi Chen, Weiwei Wu, Chenchen Fu, Xiao Han, and Yuntao Zhang. Weakly supervised semantic segmentation with boundary exploration. In ECCV, 2020.
  • [8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
  • [9] Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata, and Hyunjung Shim. Evaluating weakly supervised object localization methods right. In cvpr, 2020.
  • [10] Junsuk Choe and Hyunjung Shim. Attention-based dropout layer for weakly supervised object localization. In CVPR, June 2019.
  • [11] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly supervised object localization with multi-fold multiple instance learning. TPAMI, 39(1):189–203, 2016.
  • [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [13] Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. Learning integral objects with intra-class discriminator for weakly-supervised semantic segmentation. In CVPR, 2020.
  • [14] Junsong Fan, Zhaoxiang Zhang, and Tieniu Tan.

    Employing multi-estimations for weakly-supervised semantic segmentation.

    In ECCV, 2020.
  • [15] Qibin Hou, PengTao Jiang, Yunchao Wei, and Ming-Ming Cheng. Self-erasing network for integral object attention. In NIPS, pages 549–559, 2018.
  • [16] Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and Jingdong Wang. Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, pages 7014–7023, 2018.
  • [17] Peng-Tao Jiang, Qibin Hou, Yang Cao, Ming-Ming Cheng, Yunchao Wei, and Hong-Kai Xiong. Integral object mining via online attention accumulation. In ICCV, pages 2070–2079, 2019.
  • [18] Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, and Wei Liu. Deep self-taught learning for weakly supervised object localization. In CVPR, 2017.
  • [19] Vadim Kantorov, Maxime Oquab, Minsu Cho, and Ivan Laptev. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV, pages 350–365, 2016.
  • [20] Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, volume 1, page 3, 2017.
  • [21] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and In So Kweon. Two-phase learning for weakly supervised object localization. In ICCV, pages 3534–3543, 2017.
  • [22] Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, pages 695–711, 2016.
  • [23] Viveka Kulharia, Siddhartha Chandra, Amit Agrawal, Philip Torr, and Ambrish Tyagi. Box2seg: Attention weighted loss and discriminative feature learning for weakly supervised segmentation. In ECCV, 2020.
  • [24] Suha Kwak, Seunghoon Hong, Bohyung Han, et al. Weakly supervised semantic segmentation using superpixel pooling network. In AAAI, pages 4111–4117, 2017.
  • [25] Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and Sungroh Yoon. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In CVPR, pages 5267–5276, 2019.
  • [26] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan Yang. Weakly supervised object localization with progressive domain adaptation. In CVPR, pages 3512–3520, 2016.
  • [27] Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. Tell me where to look: Guided attention inference network. In CVPR, pages 9215–9223, 2018.
  • [28] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Attention bridging network for knowledge transfer. In CVPR, pages 5198–5207, 2019.
  • [29] Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, and Shuicheng Yan. Towards computational baby learning: A weakly-supervised approach for object detection. In ICCV, pages 999–1007, 2015.
  • [30] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. CVPR, 2016.
  • [31] Weizeng Lu, Xi Jia, Weicheng Xie, Linlin Shen, Yicong Zhou, and Jinming Duan. Geometry constrained weakly supervised object localization. In ECCV, 2020.
  • [32] Wenfeng Luo and Meng Yang. Semi-supervised semantic segmentation via strong-weak dual-branch network. In ECCV, 2020.
  • [33] Jinjie Mai, Meng Yang, and Wenfeng Luo. Erasing integrated learning: A simple yet effective approach for weakly supervised object localization. In CVPR, 2020.
  • [34] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic.

    Is object localization for free?-weakly-supervised learning with convolutional neural networks.

    In CVPR, pages 685–694, 2015.
  • [35] George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan L Yuille.

    Weakly-and semi-supervised learning of a dcnn for semantic image segmentation.

    In ICCV, 2015.
  • [36] Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, pages 1796–1804, 2015.
  • [37] Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015.
  • [38] Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, Hengshuang Zhao, and Jiaya Jia. Augmented feedback in semantic segmentation under image level supervision. In ECCV, pages 90–105, 2016.
  • [39] Rui Qian, Yunchao Wei, Honghui Shi, Jiachen Li, Jiaying Liu, and Thomas Huang. Weakly supervised scene parsing with point-based distance metric learning. In AAAI, volume 33, pages 8843–8850, 2019.
  • [40] Amir Rahimi, Amirreza Shaban, Thalaiyasingam Ajanthan, Richard Hartley, and Byron Boots. Pairwise similarity knowledge transfer for weakly supervised object localization. In ECCV, 2020.
  • [41] O Russakovsky, AL Bearman, V Ferrari, and L Fei-Fei. What’s the point: Semantic segmentation with point supervision. In ECCV, pages 549–565, 2016.
  • [42] Fatemehsadat Saleh, Mohammad Sadegh Ali Akbarian, Mathieu Salzmann, Lars Petersson, Stephen Gould, and Jose M Alvarez. Built-in foreground/background prior for weakly-supervised semantic segmentation. In ECCV, pages 413–432, 2016.
  • [43] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
  • [44] Yunhang Shen, Rongrong Ji, Yan Wang, Yongjian Wu, and Liujuan Cao. Cyclic guidance for weakly supervised joint detection and segmentation. In CVPR, 2019.
  • [45] Wataru Shimoda and Keiji Yanai. Distinct class-specific saliency maps for weakly supervised semantic segmentation. In ECCV, pages 218–234, 2016.
  • [46] Wataru Shimoda and Keiji Yanai. Self-supervised difference detection for weakly-supervised semantic segmentation. In ICCV, 2019.
  • [47] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, pages 3544–3553, 2017.
  • [48] Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In CVPR, 2019.
  • [49] Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool. Mining cross-image semantics for weakly supervised semantic segmentation. In ECCV, 2020.
  • [50] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, Yuri Boykov, and Christopher Schroers. Normalized cut loss for weakly-supervised cnn segmentation. In CVPR, 2018.
  • [51] Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. On regularized losses for weakly-supervised cnn segmentation. In ECCV, pages 507–522, 2018.
  • [52] Olga Veksler. Regularized loss for weakly supervised single class semantic segmentation. In ECCV, 2020.
  • [53] Bin Wang, Guojun Qi, Sheng Tang, Tianzhu Zhang, Yunchao Wei, Linghui Li, and Yongdong Zhang. Boundary perception guidance: a scribble-supervised semantic segmentation approach. In IJCAI, pages 3663–3669, 2019.
  • [54] Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR, 2020.
  • [55] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In CVPR, 2017.
  • [56] Yunchao Wei, Xiaodan Liang, Yunpeng Chen, Zequn Jie, Yanhui Xiao, Yao Zhao, and Shuicheng Yan. Learning to segment with image-level annotations. Pattern Recognition, 2016.
  • [57] Yunchao Wei, Xiaodan Liang, Yunpeng Chen, Xiaohui Shen, Ming-Ming Cheng, Jiashi Feng, Yao Zhao, and Shuicheng Yan. Stc: A simple to complex framework for weakly-supervised semantic segmentation. TPAMI, 2016.
  • [58] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. Hcp: A flexible cnn framework for multi-label image classification. TPAMI, (9):1901–1907, 2015.
  • [59] Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S Huang. Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In CVPR, pages 7268–7277, 2018.
  • [60] Huaxin Xiao, Yunchao Wei, Yu Liu, Maojun Zhang, and Jiashi Feng. Transferable semi-supervised semantic segmentation. In AAAI, 2018.
  • [61] Jia Xu, Alexander G Schwing, and Raquel Urtasun. Learning to segment under various forms of weak supervision. In CVPR, 2015.
  • [62] Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, Xiangyang Ji, and Qixiang Ye. Danet: Divergent activation for weakly supervised object localization. In ICCV, October 2019.
  • [63] Yu Zeng, Yunzhi Zhuge, Huchuan Lu, and Lihe Zhang. Joint learning of saliency detection and weakly supervised semantic segmentation. In ICCV, October 2019.
  • [64] Chen-Lin Zhang, Yun-Hao Cao, and Jianxin Wu. Rethinking the route towards weakly supervised object localization. In CVPR, 2020.
  • [65] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. In ECCV, pages 543–559, 2016.
  • [66] Tianyi Zhang, Guosheng Lin, Jianfei Cai, Tong Shen, Chunhua Shen, and Alex C Kot. Decoupled spatial neural attention for weakly supervised semantic segmentation. 21(11):2930–2941, 2019.
  • [67] Tianyi Zhang, Guosheng Lin, Weide Liu, Jianfei Cai, and Alex Kot. Splitting vs. merging: Mining object regions with discrepancy and intersection loss for weakly supervised semantic segmentation. In ECCV, 2020.
  • [68] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR, 2018.
  • [69] Xiaolin Zhang, Yunchao Wei, Guoliang Kang, Yi Yang, and Thomas Huang. Self-produced guidance for weakly-supervised object localization. In ECCV, 2018.
  • [70] Xiaolin Zhang, Yunchao Wei, and Yi Yang. Inter-image communication for weakly supervised localization. In ECCV, 2020.
  • [71] Xiaolin Zhang, Yunchao Wei, Yi Yang, and Fei Wu. Rethinking localization map: Towards accurate object perception with self-enhancement maps. arXiv preprint arXiv:2006.05220, 2020.
  • [72] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016.
  • [73] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  • [74] Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation using class peak response. In CVPR, pages 3791–3800, 2018.
  • [75] Yi Zhu, Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Soft proposal networks for weakly supervised object localization. arXiv preprint arXiv:1709.01829, 2017.