Log In Sign Up

Weakly Supervised Foreground Learning for Weakly Supervised Localization and Detection

Modern deep learning models require large amounts of accurately annotated data, which is often difficult to satisfy. Hence, weakly supervised tasks, including weakly supervised object localization (WSOL) and detection (WSOD), have recently received attention in the computer vision community. In this paper, we motivate and propose the weakly supervised foreground learning (WSFL) task by showing that both WSOL and WSOD can be greatly improved if groundtruth foreground masks are available. More importantly, we propose a complete WSFL pipeline with low computational cost, which generates pseudo boxes, learns foreground masks, and does not need any localization annotations. With the help of foreground masks predicted by our WSFL model, we achieve 72.97 localization accuracy on CUB for WSOL, and 55.7 VOC07 for WSOD, thereby establish new state-of-the-art for both tasks. Our WSFL model also shows excellent transfer ability.


page 3

page 8

page 12


Incorporating Network Built-in Priors in Weakly-supervised Semantic Segmentation

Pixel-level annotations are expensive and time consuming to obtain. Henc...

Multiple Instance Reinforcement Learning for Efficient Weakly-Supervised Detection in Images

State-of-the-art visual recognition and detection systems increasingly r...

Self-produced Guidance for Weakly-supervised Object Localization

Weakly supervised methods usually generate localization results based on...

Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

Pixel-level annotations are expensive and time consuming to obtain. Henc...

Weakly supervised 3D Reconstruction with Adversarial Constraint

Supervised 3D reconstruction has witnessed a significant progress throug...

Weakly-Supervised Localization and Classification of Proximal Femur Fractures

In this paper, we target the problem of fracture classification from cli...

1 Introduction

Deep learning have been the de facto

standard for many tasks in the computer vision community. However, the thirst for large-scale labeled data has grown with the development of such models. Since accurate annotations are expensive and sometimes even unavailable, weakly supervised methods, including both weakly supervised object localization (WSOL) and weakly supervised object detection (WSOD) are popular these days. WSOL aims at predicting objects’ locations in test images when only the class-level labels of training images are given. Moving beyond WSOL, WSOD seeks to detect and classify multiple objects, which is considerably more challenging.

A recent work [evaluatecvpr2020] argues that previous WSOL methods in fact do not outperform the pioneering class activation map (CAM) [camcvpr2016] method, and further claims that WSOL is an ill-posed problem when it is not given any location annotations. In [evaluatecvpr2020], using only a few images with groundtruth pixel-level annotation (, few-shot) can outperform existing WSOL methods. This few-shot learning setting, however, is not weakly supervised anymore.

Figure 1:

WSOL performance with different levels of supervision on ImageNet.

Figure 2: WSOD performance with different levels of supervision on VOC2007.

Although WSOL might be ill-posed inside the CAM framework, we propose a new task of weakly supervised foreground learning (WSFL) and argue that it is feasible and well-posed. That is, it is viable to determine whether a pixel belongs to foreground (object) or background with only weak supervision. As the method PSOL [psolcvpr2020] illustrated, in WSOL we need to separate two subtasks (localization of a foreground object and recognition of its label), although existing methods mostly mix both subtasks together. The separated localization subtask in fact hinges on the WSFL task we propose, and the success of PSOL [psolcvpr2020] in turn gives us some confidence in the feasibility of WSFL.

The value of foreground/background dichotomy (and thus the proposed WSFL task) is beyond WSOL. Figures 1 and 2 illustrate that it is also very valuable in the more difficult WSOD setup. The dichotomy can appear in different forms with the supervision signal getting weaker and weaker (only provided for training images, more details in Sec. 4):

  • High-resolution (HR) GT: Every pixel has a groundtruth (GT) foreground mask;

  • Low-resolution (LR) GT: GT masks exist, but are low resolution (, mask for image);

  • High-resolution few-shot [evaluatecvpr2020]: Like HR GT, but only labeled few training images per category;

  • WSFL: The proposed method, where masks are learned in a weakly supervised manner (, no additional supervision);

  • No GT (Baseline): No masks are used at all.

Both figures show that extra supervision (masks) are extremely useful, no matter it is high- or low-resolution, or many- or few-shot. A natural question is: can we perform weakly supervised foreground learning (WSFL)? If the answer is “yes”, will WSFL be on par with relatively weak masks (such as few-shot GT or low-resolution GT)?

We believe that the answers to both questions are “yes” and propose a WSFL pipeline, whose effectiveness is verified on various WSOL and WSOD datasets. The pipeline is illustrated in Fig. 3. WSFL first generates pseudo boxes with a co-localization method, deep descriptor transformation [ddtpr2019]. Then, it generates low resolution pseudo foreground masks using the pseudo bounding boxes. A low resolution pixel mask classification model will be learned using the generated foreground masks. Compared to existing saliency methods [saliencyref1iccv2019, saliencyref2iccv2019, saliencyref3cvpr2019, saliencylataaai2018] that will also output a binary pixel mask using segmentation-based backbone models, our WSFL pipeline is computationally more efficient with classification backbone models.

Figure 3: Overall pipeline of our WSFL framework. The training pipeline of WSFL lies on the left side while the testing pipeline of WSFL, including downstream applications WSOL and WSOD, lies on the right side. In WSOL, foreground masks predicted by WSFL is post-processed by CAM to generate bounding boxes. In WSOD, the foreground masks are used to filter away proposals with low foreground (objectness) scores. Both applications can be significantly improved by WSFL.

During testing, WSFL directly predicts low resolution foreground masks using the learned low resolution pixel mask classification model. These masks can be combined with class activation map (CAM) to predict bounding boxes in WSOL tasks. WSOD methods can also make use of these masks as extra supervision for objectness scores during training. With the proposed WSFL models, we achieve new state-of-the-art results for both WSOL and WSOD.

In short, our contributions are:

  • We find that groundtruth foreground masks can greatly benefit tasks such as WSOL and WSOD.

  • We propose WSFL (weakly supervised foreground learning) which learns foreground masks in a weakly supervised fashion (, no extra supervision).

  • Applying WSFL to WSOL and WSOD, our method establishes new state-of-the-art results for both tasks.

2 Related Works

We briefly review recent works on weakly supervised object localization/detection, and saliency related methods.

WSOL: Weakly supervised object localization (WSOL) aims at locating an object when given only image-level labels during training. Researchers have tried to adopt deep learning models to WSOL. The pioneering work [camcvpr2016]

proposes class activation map (CAM), which uses a combination of classification weights and feature map of a convolutional neural network (CNN) to conduct localization.

Some methods improve the CAM pipeline [gradcamiccv2017, rethinkcameccv2020][evaluatecvpr2020] builds a new evaluation pipeline to deal with unfair comparisons. In addition, [evaluatecvpr2020] also proposes a simple few-shot baseline, which uses a fully convolutional network (FCN) model [fcncvpr2015], and beats previous WSOL models. PSOL [psolcvpr2020] shows that the localization subtask should be separated from the recognition subtask, and proposes an independent localization model.

Another track of WSOL attack is to improve the localization ability of classification models. ACoL [acolcvpr2018] conducts adversarial learning over baseline models. SPG [spgeccv2018] adopts self-produced localization masks into the base network. ADL [adlcvpr2019] combines attention dropout layer into the classification model. HaS [hideandseekiccv2017] and EIL [eilcvpr2020] use erasing to boost the localization performance. I[i2ceccv2020] explores the inter-image information for localization in WSOL.

WSOD: Weakly supervised object detection (WSOD) seeks to detect multiple objects in a test image given only the image-level labels during training. Compared to WSOL, WSOD is more challenging. WSOD methods often use object proposals as extra inputs to the detection model.

Most WSOD methods use the multi-instance learning framework to build a detector. Some works use external information like object size [wsodsizeeccv2016] or context information [contextlocneteccv2016]. The pioneering WSDDN method [wsddncvpr2016] uses spatial and class regularization over different proposals. OICR [oicrcvpr2017] introduces online instance refinement and pseudo groundtruth mining. The OICR framework is popular among WSOD works. One trend is to improve the pseudo groundtruth mining part [pcltpami2018, cmilcvpr2019]. Recently, MIST [wetectroncvpr2020] improves the mining rule of pseudo groundtruth boxes and proposes the Concrete DropBlock module. [enableresneteccv2020] makes ResNet [resnetcvpr2016] backbones work in WSOD tasks.

Extra information can guide the detector’s learning process. WS-JDS [wsjdscvpr2019], SDCN [sdcniccv2019] and OCRepr [ocreprmm2020] build a joint detection-segmentation framework and improves WSOD. WSOD [wsod2iccv2019] uses low-level vision information to help classify foreground proposals in the image.

Saliency Methods: Saliency aims at locating visually salient objects [saliencytpami2021]. Similar to our WSFL, saliency methods will also produce a binary mask for an input image, and weakly supervised saliency (WSS) methods are closely related to this paper. [wsscvpr2017] first uses deep CNNs in WSS. LICNN [saliencylataaai2018] formulates lateral inhibition to conduct WSS detection. SuperVAE [saliencyvaeaaai2019]

introduces variational autoencoder into the field. However, WSS methods have high computational costs since they are based on segmentation models, and can only be trained on small datasets around 10,000 images. In contrast, our WSFL model can be applied in large-scale datasets like ImageNet-1k.

3 Weakly Supervised Foreground Learning

We now present our WSFL framework, and describe how WSFL can be used in WSOL and WSOD.

3.1 Notation

Suppose we have a dataset comprising of images , where each is an image with height and width . Each image has an object label . In the fully-supervised setting, it also has a set of bounding box annotations , where is in the format of , with and being the top-left and bottom-right corners of the bounding box. These bounding boxes are, however, not available in our weakly supervised setting. Feeding an image into a CNN model , we will obtain a final feature map before proceeding to the classification head: , with , and being the height, width and depth of the feature map.

3.2 Motivating Weakly Supervised Foreground Learning

For WSOL, an FCN (fully convolutional network [fcncvpr2015]) style network is used in [evaluatecvpr2020] to train a binary segmentation network that predicts per-pixel foreground masks. A convolution layer first outputs an binary foreground mask. Then, the standard FCN training pipeline is applied to train the network. Few (10) images per category were manually provided with groundtruth pixel-level foreground masks during training. During prediction, high-resolution foreground masks predicted by the network is fed to a CAM [camcvpr2016] procedure, which generates bounding boxes as localization results. [evaluatecvpr2020] shows that even with only few groundtruth foreground mask supervision, this few-shot supervised method can beat state-of-the-art WSOL methods.

For WSOD, the method WSOD [wsod2iccv2019] uses external low-level objectness scores (including edge density and superpixel straddling) to inject extra objectness information into current WSOD methods. It calculates the low-level objectness score for each proposal, and filters away proposals with low scores during training. Performance gain by WSOD shows that pixel-level objectness (foreground) scores, even obtained from low-level pixel information, can significantly boost the performance of WSOD methods.

Thus, pixel-level groundtruth foreground masks must lead to even higher improvements for both WSOL and WSOD. This conjecture is verified by Figures 1 and 2. For example, using the OICR method [oicrcvpr2017] with the VGG16 backbone, this HR GT information brings 3.3 mAP gains (from 46.9 of “Weakly” to 50.2 of “HR GT” in Fig. 2).

Pixel-level groundtruth information (even few-shot), however, are not available in WSOL. Low-level objectness scores for WSOD may contain high percentage of errors because it disregards image-level labels and high-level semantics. Hence, we propose weakly supervised foreground learning (WSFL), which learns foreground masks utilizing high-level deep learning techniques and image-level labels in a weakly supervised fashion, and can be utilized in both WSOL and WSOD as its downstream applications.

We propose a novel WSFL framework to learn how to estimate the unavailable groundtruth foreground masks, as shown in Algorithm 

1. In this algorithm, we do not need any groundtruth annotations, including bounding box annotations or pixel-level masks.

First, we generate pseudo low resolution foreground masks. Then, using these (possibly highly noisy) masks as supervision signal, we train a low resolution foreground classification model, which predicts more accurate foreground masks for testing images. We will introduce these components step by step in the coming sections.

1:Input: Training images , testing images
2:Output: Foreground masks in testing images
3: Generate pseudo bounding boxes on
4: Generate low resolution foreground masks on with
5: Train a low resolution foreground classification CNN on using as supervision
6: Use to predict for
7: Upsample to the original size of the input images
Algorithm 1 Weakly Supervised Foreground Learning

3.3 Pseudo Foreground Mask Generation

We now introduce how to generate pseudo foreground masks using only image-level labels.

3.3.1 High Resolution Pseudo Foreground Masks

When groundtruth pixel annotations (., for fully-supervised semantic segmentation) or bounding box annotations (., for fully-supervised object detection) are available, it is straightforward to obtain high-resolution groundtruth foreground masks: A pixel is a foreground pixel (, its mask value bing 1) if this pixel is inside any one of the object regions or bounding boxes; otherwise, the mask value is 0 (, is a background pixel).

Since we are in a weakly supervised setting, we replace groundtruth bounding boxes with DDT [ddtpr2019] generated boxes. According to previous literature [psolcvpr2020], we find that DDT [ddtpr2019], an object co-localization method that only requires image-level labels, provides good pseudo bounding boxes that roughly separate objects from the background. We then generate high resolution pseudo foreground masks by setting a pixel’s foreground mask value to 1 if and only if it is within any one of these pseudo bounding boxes.

3.3.2 Convert Pseudo Masks to Low Resolution

Groundtruth high resolution masks has shown excellent utility for WSOL in [evaluatecvpr2020]. Similarly, high resolution masks in object detection is equivalent to fully-supervised detection, which outperforms WSOD significantly. But, the FCN-style network in [evaluatecvpr2020] requires high computational costs, which prevents it from being applied to large-scale datasets. DDT (and other co-localization methods) only generates one bounding box per image, which means our pseudo high resolution masks cannot be directly applied to WSOD.

To solve these difficulties, we turn the high resolution pseudo masks into low resolution ones, whose size equals the spatial size of the final activation maps in modern CNN models, e.g., . Then, we resize the bounding box annotations from inside the original input images to fit the low resolution ones. In turn, these bounding boxes are transformed into low resolution foreground masks. Low resolution maps will inevitably contain errors due to quantization issues. However, compared to high resolution semantic segmentation learning, its computational costs are significantly lower. Hence, it can be applied to large-scale datasets like the full training set of ImageNet-1k [imagenetijcv2015].

3.4 Low Resolution Pixel Classification

An FCN-style semantic segmentation task using the low resolution pseudo masks as training labels are still time-consuming and difficult to optimize. Hence, we propose to use these pseudo masks as binary classification labels, and learn a low resolution pixel classification model instead.

In detail, we simply take a widely used image classification model (such as VGG16 [vggiclr2014] or ResNet50 [resnetcvpr2016]), replace the final global averaging pooling and/or fully connected layers with a single convolutional layer with only one output channel, leading to an output activation map of size (same as that of the low resolution pseudo mask). We add a sigmoid layer to convert the output values to , and minimize the binary cross entropy loss between these output values and the low resolution pseudo masks. Given a test image, the learned classifier predicts a binary map of size . Note that the predicted foreground map can have more than one disjoint foreground connected regions, thus suitable for application in WSOD.

This mask can be easily resized to match the input image’s size using bilinear interpolation, which is

what WSFL uses to replace the pixel-level groundtruth mask in a weakly supervised fashion.

Combining our low resolution mask and pixel classification model, we perform an oracle study in the WSOL task on ImageNet-1k with groundtruth foreground masks. Note that bounding box annotations on ImageNet-1k is incomplete, and we only use those images with groundtruth bounding box annotations to train the models. Experiment details can be found in Sec. 4. Results in Table 1 show that our “low resolution mask + pixel classification model” outperforms the “few-shot FCN-style network + high resolution mask” in [evaluatecvpr2020]. These result show the effectiveness of our proposed pipeline: converting high resolution masks to low resolution for all images (followed by a pixel classification network) is not only efficient, but also more effective than using high resolution masks on only a few images.

Model Supervision CorLoc
ResNet50 HR GT, few-shot [evaluatecvpr2020] 67.5
ResNet50 LR GT 72.8
Table 1: Correct localization (CorLoc) accuracy of models in the WSOL task on ImageNet-1k. HR means high resolution foreground mask and LR means low resolution mask.

3.5 Downstream Applications of WSFL

The foreground masks predicted by WSFL can be directly applied in WSOL tasks, since we can use CAM [camcvpr2016] to generate bounding box predictions. The utility of WSFL, however, is much wider than the simple WSOL.

In Sec 3.2, we show that groundtruth masks are useful for both WSOL/WSOD tasks. In Algorithm 1, we train and predict foreground masks on the same dataset. However, since our WSFL model does not need extra label inputs (unlike traditional CAM), we can directly transfer our model trained on a large-scale dataset (like ImageNet [imagenetijcv2015]) to different datasets (, CUB-200 [cubtech2011]) without modifications.

Another natural application is WSOD. It is very difficult to find precise pseudo bounding boxes for multi-object datasets like PASCAL VOC 

[vocijcv2010], when only image-level labels are available. Although there are weakly supervised semantic segmentation methods which can generate class-aware masks, their performances are still low and cannot meet our demand. Since we have shown that various groundtruth foreground (class-agnostic) masks can improve the performance of WSOD methods, and our WSFL model can transfer without any fine-tuning or additional labels, we can directly predict foreground masks in WSOD tasks using a pretrained WSFL model.

After generating low resolution masks, we upsample them to the original size using bilinear interpolation, and use the following process to calculate the foreground score for every proposal in the image. We treat the average value in the foreground mask of all pixels inside one object proposal as the objectness score for this proposal. Then, we add these scores into the WSOD training pipeline as extra inputs. State-of-the-art WSOD models will classify some input proposals as foreground proposals and use these foreground proposals to retrain the online classifier. In this paper, we follow the pipeline of WSOD: We classify some proposals as background proposals at the online instance refinement stage if they have objectness scores which are lower than a threshold. The detailed threshold value setting will be discussed in Sec 4.

One final note: Since all components in our WSFL (generating pseudo boxes using DDT, converting pseudo masks to low resolution, and pixel classification) have no interaction with the test set at all, the reasons described in [evaluatecvpr2020] for leading to an ill-posed problem do not apply to WSFL.

4 Experimental Setups and Details

Next, we present our experimental setup for evaluating our WSFL framework, including the datasets and the implementation details.

4.1 Datasets

We evaluate our WSFL framework on two weakly supervised tasks: WSOL and WSOD. For WSOL, we use two standard datasets: ImageNet-1k [imagenetijcv2015] and CUB-200 [cubtech2011]. ImageNet-1k is a single-object image classification dataset with around 1.3 million images. Bounding box annotations are incomplete for training images and complete for validation images. CUB-200 contains 200 kinds of different birds, with 11,788 images that have groundtruth box annotations. Except those lines with the “GT” suffix, we do not use any groundtruth annotation to train our WSFL model. We use two metrics to evaluate our models: Top-1 localization accuracy (Top-1 Loc) and correct localization accuracy (CorLoc). CorLoc is correct when given the class label to the WSOL model, the intersection over union (IoU) score between the groundtruth box and the output box is larger than 50% or more. Top-1 Loc is correct when the Top-1 classification and GT-Known Loc are both correct.

For WSOD, we will evaluate on a standard detection benchmark dataset VOC2007 [vocijcv2010]. VOC2007 is an object detection dataset with 2,501 training images, 2,511 validation images and 4,952 testing images. We use training and validation images to train our models and test their performance on the testing images, following previous WSOD protocols [wsddncvpr2016, wsodsizeeccv2016, oicrcvpr2017]. During evaluation, we use the common mean average precision (mAP) metric on test images.

4.2 Implementation Details

We use the PyTorch framework with 2080Ti GPUs to conduct our experiments.

Base Models. For backbone models in WSOL, we use the same models used in previous methods [spgeccv2018, adlcvpr2019, evaluatecvpr2020]: VGG16 [vggiclr2014], InceptionV3 [inceptionv3cvpr2016] and ResNet50 [resnetcvpr2016]. For VGG16, we follow the previous guide in WSOL [camcvpr2016, spgeccv2018, adlcvpr2019]

: remove the last max pooling layer to enlarge the receptive field. Also, we remove all fully connected layers in VGG16. For ResNet50, we remove the downsample stride of the last residual block. For InceptionV3, we follow the structural changes in 

[spgeccv2018, adlcvpr2019]. We use pretrained weights on ImageNet-1k to initialize our WSFL models.

For backbone models in WSOD, we simply take the previous methods OICR and MIST [oicrcvpr2017, wetectroncvpr2020] as baseline models, then rerun these models with the extra supervision of our objectness scores.

For all experiments on WSOL, we first use DDT [ddtpr2019]

to generate pseudo boxes on ImageNet-1k and CUB-200. Then, we generate pseudo foreground masks according to the pseudo boxes. We use the following hyperparameters on ImageNet: batch size 256, weight decay 0.0001. We use SGD optimizer with 0.9 momentum. For the learning rate, we will start at 0.001, then decay at every 4 epochs. The total training epochs on ImageNet is 12 for all models. For WSOL experiments on CUB-200, we keep other hyperparameters the same and change the batch size from 256 to 64. The learning rate will decay at every 10 epochs and the total training epochs is 30 on CUB-200. We directly resize all input images into

, then perform random horizontal flipping for training. For testing, we directly resize the input image to , then feed into the model.

After getting the low resolution output, we will use bilinear interpolation to upsample the low resolution output to the size of the original input image. Then we will use CAM [camcvpr2016] to generate the output bounding box. Since our models can directly output localization results, we follow the instructions in [psolcvpr2020] to combine localization results with classification results. For results on CUB-200, we find that fine-tuning WSFL models trained on ImageNet-1k will have better performance. Ablation studies on different initialization weights on CUB-200 will be presented in Sec. 5.

For WSOD experiments, we directly use a ResNet50 WSFL model pre-trained on the ImageNet dataset to generate pseudo foreground masks on VOC2007 without any fine-tuning. We use the same hyperparameters and evaluation pipeline of OICR and MIST, and do not make any further changes. For proposal filtering, we set threshold as 0.2 to filter proposals with our WSFL masks. For groundtruth masks, we set the threshold as 0.5 when we apply our WSFL. Visual inspections find that our WSFL model performs poorly on the “person” and “plant” categories on VOC2007, possibly because these categories do not appear in ImageNet. Thus, we do not filter proposals which are classified as foreground of person and plant in the online instance refinement stage.

5 Results and Analyses

In this section, we provide empirical results and analyses of the proposed WSFL model and applications.

5.1 WSOL Results

First, we show results on WSOL benchmark datasets and compare with state-of-the-art WSOL methods. Top-1 Loc and CorLoc results are shown in Table 2. From the table, we have the following observations and findings.

Model Backbone CUB-200 ImageNet-1k
Top-1 Loc CorLoc Top-1 Loc CorLoc
VGG16-CAM [camcvpr2016] VGG-GAP 37.05 53.68 42.80 59.00
VGG16-ACoL [acolcvpr2018] VGG-GAP 45.92 - 45.83 62.96
ADL [adlcvpr2019] VGG-GAP 53.40 73.96 42.96 59.24
CutMix [cutmixiccv2019] VGG-GAP 52.53 - 43.45 -
DDT [ddtpr2019] VGG16 62.30 84.55 47.31 61.41
PSOL-Sep [psolcvpr2020] VGG-GAP 59.29 80.45 48.36 63.72
I[i2ceccv2020] VGG-GAP 55.99 72.60 47.41 63.90
SEM [semarxiv2020] VGG-GAP - - 47.53 63.47
CAM w/ [rethinkcameccv2020] VGG-GAP 61.30 80.72 45.40 62.68
WSFL VGG-GAP 68.33 92.92 51.47 66.95
Few-Shot GT FCN [evaluatecvpr2020] VGG16 - 86.30 - 62.80
SPG [spgeccv2018] InceptionV3 46.64 - 48.60 64.69
ADL [adlcvpr2019] InceptionV3 53.04 - 48.71 -
ADL w/ [rethinkcameccv2020] InceptionV3 53.04 69.95 50.56 64.44
PSOL-Sep [psolcvpr2020] InceptionV3 65.51 83.44 54.82 65.21
I[i2ceccv2020] InceptionV3 55.99 72.60 53.17 68.50
SEM [semarxiv2020] InceptionV3 - - 53.04 69.04
WSFL InceptionV3 69.04 93.96 57.12 69.59
Few-Shot GT FCN [evaluatecvpr2020] InceptionV3 - 94.00 - 68.70
ADL [adlcvpr2019] ResNet50-SE 62.29 - 48.53 -
ADL w/ [rethinkcameccv2020] ResNet50 59.53 77.58 49.42 62.20
CutMix [cutmixiccv2019] ResNet50 54.81 - 47.25 -
I[i2ceccv2020] ResNet50 - - 54.83 68.50
PSOL [psolcvpr2020] ResNet50 69.87 86.56 53.98 65.44
WSFL ResNet50 72.97 94.75 56.56 69.53
Few-Shot GT FCN [evaluatecvpr2020] ResNet50 - 95.80 - 67.50
LR GT ResNet50 75.87 98.45 59.30 72.80
Table 2: Top-1 localization and CorLoc results on CUB-200 and ImageNet-1k. VGG-GAP means we replace three fully connected layers in VGG16 with global average pooling and one fully connected layer, and for VGG16 it keeps the original VGG16 structure. For compared methods, we report numbers in previous papers. “-” means results were not reported in the respective paper. “LR GT” means the model are trained with low resolution groundtruth foreground masks. Best results are shown in boldface. An important note: For few-shot GT FCN [evaluatecvpr2020] and LR GT, since they rely on extra supervision signals, their results are listed here to provide a context, and they are not counted in the comparison.
  • WSFL performs better than baseline methods with the same inputs, including DDT [ddtpr2019] and PSOL [psolcvpr2020]. From the performance on VGG-GAP backbone, we can see that WSFL performs significantly better than DDT and PSOL on both datasets. Compared to PSOL [psolcvpr2020], WSFL turns the pseudo box annotations into low resolution foreground masks and obtains better results. This phenomenon shows we should utilize more information than bounding box coordinates in WSOL tasks.

  • WSFL significantly outperforms other WSOL methods with the same level of supervision. With the same post-processing step (CAM) and same backbone [camcvpr2016], WSFL outperforms previous WSOL methods [adlcvpr2019, spgeccv2018, cutmixiccv2019, i2ceccv2020] by a large gap. With the same pseudo supervision, we have better performance than DDT [ddtpr2019] and PSOL [psolcvpr2020]. Some recent methods try to modify CAM [camcvpr2016] to achieve better performance [semarxiv2020, rethinkcameccv2020]. Our WSFL models, even without any modification to CAM, still achieve significantly better accuracy than these methods. Moreover, WSFL can be combined with these post-processing methods to further achieve better performance.

  • WSFL can achieve comparable or better performance than few-shot baselines, which breaks the claim in [evaluatecvpr2020]. Our WSFL models have comparable or better performance than few-shot FCNs. In [evaluatecvpr2020], they mentioned that WSOL is an ill-posed problem, and researchers should use few-shot learning to achieve better performance. However, our WSFL framework shows that, so long as we utilize proper inputs in the current WSOL setting, we can achieve comparable or better performance than few-shot FCN. Consider the scalability issues of FCN-style networks on large-scale datasets like ImageNet-1k, WSFL with its low resolution classification model can predict and scale better.

5.2 WSOD Results

Now we move on to introduce results of WSFL on WSOD, summarized in Table 3. Our results suggest that:

Baseline method WS Extra information mAP
WS-JDS WS-JDS [wsjdscvpr2019] 45.6
OICR None 46.9
SDCN [sdcniccv2019] 46.0
OCRepr [ocreprmm2020] 46.3
WSOD [wsod2iccv2019] 48.1
WSFL 48.3
LR GT 48.6
HR GT 50.2
MIST None 55.2
WSFL 55.7
LR GT 56.5
HR GT 56.8
Table 3: mAP results of different WSOD methods on VOC2007 with extra information given by WSFL and/or other models. Please note that we re-implemented the OICR [oicrcvpr2017] and MIST [wetectroncvpr2020] baseline methods in their respective paper, and our results are higher than results reported in the original papers. The column “WS” denotes whether a method is weakly supervised or not. Both “HR GT” and “LR GT” use groundtruth pixel supervision, and should not be directly compared with WSFL.
  • With groundtruth foreground pixel masks, various WSOD methods get consistent improvements over their baselines. For two recent WSOD methods, OICR [oicrcvpr2017] and MIST [wetectroncvpr2020], high resolution groundtruth mask scores can improve 3.3 mAP and 1.4 mAP for them, respectively. With low resolution groundtruth masks, we can still have 1.7 mAP and 1.2 mAP gains, respectively, which proves that even low resolution masks can still boost the performance of various WSOD methods.

  • Our WSFL model can provide a substitution for groundtruth foreground masks. Masks predicted by WSFL lead to 1.3 mAP and 0.5 mAP gain on these WSOD methods, respectively. which is lower than, but close to gains by low resolution groundtruth masks.

  • Our WSFL achieves higher mAP than WSOD methods with extra segmentation branches [sdcniccv2019, ocreprmm2020, wsjdscvpr2019] and slightly higher mAP than the method using low-level vision scores [wsod2iccv2019]. In the mean time, computing the foreground mask in WSFL is more efficient than computing the segmentation mask. In the future, the synergy between WSFL masks and other scores may further improve WSOD.

5.3 Good Transfer Ability

In this section, we will provide analyses on the transfer ability of our WSFL models.

Previously, WSOL methods require class labels to generate bounding boxes, which cannot be transferred between different datasets. PSOL [psolcvpr2020] shows that their bounding box prediction models trained on ImageNet can achieve good performance without any further fine-tuning on a different dataset. We want to verify whether our WSFL model has good transfer ability, too. In fact, since we used a ImageNet pretrained WSFL model in WSOD tasks (on Pascal VOC), the WSFL transfer ability has been indirectly validated.

Model Initialization Weights Fine-tuned CorLoc
ResNet50 WSFL on ImageNet 91.25
Classification 93.55
WSFL on ImageNet 94.75
VGG-GAP WSFL on ImageNet 82.59
Classification 91.18
WSFL on ImageNet 92.92
InceptionV3 WSFL on ImageNet 90.92
Classification 91.25
WSFL on ImageNet 93.96
Table 4: Correct localization (CorLoc) accuracy of different WSFL models with different initial weights and training targets on CUB-200.
Figure 4: Visualization of our WSFL model’s output on images from different datasets. On the left, we randomly choose 4 images from the ImageNet-1k validation dataset; on the right, since we do not perform any fine-tuning, we randomly choose 4 images from the VOC 2007 trainval dataset. For every image, we show the foreground masks predicted by our WSFL model, along with the groundtruth pixel-level foreground mask provided (or transferred) from annotations. The ResNet50 WSFL model was trained on ImageNet. Pictures with red edges mean that our WSFL models output incorrect masks for the input image. For example, as aforementioned, “person” in VOC images are often missing in the predicted masks.

We take all baseline models in Table 2 and use CUB-200 as the target dataset to explore WSFL’s transfer ability between different datasets. There are two weight initialization choices for WSFL models: Classification weights on ImageNet, and WSFL weights on ImageNet. Furthermore, for WSFL weights on ImageNet, we can choose whether to fine-tune it on the pseudo CUB foreground masks generated by DDT. The results are in Table 4. From the table we have the following conclusions.

  • Without any fine-tuning, WSFL models perform well on CUB-200, except in the VGG-GAP model which have a large gap (10%) with fine-tuned models. ResNet50 and InceptionV3 only have small gaps, i.e., about 3% CorLoc accuracy. These phenomena show that our WSFL models transfer well on different WSOL datasets. That also explains why we can apply our WSFL models on WSOD methods to boost its accuracy.

  • WSFL models provide better initialization compared with classification weights. Compared to classification weights, all three models show 1-2% performance gain when WSFL weights are fine-tuned. These results show that there are indeed some characteristic differences between the classification and the localization task, which has been shown in [psolcvpr2020], too. We need to use separate models to conduct classification and localization tasks.

5.4 Visualization and Failure Cases

In this section, we show some visualization results to evaluate the output of our WSFL models. The visualizations are in Fig. 4. Moreover, we provide the output of our WSFL models and weakly supervised saliency detection methods in the supplementary material.

Fig. 4 clearly show that WSFL models output precise masks for object localization tasks, even with multiple separate objects (the bottom-left image). The masks are also high-quality when transferred to a different dataset without fine-tuning (top 2 images in the right half).

However, our WSFL models fail to predict the bounding boxes for nearby objects, although we have correct foreground masks. The problem comes from the CAM post-processing part. Also, since ImageNet-1k does not have the “person” category, people are often labeled as background in WSFL ImageNet learning. Thus, the learned WSFL model will not label person in the image as foreground. We need to deal with these biases in the future.

6 Conclusion and Future Work

In this paper, we established the motivation and necessity for weakly supervised foreground learning (WSFL), and proposed the WSFL task. We also showed that a successful WSFL model is very valuable for various downstream applications, such as weakly supervised object localization and detection (WSOL and WSOD). We then proposed a computationally efficient WSFL pipeline. Our WSFL model significantly improves the performance of WSOL and WSOD, and has demonstrated excellent transfer ability. We believe that our work provide a solid step towards weakly supervised tasks in computer vision.

In the future, we will explore better backbones for WSFL (, removing biases), and will explore more applications of WSFL in different tasks, including better proposal scoring for WSOD tasks, and application in other new tasks (such as weakly supervised semantic segmentation).


Supplementary Materials

This is the supplementary material for our paper, titled “Weakly Supervised Foreground Learning for Weakly Supervised Localization and Detection”. In this document, we provide additional visualization and comparison with weakly supervised saliency methods.

Comparison with Saliency Methods

We now present the comparison between our method and weakly supervised saliency methods.

For weakly supervised saliency methods, we choose WSS [wsscvpr2017], which trains a saliency detection model with image-level supervision. We list the difference between our WSFL model and WSS as follows: WSFL aims to learn low-resolution foreground masks with a classification backbone while WSS learns high-resolution saliency masks based on a segmentation backbone. Thus, WSFL can be trained on a large-scale dataset like the full ImageNet-1k [imagenetijcv2015], which contains more than one million images. However, WSS can only be trained on a small dataset. In the WSS paper, they used a subset of ImageNet-DET, which only contains around 10,000 images.

For a detailed comparison, we use WSS to replace proposal scores generated by WSFL models. Then we use these proposals to conduct weakly supervised object detection tasks on VOC 2007. The results are in Table 5. From Table 5, we can see that, our WSFL model can achieve better result than WSS, which indicates the effectiveness of our WSFL model.

Baseline method Extra information mAP
OICR None 46.9
WSS [wsscvpr2017] 47.5
WSFL 48.3
Table 5: mAP results of different WSOD methods on VOC2007 with extra information given by WSFL and WSS models. Please note that we re-implemented the OICR [oicrcvpr2017] baseline methods in their respective paper, and our results are higher than results reported in the original papers.

Moreover, we present the visualization of WSFL and WSS on the PASCAL VOC2007 [vocijcv2010] dataset. For WSS, we directly take pre-trained models from the official website.111 For WSFL, we directly take a ResNet50 WSFL model pre-trained on the ImageNet dataset.

Figure 5: Visualization of our WSFL model’s output and WSS’s output on randomly chosen images from the VOC2007 dataset. Please note that, WSFL and WSS are not trained or finetuned on the VOC2007 dataset. We directly evaluate these models. The first to third columns show the outputs of the first 5 images and the fourth to sixth columns show the outputs of another 5 images.

The visualization results are in Figure 5. From Figure 5, we can have these findings:

  • WSS can respond to part of the background classes, e.g., the first image on the left and the second image on the right. In these two images, WSS labels part of the tree as foreground while WSFL does not label them as foreground. Also, WSS labels parts of the shadow as salient objects in the first image on the right.

  • WSS can label only discriminative parts of the object in many images, like the second image on the left and the third image on the right. In these images, WSS only labels some parts of the objects (like the head of the bird and the cat). However, WSFL can label the whole object in the image.

  • For multiple objects with the same category in a single image, WSS will only label one most salient object and ignore other objects, e.g., the fourth row in the image. In contrast, WSFL can label all objects with the same category in a single image.

  • WSS can ignore small objects like the third image on the left. There is an airplane in the image. WSFL can recognize the whole plane while WSS nearly missed this airplane. Considering the above three issues of WSS, it shows that weakly supervised saliency methods like WSS cannot predict foreground objects well. It may have several reasons. One reason is the small amount of training data. With a small set of training data, WSS cannot learn the object concept very well. Also, directly learning from classification labels will have some issues, like only learns the discriminative part of objects. However, our WSFL is learned from pseudo generated boxes, which can overcome these issues.

  • Since WSS is based on high-resolution segmentation models while WSFL is based on classification models and WSFL is trained on masks generated by boxes, WSS can have better boundary predictions than WSFL. Also, when the background is not cluttered or the input image is rather simple, like the last row of Figure 5, WSS can have better results than WSFL.