Deep convolutional neural networks have achieved enormous success in various computer vision tasks, such as classification, localization and detection. However, current deep learning models need a large number of accurate annotations, including image-level labels, location-level labels (bounding boxes and key points) and pixel-level labels (per pixel class labels for semantic segmentation). Many large-scale datasets are proposed to solve this problem[15, 10, 3]. However, models pre-trained on these large-scale datasets cannot be directly applied to a different task due to the differences between source and target domains.
To relax these restrictions, weakly supervised methods are proposed. Weakly supervised methods try to perform detection, localization and segmentation tasks with only image-level labels, which are relatively easy and cheap to obtain. Among these tasks, weakly supervised object localization (WSOL) is the most practical task since it only needs to locate the object with a given class label. Most of these WSOL methods try to enhance the localization ability of classification models to conduct WSOL tasks [19, 28, 29, 2, 27] using the class activation map (CAM) .
However, in this paper, through ablation studies and experiments, we demonstrate that the localization part of WSOL should be class-agnostic, which is not related to classification labels. Based on these observations, we advocate a paradigm shift which divides WSOL into two independent sub-tasks: class-agnostic object localization and object classification. The overall pipeline of our method is in Fig. 1. We name this pipeline as Pseudo Supervised Object Localization (PSOL). We first generate pseudo groundtruth bounding boxes based on a class-agnostic method: Deep descriptor transformation (DDT) . By performing bounding box regression on these generated bounding boxes, our method removes restrictions on most WSOL models, including the restriction of only one fully connected layer as classification weights  and the dilemma between classification and localization [19, 2].
We achieve state-of-the-art performances on ImageNet-1k  and CUB-200  combining the results of these two independent sub-tasks, obtaining a large edge over previous WSOL models (especially on CUB-200). With classification results of the recent EfficientNet  model, we achieve 58.00% Top-1 localization accuracy on ImageNet-1k, which significantly outperforms previous methods.
We summarize our contributions as follows.
We show that weakly supervised object localization should be divided into two independent sub-tasks: Class-agnostic object localization and object classification. We propose PSOL to solve the drawbacks and problems in previous WSOL methods.
Though generated bounding boxes are noisy, we argue that we should directly optimize on them without using class labels. With the proposed PSOL, we achieve 58.00% Top-1 localization accuracy on ImageNet-1k and 74.74% Top-1 localization accuracy on CUB-200, which is far beyond state-of-the-art.
Our PSOL method has good localization transferability across different datasets without any fine-tuning, which is significantly better than previous WSOL models.
2 Related Works
Convolutional neural networks (CNN), since the success of AlexNet , have been widely applied in many areas of computer vision, including object localization and object detection tasks. We will briefly review detection and localization with full supervision and weak supervision in this section.
2.1 Fully Supervised Methods
After the success of AlexNet , researchers tried to adopt CNN to conduct object localization and detection. The pioneering work OverFeat  tried to use sliding window and multi-scale techniques to conduct classification, localization and detection within a single network. VGG-Net  adds a per-class regression and model ensemble to enhance the prediction result of localization.
to generate candidate regions and then use CNN to classify them. Faster-RCNN proposes a two-stage network: region proposal network (RPN) for generating regions of interest (ROI), then R-CNN module to classify them and localize the object in the region. These popular two-stage detectors are widely used in detection tasks. YOLO  and SSD  are one stage detectors with carefully designed network structures and anchors. Recently, some anchor-free detectors are proposed to migrate the anchor problem in common detectors like CornerNet  and CenterNet .
However, all these methods need massive, detailed and accurate annotations. Annotations in real-world tasks are expensive and sometimes even hard to get. So we need some other methods to perform object localization tasks without requiring many exact labels.
2.2 Weakly Supervised Methods
Weakly supervised object localization (WSOL) learns to localize the object with only image-level labels. It is more attractive since image-level labels are much easier and cheaper to obtain than object level labels. Weakly supervised detection (WSOD) tries to give the object location and category simultaneously when training images only have image-level labels.
WSOL has the assumption that there is only one object of the specific category in the whole image. Based on this assumption, many methods are proposed to push the limit of WSOL.  first generates class activation maps with the global average pooling layer and the final fully connected layer (weights of the classifier). Grad-CAM  uses gradients rather than output features to generate more accurate class response maps. Besides these methods which focus on improving class response maps, some other methods try to make the classification model more suitable for localization tasks. HaS  tries to randomly erase some regions in the input image to force the network to be meticulous for WSOL. ACoL  uses two parallel classifiers with dynamic erasing and adversarial learning to discover complementary object regions more effectively. SPG  generates Self-Produced Guidance masks to localize the whole object. ADL  proposes the importance map and the drop mask, with a random selection mechanism to achieve a balance between classification and localization.
WSOD does not have the one object in one class restriction. However, WSOD often needs methods to generate region proposals like selective search  and edge boxes , which will cost much computation resources and time. Furthermore, current WSOD detectors use high resolution inputs to output the bounding box. This will also result in heavy computational burden. Thus, most WSOD methods are difficult to be applied to large-scale datasets.
In this section, we will mainly discuss the drawbacks of current WSOL pipeline and propose our pseudo supervised object localization (PSOL).
3.1 A paradigm shift from WSOL to PSOL
Current WSOL can generate the bounding box with a given class label. However, the community have identified serious drawbacks of this pipeline.
The learning objective is indirect, which will hurt the model’s performance on localization tasks. HaS  and ADL  show that localization is not compatible with classification when only having a single CNN model. Localization tries to localize the whole object while classification tries to classify the object. The classification model often tries to localize only the most discriminative part of the object in an image.
Offline CAM  has the thresholding parameter and needs to store the three-dimensional feature map for further computation. The thresholding value is tricky and hard to determine.
Those drawbacks make WSOL hard to apply to real-world applications.
Encouraged by the class-agnostic process of generating regions of interest (ROI) in selective search  and Faster-RCNN , we divide WSOL into two sub-tasks: class-agnostic object localization and object classification. Based on these two sub-tasks, we propose our PSOL method. PSOL directly optimizes model on generated pseudo ground-truth bounding boxes, which does not need to generate bounding boxes implicitly. So it removes restrictions and drawbacks illustrated in previous WSOL methods, and it is a paradigm shift for WSOL.
3.2 The PSOL Method
The general framework of our PSOL is in Algorithm 1. We will introduce our PSOL step by step. We will discuss the details of generating pseudo groundtruth bounding boxes in Sec 3.2.1, then the localization method used in our model in Sec 3.2.2. For the classification method, we use pre-trained models in the computer vision community directly.
3.2.1 Bounding Box Generation
The critical difference between WSOL and our PSOL is the generation of pseudo bounding boxes for unlabeled training images. Detection is a natural choice for this task since detection models can provide bounding boxes and classes directly. However, the largest dataset in detection only has 80 classes , and it cannot provide a general object localizer for datasets with many classes such as ImageNet-1k. Furthermore, current detectors like Faster-RCNN  need substantial computation resources and large input image sizes (like shorter side=600 when testing). These issues prevent detection models from being applied to generate bounding boxes on large-scale datasets.
Without detection models, we can try some localization methods to output bounding boxes for training images directly. Some weakly and co-supervised methods can generate noisy bounding boxes, and we will give a brief introduction to them.
WSOL methods often follow this pipeline to generate the bounding box for an image. First the image is feed into the network , then the final feature map (often the output of the last convolutional layer) is generated: , where are the height, width and depth of the final feature map. Then, after global average pooling and the final fully connected layer, the label is produced. According to the predicted label or the ground truth label , we can get the specific class weight in the final fully connected layer . Then each spatial location of is channel-wise weighed and summed to get the final heat map for the specific class: . Finally, is upsampled to the original input size, and thresholding is applied to generate the final bounding box.
Some co-supervised methods can also have good performances on localization tasks. DDT has good performance and little computational resource requirement among these co-supervised methods. So we use DDT  as an example. Here is a brief recap of DDT. Given a set of images with images, where each image has the same label, or has the same object in the image. With a pre-trained model , the final feature map is also generated: . Then these feature maps are gathered together into a large feature set:
. Principal component analysis (PCA)
is applied along the depth dimension. After the PCA process, we can get the eigenvector
with the largest eigenvalue. Then, each spatial location ofis channel-wise weighed and summed to get the final heat map : . Then is upsampled to the original input size. Zero thresholding and max connected component analysis is applied to generate the final bounding box.
We will generate pseudo bounding boxes using both WSOL methods and DDT method, and evaluate their suitability.
3.2.2 Localization Methods
After generating bounding boxes, we have bounding box annotations for each training image. Then it is natural to perform object localization with these generate boxes. As shown before, detection models are too heavy to handle this task. Thus, it is natural to perform bounding box regression. Previous fully supervised works [18, 17] suggest two methods of bounding box regression: single-class regression (SCR) and per-class regression (PCR). PCR is strongly related to the class label. Since we advocate that localization is a class-agnostic rather than a class-aware task, we choose SCR for all our experiments.
We follow previous work to perform bounding box regression . Suppose the bounding box in format, where is the top-left coordinate of the bounding box and is the width and height of the bounding box. We first transfer into where , and and
is the height and width of the input image. We use a sub-network with two fully connected layers and corresponding ReLU layers for regression. Finally, the outputs are sigmoid activated. We use the mean squared error loss (loss) for the regression task.
Step 2 and step 3 in Algorithm 1 may be combined, i.e., and can be integrated into a single model, which is jointly trained with classification labels and generated bounding boxes. However, we will show empirically that localization and classification models should be separated.
4.1 Experimental Setups
We evaluate our proposed method on two common WSOL datasets: ImageNet-1k  and CUB-200 . ImageNet-1k dataset is a large dataset with 1000 classes, containing 1,281,197 training images and 50,000 validation images. For training images, bounding box annotations are incomplete, and bounding box labels are complete for validation images. In this paper, we do not use any accurate training bounding box annotations. In our experiments, we generate pseudo bounding boxes on training images by previous methods. The detailed ablation studies will be in Sec 5.1. We train all models on generated bounding box annotations and classification labels and test them on the validation dataset.
For the CUB-200 dataset, it contains 200 categories of birds with 5,994 training images and 5,794 testing images. Each image in the dataset has an accurate bounding box annotation. We follow the strategies on ImageNet-1k to train and test models.
We use three metrics for evaluating our models: Top-1/Top-5 localization accuracy (Top-1/Top-5 Loc) and localization accuracy with known ground truth class (GT-Known Loc). They are following previous state-of-the-art methods [30, 2]: GT-Known Loc is correct when given the ground truth class to the model, the intersection over union (IoU) between the ground truth bounding box and the predicted box is 50% or more. Top-1 Loc is correct when the Top-1 classification result and GT-Known Loc are both correct. Top-5 Loc is correct when given the Top-5 predictions of groundtruth labels and bounding boxes, there is one prediction which the classification result and localization result are both correct.
We prepare several baseline models for evaluating our method on localization tasks: VGG16 , InceptionV3 , ResNet50  and DenseNet161 . Previous methods try to enlarge the spatial resolution of the feature map [28, 29, 2], we do not use this technology in our PSOL models. Previous WSOL methods need the classification weights to turn a 3D feature map into a 2D spatial heat map. However, in PSOL, we do not need the feature map for localization, our model will directly output the bounding box for object localization. For a fair comparison, we modified VGG16 into two versions: VGG-GAP and VGG16. VGG-GAP replaces all fully connected layers in VGG16 with GAP and a single fully connected layer, and VGG16 keeps the original structures in VGG16. For other models, we keep the original structure of each model. For regression, we use a two-layer fully connected network with corresponding ReLU layers to replace the last layer in original networks, as illustrated in Sec 3.2.2.
Joint and Separate Optimization
In the previous section, we discussed the problem of joint optimization of classification and localization tasks. For ablating this issue, we prepare several models for each base model. For joint optimization models, we add a new bounding box regression branch to the model (-Joint models). Then train this model with both generated bounding boxes and class labels simultaneously. For separate optimization models, we replace the classification part with the regression part (-Sep
models), then train these two models separately, i.e., localization models are trained with only generated bounding boxes while classification models are trained with only class labels. The hyperparameters are kept same for all models.
4.2 Implementation Details
We use the PyTorch framework with TitanX Pascal GPUs support. For all models, we use pre-trained classification weights on ImageNet-1k and fine-tune on target localization and classification tasks.
For experiments on ImageNet-1k, the hyperparameters are set the same for all models: batch size 256, 0.0005 weight decay and 0.9 momentum. We will fine-tune all models with a start learning rate of 0.001. Added components (like the regression sub-network) will have a larger learning rates due to the random initialization. We train 6 epochs on ImageNet and 30 epochs on CUB-200. For localization only tasks, we keep the learning rate fixed among all eppochs. The reason is that DDT generated bounding boxes are noisy, which contain many inaccurate or even totally wrong bounding boxes. The conclusion in shows that for noisy data, we should retain large learning rates. For classification related tasks (including single classification and joint classification and localization) tasks, we divide the learning rate by 10 every 2/10 epochs on ImageNet/CUB-200.
For testing models, we use ten crop augmentations on ImageNet to output results of the final classification following  and  on ImageNet and single crop classification results on CUB200, and use single image inputs for all our localization results. We use the center crop techniques to get the image input, e.g., resize to 256256 then center crop to 224224 for most models except InceptionV3 (resize to 320x320 then center crop to 299299), following the setup in [2, 27]. For state-of-the-art classification models, we also follow the input size in their paper, e.g., 600 for EfficientNet-B7.
Previous WSOL methods can provide multiple boxes for a single image with different labels. However, our SCR model can only provide one bounding box output for each image. Thus, we combine the output bounding box with Top-1/Top-5 classification outputs of baseline models (-Sep models) or with outputs of the classification branch (-Joint models) to get the final output to evaluate on test images.
For experiments on CUB-200, we change the batch size from 256 to 64, and keep other hyperparameters the same as ImageNet-1k.
5 Results and Analyses
In this section, we will provide empirical results, and perform detailed analyses on them.
5.1 Ablation Studies on How to Generate Pseudo Bounding Boxes
|Top-1 Loc||Top-5 Loc||Top-1 Loc||Top-5 Loc||GT-Known Loc|
Previous WSOL methods can generate bounding boxes with given ground truth labels. Some co-localization methods can also provide bounding boxes with a given class label. Since some annotations are missing in ImageNet-1k training images, we test these methods on the validation/test set of ImageNet-1k and CUB-200 to choose a better method to generate pseudo bounding boxes for PSOL. For the DDT method, we first resize the training images to the resolution size of , then perform DDT on training images. According to the statistics collected on training images, we generate bounding boxes on test images with the correct class label. For other WSOL methods, we follow original instructions in their papers and use pre-trained models to generate bounding boxes on validation/test images with the correct class label.
We list the GT-Known Loc of DDT and weakly supervised localization methods in Table 1. As shown in Table 1, DDT achieves comparable results with WSOL methods on ImageNet-1k, but achieves better performance than all WSOL methods on CUB-200. DDT results on CUB-200 indicate that object localization should not be related to classification labels. Furthermore, these WSOL methods need large computational resources, e.g., storing the feature map of each image, then perform off-line CAM operation to get the final bounding boxes. Compared to these methods, DDT has little computation requirements and achieves comparable results. For the base model choice of DDT, though DDT-DenseNet161 has higher accuracy than DDT-VGG16 on ImageNet-1k, it runs much slower due to the dense connections and has lower accuracy than DDT-VGG16 on CUB-200. Based on these observations, we choose DDT with VGG16 to generate our bounding boxes on training images in PSOL.
5.2 Comparison with State-of-the-art Methods
We list experimental results in Table 2. Furthermore, we visualize bounding boxes generated by CAM , DDT  and our methods in Fig. 2. According to these results, we have the following findings.
Without any training, DDT already performs well on both CUB-200 and ImageNet. DDT-VGG16 achieves 47.31% Top-1 Loc accuracy, which has a 23% edge over WSOL models based on VGG16. Since DDT is a class-agnostic method, it suggests that WSOL should be divided into two independent sub-tasks: class-agnostic object localization and object classification.
All PSOL models with separate training perform better than PSOL models with joint training. In all five baseline models, -Sep models consistently perform better than -Joint models by large margins. These results indicate that learning with joint classification and localization is not suitable.
All our PSOL models enjoy a large edge (mostly %) on CUB-200 compared with state-of-the-art WSOL methods, including the DDT-VGG16 method. CUB-200 is a fine-grained dataset which contains many categories of birds. Within-class variation is much larger than between-class variation in most fine-grained datasets . The exact label may not help the process of localization. Hence, the co-localization method DDT will perform better than previous WSOL methods.
CNN has the ability to tolerate some incorrect annotations, and retain high accuracy on validation sets. For all separate localization models, the GT-Known Loc is higher than DDT-VGG16. This phenomenon indicates that CNN can tolerate some annotation errors and learn robust patterns from noisy data.
Some restrictions and rules-of-thumb in WSOL do not carry over to PSOL. In previous WSOL papers, only one final fully connected layer is allowed, and large spatial size of the output feature map is recommended. Many methods try to remove the stride of the last downsample convolution layer, which will result in large FLOPs (such as SPG and VGG16-ACoL). Besides this, three fully connected layers in VGG16 are all removed, which will directly affect the accuracy. However, in our experiments, VGG-Full performs significantly better than VGG-GAP. Since CAM requires GAP and only one FC layer, when this restriction is removed, VGG16 can get better performance. Another restriction is the inference path of the network. WSOL needs the output of the last convolutional layer in the model, and often uses simple forward networks (VGG16, GoogLeNet, and InceptionV3). Complex network structures like DenseNet are not recommended and do not perform well in the WSOL problem. As we show in Table 2, CAM achieves poor performance with DenseNet161. DenseNet will use features of every block, not just the last feature to conduct classification. Thus, the semantic meaning of the last feature may not as clear as the last feature of sequential networks like ResNet and VGG. However, PSOL-DenseNet models are directly trained on noisy bounding boxes, which can avoid this problem. Moreover, DenseNet161 achieves the best performance.
5.3 Transfer Ability on Localization
In this section, we will discuss the transferability of different localization models.
Previous weakly supervised localization models need the exact label to generate bounding boxes, regardless of the correctness of the label. However, our proposed method does not need the label and directly generate the bounding box. So we are interested in that: Is single object localization task transferable? Does the model trained directly on object localization tasks like trained on image recognition tasks, have good generalization ability?
We perform the following experiment. We take object localization models trained on ImageNet-1k, then predict on CUB-200 test images directly, i.e., without any training or fine-tuning process. We add previous WSOL methods for a fair comparison. Since they need exact labels, we fine-tune all these models. For all models marked with *, they are only fine-tuned with classification parts (the last fully connected layer), i.e., features learned on ImageNet-1k are directly transferred to CUB-200. For models marked without *, they are fine-tuned on CUB-200 with all layers. We take our VGG-GAP-Sep model for fair comparison and DenseNet161-Sep model for better results. Results are in Table 3.
|VGG-GAP + CAM||CUB-200||CUB-200||57.96|
|VGG-GAP* + CAM||ImageNet||CUB-200||57.53|
|VGG16-ACoL + CAM||CUB-200||CUB-200||59.30|
|VGG16-ACoL* + CAM||ImageNet||CUB-200||58.70|
|SPG + CAM||CUB-200||CUB-200||60.50|
|SPG* + CAM||ImageNet||CUB-200||59.70|
It is surprising that without any supervision, PSOL object localization models can transfer very well from ImageNet-1k to CUB-200, which performs significantly better than previous WSOL methods, including models which only fine-tune the classification weight (models marked with *), and models which fine-tune the whole weights. It further indicates that objection localization is not dependent on classification, and it is inessential to perform object localization with the class label. Furthermore, it proves the advantage of our PSOL method.
5.4 Combining with State-of-the-art Classification
Previous methods try to combine localization outputs with state-of-the-art classification outputs to achieve better localization results. SPG  and ACoL  combine with DPN networks including DPN-98, DPN-131 and DPN-ensemble . For a fair comparison, we also combine other models’ (InceptionV3 and DenseNet161) results with DPN-131. Moreover, EfficientNet  achieves better results on ImageNet-1k recently. We combine our localization outputs with EfficientNet-B7’s classification outputs. Results are in Table 4.
|Model||Top-1 Loc||Top-5 Loc|
|SPG + DPN131||55.19||62.76|
|SPG + DPN-ensemble||56.17||63.22|
|PSOL-InceptionV3-Sep + DPN131||55.72||63.64|
|PSOL-DenseNet161-Sep + DPN131||56.59||64.63|
|PSOL-InceptionV3-Sep + EfficientNet-B7||57.25||64.04|
|PSOL-DenseNet161-Sep + EfficientNet-B7||58.00||65.02|
From the table we can see that our model achieves better localization accuracy on ImageNet-1k compared with SPG  and ACoL  when combining the same classification results from DPN131 . Furthermore, when combining with EfficientNet-B7 , we can achieve 58.00% Top-1 localization accuracy.
5.5 Comparison with fully supervised methods
We also compare our PSOL with fully supervised localization methods on ImageNet-1k. Fully supervised methods use training images with accurate bounding box annotations in ImageNet-1k to train their models. Results are in Table 5.
|ResNet + Faster-RCNN-ensemble ||full||90.0|
With the bounding box regression sub-network, our DenseNet161-Sep model can roughly match fully supervised AlexNet with Top-5 Loc accuracy. However, our performances are still worse than fully supervised OverFeat, GoogLeNet and VGGNet. It is noticeable that ResNet + Faster-RCNN-ensemble  achieves the best Top-5 Loc accuracy. They transfer region proposal networks trained on ILSVRC detection track, which has 200 classes of fully labeled images, to the 1000-class localization tasks directly. The region proposal network shows good generalization ability among different classes without fine-tuning, which indicates that localization is separated with classification.
6 Discussions and Conclusions
In this paper, we proposed the pseudo supervised object localization (PSOL) to solve the drawbacks in previous weakly supervised object localization methods. Various experiments show that our methods obtain a significant edge over previous methods. Furthermore, our PSOL methods have good transfer ability across different datasets without any training or fine-tuning.
For future works, we will try to dive deep into the joint classification and localization problem: We will try to integrate both tasks into a single CNN model with less localization accuracy drop. Another direction is trying to improve the quality of generating bounding boxes with class-agnostic methods. Finally, novel network structures or algorithms on localization problems should be found, which should prevent the high input resolution and computational resources in the current detection framework to apply to large-scale datasets.
-  (2017) Dual path networks. In NIPS, pp. 4467–4475. Cited by: §5.4, §5.4.
-  (2019) Attention-based dropout layer for weakly supervised object localization. In CVPR, pp. 2219–2228. Cited by: §1, §1, §2.2, 1st item, §4.1, §4.1, §4.2, §5.2, Table 2.
The Cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: §1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §2.1.
-  (2015) Fast R-CNN. In ICCV, pp. 1440–1448. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.
-  (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §4.1.
-  (2012) ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §2.1, §2, Table 5.
-  (2018) CornerNet: detecting objects as paired keypoints. In ECCV, LNCS, Vol. 11218, pp. 734–750. Cited by: §2.1.
-  (2014) Microsoft COCO: common objects in context. In ECCV, LNCS, Vol. 8693, pp. 740–755. Cited by: §1, §3.2.1.
-  (2016) SSD: single shot multibox detector. In ECCV, LNCS, Vol. 9905, pp. 21–37. Cited by: §2.1.
-  (1901) On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11), pp. 559–572. Cited by: §3.2.1.
-  (2016) You only look once: unified, real-time object detection. In CVPR, pp. 779–788. Cited by: §2.1.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §2.1, §3.1, §3.2.1, §5.5, Table 5.
-  (2015) ImageNet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §1, §1, §4.1.
-  (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In ICCV, pp. 618–626. Cited by: §2.2, Table 2.
-  (2014) OverFeat: integrated recognition, localization and detection using convolutional networks. In ICLR, pp. 1–15. Cited by: §2.1, §3.2.2, Table 5.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–14. Cited by: §2.1, §3.2.2, §3.2.2, §4.1, Table 5.
-  (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, pp. 3544–3553. Cited by: §1, §1, §2.2, 1st item, §5.2.
-  (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: Table 5.
-  (2016) Rethinking the Inception architecture for computer vision. In CVPR, pp. 2818–2826. Cited by: §4.1.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, pp. 6105–6114. Cited by: §1, §5.4, §5.4.
-  (2018) Joint optimization framework for learning with noisy labels. In CVPR, pp. 5552–5560. Cited by: §4.2.
-  (2013) Selective search for object recognition. IJCV 104 (2), pp. 154–171. Cited by: §2.1, §2.2, §3.1.
-  (2011) The Caltech-UCSD birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §1, §4.1, 3rd item.
-  (2019) Unsupervised object discovery and co-localization by deep descriptor transformation. Pattern Recognition 88, pp. 113–126. Cited by: §1, §3.2.1, §5.2, Table 1, Table 2.
-  (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In ICCV, pp. in press. Cited by: §1, §4.2, Table 2.
-  (2018) Adversarial complementary learning for weakly supervised object localization. In CVPR, pp. 1325–1334. Cited by: §1, §2.2, §4.1, §4.2, §5.2, §5.4, §5.4, Table 1, Table 2, Table 3.
-  (2018) Self-produced guidance for weakly-supervised object localization. In ECCV, LNCS, Vol. 11216, pp. 610–625. Cited by: §1, §2.2, §4.1, §4.2, §5.2, §5.4, §5.4, Table 1, Table 2, Table 3.
Learning deep features for discriminative localization. In CVPR, pp. 2921–2929. Cited by: §1, §1, §2.2, 2nd item, §4.1, 5th item, §5.2, §5.2, Table 1, Table 2, Table 5.
-  (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.1.
-  (2014) Edge boxes: locating object proposals from edges. In ECCV, LNCS, Vol. 8693, pp. 391–405. Cited by: §2.2.