Weakly-Supervised Object Localization has attracted extensive research efforts in recent years [1, 3, 12, 11, 20, 21, 23, 27, 2, 35, 38]. It aims to infer object locations by only training with image-level labels rather than pixel-level annotations, which can greatly reduce the cost of human labor in annotating images. This task is challenging since no guidance of target object position is provided.
The most popular line of work tries to find cues from existing classification models. For example, Zhou et al. 
introduce Global Average Pooling (GAP) layer to generate Class Activation Maps (CAMs) in top layers, which highlights high-probability positions of target objects. However, CAMs can only detect the most discriminative part of an object, which is far from enough to cover the entire object for precise localization.
Therefore, various methods are proposed to improve the power of CAMs. For examples, to enlarge the localized area from CAMs, some adversarial erasing approaches are proposed [40, 43, 35]. These methods usually need to build new CAMs based on original ones to search additional valuable areas. To encourage new CAMs to focus on different regions, the common way is to erase part of the original image or internal feature map by directly manipulating the corresponding values. Despite the appealing idea, they add artifacts to image features and cannot guarantee a better result if background noise are adopted. Unlike the previous methods, Zhang et al.  propose a self-supervised model, called SPG, that adds additional convolutional layers to convert internal feature to pixel-level supervision and achieves the best performance so far. However, the model requires a very deep structure and also a costly training process to extract a great number of targeted pixels from several feature maps. What’s worse, for the whole training process, at least four predefined thresholds are needed to segment different feature maps in order to extract foreground and background pixels. These thresholds, selected as hyper-parameters, may result in ambiguous supervisions for many samples and finally discourage the network from achieving the best performance.
The progress and issues in existing WSOL methods have inspired us to design a more effective method that combines advantages of previous approaches. In fact, we notice that most methods generate new CAMs or feature maps from internal backbone network. However, there is no general rule about what strategy should be chosen to make these feature maps interact with each other. Some prefer to encourage them to be different in order to find more related areas while others treat them as alternate supervision for each other and get enlarged altogether. Therefore, in this work, we propose a dynamic adapted strategy for multiple CAMs to interact with each other, letting them determine automatically what strategy their pixels should apply–becoming more similar or more different.
Our method is inspired by the distribution of values in CAMs. As shown in Figure 1, a CAMs is generated for the left raw image, containing some pixels with highest values and lowest values in relatively yellow and black colors respectively. These pixels show the strong confidence of the CAMs that they belong to target object or background, which are able to become good signals for other CAMs to follow and resemble. In other hand, pixels narrowed between the former two kinds of pixels have kind of median values, which are untrustworthy and actually reflects the ambiguity of the CAMs for those corresponding positions. Therefore, these pixels can encourage other CAMs to keep suspicious and try to explore more. Figure 1(c) is actually a binary mask with value for high-confidence pixels while for ambiguous ones. Such a method, which may look similar with the one in , actually learn segmentation thresholds during training process instead of using predefined values. Besides, we also utilizes ambiguous pixels that are not highlighted in the mask to encourage other CAMs to be adversarial while these pixels are just discarded in . We will talk more about mathematical details and demonstrate its generality over the current state-of-the-art method in Section 3.
, they regard different parts of samples in the same category as one group, like hands of one person and legs of another person in different images. However, they only apply this strategy in single-category datasets while we further expand it, as an inter-sample criterion function, to WSOL tasks. Our module first generates feature vectors representing objects and background regions from raw samples according to CAMs, then make object vectors belonging to the same category closer while pushing background regions far away. Besides, we also apply background regulation for each category separately in our task. With such a metric learning strategy, we force the network to consider fragments of foreground object regions in different images as a more complete object and also prevent background noise getting involved.
In summary, our main contributions are three-fold: (1) We propose a novel strategy to encourage multiple CAMs to interact with each other dynamically. The strategy can be demonstrated in the mathematical way as a general version compared with previous methods. (2) We further introduce an inter-sample criterion module with the purpose to integrate different discriminative regions of objects from different samples and reduce the influence of background noise. (3) With only image-level supervision for training, our method greatly outperforms state-of-the-art methods on two standard benchmarks, ILSVRC validation set and CUB-200-2011, for weakly supervised localization performance.
2 Related Work
Fully supervised detection methods have been intensively studied and achieve extraordinary success [17, 9, 24, 14, 15, 30, 28]. R-CNN  and Fast R-CNN  proposes to extract region proposals and feeds them into classifiers. Afterwards, Faster-RCNN 
, being one of the most effective methods for object detection, combines region proposal network with a convolutional neural network to localize objects in images by bounding boxes. Moreover, SSD and YOLO  are designed specifically for speeding up the detection process and reach high performance in real-time detection. Despite the success of the approaches above, they all require a vast number of bounding box annotations during training, which are very expensive to create in a manual way. Besides, the pixel-level annotations are ambiguous since there is no common rule for annotating especially when it comes to pixels around object edges.
Therefore, Class Activation Maps (CAMs) work as a weakly-supervised approach for object localization tasks. Zhou et al.  add Global Average Pooling (GAP) layer for deep neural networks to generate CAMs that are utilized to localize objects. Based on it, Wei et al.  propose an adversarial erasing approach to expand CAMs with more additional object regions. With a similar purpose, Zhang et al.  propose ACoL network adopting cut-and-search strategy on the feature map and further prove that the process for obtaining CAMs can be end-to-end. Moreover, Zhang et al.  propose SPG network that adds pixel-level self-supervision for feature maps in different levels and become current state-of-the-art method in this task.
There are also some other methods that are more related with model interpretability but can be also applied to object localization tasks. Ramprasaath et al.  combine gradient values and original feature maps to produce gradient CAMs without changing the structure of networks. Chattopadhyay et al.  further refine gradCAMs by using a weighted combination of the positive partial derivatives of the last convolutional layer feature maps. These methods are usually more engaged to propose new CAMs that can interpret models on various tasks. However, in our work, we focus on object localization task and try to improve the performance based on multiple CAMs with more reasonable interaction, which, though both utilize CAMs, have totally different purposes.
In this section, we first review the seminal Class Activation Maps (CAMs), then introduce our dynamic adapted CAMs along with the inter-sample criterion module. The overview of our proposed method for the training phase is shown in Fig. 2.
We first describe the weakly supervised object localization problem and some common modules in the basic network for generating CAMs. Given a set of images, , covering objects of categories, our goal is to classify each image to a category and locate the corresponding objects with bounding boxes. Take the method in  as an example, for an input image, the Fully Convolutional Network (FCN) produces feature maps with different channel numbers and spatial sizes at different layers . We denote as the last convolutional feature map from a backbone network. To calculate CAMs, the network usually applies a classification block that consists of multiple fully convolutional layers to transform the channels of to the number of categories such that we have
. Following that, a Global Average Pooling (GAP) layer is applied at each channel to generate a class logit, which is then used for cross-entropy loss calculation. This process can be written as:
where refers to the classification block, and and locates a certain pixel on the -th channel of the feature map . The feature map in each channel of has relatively high values on positions that may correspond to the target object.
However, in this basic framework, the classification block and the GAP layer are only applied after the last convolutional feature map from the backbone network, which only captures the most discriminative part in the largest receptive field. To solve it, most of previous methods try to generate new CAMs or feature maps from additional extended convolutional layers. These new feature maps are able to search extra areas of target objects beyond the one that has already highlighted by original CAMs, or they can be regulated under the supervision of original CAMs and in turn refine the backbone network. These two potential approaches, though seemingly totally different and even contradicted, can be merged together in our method and thereby improve every pixel in the final CAMs.
3.2 Interactive Class Activation Maps
We assume there are two CAMs generated from different classifiers based on the same backbone network. The first classifier is appended after the final convolutional block as same as what we describe in the previous subsection. Another classifier is inserted in the backbone network with same structure and output dimensions. We can denote these two CAMs as:
We first obtain the CAMs generated from that is inserted in one of backbone convolutional layers. The values of pixels in vary in a large range from lowest value in background pixel to highest value on target object, which is partly shown in Figure 1(b). If the value is relatively much higher or lower than others, we consider has a high confidence on those pixels being correct. This idea is similar with the one in  which extracts pixels with extreme values in CAMs to supervise internal feature maps. However, instead of setting lots of thresholds to separate CAMs in their method, we calculate distance between each pixel and the one with averaged value. The distance, after normalized, can reveal how much the CAMs feel confident about all its pixels. The process can be denoted as:
where is the averaged value of while refers to the distance of each pixel in to . is the total number of pixels in . We then calculate as the averaged distance among all distance for each pixel. Finally, we determine the strategy of each pixel, i.e. adversarial or cooperated, by checking their corresponding mask values whether it is larger than or not. If the value is larger than , that means the CAMs have a strong confidence that the corresponding pixel is correct, which keeps the unchanged. Otherwise, the mask will be changed to the negative value, denoting a great ambiguity of the pixel.
Once we have conducted the mask, we need to generate another CAMs and make both of them interact with each other. We fist calculate the distance of each pixel of two CAMs and then multiply it with the corresponding mask value we calculate before. If the final value is positive, it means we would like to keep two pixels in different CAMs more similar while negative values, on the contrary, more different, which encourages to explore more on ambiguous values but keep unchanged on confident areas from . The objective function is denoted as:
By this way, for confident values in , we obtain them as supervisions without importing any threshold. In fact, with and , we have already obtained two thresholds after transforming the Equation 3, which can be denoted as:
where and represent thresholds used to extract foreground and background pixels from the CAMs respectively. These two values are set as predefined percentage parameters in  but determined independently relying on CAMs itself for each sample learned in our method, thus becoming a more general version for separating CAMs. Besides, unlike just throwing away other uncertain values in , we apply them to encourage to explore some different areas, which is actually a main approach used in previous adversarial methods. Therefore, all pixels in one CAMs can be contributed to the generation of another and such a method merges two different strategies for refining CAMs into an integral one.
Both classifiers are trained with image-level labels and Cross Entropy loss. However, since is appended after the final convolutional layer, its classification performance is better than the result of , we also set the classification result from as an additional supervision for , which is denoted as:
Therefore, the total loss for each training sample can be shown as:
where is Cross Entropy loss and is an linearly increasing weight from to that prevents the last two losses influencing the performance of the network in the beginning training process.
During the inference time, we obtain two CAMs from different classifiers and then combine them together. The feature map corresponding to the class with the highest predicted scores is extracted and upsampled to the same size of the raw testing image. We apply same strategy utilized in  to calculate the result bounding boxes. In details, we firstly segment the foreground by a fixed threshold and then seek the tight bounding boxes covering the largest connected area in the foreground pixels. For more details please refer to .
3.3 Inter-sample Criterion Module
Besides applying interactive CAMs for each sample, we further develop an inter-sample criterion function that is inspired from co-segmentation tasks [18, 39] to regulate CAMs in our method. The key idea of previous methods is that if different object regions belonging to the same category in multiple samples are extracted from CAMs, they can actually be considered as more similar features in a high dimensional space. In our method, based on that idea, we not only consider that foreground object parts are similar but also assume background pixels surrounding the objects in same category can also share some characteristics. Besides, we expand its generality for regulation of CAMs through multiple categories rather than only in a single one, as shown in Figure 2.
In details, for the CAMs defined above, we first extract the feature map from the CAMs that corresponds to groundtruth label index and then do an element-wise multiplication with original raw input image after upsampling it to the same spatial size. The result combination can be considered as an weighted image that focuses more on corresponding objects. Besides, we apply similar strategy to obtain background-focused images except that we calculate the feature map by finding all maximum values in each position through channels and subtract them by as the background probability. The whole process can be denoted as:
These two kinds of combinations are then transformed to feature vectors by another convolutional network that can be denoted as . Then we apply two metric learning strategies to force distances among them. That is we would like foreground object features and background features respectively in the same category to be closer while pushing foreground and background pairs further in a crossing way. The whole process in one category can be denoted as:
where and represent different samples while denotes the identification score between two different metrics. The method is similar with the one in , but instead of applying feature maps from generators, we replace it with multiple CAMs and further add a metric strategy to make background features for same category samples closer to expand distance among different categories.
Finally, we aggregate
through all categories and obtain the loss function as:
where C refers to amount of categories. During training time, we optimize both and together and remove this module for testing phase.
3.4 Implementation Details
We build our model based on VGG16-net  which is commonly used in classification tasks. We first build a convolutional classifier by combining two convolutional layers with filters and one GAP layer to generate CAMs and classification results respectively. Then we add two same classifiers after different layers in the backbone network and also modify the layer with to maintain the same spatial size for different feature maps. For our inter-sample Criterion Module, we apply Alexnet  to process combination images into feature vectors with
dimensions. The network is fine-tuned on the pre-trained weights of ImageNet for both ILSVRC and CUB datasets. We train the model with an initial of 0.001 and decay of
each epoch. The optimizer is SGD withmomentum and weight decay. For classification result, we follow the instructions in 
, which further averages the scores from the softmax layer withcrops.
4.1 Experiment Setup
Dataset and Evaluation To draw a fair comparison, we test our model on ILSVRC2016  validation set and CUB-200-2011  test set, which are two most widely-used benchmarks for WSOL. The ILSVRC dataset has a training set containing more than 1.2 million images of 1,000 categories and a validation set of 50,000 images. In CUB-200-2011, there are totally 11,788 bird images of 200 classes, among which 5,994 images are for training and 5,794 for testing. We leverage the localization metric suggested by . Specifically, the bounding box of an image is correctly predicted if: 1) the model predicts the right image label; 2) more than 50% Intersection-over-Union (IoU) is observed in the overlapped area between predicted bounding boxes and ground truth boxes. For more details, please refer to .
4.2 Experiment Result
ILSVRC: Table 1 and Table 2 show both classification and localization results on ILSVRC validation set respectively. We first build a baseline model with VGG16 backbone network which only has one classification branch in the final top layer to make a comparison with our proposed method. For the classification task, though the performance of our proposed method still has a distance compared with traditional classification networks, it achieves better results than most WSOL state-of-the-art models. We contribute it to the classifier that is inserted in the backbone network, which not only provides additional classification result, but also refines the weights in bottom layers.
|Grad-CAM on VGG16 ||30.4||10.9|
|Backprop on GoogLeNet ||61.31||50.55|
|Backprop on VGGnet ||61.12||51.46|
|Grad-CAM on VGG16 ||56.51||46.41|
|Methods||GT-Known Top-1 Loc. Err|
|Methods||Top-1 Clas. Err|
|Inception V3 ||10.4|
For localization task, our CAMs2i outperforms the state-of-the-art on both top-1 and top-5 localization error for about . It demonstrates that the interactive CAMs can help to explore more object related regions based on the existing confident area than the single CAMs does. Besides, the inter-sample loss function further improves CAMs by regulate CAMs from different parts of targeted objects and also prevents noise from background pixels.
|Dataset||Threshold||Top-1 Loc. Err.|
In the above comparison, our localization result is restricted by classification performance since we still need to consider the correctness for classification labels. In order to demonstrate the localization ability of our proposed model, we use ground-truth labels for ILSVRC validation set and only evaluate localization performance serving as an “upper-bound.” Table 3 shows the results for our method and many previous WSOL methods. We can see our model with inter-intra regulation module outperforms all other methods for about . That means, no matter how the networks classify images, the CAM can correctly locate corresponding objects in higher probability.
CUB-200-2011: We further demonstrate the power of our proposed method on CUB-200-2011 in Table 4 and Table 5. For the classification task, our VGG16-CAMs2i outperforms all other WSOL methods including the methods that rely on bounding box crops. For localization, our method outperforms almost on both top-1 and top-5 localization error, which marks a huge improvement.
Figure 3 visualizes the CAMs from each classification branch and also the final bounding box result of our proposed method on both ILSVRC and CUB-200-2011. For each input image, our proposed model first generate CAMs for each classification branch and then combine them together after normalization. Finally, the binary region is obtained by segmenting the foreground part of CAMs and extracting the largest connected one. The areas in dashed lines for each CAMs indicate the segmented regions if we use that CAMs as the final result. We can observe that in most cases, the combination of two CAMs have a more stable foreground segmentation than any single CAMs. It collect the final pixels that two CAMs are confident in and remove ambiguous pixels in each single CAMs. Besides, we compare our localization results between our method and SPG in Fig. 4. In most situations, our method can generate more precise bounding boxes than SPG by discovering more representative object components.
4.3 Ablation Study
In our proposed model, we use one threshold to segment foreground region from the combined CAMs in the inference time. To inspect the influence of the threshold for final localization result, we test different thresholds for our model in Table 6. We find that with and respectively, we can achieve the best localization performance on ILSVRC and CUB-200-2011.
We propose a new method for the Weakly Supervised Object Localization task, which generates multiple CAMs from the network and let them interact with each other to explore more related regions. We also propose an inter-sample module to further regulate CAMs in category level. The two modules improve the CAMs in both intra- and inter-sample ways for localization, achieving a new state-of-the-art result for the WSOL task.
Self-taught object localization with deep networks.
2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1–9. External Links: Cited by: §1.
-  (2016) What’s the point: semantic segmentation with point supervision. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 549–565. External Links: Cited by: §1.
Weakly supervised localization using deep feature maps. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 714–731. External Links: Cited by: §1.
-  (2017) STNet: selective tuning of convolutional networks for object localization. CoRR abs/1708.06418. External Links: Cited by: Table 3.
-  (2015-12) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 3.
-  (2017) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. CoRR abs/1710.11063. External Links: Cited by: §2.
-  (2019) RandAugment: practical data augmentation with no separate search. External Links: Cited by: Table 1.
Large scale fine-grained categorization and domain-specific transfer learning. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4109–4118. Cited by: Table 4.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 379–387. Cited by: §2.
Decaf: a deep convolutional activation feature for generic visual recognition.
International conference on machine learning, pp. 647–655. Cited by: Table 4.
-  (2018) Few-example object detection with model communication. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §1.
-  (2017) A dual-network progressive approach to weakly supervised object detection. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, New York, NY, USA, pp. 279–287. External Links: Cited by: §1.
-  (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524. External Links: Cited by: §2.
-  (2014-06) Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015-12) Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2018) Fine-grained visual categorization using PAIRS: pose and appearance integration for recognizing subcategories. CoRR abs/1801.09057. External Links: Cited by: Table 4.
-  (2017-10) Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2018) Co-attention cnns for unsupervised object co-segmentation.. In IJCAI, pp. 748–756. Cited by: §1, §3.3, §3.3.
-  (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. CoRR abs/1901.09891. External Links: Cited by: Table 4.
-  (2017-07) Deep self-taught learning for weakly supervised object localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017-10) Two-phase learning for weakly supervised object localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §3.4.
-  (2015-12) Towards computational baby learning: a weakly-supervised approach for object detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2017-07) Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2016) SSD: single shot multibox detector. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 21–37. External Links: Cited by: §2.
-  (2018-09) Exploring the limits of weakly supervised pretraining. In The European Conference on Computer Vision (ECCV), Cited by: Table 1.
Is object localization for free? - weakly-supervised learning with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2016-06) You only look once: unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. Cited by: §2.
-  (2015-12-01) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Cited by: §3.4, §4.1.
-  (2017-10) Grad-cam: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, Table 1, Table 2.
-  (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: Table 2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.4.
-  (2017-10) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 3544–3553. External Links: Cited by: §1, §1.
-  (2019) Fixing the train-test resolution discrepancy. CoRR abs/1906.06423. External Links: Cited by: Table 1.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §4.1.
-  (2017-10) Video object discovery and co-segmentation with extremely weak supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (10), pp. 2074–2088. External Links: Cited by: §1.
-  (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1354–1362. Cited by: §1, §3.3.
-  (2017-07) Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2018-10-01) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. External Links: Cited by: Table 3.
-  (2013-12) Deformable part descriptors for fine-grained recognition and attribute prediction. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 4.
-  (2018-06) Adversarial complementary learning for weakly supervised object localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.1, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2018-09) Self-produced guidance for weakly-supervised object localization. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, §3.2, §3.2, §3.2, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2016-06) Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.4, Table 1, Table 2, Table 3, Table 4, Table 5.