Entropy Guided Adversarial Model for Weakly Supervised Object Localization

08/04/2020 ∙ by Sabrina Narimene Benassou, et al. ∙ Harbin Institute of Technology Shenzhen University 0

Weakly Supervised Object Localization is challenging because of the lack of bounding box annotations. Previous works tend to generate a class activation map i.e CAM to localize the object. Unfortunately, the network activates only the features that discriminate the object and does not activate the whole object. Some methods tend to remove some parts of the object to force the CNN to detect other features, whereas, others change the network structure to generate multiple CAMs from different levels of the model. In this present article, we propose to take advantage of the generalization ability of the network and train the model using clean examples and adversarial examples to localize the whole object. Adversarial examples are typically used to train robust models and are images where a perturbation is added. To get a good classification accuracy, the CNN trained with adversarial examples is forced to detect more features that discriminate the object. We futher propose to apply the shannon entropy on the CAMs generated by the network to guide it during training. Our method does not erase any part of the image neither does it change the network architecure and extensive experiments show that our Entropy Guided Adversarial model (EGA model) improved performance on state of the arts benchmarks for both localization and classification accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Weakly supervised learning has gained a lot of popularity during these past few years, especially since Zhou et al

Zhou et al. (2016) propose the use of Class Activation Map (CAM) to localize objects without bounding boxes annotations. Since that, CAM was extensively used for object localization Zhang et al. (2018); Kumar Singh and Jae Lee (2017); Choe and Shim (2019); Zhang et al. (2018); Xue et al. (2019); Yang et al. (2020); Yun et al. (2019), object detection Wan et al. (2018); Hakan and Andrea (2016); Tang et al. (2017); Wei et al. (2018); Wan et al. (2019); Zhang et al. (2018), image segmentation Lee et al. (2019); Huang et al. (2018); Wang et al. (2018)

, etc. However, not the whole object is highlighted on the CAM, because the network learns the features that discriminate the most the object. Some works have been proposed to deal with this probelm, we can classify them into two approaches. The first approach

Zhang et al. (2018); Kumar Singh and Jae Lee (2017); Choe and Shim (2019); Yun et al. (2019) consists of hiding a part or some parts of the image during training to force the CNN to detect the full object, nevertheless this approach has a drawback, when the most discriminative part is removed the CNN tends to learn regions of the image that does not belong to the object (as water or tree branches in CUB dataset) because they appear frequently in the training samples Choe and Shim (2019); Choe et al. (2020). Furthermore, removing some parts of the image results in information loss for the network and hence decreases its recognition ability Yun et al. (2019). The second approach Zhang et al. (2018); Xue et al. (2019); Yang et al. (2020) consists of activating different parts of the object by generating multiple CAMs from different levels of the network. These methods however, require a modification in the network structure by plugging some layers or some blocks to the network, which is not always intuitive to design or to generalize for different network structures. In this paper, we propose to take advantage of the generalization ability of the network and propose to use Adversarial Learning (AL) Madry et al. (2018); Goodfellow et al. (2015) and entropy Vu et al. (2019); Wan et al. (2019) to tackle the limitations of the two approches. Our method does not modify the network backbone, which make the implementation easier, and because the whole images are used for training, there is no information loss for the model.

Figure 1: The difference between the features that affect the classification prediction when a model is trained with clean examples and a model trained with adversarial (noisy) examples Tsipras et al. (2019)

Adversarial learning was originally made to prevent adversarial attacks by training the model with adversarial examples that are generated by introducing some perturbations to clean images. In Tsipras et al. (2019)

, Tsipras et al. visualize the gradient of the loss on ImageNet dataset and have shown the features that affect the classification prediction of a CNN trained with clean examples and a CNN trained with adversarial (noisy) examples. As we can see in Fig.

1, the CNN trained adversarially, detect more relevant and clearer features such as edges and borders (line 3) than the standard one, that detects noisy features (line 2). This leads us to the following conclusion; to get a good classification accuracy, the CNN trained only with clean images activate the features that better discriminate the object e.g the head of a bird. But if we futher train the CNN with images where we add a small perturbation, the network is forced to look for more relevant features to better recognize the object e.g the body of the bird. By training a model with both clean and adversarial examples, the generated CAM will activate more discriminative pattern to recognize the object without erasing any part of the image or changing the network architecture. Furthermore, Goodfellow et al. Goodfellow et al. (2015) advice the use of a mixture of clean and adversarial examples to train the CNN. Data augmentation such as rotation or translation will not occur naturally and treating adversarial examples as data augmentation will in fact regularize the network.

As adversarial learning detects more discrimative features of the object, some remaining pixels that belong to the object are still not activated by the CAM. To this end, we propose to further use the concept of entropy to guide the CNN during training. The entropy is a measure of uncertainty, it represents how a CNN is certain or not about its predictions Vu et al. (2019); Wan et al. (2018, 2019)

, the CAM generated by the CNN only highlights the most discriminative features of the object. The pixels that constitute this part have a high prediction probability, i.e., low-entropy. And the other pixels, that are not highlighting by the CAM have a low prediction propability, i.e., high-entropy. By minimizing the entropy of the CAM, the CNN could extend the localization of the object features.

We can resume our work to the following contributions :

  1. An entropy guided adversarial model (Dubbed EGA model) that uses both clean examples and adversarial examples to activate more features of the object.

  2. Introduce an entropy loss function to guide the CNN to detect the pixels that belong to the object.

  3. Extensive expriments shown that our EGA model obtained state-of-theart performance in weakly supervised object localization.

2 Related work

2.1 Weakly Supervised Object Localization

Weakly supervised object localization recieved a lot of attention since Zhou et al. (2016), where they used a Global Average Pooling (GAP) layer to generate a Class Activation Map that localizes the object with only label annotation. This map highlights only the most discriminative part of the object, resulting in a tight bounding box. Many works attempt to create new methods to solve this problem. We can devide the proposed solutions into two classes, the first class remove a part or some parts of the image to force the network to detect more features. As in Zhang et al. (2018), where they proposed to use two branches, the first branch generates a map with the discriminative part highlighted, and then removes the pixels of this part, to give it as input to the second branch of the network, the second branch generates another map where other features of the object are highlighted. The two generated maps are then fused by taking the maximum value. Another method that uses a hiding process is Kumar Singh and Jae Lee (2017)

, in which the images are divided into patches, and then for each patch a probability is assigned. At each epoch, some patches are hidden and given as input to the CNN. The CNN hence learns to detect the whole object. During testing, the images are given to the network without removing any patches.

Choe and Shim (2019) uses either an attention mechanism or a dropout mask to help the network to detect the whole object. Yun et al. (2019) removes the discriminative region and replace it with a patch from another image, hence there is no non-informative pixels in the image and the network could generalize well for both classification and localization. Some other methods do not remove any part of the image and modify the network achitecture to generate different maps from different hierarchies. In Zhang et al. (2018), an attention map generated from high level features map, is used to guide the network to distinguish between the forground and background pixels of the map generated from low level features map. Xue et al. (2019) combines two child classes labels to create a parent class and train the CNN with hierarchical class labels to detect common visual patterns and Yang et al. (2020) generates many CAMs from low and high features map and combine them using a polynomial function.

For the first approach, when some parts of the object are removed, non discriminative features are activated by the network because they appear too frequently in the dataset. The second approach moreover is not intuitive to apply for complex network architectures as we have to plug some blocks to generate multiple CAMs from different levels of feature maps. In this article, we propose to employ adversarial examples as data augmentation to detect more discriminative features. Minimizing the entropy loss on the CAM will further guide the model during training.

2.2 Adversarial learning

Adversarial learning consists of constructing robust models by training the classifier with adversarial examples Goodfellow et al. (2015). Some works used it either to improve the robusteness of the model, or to solve other problems; as Madry et al. Madry et al. (2018) where they constructed a robust classifier using a min-max formulation to protect their network against adversarial attacks. Tsipras et al. Tsipras et al. (2019) relates the benefit of adversarial learning and show that robust classifier learns different features than standard one. Xie et al. (2020) trains a network with both clean examples and adversarial examples to improve the classification accuracy and achieved state of the art accuracy on ImageNet dataset. Miyato et al. (2017)

employs adversarial learning to improve classification in semi-supervised learning. In

Park et al. (2018), an adversarial dropout mask is selected using adversarial learning and applied to the network to improve the classification accuracy for both supervised and semi-supervised learning. Shafahi et al. (2019) proposes a fast training for adversarial training by updating the network parameters and creating an adversarial example in one backward pass. Lee et al. (2019) applies adversarial learning and adversarial dropout to learn discriminative features for unsupervised domain adaptation. In this article, we take advantage of adversarial attacks and use adversarial examples as data augmentation to improve performance of weakly supervised object localization problems.

2.3 Entropy

In information theory, entropy is the measure of uncertainty, it tells how a network is uncertain about its prediction. When the prediction of the network is of high porbability, the entropy value is low, and when the prediction is low, the amount of entropy is high. Wan et al. (2018) uses entropy for weakly supervised object detection as an optimization method associated with multiple instance learning to select the object proposals that belong to the object. Vu et al. (2019) shows state of the art performance in Domain Adaptation in Semantic Segmentation using an entropy minimization loss on the target dataset. Wan et al. (2019) considers the entropy as a weighting coefficient to improve weakly supervised learning on Pascal VOC dataset Everingham et al. (2015). Entropy was also used for semi-supervised domain adaptation Saito et al. (2019) and semi-supervised learning Grandvalet and Bengio (2004).

Figure 2:

The clean example and adversarial example are fed through the network. The clean example passes through the main batch normalization and the adversarial learning through the auxiliary batch normalization. After the forward pass, we get

and and calculate the shannon entropy loss.

3 Method

In this section, we introduce our Entropy Guided Adversarial model i.e EGD model. In section 3.1, we review the baseline for WSOL. In 3.2, we briefly present adversarial learning and how to apply it for WSOL problem. In 3.3, we discuss about the shannon entropy application on CAM, and finally in section 3.3, we present the loss function used by our network. Our model is illustrated in Fig. 2.

3.1 Revising Class Activation Map

Weakly Supervised Object Localization aims at detecting objects without bounding box annotations. One method widely used to localize objects using only label annotations is Zhou et al. (2016). In Zhou et al. (2016)

, they propose to add to a network, composed of convolutional layers a global average pooling layer before the softmax layer. We denote the last convolutional layer as

. The GAP layer is applied to the last convolutional layer and output a vector where each unit is the average of each feature map :

(1)

The weights of the generated vector are then multiplied by the features map of the last convolutional layer and are given as input to the softmax layer as:

(2)

where is the weight of the corresponding class c which indicates the importance of for class c.

To generate a map that indicates the importance of each pixel for the corresponding class, we summed up the weighted features map as:

(3)

where is the class activation map generated for class c. The CAM activates the features that discriminate the most the object.

Figure 3: The clean example is given as input to the network to create a perturbation that maximizes the loss function. We drop the main batch norm and use the auxiliary batch norm to generate the adversarial example.

3.2 Adversarial learning

To train a CNN, we use stocastic gradient descent method to minimize an objective function according to the parameters of the network:

(4)

where is an input sample associated with its ground truth label and is the objective function i.e loss function.

Figure 4: The difference between the distribution of clean examples and adversarial examples.

To construct a model that is robust to adversarial attacks, we train the classifier with adversarial examples by adding a small perturbation to the image to fool the network Goodfellow et al. (2015); Madry et al. (2018). The resulted sample should be close to the original image, and maximize the loss function instead of minimizing it as we usually do for standard training, the objective function hence becomes :

(5)

where is the adversarial perturbation i.e noise and is the allowable set of perturbations which ensure that adversarial image is close to the clean one.

Goodfellow et al Goodfellow et al. (2015) advice the use of a mixture of clean examples and adversarial examples to train the model. Adversarial examples act as data augmentation and associated with clean examples regularize the network. As suggested in Goodfellow et al work the objective function is:

(6)

Clean samples and adversarial samples can be considered as two different datasets, because they have two different disctributions, hence the network could not generalize well on clean data during the inference. This is due to the distribution mismatch between the two type of datasets Xie et al. (2020). As shown in Fig. 4

, the mean and the standard deviation of the ditribution of clean data is different from the one of adversarial data.

To solve this problem, we follow the work of Xie et al. (2020) and add an auxiliary batch normalization to the network. Batch normalization (BN) is a technique used to normalize the input data Ioffe and Szegedy (2015). As we have two datasets distributed differently, we use one batch norm for each type of dataset i.e we feed each sample to its corresponding batch norm. Specifically, for each clean mini batch , we generate its corresponding adversarial mini-batch using the auxiliary batch norm layer. The process of generating the adversarial example is shown in Fig. 3. We feed a clean sample to the network to calculate the perturbation that maximizes the loss function using Eq (5). The network parameters are not updating during this process. We then pass the clean mini batch and adversarial mini batch through the network to compute the loss, the clean mini batch pass through the main batch normalization and the adversarial mini batch pass through the auxiliary batch normalization. After the forward pass, the model generates two maps; and for clean sample and adversarial sample respectively. After that, we calculate the loss function and update the network parameter. During inference, we test our model only on clean data, the auxialiary batch norm is dropped and only the main batch norm is used.

3.3 Entropy minimization

Unlike previous wroks Zhang et al. (2018); Zhou et al. (2016), where they generate the CAM during the test stage, we generate our CAMs after each forward pass to guide the CNN for localization during the training. Our adversarial model using the main and auxiliary batch norm layer generates two different CAMs; and for the clean example and adversarial example respectively. These CAMs activate different parts of the object but still do not activate the whole object. We propose to use the shannon entropy to remedy to this problem.

The Shannon entropy measure the amount of uncertainty Vu et al. (2019), when it is applied to the CAM, a pixel with a low entropy means a high prediction probability and a pixel with a high entropy means a low prediction probability. The maps generated by the network highlight the most discriminative part i.e these pixels have a low entropy; the other pixels that are not highlighted by the maps but belongging to the object have a high entropy. By minimizing the entropy of the CAM, we force the CNN to activate the pixels that belong to the object but not highlighted by the map

Given the generated CAMs, and , we calculate the entropy loss for one as :

(7)

where represent the probability of the pixel . is the sum of all entropies pixels i.e we maximize the predictions on the pixels that are not activated by the CAM.

3.4 Loss function

After feeding the network with clean and adversarial sample, we generate and , we then calculate the loss entropy from the two maps. We jointly optimize the adversarial learning loss function and entropy loss function to optimize the network parameters by summing Eq (6) and Eq (7). The final loss function is defined as below :

(8)

where and is the weighting factor controlling the importance of and , respectively.

Because of the perturbation on the adversarial example, has more activated pixels than . To this end, we set , this setting pushes the classifier to be more severe with clean examples by forcing the to activate more pixels.

4 Experiments

4.1 Experimental Setup

Datasets

We evaluate our method on two commonly used datasets for WSOL i.e CUB-200-2011 Wah et al. (2011) and ILSVRC Deng et al. (2009); Russakovsky et al. (2015). We further evaluation EGA model on OpenImages, a fresh new dataset proposed by Choe et al. (2020) for WSOL. CUB-200- 235 2011 consists of 200 bird species, this dataset contains 11,788 images with 5,994 images for training and 5,794 for testing. ILSVRC 2016, is a large scale dataset of 1000 classes, that comprise 1.2 million for training, and 5,000 images for the validation set that we use for testing. OpenImages has 100 classes, it contains 29 819 for training, 2 500 for validation, and 5 000 images for testing.

Evaluation metrics

For classification evaluation, we use the Top-1 classification accuracy, which indicates that a prediction is correct when the prediction of the model is equal to the ground-truth class. For localization, as CUB and ILSVRC datasets have bounding boxes annotations, we use three evaluation metrics : Top-1 localization accuracy

Deng et al. (2009), Correct Localization Deselaers et al. (2012) (CorLoc) rate and MaxBoxAccV2 Choe et al. (2020). Top-1 localization accuracy counts a localization as correct when the predicted class is correct and the predicted bounding box has an Intersaction Over Union (overlap) with the ground truth bounding box greater than 0.5. Correct Localization (CorLoc) is the localization performance whether or not the predicted class is correct. MaxBoxAccV2 is a new metric proposed by Choe et al. (2020), which is an improved version of CorLoc; we average the results of the IOU between the predicted box and the ground truth box accross 0.3, 0.5, 0.7. OpenImages provides pixel-wise annotations, hence we use pixel average precision (PxAP) metric Choe et al. (2020)

for localization, which measure the pixelwise precision and recall trade-off.

Experimental details

We use for training VGGnet Simonyan and Zisserman (2015) and GoogLeNet Mahajan et al. (2018), We follow the same setting as Zhou et al. (2016), we remove the layers after conv5-3 (from pool5 to prob) of the VGG-16 network and the last inception block of GoogLeNet. We then add two convolutional layers with kernel size 3

3, stride 1, pad 1 with 1024 units, and a convolutional layer of size 1

1, stride 1 with 1000 units for ILSVRC, 200 units for CUB-200-2011 and 100 units for OpenImages. For training, input images are resized to 256 256, then randomly cropped to 224 224. Both backbone networks are fine-tuned on the pre-trained weights of ILSVRC. During inference, we resized images to 224 224 to find the whole objects and for classification for CUB and ILSVRC datasets, we average the scores from the softmax layer with 10 crops.

Method top1 cls-err top1 loc-err
GoogLeNet-CAM Zhou et al. (2016) 26.2 58.94
GoogLeNet-SPG Zhang et al. (2018) - 53.36
GoogLeNet-ADL Choe and Shim (2019) 25.45 46.94
GoogLeNet-DANet Xue et al. (2019) 28.8 50.55
GoogLeNet-EGA (ours) 27.89 54.26
VGGnet-CAM Zhou et al. (2016) 23.4 55.85
VGGnet-ACoL Zhang et al. (2018) 28.1 54.08
VGGnet-SPG Zhang et al. (2018) 24.5 51.07
VGGnet-CutMix Yun et al. (2019) - 47.47
VGGnet-ADL Choe and Shim (2019) 34.73 47.64
VGGnet-DANet Xue et al. (2019) 24.6 47.48
VGGnet-CCAM Yang et al. (2020) 26.8 49.93
VGGnet-EGA (ours) 21.87 40.84
Table 1: Comparison to the state-of-the-art performance on the CUB dataset.
Method top1 cls-err top1 loc-err
GoogLeNet-Backprop 37 - 61.31
GoogLeNet-CAM Zhou et al. (2016) 35.0 56.40
GoogLeNet-HaS Kumar Singh and Jae Lee (2017) - 54.53
GoogLeNet-ACoL Zhang et al. (2018) 29.0 53.28
GoogLeNet-SPG Zhang et al. (2018) - 51.40
GoogLeNet-ADL Choe and Shim (2019) 27.17 51.29
GoogLeNet-DANet Xue et al. (2019) 27.5 52.47
GoogLeNet-EGA (ours) 27.42 50.17
VGGnet-Backprop Simonyan et al. (2014) - 61.12
VGGnet-CAM Zhou et al. (2016) 33.4 57.20
VGGnet-ACoL Zhang et al. (2018) 32.5 54.17
VGGnet-CutMix Yun et al. (2019) - 56.45
VGGnet-ADL Choe and Shim (2019) 30.52 55.08
NL-CCAM Yang et al. (2020) 27.7 49.83
VGGnet-EGA (ours) 29.36 52.69
Table 2: Comparison to the state-of-the-art performance on the ILSVRC validation set.

4.2 Comparison with the state-of-the-arts

We compare our EGA model to the state of the arts on CUB, ILSVRC and the new proposed dataset OpenImages. The results are shown in Tab. 1, Tab. 2 and Tab. 3, respectively.

Cub

As shown in Tab. 1, with VGG model, our model outperform by far all the previous state of the arts in both classification and localization with 21.87% for classification and 40.84% for localization. We improved our baseline VGGnet-CAM by 1.53% and 15.01% for classification and localization respectively. We also surpass VGGnet-CCAM, the current state of the art for localization with a large margin of 9.09% for localization and margin of 4.93% for classification.

With a GoogLeNet backnone, our EGA model achieved 27.89% and 57.26% for classification and localization respectively. We did not achieve a new state of the art with GoogLeNet architecture but we surpass our baseline GoogLeNetCAM for localization with 4.68%. We argue that GoogLeNet compared to VGG backbone is deeper and hence, needs a larger dataset as ILSVRC dataset to achieve good performance, as we add some perturbation to the samples, the network needs more data to improve its classification and localization accuracy.

Ilsvrc

In Tab. 2, with VGG backbone, we achieved 29.36% and 52.69% for classification and localization respectively on ILSVRC dataset, we surpass the basline with a difference of 4.04% and 4.51%, we also suprpass all the state of the art method for both classification and localization, except for NL-CCAM which outperforms our method with a margin of 1.66% and 2.86% for classification and localization respectively.

With GoogLeNet architecture, EGA model achieved 27.42% and 50.17% for classification and localization respectivaly. As supposed earlier, our method with GoogLeNet backbone with a larger dataset outperform all state of the art for localization, we improved our basline with a margin of 7.58% and 6.23% for classification and localization respectively. We also surpass the current state of the art ADL with a difference of 1.12% for localization with a good classification accuracy.

In Fig. 5, we compare the bounding boxes generated by CAM method Zhou et al. (2016) and bounding boxes generated by our EGA model. Our method activates more object’s features than the baseline Zhou et al. (2016).

Figure 5: Comparison of our EGA method to CAM method. Our method activates more object’s features than CAM and generates tighter bounding boxes. Ground-truth bounding boxes are in red and the predicted are in blue.
Method top1 cls-err top1 loc-err
GoogLeNet-CAM Zhou et al. (2016) 63.4 36.8
GoogLeNet-HaS Kumar Singh and Jae Lee (2017) 31.6 41.9
GoogLeNet-ACoL Zhang et al. (2018) 59.3 42.8
GoogLeNet-SPG Zhang et al. (2018) 53.4 37.7
GoogLeNet-ADL Choe and Shim (2019) 53.4 43.2
GoogLeNet-CutMix Yun et al. (2019) 46.9 37.5
GoogLeNet-EGA (ours) 33.4 37.35
VGGnet-CAM Zhou et al. (2016) 32.7 41.7
VGGnet-HaS Kumar Singh and Jae Lee (2017) 40.0 41.9
VGGnet-ACoL Zhang et al. (2018) 31.8 45.7
VGGnet-SPG Zhang et al. (2018) 28.3 41.7
VGGnet-ADL Choe and Shim (2019) 33.9 41.3
VGGnet-CutMix Yun et al. (2019) 31.9 41.9
VGGnet-EGA (ours) 30.0 38.21
Table 3: Comparison to the state-of-the-art performance on OpenImages dataset with PxAP metric.

Figure 6: Comparison of our EGA method to CAM method on OpenImages dataset. The ground truth mask is in blue and the predicted mask is in red.

OpenImages

As OpenImages is a new dataset proposed by Choe et al. (2020), we compared our method to the results reported in Choe et al. (2020). In Tab. 3, with VGG backbone, our method achevies 30.0% for classification and 38.21% for localization, we still improve the baseline CAM with 2.7% for classification and 3.49% for localization. We further surpass ADL the current state of the art on OpenImages dataset with a margin of 3.9% for classification and a margin of 3.09% for localization. Hence we achieved a new state of the localization on OpenImages dataset with a small drop for classification. In Fig. 6, our EGA model detects more foreground features and less background features compared to VGGnet-CAM Zhou et al. (2016).

With GooleNet, our method achieves 33.4% for classification and 37.35% for localization, we have a slight decrease compared to the baseline CAM of 0.55% for localization, but we outperform HaS, ACoL, SPG, ADL and CutMix with a margin of 4.55%, 5.45%, 0.35%, 5.85%, and 0.15%, respectively.

MaxBoxAccV2

we further evaluate our method with the new evaluation metric MaxBoxAccV2 proposed by Choe et al. (2020). The results are shown in Tab. 4. For CUB dataset with a VGG backbone, we still surpass all the sate of the art CAM, HaS, ACoL, SPG and CutMix with a difference of 0.23%, 0.53%, 6.53%, 7.63% and 1.63%, except for ADL where there is a difference of 2.37%. For ILSVRC dataset, we achieved a new state of the art for localization by surpassing all previous works with a localization accuracy of 38.22%.

CorLoc

We also evaluate our method with Correct Localization metric, a highly used evaluation metric in WSOL Deselaers et al. (2012). As shown in Tab. 5, EGA method improves CAM, HaS, ACoL and SPG methods by 6.17%, 3.63%, 1.87% and 0.14%. We also outperform the other state of the arts methods, except for NL-CCAM where we have a slight increase of 0.4%. NL-CCAM applied a Non-Local module to the VGG backbone and hence, this method changes the architecture of the VGG and does not use the original VGG backbone. Our method does not change the backbone of VGG or GoogleNet and we still have competetive results with NL-CCAM. When we compare CCAM method applied to the original basline i.e VGGnet-CCAM we outperform it by 1.25%.

Method ILSVRC CUB
VGGnet-CAM Zhou et al. (2016) 40.0 36.3
VGGnet-HaS Kumar Singh and Jae Lee (2017) 39.4 36.6
VGGnet-ACoL Zhang et al. (2018) 42.6 42.6
VGGnet-SPG Zhang et al. (2018) 40.1 43.7
VGGnet-ADL Choe and Shim (2019) 40.2 33.7
VGGnet-CutMix Yun et al. (2019) 40.6 37.7
VGGnet-EGA (ours) 38.22 36.07
Table 4: Comparison to the state-of-the-art performance on the ILSVRC, CUB datasets with MaxBoxAccV2 metric.
Method ILSVRC
AlexNet-GAP Zhou et al. (2016) 45.01
AlexNet-HaS Kumar Singh and Jae Lee (2017) 41.25
AlexNet-GAP-ensemble Zhou et al. (2016) 42.98
AlexNet-HaS-ensemble Kumar Singh and Jae Lee (2017) 39.67
GoogLeNet-GAP Zhou et al. (2016) 41.34
GoogLeNet-HaS Kumar Singh and Jae Lee (2017) 38.8
GoogLeNet-ACoL Zhang et al. (2018) 37.04
GoogLeNet-SPG Zhang et al. (2018) 35.31
VGGnet-CCAM Yang et al. (2020) 36.42
NL-CCAM Yang et al. (2020) 34.77
GoogLeNet-EGA (ours) 35.17
Table 5: Comparison to the state-of-the-art performance on the ILSVRC validation set with CorLoc metric.

4.3 Ablation Study

In this section, we perform ablation study on CUB dataset with VGGnet and GoogLeNet networks. We firstly evaluate the effect of each contribution over the baseline, then the effect of different perturbation values on localization and classification, and finally, how and influence the results of our method. The results are reportes in Tab. 6 and Tab. 7 and Fig. 7.

Effect of AL and Entropy

As shown in Tab. 6, with VGGnet backbone, using clean examples and adversarial examples improves greatly the baseline i.e 1.55% for classifcation and 14.92% for localization. When we further apply the entropy, we improve the localization accuracy by 0.09% with a little drop in classification of 0.02%. With GoogLeNet backbone, using adversarial learning improves the baseline with a margin of 4.12% for localization, however it drops the classification accuracy by 1.73%. When the entropy is further applied we improve both the classification and the localization by 0.04% and 0.56%, respectively

Adversarial Attacker Strength

We show the effect of Projected Gradient Descent (PGD) Madry et al. (2018) attackers with different perturbation values. we train both VGG network and GoogLeNet network on CUB dataset and as Xie et al. (2020), we use perturbations ranging from 1 to 4 with an iteration of , except when we set the number of iteration to 1. As shwon in Fig. 7, the bigger the perturbation, the higher is the error for both classification and localization. We get the best results with and . This is obvious as our goal is not to build a robust model, but using adversarial learning as a way to activate more relevant features in the image.

Regularization factors

We further show the effect of the regularization factors and on the method, As shown in Tab. 7, with VGGnet backbone, we got the best results with and . By selecting the right values for and , entropy improves the localization accuracy.

Method top1 cls-err top1 loc-err
VGGnet-CAM Zhou et al. (2016) 23.4 55.85
VGG + Adversarial learning 21.85 40.93
VGG + Adversarial learning + Entropy 21.87 40.84
GoogLeNet-CAM Zhou et al. (2016) 26.2 58.94
GoogLeNet + Adversarial learning 27.93 54.82
GoogLeNet + Adversarial learning + Entropy 27.89 54.26
Table 6: The effect of adding adversarial training and entropy loss to the baseline CAM.
Figure 7: The effect of different perturbation strength on the classification and localization accuracy.
Method top1 cls-err top1 loc-err
, 22.59 41.94
, 21.87 40.84
, 22.14 41.54
, 21.88 41.23
, 22.11 41.09
Table 7: the effect of and on the results.

5 Conclusion

In this paper, we proposed to take advantage of adversarial learning and entropy to improve WSOL performance. To do this, we train the model with clean examples and adversarial examples. By introducing some perturbations to the images, adversarial examples act as data augmentation and regularize the network, resulting in the activation of more relevant features. Furthermore, applying entropy minimization on the CAMs generated by the network, guides it during the training by forcing the pixels considered not relevant by the model to have a low entropy, and hence a higher prediction. Extensive experiments demonstrate that our EGA model obtained state of the arts on the three most used benchmark CUB, ILSVRC and OpenImages datasets.

References

References

  • J. Choe, S. J. Oh, S. Lee, S. Chun, Z. Akata, and H. Shim (2020) Evaluating weakly supervised object localization methods right. In

    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1, §4.1, §4.1, §4.2, §4.2.
  • J. Choe and H. Shim (2019) Attention-based dropout layer for weakly supervised object localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, Table 1, Table 2, Table 3, Table 4.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §4.1, §4.1.
  • T. Deselaers, B. Alexe, and V. Ferrari (2012) Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision 100, pp. . External Links: Document Cited by: §4.1, §4.2.
  • M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision 111 (1), pp. 98–136. Cited by: §2.3.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: §1, §1, §2.2, §3.2, §3.2.
  • Y. Grandvalet and Y. Bengio (2004) Semi-supervised learning by entropy minimization. Vol. 17, pp. . Cited by: §2.3.
  • B. Hakan and V. Andrea (2016) Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang (2018) Weakly-supervised semantic segmentation network with deep seeded region growing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv abs/1502.03167. Cited by: §3.2.
  • K. Kumar Singh and Y. Jae Lee (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, Table 2, Table 3, Table 4, Table 5.
  • J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon (2019) FickleNet: weakly and semi-supervised semantic image segmentation using stochastic inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • S. Lee, D. Kim, N. Kim, and S. Jeong (2019) Drop to adapt: learning discriminative features for unsupervised domain adaptation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018)

    Towards deep learning models resistant to adversarial attacks

    .
    In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.2, §3.2, §4.3.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In The European Conference on Computer Vision (ECCV), Cited by: §4.1.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2017) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, pp. . External Links: Document Cited by: §2.2.
  • S. Park, J. Park, S. Shin, and I. Moon (2018) Cited by: §2.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.1.
  • K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.
  • A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein (2019) Adversarial training for free!. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3358–3369. External Links: Link Cited by: §2.2.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034. Cited by: Table 2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §4.1.
  • P. Tang, X. Wang, X. Bai, and W. Liu (2017) Multiple instance detection network with online instance classifier refinement. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019)

    Robustness may be at odds with accuracy

    .
    In International Conference on Learning Representations, External Links: Link Cited by: Figure 1, §1, §2.2.
  • T. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez (2019) ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.3, §3.3.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §4.1.
  • F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, and Q. Ye (2019) C-mil: continuation multiple instance learning for weakly supervised object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • F. Wan, P. Wei, J. Jiao, Z. Han, and Q. Ye (2018) Min-entropy latent model for weakly supervised object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.3.
  • W. Wan, J. Chen, T. Li, Y. Huang, J. Tian, C. Yu, and Y. Xue (2019)

    Information entropy based feature pooling for convolutional neural networks

    .
    In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2.3.
  • X. Wang, S. You, X. Li, and H. Ma (2018) Weakly-supervised semantic segmentation by iteratively mining common object features. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • Y. Wei, Z. Shen, B. Cheng, H. Shi, J. Xiong, J. Feng, and T. Huang (2018) TS2C: tight box mining with surrounding segmentation context for weakly supervised object detection. In The European Conference on Computer Vision (ECCV), Cited by: §1.
  • C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le (2020) Adversarial examples improve image recognition. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §3.2, §3.2, §4.3.
  • H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye (2019) DANet: divergent activation for weakly supervised object localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, Table 1, Table 2.
  • S. Yang, Y. Kim, Y. Kim, and C. Kim (2020) Combinational class activation maps for weakly supervised object localization. In The IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §1, §2.1, Table 1, Table 2, Table 5.
  • S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, Table 1, Table 2, Table 3, Table 4.
  • X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang (2018) Adversarial complementary learning for weakly supervised object localization. In IEEE CVPR, Cited by: §1, §2.1, §3.3, Table 1, Table 2, Table 3, Table 4, Table 5.
  • X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang (2018) Self-produced guidance for weakly-supervised object localization. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • Y. Zhang, Y. Bai, M. Ding, Y. Li, and B. Ghanem (2018) W2F: a weakly-supervised to fully-supervised framework for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, §3.1, §3.3, §4.1, §4.2, §4.2, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.