During the last decade, researchers have developed various forms of deep learning-based models and achieved remarkable performance in object localization of inferring the bounding box of objects in natural images [felzenszwalb2009object, girshick2014rich, ren2016faster]. However, from the learning and data efficiency perspectives, the major limitation of those works is the use of a fully-labeled dataset for supervision. Undoubtedly, it is labor-intensive and time-consuming to make such a fully-labeled dataset, thus causing their applicability limited in practice.
Meanwhile, Weakly Supervised Object Localization (WSOL) methods, which employ only class labels but no information of the targeted bounding box of objects [Zhou_2016_CVPR, wei2017object, zhu2017soft, singh2017hide, zhang2018adversarial, zhang2018self, choe2019attention, xue2019danet, mai2020erasing, babar2021look], has attracted by showing their great potential for the same task, being trained in a more data-efficient manner. The main idea of WSOL methods is to detect the class-discriminative regions via an object recognition task and to utilize those regions for the localization of the identified object. For example, Class Activation Map (CAM) [Zhou_2016_CVPR]
that estimates the class-specific discriminative regions based on the inferred class scores is one of the representative methods in WSOL. In the meantime, various studies[Zhou_2016_CVPR, zhu2017soft, wei2017object, singh2017hide, zhang2018adversarial, zhang2018self, choe2019attention, xue2019danet, mai2020erasing, babar2021look, choe2020evaluating] addressed that CAM-based methods is not capable of capturing the overall object regions in a finer way, because it focuses on only the class-discriminative regions, disregarding non-discriminative regions. Hence, many of the output bounding boxes are not tight enough to the target object by resulting in either over-sized or under-sized. There have been efforts to tackle these challenges via diverse network architectures or learning strategies [wei2016stc, wei2017object, singh2017hide, zhu2017soft, kim2017two, zhang2018adversarial, choe2019attention, mai2020erasing, yun2019cutmix, choe2019attention, mai2020erasing, babar2021look].
In principle, those methods devised different kinds of mechanisms to mitigate the major problem of focusing on only the discriminative regions in localization, by intentionally corrupting (, erasing) an input image [yun2019cutmix, singh2017hide, wei2017object] or a feature [zhang2018adversarial, mai2020erasing] or generating an attention map. For the image corruption methods, two different strategies were exploited, namely, random corruption and network-guided corruption. First, the random corruption approach removes a small patch within an image at random and uses the corrupted image to learn richer feature representations [singh2017hide, yun2019cutmix]. This helps the trained network to discover diverse discriminative representations, thus to detect more object-related regions. The network-guided corruption approach adaptively corrupts by dropping out the most discriminative regions based on the integrated activation maps [wei2017object, zhang2018adversarial, choe2019attention, mai2020erasing]. As for the attention-based methods [choe2019attention, zhang2018self], they use a specially designed module to generate an attention map, based on which the most discriminative regions are hidden to capture the integral extent of an object.
While those methods helped improve the performance, they have limitations and issues that should be considered further. First, the random-corruption approach [singh2017hide, yun2019cutmix] potentially disrupts the network learning due to unexpected information loss [zhang2018adversarial, choe2019attention]
. For example, if the object-characteristic parts were removed from an input image, a network is enforced to discover other parts from the remaining regions. Obviously, when there exists no further discriminative region, it would be trained in a wrong way. Second, the network-guided corruption approach introduces additional hyperparameters to determine the most discriminative regions and their sizes in activation maps. Third, the attention-based methods[wei2017object, zhang2018adversarial, choe2019attention, mai2020erasing] mostly exploit the coarse information in the form of channel or spatial attention and apply the same attention values to units in feature maps.
In this paper, we propose a novel fine-grained attention method that efficiently and accurately localizes object-related regions in an image. Specifically, we propose a new mechanism to generate a fine-level attention map that allows to utilize a series of information distributed over channels and locations within feature maps. The fine-level attention map is in the same size of as the input feature maps to the attention-based module, thus the attention is assigned for each of the units across feature maps and channels. Compared to the corruption-based approaches, our proposed method doesn’t need to mask patches in an image and doesn’t have additional hyperparameters for most discriminative regions selection.
The main contributions of our work are three-fold:
We propose a novel mechanism to represent a fine-grained attention that allows us to utilize feature representations globally in high resolution, thus to localize an object accurately.
In combination with a residual connection, our attention module autonomously concentrates on the less activated regions. Accordingly, it is more likely to focus on other informative regions of an object in an image.
In the experiments, our proposed method, Residual Fine-Grained Attention (RFGA), achieved the state-of-the-art object localization performances in the metrics of mIOU and MaxBoxAcc [choe2020evaluating]
on three datasets, , CUB-200-2011[wah2011caltech]maji2013fine], and Stanford Dogs [KhoslaYaoJayadevaprakashFeiFei_FGVC2011].
2 Related Work
Weakly supervised object localization. WSOL can be mainly categorized into two approaches depending on the selection method of erasing (, corrupting) regions: (1) random corruption [singh2017hide, yun2019cutmix] and (2) network-guided currption methods [zhang2018adversarial, zhang2018self, choe2019attention, mai2020erasing]. With regard to the random corruption method, Singh and Lee [singh2017hide] devised Hide-and-Seek (HaS), which randomly drops the patches of input images in order to encourage the network to find other relevant regions rather than only focus on the most discriminative parts of an object. Yun [yun2019cutmix] introduced CutMix where the randomly erasing (, cutting) patches are filled with patches of another class and the corresponding labels are also mixed. Though these methods have been considered as an efficient data augmentation method due to their no requirement of parameters, the random corruption can negatively affect localization performance due to its brute-force elimination of the input images [babar2021look, zhang2018adversarial, choe2019attention].
For the network-guided corruption methods [zhang2018adversarial, choe2019attention, zhang2018self, mai2020erasing], the most discriminative regions of the original image or feature map are dropped with a threshold (, drop rate). Zhang [zhang2018adversarial] proposed an Adversarial Complementary Learning (ACoL) to find the complementary regions through an adversarial learning between two parallel-classifiers; one to erase discriminative regions and the other to learn other discriminative regions except for the erased regions. Similar to ACoL, Choe [choe2019attention] introduced an Attention-based Dropout Layer (ADL), which generates a drop mask and an importance map by utilizing the self-attention mechanism and then randomly selects one of them for thresholded feature maps. In addition to these methods, [zhang2018adversarial, choe2019attention, zhang2018self, mai2020erasing] also exploited a self-attention mechanism to identify discriminative regions. However, they all require a drop rate as a criterion of the masking. In these regards, our proposed RFGA is capable of discovering from the less and to the more discriminative regions by using a novel self-attention module, without setting up the drop rate.
Attention based Deep Neural Networks.
Attention based Deep Neural Networks.Attention mechanisms have widely used to enhance the representational power of features for their tasks. Among various attention mechanisms [yue2018compact, gao2020channel, wang2020axial, zheng2019looking, gao2020kronecker, huang2019ccnet, wang2018non, zhao2020exploring, cao2019gcnet], here, we focus on a context fusion based mechanism [hu2018squeeze, woo2018cbam, wang2020eca, lee2019srm, zhuang2020learning, liu2020improving, kim2020spatially, lee2019srm, hu2018gather, kim2020learning, yang2021sa] that strengthens the feature maps to be more meaningful by aggregating information from every pixel. For instance, Hu [hu2018squeeze] proposed a Squeeze-and-Excitation Network (SENet), a highly simple and efficient gating mechanism to consider the channel-wise relationships among feature maps of the basic architectures. Likewise, Woo [woo2018cbam] devised a Convolutional Block Attention Module (CBAM) that sequentially combines two separate attention maps for channel and spatial. Different from SENet [hu2018squeeze], CBAM [woo2018cbam] additionally considered spatial attention which involves “where” to focus. Moreover, to alleviate a limitation of SENet [hu2018squeeze] that utilizes fully-connected layers in order for the reduction of the computational cost at the expense of the association between channel and weight, Wang [wang2020eca] introduced an Efficient Channel Attention Network (ECA-Net) [wang2020eca] that deploys a 1D convolutional layer to obtain cross-channel attention, while maintaining lower model complexity.
However, since [hu2018squeeze, woo2018cbam, wang2020eca] emphasized meaningful features by multiplying the same attention values, where the different information corresponding to spatial (, height and weight) or channel dimensions might be ignored, they can be unsuitable for WSOL where the fine location information is demanded. Meanwhile, our RFGA generates a detailed attention map that has different attention values across all regions by inferring the intersection of triple-view (, height, weight, and channel) attentions.
3 Residual Fine-Grained Attention
In this section, we present the details of our proposed residual fine-grained attention (RFGA) module. The RFGA is applied on the output feature maps before feeding into a classifier (Fig. 1
) to induce the model to learn the entire region of an object. Hereafter, we regard the output feature maps as a feature tensor (3D) without loss of generality.
Our RFGA generates a self-attention tensor, which is generated from three types of view-dependent attention maps by projecting the input feature tensor into channel, width, and height dimensions, respectively. On the contrary to this, the existing works [hu2018squeeze, wang2020eca] primarily consider channel-wise attention by ignoring the spatial characteristics distributed over the different maps in a feature tensor. The RFGA-generated attention tensor presents a fine-grained characteristic in the sense of assigning different attention values for each of the elements in a tensor. Notably, the residual connection in RFGA leads the attention tensor to focus on less discriminative areas of an object as well. In these regards, the final output feature tensor presents an enriched representation resulting in better object localization output, even relatively less discriminative feature for classification. The overall architecture of the proposed RFGA is illustrated in Fig. 2 and the detailed descriptions are given below.
3.1 Triple-view Attentions
Let be an input feature tensor, where , , and denote the dimension of the channel, height, and width, respectively. To condense the global distribution of an input feature tensor in the triple views, we apply an average pooling in each dimension of the tensor, , channel, height, and width as follows:
where is an average pooling operator with respect to the dimensions of .
After computations in Eq. (1)-(3), the three pooled features of , , and can be regarded as a summary of the extracted features in from different viewpoints. Surely, the three views carry different information distributed in the input feature tensor . That is, captures which feature representations are highly activated, and and reflect, respectively, where the discriminative features distributed vertically and horizontally across channels.
Subsequently, in order to utilize their local interaction among units in each pooled feature [wang2020eca], we apply a 1D convolution with a kernel size ofioffe2015batch]
and a non-linear activation function are applied as follows:
is a sigmoid function andindicates the 1D convolutional layer for the respective pooled features. Here, , , and corresponds to the resulting triple-view attentions.
3.2 Attentions Expansion
We propose to expand the triple-view attentions of , , and in the size of an input feature tensor , by which it is beneficial to reflect the attention information back into the input feature tensor in a fine-grained manner. Therefore, we create an attention map of the same size of the input feature map by means of an expansion function defined by an outer sum as follows:
It should be note that values in the attention map are likely to be different from each other, resulting in a fine-grained attention map. Our fine-grained attention map representation method is of major contrast to the previous attention-based methods that learn a coarse attention map, having the same values across elements within the same channel, for example.
3.3 Feature Calibration
With the attention tensor estimated in Eq. (7), we then apply it to the input feature tensor. In particular, we consider computational approaches as follows:
where and denote Hadamard product and element-wise summation, respectively.
Fundamentally, although those two approaches employ fine-level attention maps, enabling detailed feature calibration at element-level units, they work in different ways in terms of feature representation learning. The non-residual approach is to mine and calibrate as many discriminative features as possible by multiplying the input feature tensor with the corresponding attention tensor. Meanwhile, in the residual approach, because of the element-wise sum operation (, a residual operation), the element-wise multiplication part between the input feature tensor and the attention tensor is likely to excite the locations where the input feature tensor may have less activations. Hence, the attention module described in Section 3.2 is trained to give more attention to locations where the input feature tensor presents relatively small activations. Accordingly, the output attention tensor in plays the role of exciting the less activated regions in while inhibiting the more activated regions. This interpretable phenomenon is clearly observed from our experimental results in Fig. 5 and Fig. 6.
|MaxBoxAcc (%)||mIoU (optimal )||MaxBoxAcc (%)||mIoU (optiaml )||MaxBoxAcc (%)||mIoU (optiaml )|
|RFGA w/o R||71.85||42.04||17.24||4.09||0.59||27.16||0.6635||96.22||87.22||53.29||15.78||2.37||50.97||0.7600||84.81||74.62||59.22||36.49||12.10||53.44||0.7613|
|RFGA w/ R (Ours)||75.99||51.19||26.30||8.08||1.19||32.55||0.6970||89.11||79.15||62.59||35.70||6.48||54.60||0.8101||84.15||74.69||60.55||39.14||13.61||54.42||0.8135|
4.1 Experimental Setup
Datasets. We validated our RFGA over three public datasets for WSOL, , CUB-200-2011 [wah2011caltech], FGVC Aircraft [maji2013fine], and Stanford Dogs [KhoslaYaoJayadevaprakashFeiFei_FGVC2011]. First, CUB-200-2011 includes a total of images from bird categories, which is divided into images for training and images for evaluation. FGVC Aircraft consists of images across aircraft categories with for training, for a validation, and for testing. Stanford Dogs contains a total of dog samples in categories, which is composed of training samples and test samples.
Competing methods. We compared our RFGA with five existing state-of-the-art WSOL methods; , CAM [Zhou_2016_CVPR], HaS [singh2017hide], ACoL [zhang2018adversarial], ADL [choe2019attention], and CutMix [yun2019cutmix]. Further, in order to see the effectiveness of attention methods in WSOL, we compared to three other context fusion based attention methods; , SENet [hu2018squeeze], CBAM [woo2018cbam], and ECA-Net [wang2020eca] as well.
Evaluation metric. In order for quantitative evaluation, we used MaxBoxAcc [choe2020evaluating] over the IoU thresholds at the optimal activation map threshold. A threshold of activation map, , is set between 0 and 1 at 0.01 intervals. Therefore, we finally obtained the results, measuring various localization performances over threshold for an activation map at various levels of .
4.2 Implementation Details
We used ResNet-50 [he2016deep]
pre-trained with ImageNet data as a backbone network. In order to obtain localization maps, we used feature maps ofconvolutional layers, similar to ACoL [zhang2018adversarial]. For the kernel size in the triple-view attentions, we used , according to the work of [wang2020eca]. The input images of training were resized to and then we cropped randomly
patches from the resized images. In addition, the input images were flipped horizontally with a probability of. Meanwhile, the test images were resized to . We trained our RFGA for a total of epochs with a mini-batch size of and an initial learning rate of that was decreased by after every
epochs. Further, we used the stochastic gradient descent optimizer with a momentum of
. More details for the settings of comparative methods can be found in Supplemantary. We implemented all methods in PyTorch111https://pytorch.org and trained with Titan X GPU.
4.3 Experimental Results
Quantitative evaluation. Table 1 summarizes the performance of the competing methods at the optimal activation map threshold . We observed that our RFGA method outperformed localization performance compared to other competing methods in terms of MaxBoxAcc and Mean Intersection over Union (mIoU) which is the average IoU of all images at optiaml . Further, it is noteworthy that our RFGA showed its superiority to all competing methods in all cases including various IoU thresholds over three datasets.
|Method||Classification Top-1 Accuracy (%)|
|RFGA w/o R||79.75||65.83||76.13|
|RFGA w/ R (Ours)||76.65||52.99||68.17|
Qualitative visualization. We visualized the predicted localization bounding boxes and activation maps for all methods in Fig. 3. Each image presents the localization results at the optimal threshold where the IoU of the bounding box from the activation map achieves maximum value. We observed that RFGA elaborately localized the entire part of an object for CUB-200-2011, FGVC aircraft, and Stanford Dogs datasets. While most competing methods focused on the partial objects or covered in excess of the exact object region, our RFGA tightly bounded the entire and specified region of the object in an image, thereby achieving the best localization performance.
Classification. We additionally report the classification accuracy in Table 2 to explore the relation between localization and classification. By following to work of [choe2020evaluation], we selected the model at the last epoch without regard to validation results. However, we observed that most WSOL methods showed a tendency of achieving the best localization performance in the early stage of training in spite of low classification performance. Consequently, we believe that the classification performance is not correlated with the localization performance, consistent with the work in [choe2020evaluating].
Hyperparameter analysis. We plotted the change of localization performance (MaxBoxAcc [choe2020evaluating]) of all methods by varying the value of in Fig. 4. Our proposed RFGA showed the best performance at a smaller value than that of other comparative methods. From the results, we could infer that most of the high activation values in our RFGA-based feature tensor were well aligned to the object-related region.
Visualization of attention maps. In order to get an insight into the working of our RFGA, we visualized all triple-view attention maps (, , , ) as well as the final combined attention map (, ) (top) and the input feature map , the element-wise product of and (, ), the resulting output feature map via a residual approach, and the difference between and (bottom) in Fig. 5
. We took an average of each attention vector along the channel axis and expanded the averaged vectors,, , and to a matrix by repetition for a visualization. It should be noted that we normalized each matrix in a range of . Contrary to the activation map of CAM that only focuses on the body of a bird, our RFGA pays additional attention to the wings, resulting in the entire object attention. In accordance with Fig. 3, we validated the effectiveness of our fine-grained calibration of features in WSOL.
4.5 Ablation Study
Effect of triple-view attention. We assumed that our RFGA could localize an object from a variety of fine-grained information. To validate the effectiveness of the fine-grained attention, we conducted ablation studies with respect to each of the attention maps generated from different views, , channel , vertical , and horizontal . We employed only one-view attention out of those three view when training the same architecture, and reported the results in Table 3. Our triple-view attentions method clearly outperformed all the ablation cases. Based on those results, we believe that our fine-grained attention map from the triple-view attentions is capable of calibrating features through the complementary relations inherent in the input feature tensor.
Effect of residual connection. In order to investigate residual connection effect, we compared the residual approach and the non-residual approach in Eq. (8) in terms of the localization task in Table 1 and the classification task in Table 2. We also visualized their respective attention maps in Fig. 6. The residual approach generated the attention maps that focused on both the most and the less discriminative regions of an object. Meanwhile, for the non-residual approach, their attention maps showed the opposite patterns to those of the residual-based maps. Based on the understanding of a residual operation, note that leads the function to learn some amount of information that the input feature tensor may have missed or less emphasized. From the viewpoint of attention map generation, the role of can be interpreted as to inhibit the regions of high activation values (as those are already well represented in ) and to excite the less activated regions where the target task-related information is inherent. Here, the inhibition effect can be related to those of the specially-designed module in ACoL[zhang2018adversarial] and ADL[choe2019attention] that erase the discriminative features.
In this paper, we proposed a novel residual fine-grained attention module to localize an object accurately. Our proposed RFGA consisted of three components; (i) the triple-view attentions, (ii) expansion of the attentions to a high resolution, and (iii) calibration of the feature. Notably, our proposed method does not require a hyperparameter such as a drop rate for masking discriminative regions. Based on the evaluation with the metrics of mIOU and MacBoxAcc [choe2020evaluating] over three datasets, our proposed method achieved the highest performance. In our exhaustive ablation studies, we presented the validity of all the three components and also interpreted the inner working of the feature calibration formulated by a residual operation. It is noteworthy that because our proposed RFGA is plugged in between the last convolution layer and a classifier, it is applicable to other CNN architectures without modifying the original network architecture. In that sense, it would be our forthcoming research issue to more generalize its application to multi-object localization.
Acknowledgement. This work was supported by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00079 , Artificial Intelligence Graduate School Program(Korea University)).