Few-shot object detection (FSOD) has been an active research area these years. The task aims to recognize and locate novel objects in an image based on a few given object samples. Unlike normal object detection task, FSOD models can fast adapt to new tasks and save a large amount of effort to collect training images and annotations for novel objects.
Previous FSOD works resemble several concepts and operations from the few-shot classification (FSC). For instance, the concept of prototypes [snell2017prototypical], ranking similarity between inputs [koch2015siamese] and feature map concatenation [sung2018learning] are all widely used in FSOD. However, while recent progress has boosted FSC accuracy, state-of-the-art FSOD models, such as [fan2020few], still have considerable performance gap compared to standard object detection models (11.1 mAP on COCO dataset). Furthermore, previous methods suffer from 17% to 40% performance drop even on base classes (train categories) (see Fig. 1). Motivated by the observations, it is worthy to analyze the deficiencies of current FSOD approaches.
FSOD is a challenging task due to the noisy supports (few examples) and complex scene in the query (image to be tested). Generally, both the support and query images in FSC are close-ups of objects with simple photography composition. However, the query images in FSOD contains multiple objects belonging to various categories, and the support instances are randomly cropped from other query images. Furthermore, FSOD not only performs classification but also estimates the object positions and sizes. Thus, compare with FSC, the spatial variability of objects and the noise in images are far more crucial in FSOD.
Fig. 1 demonstrates the potential problems of current FSOD methods. First, (a) the global pooling operation is widely employed to perform element-wise product or concatenation in previous attempts [karlinsky2019repmet, kang2019few, yan2019meta, fan2020fgn, perez2020incremental, liu2020crnet], but it leads to serious loss of spatial information. Second, (b) the convolutional-based attention [fan2020fgn], which is used to establish the correlation between the support and query, encounters the spatial misalignment problem. Additionally, (c) modern FSOD models [karlinsky2019repmet, kang2019few, yan2019meta, fan2020fgn, fan2020few, perez2020incremental] follow FSC to mean pool multiple support samples, leading to ambiguous representation since there are multiple semantics in the aggregated feature. Such an operation is workable in FSC due to the relatively simple problem setting while FSOD is a much more challenging task which requires specific and accurate contextual information to guide the detection process.
To support our intuition, we conduct a pilot experiment with the hypothesis: If the spatial variability is not crucial in FSOD task, the choice of support set images would not severely affect FSOD performance. In the experiment, well-trained models are tested on the same query set for 100 iterations and the resulting average precision (AP) is recorded. For each iteration, we randomly select 3 images from the support image pool to construct the support set. For fair comparison, the selected support set are the same across different methods. As illustrated in Fig. 2, the gaps between the highest and lowest AP for the three methods are large (up to 4.4 AP). The results may indicate that background noise and spatial variability in support set would seriously influence the performance and need to be well handled. Otherwise, it would be unstable and dangerous in the real-world applications.
To tackle the problems illustrated in Fig. 1, we present a novel Dual-awareness Attention (DAnA) mechanism which comprises the Background Attenuation (BA) Block and Cross-image Spatial Attention (CISA) Block. The proposed BA block learns the semantic representation of the target in the support image and performs the noise attenuation process by feature addition. Additionally, inspired by the nature that human can determine which fraction of the visual field should be processed at the highest resolution and ignore irrelevant details, we propose the CISA block which can capture pairwise spatial relationship between the support and query feature. We suggest each local region of the query should have its own preference towards the support features to retrieve the individually required information. Therefore, we provide customized support attention maps to different query regions. Specifically, we perform pairwise product instead of element-wise product between the support and query feature to measure the correlation between the two, breaking the physical restriction of spatial alignment. By leveraging the cross-image spatial attention, we address the problem (a) and (b) in Fig. 1.
It is also worth concerning that the features after global pooling may represent multiple different semantics, and thus the information in the mean of them may be entangled (Fig. 1 (c)). By peeking at the query, our CISA block generates query-position-aware (QPA) features from those support images which should represent the related information on the local area of the query image. Consequently, we can aggregate several QPA support features of different support images without the serious concern of the information entanglement. Notably, our proposed method can be easily adopted by different object detection models, such as Faster R-CNN [ren2015faster] and RetinaNet [lin2017focal].
We improve the standard FSOD training and evaluation protocols which can be easily reproduced by the community. Additionally, the model size and inference speed of each method are reported in this paper, which have not been evaluated in previous works. The results demonstrate that the proposed two-stage model can bring a relative improvement of in mAP (Tab. 1). The performance of our two-stage model can approach that of the oracle under the -shot settings. For the one-stage version, our quantitative results show up to relative improvement in mAP comparing with previous one-stage few-shot object detectors (Tab. 2). The comprehensive ablation study is also provided to demonstrate the impact of each proposed components.
Our main contributions can be summarized as follows:
We point out the critical deficiencies of FSOD in previous approaches which may lead to the lack of robustness and considerable impacts on performance.
We present the novel dual-awareness attention mechanism to attenuate undesired information in supports and handle the cross-image spatial relationship between the support and query. We also tackle the issue of feature aggregations over different support images to alleviate the concern of information entanglement.
We establish simple yet reliable FSOD training and evaluation protocols to fairly assess each approach. Extensive experimental results demonstrate the effectiveness and robustness of our proposed methodology.
2 Related Works
In general, few-shot learning methods can be divided into optimization-based and metric-based methods. The optimization-based methods [ravi2016optimization, finn2017model, li2017meta, nichol2018first, lee2019meta] aims to learn the good initialization parameter set and can quickly adapt to new tasks within few gradient steps in the fine-tuning stage. On the other hand, metric-based methods [koch2015siamese, vinyals2016matching, snell2017prototypical, sung2018learning, liu2019prototype, tian2020rethinking] compute the distance between learned representations in the embedding space. Some recent works [liu2019prototype, tian2020rethinking] suggest such representations should be further rectified since the qualities of these representations is important. Although some attempts have successfully boosted the performance of few-shot classification (FSC), the progress of few-shot object detection (FSOD) is still in a very early stage.
Though convolutional neural networks (CNNs) has been the dominant frameworks in CV for years, recent studies keep improving its performance with attention mechanism[hu2018squeeze, bello2019attention, ramachandran2019stand]. Recently, inspired by the great success of Transformer [vaswani2017attention]
, researchers start to explore the potential of self-attention mechanism on exiting computer vision (CV) problems[wang2018non, zhao2018psanet, fu2019dual, cao2019gcnet, hu2019local, zhu2019empirical]
. Non-local (NL) Neural Networks[wang2018non] presents a pioneering approach to leverage self-attention in order to capture the long range dependencies. [zhao2018psanet] describes the features in CNNs as the bidirectional information flows, and [cao2019gcnet] points out that a better performance can be achieved by simplifying the NL block. In a recent study, [yin2020disentangled] succeeds in capturing better visual clues by proposing a disentangled NL block. One of the critical differences between these methods and our proposed attention mechanism is we measure the dependencies in the dual-awareness manner. Compared to self-attention, our approach can capture the pairwise spatial relationship across different images.
Few-Shot Object Detection.
Current deep-learning based object detectors have reached remarkable performance. Two-stage detectors[girshick2014rich, girshick2015fast, ren2015faster, lin2017feature] are usually dominant in detection performance, while one-stage detectors [lin2017focal, law2018cornernet, tian2019fcos, zhou2019objects] are superior in run-time efficiency. However, most of the modern object detectors are category-specific. To explore the generalized detectors, LSTD [chen2018lstd] and RepMet [karlinsky2019repmet]
employ transfer learning and distance metric learning respectively. Recent works[kang2019few, yan2019meta]
encode supports into global vectors, and thus the similarity between supports and queries can be measured by aggregation. Following this spirit,[fan2020fgn] perceives the problem as a guided process, and [fan2020few] leverages the convolution-based attention mechanism to capture the correlation between supports and queries. In this work, we analyze the potential deficiencies of previous approaches and propose novel attention mechanisms to address those issues.
3.1 Problem Definition
Let be a support image and be a support set comprising -shot support images with the same category. Given a support set and a query image , the task of few-shot object detection (FSOD) is to recognize all the objects belonging to the category of in . By collecting number of pairs , we have one-way -shot object detection tasks .
The object categories of the dataset are divided into two disjoint parts. The base classes denotes those categories used for training, while denotes the novel categories which are used to test the generalization ability. Let be the tasks in which all the S and bounding box annotations belong to . We can leverage to learn a conditional model which performs few-shot object detection on a query image conditioned on a support set S. Thus, the learned model can then be applied to novel tasks where all the support images and bounding box annotations belongs to . The objective of FSOD is to leverage rich source-domain knowledge in to learn a model which can perform object detection in arbitrary target-domain.
3.2 Dual-awareness Attention Mechanism
FSOD relies on limited support information to detect unseen objects in the query image. Therefore, we consider two important aspects: 1) The quality of support feature maps, and 2) how to better construct the correlation between the multiple support features and query. Previous works [karlinsky2019repmet, kang2019few, yan2019meta, fan2020fgn, fan2020few, perez2020incremental] condense the support feature map into a one-dimensional feature vector by global pooling in order to aggregate it with the query feature map with various sizes. This may cause a serious loss of spatial information (Fig. 1 (a)). Additionally, these methods suggest directly take the mean feature of the support set as the class-specific representation, which ignores the potential information entanglement of support global features (Fig. 1 (c)). In contrast, we propose two novel modules which can 1) enhance the semantic context of support feature maps and 2) adaptively aggregate multiple support features and the query feature map.
3.2.2 Background Attenuation Block
The complex backgrounds and irrelevant foregrounds in a support image may lead to serious noise, disturbing the matching process. We propose a novel mechanism, Background Attenuation (BA) Block, which can depress the undesired information and enhance the target semantic features .
We consider the high-level feature vector from each pixel as a semantic signal response. We aim to construct a target semantic signal representation for enhancement. Inspired by signal addition, by adding the target signal to the feature of each pixel, those features sharing similar context with the target will be enhanced and the others will be blurred. In this way, we can introduce much more distinguishing features to the following processes. This operation can be considered as a more softly attention mechanism.
The detail structure of BA block is illustrated in Fig. 4 (a). The support image in a support set S (containing support images) and the query image are encoded by a shared CNN backbone into feature maps and respectively. We transform the reshaped support feature map by a linear learnable matrix and then apply a softmax function. This process can be formulated as the background attenuation attention function
where denotes the feature vector in pixel of the support feature map , is the set of all pixel indices, and is the softmax function along dimension. can be regarded as the importance score of position . Based on this, the enhanced support feature is defined as
where is a constant hyper-parameter. In the expression, can be seen as the weighted sum of support feature map along spatial dimension. The output vector is tiled and added with the feature of each pixel in the original feature map in order to enhance the semantic context in each location. Notably, [hu2018squeeze, fu2019dual] leverage channel-wise attention to reweight original feature map along the channel dimension which is very different from our proposed method. Besides, [hu2018squeeze, fu2019dual]
construct heavy linear transformation matrices to reweight semantic dependencies, yet our proposed module is genuinely lightweight sinceis the only learnable weights in the BA block.
|Method||Novel Task||Base Task||# parameters||FPS|
|Faster R-CNN [ren2015faster]||N/A||N/A||N/A||34.3||58.3||35.6||31|
|Meta R-CNN [yan2019meta]||8.7||11.1||11.2||19.9||25.3||25.9||6.8||8.5||8.6||27.3||28.6||28.5||50.4||52.5||52.3||27.3||28.4||28.2||28|
3.2.3 Cross-image Spatial Attention Block
In order to describe rich relationships between the supports and query, we introduce the Cross-image Spatial Attention (CISA) Block. The core idea of CISA is to keep the spatial information (Fig. 1 (a)) and address the misalignment issue (Fig. 1 (b)) by capturing the pairwise spatial relationship between support and query features. We generate the query-position-aware (QPA) support features through attention on supports conditioned on different local regions of a query image. Taking average over the QPA support features will not resulting in information entanglement (Fig. 1 (c)) since each entry of the QPA support feature represent the related information of a specific location on query image.
As illustrated in Fig. 4 (b), given a query feature map and support feature map , we obtain the query embedding and key embedding with linear weighting matrices . The query-support correlation can be computed by matrix multiplication between the two embedding:
where has the shape of and denotes the attention matrix between the query and support image; and are the averaged query and key embedding over all pixels; the softmax function is performed over the dimension. The cross-image spatial attention function is defined as
where is a learnable weight matrix and is a constant coefficient. Since has the shape , we expand to the same shape and apply addition. The operation can be considered as the combination of cross-attention and self-attention. The duel term measures the correlation, and the unary term represents the importance of each location in the image.
Next, we construct the query-position-aware (QPA) support feature by
where denotes the vector in position of , and is all the indices of support feature map. The resulting vector is a weighted sum of the based on the correlation between and . In the -shots settings, we can directly aggregate multiple QPA features of different support images by averaging
We perform averaging pooling on multiple QPA features without the concern of (Fig. 1 (c)) because , , …, all represent similar semantic contexts based on .
Finally, the output of CISA block will be the concatenation of and . The problem of spatial misalignment is addressed since the correlation between and will not be restricted by the original physical spatial alignment.
3.3 FSOD with Proposed Modules
Our proposed module can be easily combined with existing object detection networks, including two-stage and one-stage methods. We modify Faster R-CNN [ren2015faster] and RetinaNet [lin2017focal] with our novel modules to enable few-shot capability. The overview is illustrated in Fig. 3. For DAnA-FasterRCNN, the BA block is performed on the support feature maps. Then, we apply one CISA block before RPN, generating the correlation feature. Another CISA block is applied in the second stage, taking RoI feature and support features as input. For DAnA-RetinaNet, different from DAnA-Faster RCNN, we leverage the BA and CISA blocks on each level of the feature pyramids, and no second stage is needed.
Following [fan2020few], we replace the multi-class classification output with binary to better simulate the real-world application. The binary classification indicates whether the bounding box is background or the target objects specified by the support set. The rest details remain the same as the original Faster R-CNN and RetinaNet.
|Method||Novel Task||Base Task|
4.1 FSOD Experimental Settings
Generally, previous FSC and FSOD approaches are evaluated under the -way -way setting. In this paper, the training and evaluation both adopt the 1-way -shot setting, where a support set only contains one category. It can be seen as retrieving objects belonging to the support set class in the query image. We suggest such a setting is much close to real-world applications.
Following the paradigm of meta-learning, the training data is organized into episodes, which contains a support set and a query image. To perform -way -shot FSOD tasks, we randomly select one of those categories contained in the query image as the target class. Notably, we remove all the annotations belonging to novel classes in the training stage. Those instances of novel classes may appear in training query images but the model will never be trained to recognize these novel objects. In addition, we discover the two-way contrastive training strategy proposed in [fan2020few] is suitable for our setup as well. Note that all the two-stage implementations in this paper adopt this training strategy to alleviate the proposal imbalance problem. Therefore, some of them may achieve higher performance comparing with the results in original papers.
The test data is also organized into episodes, and we ensure the selected support sets are the same across different methods for fair comparison. Although previous works [yan2019meta, kang2019few, fan2020fgn, fan2020few, perez2020incremental] tend to fine-tune their models on the , we do not leverage fine-tuning in this paper. One of the reasons is the ambiguity in the fine-tuning protocol. To build the fine-tuning dataset, [kang2019few] defines there are only annotated bounding boxes for each category, which leads to arbitrary number of images in the fine-tuning dataset. On the other hand, [chen2018lstd] collects images for each class, and consequently the fine-tuning dataset may contain arbitrary number of instances. Additionally, the ability of models to directly recognize unseen objects with given samples is more suitable for real-world applications. Based on the reasons above, all the few-shot object detectors in this paper are trained from the source-domain and directly tested on the target-domain.
In this paper, we leverage Microsoft COCO 2014 [lin2014microsoft] to evaluate the performance. We define 20 categories as the novel classes and the other 60 classes belongs to the base classes . The training split of COCO 2014 has 82,783 images, we extract 13,846 images in which all the objects belong to as validation set and preserve the remaining 68,937 images for training. The validation split of COCO is served as the test data.
4.2 Experiment Result
|Method||Novel Task||Base Task|
|Faster R-CNN [ren2015faster]||N/A||N/A||N/A||47.8||74.4||52.2|
|Meta R-CNN [yan2019meta]||14.7||17.0||17.4||30.3||34.9||36.2||13.1||15.0||15.2||38.9||40.8||41.0||66.2||69.1||69.3||40.9||43.2||43.0|
4.2.1 COCO FSOD Result
The novel/base tasks denote the targets in them will all belong to novel/base classes. In Tab. 1, our proposed DAnA-FasterRCNN significantly outperforms the other SOTA two-stage few-shot object detectors in both novel and base tasks. In the novel tasks (unseen categories), our DAnA-FasterRCNN outperform Attention RPN [fan2020few] and FGN [fan2020fgn] by and respectively on metrics under the 5-shot setting. The qualitative result can be seen in Fig. 5. The relative mAP improvement under the -shot setting is up to , which demonstrates our method can effectively recognize the objects with limited support information. For base tasks, our DAnA-FasterRCNN also achieves the highest performance on both and metrics.
We report the base task performance of standard Faster R-CNN [ren2015faster] and RetinaNet [lin2017focal]
trained under the fully supervised learning setting. These oracle methods can be regarded as the upper bound of FSOD methods which can roughly show us the room of improvement for FSOD. To best of our knowledge, we are the first to do so. We assume that the performance in the base tasks of a well-designed few-shot object detector should be close to that of an oracle. In the 5-shot settings, our DAnA-FasterRCNN is able to reduce the gap between FSOD and standard object detection methods in the base tasks.
We notice the base task performance of Meta R-CNN [yan2019meta] is also close to standard Faster R-CNN. We suppose that it is because Meta R-CNN has the least modification of the Faster R-CNN architecture and it does not combine support and query information in the region proposal stage. Such experimental results may imply that previous attention mechanisms can not well handle the pairwise spatial relationships, and consequently the RPN has difficulty in generating good region proposals.
The comparison of one-stage FSOD methods is shown in Tab. 2. The results of [kang2019few, perez2020incremental] are from their original paper. Note that both of their models are trained on the COCO 2017 with 118,287 images, which is more than our training data. Also, different from them, we do not includes fine-tuning in our settings. However, our DAnA-RetinaNet still outperforms previous approaches by a large margin. The relative mAP improvements in the novel tasks compared to [kang2019few] and [perez2020incremental] are 105% and 125% respectively.
We also report the model size and inference speed reported in Tab. 1. Since our proposed method leverages the cross-image spatial attention to generate QPA support features, it raises a concern about the additional computational cost. However, we show that our proposed model not only achieves remarkable performance but also maintains the acceptable model size and inference speed compared to other baselines.
4.2.2 Class-balanced FSOD Protocol Result
We also consider the FSOD protocol proposed by [karlinsky2019repmet]. The long-tail distribution of object categories [li2020overcoming] in COCO benchmark may have considerable impact on evaluations since the AP on those many-shot categories will dominate the final performance. To address the issue, we generate the class-balanced test data comprised of 500 episodes. An episode in the class-balanced FSOD protocol contains 10 query images and support sets, where denotes the number of categories in the novel/base task. Each support set comprises images when evaluated under the shot setting. Images with annotation bounding boxes smaller than will not be selected as the query images.
We notice our DAnA-RetinaNet outperforms other two-stage baselines in the novel tasks, yet it does not perform well in the base tasks. On the other hand, our DAnA-FasterRCNN shows consistent improvement with previous COCO FSOD result and achieves state-of-the-art performance in both novel tasks and base tasks. By comparing the performance of DAnA-RetinaNet and DAnA-FasterRCNN together, we suppose that RPN plays a crucial role in FSOD.
4.2.3 Ablation Study
We provide the ablation study in Tab. 4 to verify the impact of each proposed module. Note that the CISA block can not be removed from our proposed models. Otherwise, the models will lose the ability of measuring correlation between supports and queries and consequently become incapable of performing FSOD.
Does the BA block offer better feature map quality than naive attention method? The Mask in second column of Tab. 4 denotes we replace the BA block with a naive soft attention process. To remove the noise in the support images, it is intuitive to learn a soft attention mask to reweight the importance of each location. Therefore, we perform element-wise product between such a learned mask and each channel of a support feature map. However, we empirically found such an operation leads to significant performance drops. By visualizing these learned attention masks, we discover that the values tend to concentrate into small areas. Such degeneration may introduce difficulties to following detection processes. Instead, the BA block improve the performance since it depress the background noise rather than directly filtering it.
Should we apply the concatenation or element-wise product? Since the query feature and QPA support feature share the same feature size, we can combine them together by either concatenation or element-wise product. As it can be observed in Tab. 4, the implementation of element-wise product suffers a performance drop. We conclude that concatenation is the better choice in our CISA.
We visualize our cross-image spatial attention (CISA) block in Fig. 6. Each support attention map is generated based on a particular region in query images. The result suggests that the CISA learns semantic correspondence between the support image and query image. Given the head (colored in red) or the feet of a human (colored in blue), for each support image, the attention maps highlight the head or the feet areas respectively. We show that the spatial misalignment issue Fig. 1 (b) can therefore be solved by our CISA block.
|Network Alchemy||Fusion Strategy||Novel Task|
In this work, we observed the problem of spatial misalignment and support information entanglement on the challenging few-shot object detection (FSOD) task. We propose a novel yet effective Dual-Awareness Attention (DAnA) Network tackling the problems. Our methods is adaptable to one-stage or two-stage object detection backbones. DAnA remarkably boosts few-shot object detection performance by 48% to 125% relatively on the COCO benchmark and reaches state-of-the-art performance. We are excited to point out a new direction to solve few-shot tasks. We encourage future work to extend our method to explore more challenging tasks such as few-shot instance segmentation.