Recently, deep neural networks have achieved state-of-the-art on a variety of visual tasks,image classification [dai2017deformable, he2016deep, huang2017densely] and object detection [cai2018cascade, dai2016r, girshick2015fast, girshick2014rich, lin2017feature, redmon2016you, redmon2017yolo9000, ren2015faster]. However, these leaps of performance arrive only when a large amount of annotated data is available. Since it is often labor-intensive to obtain adequate labelled data, the number of available samples severely limits the applications of current vision systems. Besides, compared to the ability of human to quickly extract novel concepts from few examples, these deep models are still far from satisfactory.
It is thus of attracting major research interest on few-shot learning [chen2019closer, koch2015siamese, li2017meta, qiao2019transductive, Snell2017, Sung2018, vinyals2016matching], which employs the idea of learning novel concepts rapidly and generalizing well in data-scarce scenario. As one of the research branches, few-shot object detection (FSOD) is a much more challenging task than both few-shot classification and object detection [chen2018lstd, kang2019few, wang2020frustratingly, Xiao2020FSDetView, yan2019meta]. At present, most FSOD approaches prefer to follow the meta-learning paradigm to acquire more task-level knowledge and generalize better to novel classes. However, these methods usually suffer from a complicated training process and data organization, which results in limited application scenarios. In contrast, the finetune-based methods that exist as another research branch of FSOD, are very simple and efficient [wang2020frustratingly]. By adopting a two-stage fine-tuning scheme, this series is comparable to meta methods. Yet, due to most parameters are pre-trained on base domain and then frozen on novel set, they may fall down the severe shift in data distribution and underutilization of novel data.
Regardless of the meta-based or finetune-based method, Faster R-CNN [ren2015faster] has been widely used as the basic detector and achieved good performance. However, its original architecture is designed for conventional detection and lacks of tailored consideration for few-shot scenario, which limits the upper bound of existing approaches. Concretely, on the one hand, as a classic two-stage stacking architecture, (, backbone, RPN and RCNN, see Fig.2), Faster R-CNN may encounter an intractable conflict when it performs joint optimization end-to-end between class-agnostic RPN and class-relevant RCNN through the shared backbone. On the other hand, as a multi-task learning paradigm (, classification and localization), RCNN needs translation-invariant features for box classifier whereas translation-covariant features for box regressor. These mismatched goals potentially generate so many low-quality scores and then further lead to the reduced classification power. Moreover, since there are only a few samples available during learning, these above contradictions will be further exacerbated.
Motivated by the above observations, we extend Faster R-CNN for few-shot scenario from two orthogonal perspectives: (1) multi-stage view. As shown in Fig.2, the Faster R-CNN contains three components, , backbone, RPN and RCNN, which interact with each other through feature-forward and gradient-backward. Due to the contradiction mentioned above between RPN and RCNN, we present to alleviate the entire model from being dominated by one of them with tailoring the degree of decoupling between three modules through gradient. (2) multi-task view. The task conflict between classification and regression affects the quality of features, which in turn damages the performance of box head outputs, , category scores and box coordinates. We employ an efficient score calibration module only on the classification branch to achieve the purpose of decoupling the above two tasks.
This paper proposes a simple yet effective approach, named Decoupled Faster R-CNN (DeFRCN), to perform both multi-stage decoupling and multi-task decoupling for few-shot object detection. The overall architecture is very straightforward as demonstrated in Fig.3. Compared to the standard Faster R-CNN [ren2015faster], DeFRCN additionally contains two Gradient Decoupled Layer (GDL) and an offline Prototypical Calibration Block (PCB). The former ones are inserted between the shared backbone and RPN, meanwhile, between the backbone and RCNN to adjust the degree of decoupling among three modules, and the latter is parallel to the box classifier for further score calibration. Specifically, during the forward-backward propagation, GDL performs a learnable affine transformation on the forward feature maps and simply multiplies the backward gradient by a constant, which decouples the subsequent module and preceding module efficiently. Moreover, PCB is initially equipped with a well pre-trained classification model (ImageNet Pretrain) and a set of novel support prototypes. Then it takes the region proposals from few-shot detector as input and boosts the original softmax scores with additional prototype-based pairwise scores. As an interesting by-product, we find that just adopting PCB only in the inference phase can greatly improve the performance of few-shot detectors, with no extra training effort, which makes the PCB data-efficient and plug-and-play.
The main contributions of our approach are three-folds:
We look closely into the conventional Faster R-CNN and propose a simple yet effective architecture for few-shot detection, named Decoupled Faster R-CNN, which can be learned end-to-end via straightforward fine-tuning.
To deal with the data-scarce scenario, we further present two novel modules, GDL and PCB, to perform decoupling among multiple components of Faster R-CNN and boost classification performance respectively.
DeFRCN is remarkably superior to SOTAs on various benchmarks, revealing the effectiveness of our approach.
2 Related Work
2.1 General Object Detection
General object detection based on deep neural networks are currently divided into two main branches, , two-stage proposal-based paradigm [cai2018cascade, dai2016r, girshick2015fast, girshick2014rich, he2017mask, lin2017feature, ren2015faster] and one-stage proposal-free one [bochkovskiy2020yolov4, liu2016ssd, redmon2016you, redmon2017yolo9000, redmon2018yolov3], which both have witnessed fantastic progress on numerous large-scale benchmarks. The R-CNN series falls into the former line of work, which firstly generates a set of potential objects with region proposal network (RPN) [ren2015faster] and then performs category classification and box localization for end-to-end detection. In contrast, one-stage detectors endeavour to directly produce final predictions from the feature map without RPN module, usually have the advantages of inference speed but the detection performance is often not as good as two-stage approaches. However, all these frameworks uniformly assume that a large amount of annotated data from seen domain can be accessed, which may be stuck in troubles in data-scarce scenarios or novel unseen domains.
2.2 Few-Shot Learning
Few-shot learning, which aims at learning to learn general knowledge slowly from abundant base data and extracting novel concepts rapidly from extremely few examples of new-coming classes, has been recently featured into the meta-learning based [vilalta2002perspective] and fine-tuning based [pan2009survey] paradigms. As a recognition case of few-shot learning, few-shot classification has been widely investigated until now. In the literature, a large amount of studies that follow the idea of meta-learning to alleviate severe over-fitting can be divided into two streams, namely, optimization approaches [andrychowicz2016learning, finn2017model, li2017meta, nichol2018first, qiao2018few, ravi2016optimization] and metric approaches [koch2015siamese, qiao2019transductive, Snell2017, Sung2018, vinyals2016matching]. The former intents to learn efficient parameter updating rules [ravi2016optimization] or good parameters initialization strategies [finn2017model], and the latter focuses on obtaining a generalizable embedding metric space to perform pairwise similarity of inputs. In addition to meta-based approaches, some simple fine-tuning based methods [chen2019closer, tian2020rethinking] are attaching more and more attention in the few-shot community. These methods show that just fine-tuning a linear classifier on top of a pre-trained model surprisingly achieves competitive performance with the meta-based approaches. Compared to classification, the solutions for other tasks, such as object detection and segmentation, are still underdeveloped.
2.3 Few-Shot Object Detection
Since previous detectors usually require a large amount of annotated data, few-shot detection has attracted more and more interest recently [bansal2018zero, dong2018few, fan2020fgn, perez2020incremental, rahman2020any, wang2019few, wang2019meta, xiao2021few, yang2020context]. Similar to classification task [Snell2017, sun2019meta], most of the current few-shot detectors focus on the meta-learning paradigm. FSRW [kang2019few] is a light-weight meta-model based on YOLOv2 [redmon2017yolo9000] to re-weight the importance of features with channel-wise attention, and then adapt these features to promote novel detection. Yet, instead of employing attention on the whole feature map, Meta R-CNN [yan2019meta] focuses on the attention of each RoI feature. Furthermore, FSDView [Xiao2020FSDetView] puts forward a novel feature aggregation scheme, which leverages on base classes feature information to improve the performance on novel classes. From the perspective of attention on RPN, FSOD [fan2020few] utilizes support information to filter out most background boxes and those in non-matching categories. Although meta-based approaches have been extensively studied recently, there are still some other meta-free methods. RepMet [karlinsky2019repmet] incorporates a modified prototypical network as classification head into a standard object detector. And TFA [wang2020frustratingly]
proposes a simple approach based on transfer learning, that only fine-tunes the last layer of existing detectors on rare classes, which are comparable to the previous meta-based methods. Instead, our approach, which also follows the idea of fine-tuning, jointly trains the almost entire detector with novel gradient decoupled layer and prototypical calibration block, outperforming all above meta-based and finetune-based approaches.
In this section, we first introduce the setup of few-shot object detection in Section 3.1. Then we revisit conventional Faster R-CNN in Section 3.2 and elaborate our Decoupled Faster R-CNN (DeFRCN) in Section 3.3.
3.1 Problem Setting
As in various previous work [fan2020few, kang2019few, wang2020frustratingly, Xiao2020FSDetView], we follow the standard problem settings of few-shot object detection in our paper. Specifically, the whole learning procedure is organized into the form of two-stage fine-tuning paradigm, which gradually collects transferable knowledge across a large base set with abundant annotated instances and performs adaptation quickly on novel support set with only a few samples per category. Note that the base classes in and the novel classes in are non-overlapping, namely, . Given a sample , where is the input image with objects and denotes the category and the structured box annotations . Under this setting, the ultimate goal of our algorithm is to optimize a robust detector based on the and , then classify and localize unlabelled objects of a novel query set with classes , where . The overall procedure, which follows the standard transfer learning, can be summarized as follow,
where , and denote the learned detectors in initialization, base training and novel fine-tuning stages respectively. The symbol indicates model training.
3.2 Revisiting Faster R-CNN
As a two-stage stacking architecture, Faster R-CNN [ren2015faster] consists of three function-detached modules for end-to-end training, , a shared convolutional backbone for extracting generalized features, an efficient Region Proposal Network (RPN) for generating class-agnostic proposals and a task-specific RCNN head [girshick2015fast] for performing class-relevant classification and localization. The whole learning procedure is illustrated in Fig.2 (a). Concretely, the input image is first fed into the backbone to generate a high-level feature map, and then parallelly provided to the next two modules, , RPN and RCNN. Second, with classifying and regressing a group of scale varying anchors of the feature map simultaneously, RPN generates a sparse set of high-quality region proposals. Finally, on top of the shared feature map and proposals, RCNN pools each region-of-interest into a fixed size feature map with RoI pooling [he2017mask]
, and then performs box classifier and regressor for computing the object category probabilities and fine-tuning the box boundaries respectively. All these modules are jointly optimized end-to-end by minimizing an unify objective function, which follows the multi-task learning paradigm as:
where is a balanced hyper-parameter for different tasks.
Problem of multi-task learning. It can be seen that the above-mentioned three modules of Faster R-CNN constitute an unified multi-task learning (MTL) framework, yet there is a certain inconsistency among the optimization goals of these sub-networks. Specifically, with utilizing the feature maps extracted from hard-parameter shared [vandenhende2021multi] backbone, RPN aims at generating class-agnostic region proposals to tell the network where to look, while RCNN targets to perform region-based detection category by category to identify what to look. Furthermore, the classification head needs translation invariant features whereas the localization head needs translation covariant features on the contrary. In spite of multi-task learning generally helps to improve the end-to-end performance of object detection as shown in Faster R-CNN [ren2015faster], the joint optimization with the Eq.2 may lead to possible suboptimal solution on individual tasks in order to balance the mismatched goals of them [cheng2018revisiting, wu2020rethinking].
Problem of shared backbone. According to the arguments in [ren2015faster], the ultimate goal of shared backbone is to extract general features that are as suitable as possible for all downstream tasks. In fact, from the perspective of gradient flow in Fig.2 (a), RPN and RCNN mutually exchange information of optimization through the shared backbone. However, due to the potential contradictions between RPN and RCNN, we notice that the current architecture may lead to the reduced few-shot detection power of the entire framework. Moreover, following the setting of Eq.1, the shared backbone of few-shot novel detector is usually fine-tuned from a base domain detector . During this two-stage cross-domain procedure, RPN may suffer from the foreground-background confusion, which means a proposal that belongs to background in the base training phase is likely to be foreground in the novel fine-tuning phase. Through the gradient from RPN, the shared convolutional layers propagate the tendency of over-fitting on base classes to backbone and RCNN. Although this is one of the convergence schemes to behave well on base domain, it potentially damages the ability to transfer to the novel set quickly and efficiently, especially in the data-scarce scenario.
3.3 Decoupled Faster R-CNN
Motivated by the above arguments, we propose a simple yet effective approach, named Decoupled Faster R-CNN (DeFRCN), to tap into more potential of Faster R-CNN styled detectors in few-shot literature. Based on the idea of decoupling three functional modules ( , backbone, RPN and RCNN) and two kinds of tasks (, classification and localization), the overall architecture of our method is very straightforward as demonstrated in Fig.3, which has two Gradient Decoupled Layers (GDL) to adjust the degree of decoupling among three modules and an offline Prototypical Calibration Block (PCB) to improve the classification power of RCNN during the inference phase.
3.3.1 Gradient Decoupled Layer
In this section, we look into a different aspect of network design - how to customize the relationship between the upstream and downstream modules of the model. From the perspective of feature-forward and gradient-backward, we introduce a novel architectural unit, denoted as the Gradient Decoupled Layer (GDL). During the forward propagation, GDL employs an affine transformation layer , which is parameterized by learnable channel-wise weights and bias , to simply enhance feature representations and perform forward-decoupling. During the backward propagation, GDL takes the gradient from the subsequent layer, multiplies it by a constant and passes it to the preceding layer, as illustrated in Fig.3. Concretely, along with the back-propagation process passes through the GDL, the partial derivatives of the loss that is downstream of the GDL with respect to the layer parameters that are upstream of the GDL get multiplied by , , (denoted as in Eq.4) is simply replaced with . Mathematically, we can formally treat GDL as a pseudo-function defined by two equations describing its forward- and backward-propagation behaviour as follows:
where is an affine transformation layer, is a decoupling coefficient and
is the Jacobian matrix from the affine layer. In general, implementing such layer with existing deep learning frameworks are extremely simple, as defining procedures for forwardprop (affine transformation) and backprop (multiplying by a constant) is trivial. We provide the pseudo-code of GDL in Algorithm1.
Perform Decoupling with GDL. Given a standard Faster R-CNN [ren2015faster], two GDLs are respectively inserted between the shared backbone and RPN (, ), as well as the shared backbone and RCNN (, ), which brings the part of DeFRCN architecture depicted in Fig.3. Specifically, during the forward propagation, the feature from shared backbone is transformed into different feature spaces through and . Moreover, during the backward propagation, we adjust the decoupling degree of three modules (, backbone, RPN and RCNN) by applying different and
on gradients. More formally, we consider the following loss function with two separate GDLs as:
Here, is the Gradient Decoupled Layer we proposed in this section, , and are learnable parameters for the backbone, RPN and RCNN respectively. Moreover, is a hyper-parameter to control the trade-off between and (usually is set to ).
Optimization with GDL. Consistent with the optimization goal of Faster R-CNN, we seek the optimal parameters , and , denoted as , for the function Eq.5 as:
where is the number of training samples, and is from the Eq.5. Concretely, a gradient descent step can be described as:
where is the learning rate, and are decoupling coefficients for RPN and RCNN respectively. It can be seen from Eq.8 and Eq.9 that adding GDL does not affect the optimization of RPN and RCNN. However, the parameter update of sharing backbone is deeply affected by GDL in Eq.7. We mainly analyze three important situations: (1) (or ), it is equivalent to stopping gradient from RPN (or RCNN), and the update of will only be dominated by RCNN (or RPN); (2) (or ), it is equivalent to scaling gradient from RPN (or RCNN), which means that the RPN (or RCNN) has individual contributions to the update of shared backbone; (3) , which is equivalent to multiplying the learning rate of backbone by a small coefficient, , , ensures that the update speed of is slower than and . Note that is meaningless for detection and more discussion about is mentioned in the supplementary material.
3.3.2 Prototypical Calibration Block
In this section, we introduce a novel metric-based score refinement module, termed as Prototypical Calibration Block (PCB), to effectively decouple the classification and localization tasks during the inference time. In general, most of detectors parallelly deploy a classifier and a regressor on top of the shared network. However, classification needs translation invariant features whereas localization needs translation covariant features. Thus the localization branch may force the backbone to gradually learn translation covariant property, which potentially downgrades the performance of classifier. Due to model complexity, the extreme lack of annotated samples will further exacerbate this contradiction.
We notice that the under-explored few-shot classification branch generates a large amount of low-quality scores, which motivates us to eliminate high-scored false positives and remedy low-scored missing samples by introducing a Prototypical Calibration Block (PCB) for score refinement. The overall pipeline is illustrated in Fig.3
(c). Concretely, our PCB consists of a strong classifier from ImageNet pre-trained model, a RoIAlign layer and a prototype bank. Given a-way -shot task with support set , the PCB first extracts original image feature map and then employs RoIAlign with ground-truth boxes to produce instance representations. Based on these features, we shrink the support set to a prototype bank with Eq.10:
where is a subset which contains samples with the same label in . Given an object proposal produced by fine-tuned few-shot detector, where is the box boundaries, is the predicted category and is the corresponding score, PCB first performs RoIAlign on predicted box to generate object feature
, and then calculate the cosine similaritybetween and as:
In the end, we perform weighted aggregation between the from PCB and from few-shot detector for final classification score as follow:
where is the trade-off hyper-parameter.
Moreover, we do not share any parameters between the few-shot detector and PCB module, so that the PCB can not only preserve the quality of classification-aimed translation invariance feature, but also better decouple the classification task and regression task within the RCNN. Furthermore, since the PCB module is offline without any further training, it can be plug-and-play and easily equipped to any other architectures to build stronger few-shot detectors.
In this section, we first introduce the experimental settings in Sec.4.1 and then compare our approach with previous SOTAs on multiple benchmarks in Sec.4.2. Finally, we provide comprehensive ablation studies in Sec.4.3.
4.1 Experimental Setting
Existing benchmarks. We follow the previous work [kang2019few, wang2020frustratingly, Xiao2020FSDetView] and utilize the same data splits with [wang2020frustratingly]
to evaluate our approach for a fair comparison. As for PASCAL VOC, we have three random split groups and each of them covers 20 categories, which are randomly divided into 15 base classes and 5 novel classes. Each novel category hasobjects sampled from the combination of VOC07 and VOC12 train/val set for few-shot training. And VOC07 test set is used for evaluation. As for COCO, the 60 categories disjoint with VOC are denoted as base classes while the remaining 20 classes are used as novel classes with shots. We utilize 5k images from the validation set for evaluation and the rest for training.
Evaluation setting. We take two popular evaluation protocols into consideration to access the effectiveness of our approach, including few-shot object detection (FSOD) and generalized few-shot object detection (G-FSOD). The former protocol is widely adopted by most previous methods [chen2018lstd, kang2019few, Xiao2020FSDetView, yan2019meta] and only focuses on the performance of novel classes. Yet, the latter presents to not only observe the performance on novel classes, but also base and overall performance of the few-shot detector, which is more comprehensive and monitors the occurrence of catastrophic forgetting [wang2020frustratingly]
. For evaluation metrics, we reportfor VOC and the COCO-style for COCO. Moreover, all results are averaged over multiple repeated runs.
Implementation details. Our approach employs Faster R-CNN [ren2015faster] (termed as FRCN) as the basic detection framework and ResNet-101 [he2016deep] pre-trained on ImageNet [russakovsky2015imagenet] as the backbone. We adopt SGD to optimize our network end-to-end with a mini-batch size of 16, momentum of 0.9 and weight decay of . The learning rate is set to 0.02 during base training and 0.01 during few-shot fine-tuning. Moreover, the in GDL of RPN is set to 0 for stopping gradient and the in GDL of RCNN is set to 0.75 during base training and 0.01 during novel fine-tuning for scaling gradient. The in PCB is uniformly set to 0.5 in all settings.
|Novel Set 1||Novel Set 2||Novel Set 3|
|Method / Shots||w/G||1||2||3||5||10||1||2||3||5||10||1||2||3||5||10|
|Meta R-CNN [yan2019meta]||✗||19.9||25.5||35.0||45.7||51.5||10.4||19.4||29.6||34.8||45.4||14.3||18.2||27.5||41.2||48.1|
|Method / Shots||w/G||1||2||3||5||10||30|
|Meta R-CNN [yan2019meta]||✗||-||-||-||-||8.7||12.4|
4.2 Comparison Results
PASCAL VOC. We present our evaluation results of VOC on three different data splits in Table 1. It can be seen that, no matter under the FSOD or G-FSOD setting, our DeFRCN is significantly superior to the recent state-of-the-art approaches by a large margin (up to 21.4), which demonstrates the effectiveness of our approach. Based on the results of Table 1, we further notice that two interesting phenomena exist in few-shot detection: (1) For FSOD setting, the increment of novel shots does not necessarily lead to an advance in final performance. Take Novel Set 1 as an example, the of 5-shot is 64.1 but 10-shot is 60.8 (-3.3). There is a similar case in TFA. We conjecture that the quality of sample is vital in data-scarce scenario and adding low-quality samples may be harmful to the detector. (2) For the comparison between FSOD and G-FSOD, we find that as the number of shots increases, the final performance of G-FSOD grows faster than that of FSOD (40.2 66.5 vs. 53.6 60.8), which is due to the addition of more negative samples under the G-FSOD setting.
COCO. The Table 2 shows all evaluation results on COCO dataset with the standard COCO-style averaged AP (). Obviously, our approach consistently outperforms recent SOTAs in all setups, including FSOD and G-FSOD for =1,2,3,5,10,30. For FSOD, we achieve around 6.0 and 7.9 improvement over the best method in 10-shot and 30-shot respectively, which demonstrates the strong robustness and generalization ability of our method in the few-shot scenario. Furthermore, compared to the fine-tuning based methods, the number of learnable parameters of DeFRCN is almost the same as FRCN-ft and much more than TFA. The results in Table 2 reveal that our method not only guarantees the sufficient learning of these parameters, but also does not fall into the severe over-fitting. All base/overall results of G-FSOD are presented in supplementary materials.
COCO to VOC. We conduct the cross-domain FSOD experiments on the standard VOC 2007 test set with following the same setting from [kang2019few, wu2020multi], which uses the base dataset with 60 classes as in the previous COCO within-domain setting and the novel dataset with 10-shot objects for each of the 20 classes from VOC. As shown in the Table 3, our approach achieves the best performance with 55.9, which has 13.6 improvement than MPSR [wu2020multi]. This huge upswing demonstrates that our proposed DeFRCN has better generalization ability in cross-domain situations.
4.3 Ablation Study
Effectiveness of different modules. We conduct relative ablations in 10/30-shot scenarios on the COCO dataset to carefully analyze how much each module contributes to the ultimate performance of DeFRCN. All results are shown in Table 4 in great details. Specifically, the first row shows the results of plain FRCN, which only achieves 7.9/12.2 for 10/30-shot respectively, indicating that the original model without any few-shot techniques is severely over-fitting due to the lack of training data. Next, we take four progressive steps to complete the exploration of our DeFRCN: (1) add GDL in base training phase (GDL-B). Through the results of rows 1-4 and 5-8, we find that the GDL-B improves by 0.6 on base classes and also a certain improvement (0.3 2.1) on novel classes. This indicates that a better base model is beneficial to the performance of few-shot detector. (2) add GDL in novel fine-tuning phase (GDL-N). The results of first row and third row show that GDL-N makes an amazing boost with 7.3/6.8 for 10/30-shot, which are mainly from two aspects: i) more learnable parameters guarantee sufficient ability to transfer to novel domain, and ii) GDL greatly reduces the risk of over-fitting.(3) add PCB in the inference phase. As a plug-and-play module, PCB is orthogonal to GDL, so no matter which setting PCB is added, our model further gains 1.4 2.6 points on . (4) Finally, we integrate the above three modules into original FRCN, and the last line shows the final performance of DeFRCN. Compared to the plain results in the first row, we obtain a marvelous promotion of 10.6/10.4 for 10/30-shot, which proves the effectiveness of our approach.
Effectiveness of the degree of decoupling. We carefully explore the influence of decoupling with setting different and in GDL during both base training and novel fine-tuning, and all results are illustrated in Fig.4. No matter in the base training or the novel fine-tuning stage, the model tends to achieve higher performance when is set to a smaller value (close to 0), while needs an appropriate value to ensure that the backbone can be optimized better. This observation prompts us to perform stop-gradient for RPN and scale-gradient for RCNN in DeFRCN. In addition, we further get a very interesting conclusion from four corners in Fig.4(a): in term of FRCN backbone optimization, RPN plays a negative role in this procedure (39.01 vs. 38.39), while RCNN has a positive effect (31.56 vs. 38.39).
Can GDL boost conventional detection? The above analysis shows that GDL brings a significant improvement on FSOD. Since the problem it solves (that is, the contradiction in Faster R-CNN) also potentially exists in conventional detection, we conjecture that our GDL is as well as effective in data-sufficient scenarios. Thus we further conduct experiments on COCO 2017 dataset with standard setup [ren2015faster] and the results are shown in Table 5. It can be seen that the proposed GDL outperforms baselines on all evaluation metrics. Specifically, adding GDL to original FRCN gains 1.5 and 0.9 AP50 for Res-50 and Res-101 respectively.
In this paper, we look closely into the visual task of few-shot object detection and propose a simple yet effective fine-tuning based framework, named Decoupled Faster R-CNN, which remarkably alleviates the potential contradictions of conventional Faster R-CNN in data-scarce scenario with introducing novel GDL and PCB. Despite its simplicity, our method still achieves new state-of-the-art on various benchmarks, which demonstrates its effectiveness and versatility.
Acknowledgement. This paper is supported by the National Key R&D Plan of the Ministry of Science and Technology (Project No. 2020AAA0104400).
In this supplementary material, we provide additional details which we could not include in the main paper due to space limitations, including more experimental analysis and visualization details that help us develop further insights to the proposed approach. We discuss:
More results of generalized few-shot object detection.
Additional analysis on Prototypical Calibration Block.
Related extensions of Gradient Decoupled Layer.
Qualitative visualization results of our approach.
Appendix A Generalized Few-Shot Object Detection
a.1 Implementation Details
As mentioned in the main paper, we take two popular evaluation protocols into consideration to assess the effectiveness of our approach, including few-shot object detection (FSOD) and generalized few-shot object detection (G-FSOD). The difference between these two protocols is whether the performance of base classes is still required after the fine-tuning stage. Following the G-FSOD setting in TFA [wang2020frustratingly], we fine-tune our DeFRCN on a small balanced training set consisting of both base and novel classes, where each class has the same number of annotated objects (, -shot). In addition to deploying more training iterations (), other experimental settings for G-FSOD are exactly consistent with the FSOD in our paper.
a.2 Experimental Results of G-Fsod Setting
In this section, we show the full benchmark results of the G-FSOD setting. For each evaluation metric, we report the average results of random splits ( for VOC and
PASCAL VOC. We present the complete G-FSOD results of VOC () in Table 6 and then analyze our results from the following three aspects: (1) Novel AP. The novel AP of our model is usually over 7 points higher than that of TFA in three data splits, which indicates that the proposed DeFRCN has absolute advantage on novel performance. (2) Base AP. Our approach is able to outperform TFA on split 2 (+1.9 +3.7 ), however, it is slightly worse on data split 1 and 3 (-0.3 -1.0 ). We notice that the base performance advantage of TFA comes from the strategy of fine-tuning only the last layer of detectors, which can indeed be eccentric to ensure that the base performance does not decrease too much, but it also results in the novel performance cannot be further improved. (3) Overall AP. As shown in the Table 6, the proposed DeFRCN achieves the best overall performance across all settings (+1.4 +4.0 ), including data splits and shots.
|Split||# shots||Method||Overall #20||Base #15||Novel #5|
|Split 1||1||FSRW [kang2019few]||27.60.5||50.8 0.9||26.50.6||34.10.5||8.01.0|
|DeFRCN||42.00.6 (+1.4)||66.70.8 (+2.2)||45.50.7 (+0.8)||48.40.4 (-1.0)||22.51.7 (+8.3)|
|DeFRCN||44.30.4 (+1.7)||70.20.5 (+3.1)||48.00.6 (+1.0)||49.10.3 (-0.5)||30.61.2 (+8.9)|
|DeFRCN||45.30.3 (+1.6)||71.50.4 (+3.0)||49.00.5 (+0.7)||49.30.3 (-0.5)||33.70.8 (+8.3)|
|DeFRCN||46.40.3 (+1.6)||73.10.3 (+3.0)||50.40.4 (+1.0)||49.60.3 (-0.5)||37.30.8 (+8.4)|
|DeFRCN||47.20.2 (+1.4)||74.00.3 (+2.7)||51.30.3 (+0.9)||49.90.2 (-0.5)||39.80.7 (+7.8)|
|Split 2||1||FSRW [kang2019few]||28.40.5||51.70.9||27.30.6||35.70.5||6.30.9|
|DeFRCN||40.70.5 (+4.0)||64.80.7 (+4.9)||43.80.6 (+4.5)||49.60.4 (+3.7)||14.61.5 (+5.6)|
|DeFRCN||42.70.3 (+3.7)||67.70.5 (+4.7)||45.70.5 (+3.6)||50.30.2 (+3.0)||20.51.0 (+6.4)|
|DeFRCN||43.50.3 (+3.4)||68.90.4 (+4.4)||46.60.4 (+3.3)||50.60.3 (+2.5)||22.91.0 (+6.9)|
|DeFRCN||44.60.3 (+3.7)||70.20.5 (+4.5)||47.80.4 (+3.7)||51.00.2 (+2.4)||25.80.9 (+8.0)|
|DeFRCN||45.60.2 (+3.3)||71.50.3 (+3.9)||49.00.3 (+3.3)||51.30.2 (+1.9)||29.30.7 (+8.5)|
|Split 3||1||FSRW [kang2019few]||27.50.6||50.01.0||26.80.7||34.50.7||6.71.0|
|DeFRCN||41.60.5 (+1.5)||66.00.9 (+2.5)||44.90.6 (+1.3)||49.40.4(-0.8)||17.91.6 (+8.3)|
|DeFRCN||44.00.4 (+2.2)||69.50.7 (+3.9)||47.70.5 (+2.4)||50.20.2 (-0.5)||26.01.3 (+10.9)|
|DeFRCN||45.10.3 (+2.0)||70.90.5 (+3.4)||48.80.4 (+2.1)||50.50.2 (-0.6)||29.21.0 (+10.3)|
|DeFRCN||46.20.3 (+2.1)||72.40.4 (+3.3)||50.00.5 (+2.2)||51.00.2 (-0.3)||32.30.9 (+9.5)|
|DeFRCN||47.00.3 (+2.0)||73.30.3 (+3.0)||51.00.4 (+2.1)||51.30.2 (-0.3)||34.70.7 (+9.3)|
COCO. The Table 7 shows the G-FSOD results on COCO dataset over shots. Although COCO is much more complicated than VOC, similar observations can be drawn about accuracy on both base classes and novel classes. Concretely, the performance on base classes is comparable to TFA, but we are far superior to TFA in terms of both novel and overall results. In addition, we further notice that as the number of support shots increases, our approach can bring more performance improvements.
|# shots||Method||Overall #80||Base #60||Novel #20|
|DeFRCN (Ours)||24.00.4 (-0.4)||36.90.6 (-2.9)||26.20.4 (+0.1)||30.40.4 (-1.5)||4.80.6 (+2.9)|
|DeFRCN (Ours)||25.70.5 (+0.8)||39.60.8 (-0.5)||28.00.5 (+1.0)||31.40.4 (-0.5)||8.50.8 (+4.6)|
|DeFRCN (Ours)||26.60.4 (+1.3)||41.10.7 (+0.7)||28.90.4 (+1.3)||32.10.3 (+0.1)||10.70.8 (+5.6)|
|DeFRCN (Ours)||27.80.3 (+1.9)||43.00.6 (+1.8)||30.20.3 (+1.8)||32.60.3 (+0.3)||13.60.7 (+6.6)|
|DeFRCN (Ours)||29.70.2 (+3.1)||46.00.5 (+3.8)||32.10.2 (+3.1)||34.00.2 (+1.6)||16.80.6 (+7.7)|
|DeFRCN (Ours)||31.40.1 (+2.7)||48.80.2 (+4.1)||33.90.1 (+2.4)||34.80.1 (+0.6)||21.20.4 (+9.1)|
Appendix B Additional Analysis on PCB
b.1 Boost Other Approaches with PCB
As a plug-and-play module, the proposed PCB is easily equipped to any other architectures to build stronger few-shot detectors. Here, we verify this argument with introducing PCB into other previous approaches, including FRCN-ft [yan2019meta], TFA [wang2020frustratingly], MPSR [wu2020multi], and all experimental results on COCO dataset are shown in the Table 8. Regardless of methods or the number of shots, we observe that using PCB can consistently achieve much higher performance (+1.0 +3.0 points) on novel classes, which demonstrates the effectiveness and flexibility of our PCB module.
|Method||w / PCB||1||2||3||5||10||30|
|FRCN-ft [yan2019meta]||✓||2.4 (+1.4)||4.1 (+2.3)||5.2 (+2.4)||6.6 (+2.6)||9.9 (+3.0)||14.0 (+3.0)|
|TFA [wang2020frustratingly]||✓||6.7 (+2.3)||7.6 (+2.2)||9.0 (+3.0)||10.4 (+2.7)||11.8 (+2.8)||15.5 (+2.1)|
|MPSR [wu2020multi]||✓||6.7 (+1.6)||8.9 (+2.2)||9.7 (+2.3)||10.9 (+2.2)||11.9 (+2.1)||15.5 (+1.0)|
|DeFRCN (Ours)||✓||9.3 (+1.4)||12.9 (+2.0)||14.8 (+1.4)||16.1 (+1.5)||18.5 (+1.6)||22.6 (+1.6)|
b.2 Employ Other Pre-trained Models
In the main paper, we utilize the standard ImageNet pre-trained model (IN-1K), which is widely adopted in most of few-shot object detection frameworks, to initialize both Faster-RCNN and PCB. Since the core module of PCB is the generalizable feature extractor, which determines the final performance of the score calibration, we further explore other pre-trained models (see Table 9) in this section. SwAV [caron2020unsupervised] is an efficient method for pre-training without using annotations,
, self-supervised learning. IN-SwAV indicates that the model is pre-trained by SwAV on ImageNet. IG-WSL[wslimageseccv2018] employs the ResNeXt [xie2017aggregated] architecture and pre-trains on a much larger social media image dataset (Instagram) with weakly-supervised learning paradigm. Table 10 shows the performance on VOC with utilizing the above three pre-trained models. No matter which one is exploited, the final performance is better with PCB. Moreover, we further notice that using a stronger pre-trained model, the performance of FSOD can be improved more.
|Method||Backbone||Paradigm||# Images||# Classes|
|Novel Set 1||Novel Set 2||Novel Set 3|
b.3 Why PCB Works ?
The PCB can be reinterpreted as a non-parameter few-shot classification model, which draws on the idea of Prototypical Network [Snell2017]. Based on the COCO 10-shot task, we calculate the channel-wise cosine similarity between different few-shot RoI prototypes () and the feature map () of the test image, and then visualize the similarity map in Fig.5. We find that the prototypes from different categories can indeed activate distinct areas of the feature map, which indicates that the metric-based pairwise score in data-scarce scenario is very effective. In addition, we notice that even if the category label of novel prototype is not seen before by the pre-trained classification model, an ideal similarity map can still be obtained, , the novel label ’Person’ does not exist in ImageNet 1K sysnets, see the first three lines in Fig.5. Moreover, the results of IN-SwAV ( self-supervised paradigm) in Table 10 further prove this argument. According to the visualization and above analysis, we believe that it is reasonable for PCB to utilize the pairwise score based on classification model to calibrate the softmax score from the original classification branch of few-shot detector.
Appendix C Related extensions of GDL
c.1 Conventional Cross-Domain Object Detection
In the experimental section of the main paper, we have verified that the proposed GDL is not only remarkably effective for few-shot object detection ( , FSOD, G-FSOD and cross-domain FSOD), but also plays a positive role in conventional object detection. In this section, we further explore the conventional cross-domain object detection and all experimental results are shown in Table 11
. We use the Cityscapes[cordts2016cityscapes] and FoggyCityscapes [sakaridis2018semantic] (Normal-to-Foggy) as our benchmarks and follow the same evaluation protocol in [zheng2020cross]. By comparing the experimental results of the second row and the third row in Table 11, we find that adding GDL achieves 32.8% mAP on the weather transfer task, which is +2.8% higher than the plain Faster-RCNN.
|+ GDL||ResNet-101||32.9 (+1.4)||38.4 (-0.9)||47.3 (+2.1)||26.6 (+1.9)||34.3 (-1.0)||41.4 (+0.2)||17.3 (+8.5)||24.3 (+7.6)||32.8 (+2.8)|
c.2 The value range of
We discuss the value range of into three situations.
[leftmargin=9pt,topsep=1pt, parsep=-2pt, partopsep=-2pt]
and . This setting has been explored in our paper and achieved the best results.
or . means that the downstream module has a negative effect on the optimization direction of backbone. Without any adversarial strategy, this setup is meaningless for object detection.
or . means that the gradient from the downstream module magnifies its effect on the backbone. We notice that slightly increasing () will not affect the stability of detector but incite performance degradation, which is caused by the backbone’s update speed faster than before and over-fitting. When is relatively large (), due to over-emphasizing the degree of coupling between the module and the backbone, the model will usually converge to an unreasonable saddle point and cause a collapse solution. The value of 5 is obtained by experiments approximately.
Appendix D More Visualization of Our Approach
We provide qualitative visualizations of the detected novel objects on COCO dataset in Fig.6. We show both success (green box) and failure cases (red box) when detecting novel objects for each image to help analyze the possible error types, including misclassifying novel objects, mislocalizing objects and missing detections.