MINI: Mining Implicit Novel Instances for Few-Shot Object Detection

Learning from a few training samples is a desirable ability of an object detector, inspiring the explorations of Few-Shot Object Detection (FSOD). Most existing approaches employ a pretrain-transfer paradigm. The model is first pre-trained on base classes with abundant data and then transferred to novel classes with a few annotated samples. Despite the substantial progress, the FSOD performance is still far behind satisfactory. During pre-training, due to the co-occurrence between base and novel classes, the model is learned to treat the co-occurred novel classes as backgrounds. During transferring, given scarce samples of novel classes, the model suffers from learning discriminative features to distinguish novel instances from backgrounds and base classes. To overcome the obstacles, we propose a novel framework, Mining Implicit Novel Instances (MINI), to mine the implicit novel instances as auxiliary training samples, which widely exist in abundant base data but are not annotated. MINI comprises an offline mining mechanism and an online mining mechanism. The offline mining mechanism leverages a self-supervised discriminative model to collaboratively mine implicit novel instances with a trained FSOD network. Taking the mined novel instances as auxiliary training samples, the online mining mechanism takes a teacher-student framework to simultaneously update the FSOD network and the mined implicit novel instances on the fly. Extensive experiments on PASCAL VOC and MS-COCO datasets show MINI achieves new state-of-the-art performance on any shot and split. The significant performance improvements demonstrate the superiority of our method.

READ FULL TEXT VIEW PDF

page 2

page 16

11/23/2021

Few-Shot Object Detection via Association and DIscrimination

Object detection has achieved substantial progress in the last decade. H...
04/12/2022

Few-shot Forgery Detection via Guided Adversarial Interpolation

Realistic visual media synthesis is becoming a critical societal issue w...
03/10/2020

Incremental Few-Shot Object Detection

Most existing object detection methods rely on the availability of abund...
10/28/2021

Bridging Non Co-occurrence with Unlabeled In-the-wild Data for Incremental Object Detection

Deep networks have shown remarkable results in the task of object detect...
12/03/2020

Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection

Object detectors usually achieve promising results with the supervision ...
05/04/2021

Hallucination Improves Few-Shot Object Detection

Learning to detect novel objects from few annotated examples is of great...
05/06/2020

Low-shot Object Detection via Classification Refinement

This work aims to address the problem of low-shot object detection, wher...

1 Introduction

Figure 1: (a) Left figure demonstrates annotated instances of base classes (in green) and implicit instances of novel classes (in red) of FSOD datasets, where co-occurrence, , a “cat” lies on a “sofa”, a “person” rides a “motor”, is widely existed. (b) Right figure compares the performance of different FSOD methods on PASCAL VOC dataset. The TFA [42] which is pre-trained on base classes is learned to treat implicit novel instances as backgrounds, resulting in unsatisfactory performance. Simply applying TFA to mine the implicit novel instances (TFA Mining) and re-train a detector with these instances is a straightforward solution. However, its performance is limited by the inaccurate initial detection results of TFA on low shots. The proposed MINI can better mine these instances and significantly boost performance

Object detection aims to classify and localize objects, which receives remarkable progress in recent years 

[36, 16, 3]. However, the strong performance heavily relies on a large number of labeled training samples, which requires extensive labeling for each object and is expensive to acquire. On the contrary, humans can recognize novel classes with the aid of only a few annotated samples, which is a desirable ability object detectors should have. Thus great interests have been invoked to explore Few-Shot Object Detection (FSOD), which aims to train an object detector for novel classes with the help of abundant data on base classes and few shot samples on novel classes.

Current FSOD methods mostly follow a pretrain-transfer paradigm. Specifically, it first pre-trains the object detector on the base classes with abundant data to attain general representation ability. And the pre-trained model is then transferred with few shot training samples to detect novel classes. Due to the limited number of novel samples, most parameters of the FSOD model are frozen when being transferred to preserve the pre-trained knowledge and prevent overfitting. Although various methods have been proposed following this paradigm, including meta-learning [48, 45, 11], metric learning [21, 26], and fine-tuning [42, 4, 39], their performance is still far behind satisfactory on benchmark datasets.

In this paper, we reveal that the performance of the current FSOD methods is heavily hindered by two aspects. First, the scarce novel samples fail to provide sufficient diversity of novel classes, making FSOD models tend to overfit to these few shot samples. Second, due to the co-occurrence between base and novel classes on benchmark datasets, the object detector pre-trained on base classes is learned to treat the co-occurred novel instances as backgrounds. The classification bias is hard to be eliminated with frozen most parameters of the pre-trained model during transferring.

Motivated by the observations, this paper proposes to tackle this problem by mining the implicit novel instances, which widely exist in abundant data of base classes but are not annotated in FSOD datasets as in Fig. 1(a). By discovering these implicit novel instances and taking them as extra training samples, we can optimize all parameters of the FSOD model to solve the two mentioned obstacles at once. On the one hand, the enriched training sample of novel classes enhance the representation ability to discriminate novel classes from other classes. On the other hand, it effectively mitigates the classification confusion between backgrounds and novel instances.

To achieve this goal, a straightforward solution is to directly adopt a FSOD model to discover these implicit novel instances. However, this simple design heavily relies on the initial performance of the FSOD model, leading to unsatisfactory performance, especially in low shots scenarios. Moreover, it lacks a mechanism to upgrade the discovered novel instances as the FSOD model improves, which hinders further performance improvement.

Towards the aforementioned drawbacks, this paper proposes a framework called Mining Implicit Novel Instances (MINI) which mines the implicit novel instances with an offline mining mechanism and an online mining mechanism. Specifically, a FSOD model, , TFA [42], is firstly trained to discover initial implicit novel instances, as in Fig. 1(b). The offline mining mechanism leverages a self-supervised discriminative model to calibrate classification confidences of these discovered novel instances. During training, taking the offline mined implicit novel instances as auxiliary training samples, the online mining mechanism takes a teacher-student framework to simultaneously update the parameters of the FSOD network and the mined implicit novel instances on the fly.

We conduct extensive experiments on Pascal VOC [10] and MS COCO [30] benchmarks, and achieve new SOTA performance for all settings. Concretely, we improve the current SOTA performance (novel AP50) by 18.4, 16.7, 10.9, 10.6, 12.8 and 19.3, 15.5, 15.3, 8.8, 13.5 and 16.6, 15.6, 11.7, 11.9, 10.8 for =1, 2, 3, 5, 10 on novel split 1, 2 and 3, respectively. Even on the challenging COCO split, we push the limit of the envelope performance (novel mAP) by 3.3 and 4.7 for 10 and 30, respectively. The significant performance gain demonstrates the effectiveness of the proposed MINI.

2 Related Work

2.1 Few-Shot Object Detection

Few-Shot Object Detection(FSOD) aims to detect novel concepts given abundant base data and limited novel data. One main line of FSOD methods is meta-learning based approaches [21, 20, 48, 45, 11, 26, 49, 18]. FSRW [20] and Meta R-CNN [48] introduce feature re-weighting to one-stage and two-stage detection methods, respectively. Meta-Det [43] disentangles the learning of category-specific and category-agnostic components. FSIW [45] improves FSRW [20]

with more complex feature aggregation module and unify few-shot object detection and viewpoint estimation. Another line is fine-tuning based approaches

[42, 39, 27, 12, 4, 35]. TFA [42] firstly introduces a simple base-training and few-shot fine-tuning paradigm. FSCE [39] improves the TFA baseline by fine-tuning more layers and brings batch contrastive learning to FSOD. FADI [4] divides the fine-tuning stage into association and discrimination to promote the discriminate power of the classifier. DeFRCN [35] devises GDL and PCB to alleviates the potential contradictions of Faster R-CNN [36] in FSOD.

2.2 Semi-Supervised Object Detection

Semi-Supervised Object Detection (SSOD) aims to train a detector with limited labeled data and abundant unlabeled data. There are two lines of methods, the consistency methods [19, 40] and pseudo label methods [38, 41, 32, 51, 47]. CSD [19] enforces a consistency loss between the original image and the horizontally flipped one. STAC [38] proposes a simple pseudo-labeling framework, which trains the model with highly confident pseudo labels from unlabeled dataset with strong augmentations. Unbiased Teacher [32] finds the bias existed in pseudo labels due to over-fitting and class imbalance, hence introducing EMA and Focal Loss [29] to resolve them. There are many subsequent variants [41, 51, 47]

. Our work shares similar ideas with pseudo labeling methods, but it is not feasible to directly apply SSOD methods. Due to the severe data scarcity and extreme class imbalance, the poor-learned teacher model cannot well discover potential novel instances. Moreover, SSOD methods usually rely on a heuristics confidence threshold which fails to implicate the quality of novel instances in FSOD scenario. Hence we propose

MINI to better tackle it.

2.3 Self-Supervised Learning

Self-supervised learning (SSL), also named representation learning, aims to learn general visual representation for downstream tasks. Early works rely on ad-hoc heuristics to design pretext tasks 

[8, 50, 34, 24], which limits the generality of learned representations. Recent approaches can be categorized as discriminative [9, 1, 15, 6, 13] or generative [23, 46, 2, 14]. Contrastive methods [9, 1, 15, 6]

are representative for discriminative methods, which enforce a consistency loss between different views of the same image by contrastive positive pairs against negative pairs, has shown promising results recently. We notice the unsupervised learned visual representation by SSL pre-training has strong discriminative power, and we exploit it for better instance mining.

3 Our Approach

In this section, we first revisit the problem setting of the conventional few-shot object detection, and discuss limitations of the widely adopted pretrain-transfer paradigm. Then we elaborate our Mining Implicit Novel Instances (MINI) to better tackle it.

Figure 2: Method Overview. MINI mines implicit novel instances with an offline mining mechanism and online mining mechanism. The pipeline of MINI is following: 1) An FSOD detector is used to discover initial implicit novel instances. The offline mining mechanism leverages a self-supervised discriminative model to calibrate classification confidences of these discovered novel instances. 2) In the online mining mechanism, the teacher model discovers the implicit novel instances in each iteration during training. 3) The offline and online discovered novel instances are combined with an adaptive mingling design. 4) The student model takes implicit novel instances as ground-truths and updates the parameters of the teacher model via EMA.

3.1 Revisiting of Few-Shot Object Detection (FSOD)

In the conventional few-shot object detection (FSOD), there exists two non-overlapping training sets, , a base dataset with exhaustive annotated instances for each base class , and a novel dataset with annotated instances for each novel class , here indicates the input image and ground truth, respectively. The ultimate goal of FSOD is to optimize a robust detector to detect objects in a test set that comprises both classes in .

To leverage abundant base dataset , most FSOD works follow a pretrain-transfer paradigm, where the model is firstly pre-trained on to attain general representation ability, and then being transferred to novel class with few-shot novel samples in . Due to the limited number of novel samples, most parameters of the FSOD model are frozen when being transferred to preserve the pre-trained knowledge and prevent over-fitting.

Despite the substantial progress in FSOD area, due to the co-occurrence between base and novel classes in , the model is learned to treat the co-occurred novel classes as backgrounds. During transferring, given only novel samples from , such classification bias is hard to be eliminated with frozen most parameters of the pre-trained model.

To overcome the obstacles, we propose Mining Implicit Novel Instances (MINI) to mine implicit novel instances with an offline mining mechanism and an online mining mechanism. As shown in Fig 2, we first train an FSOD detector to discover initial implicit novel instances (Sec. 3.2). The offline mining mechanism leverages a self-supervised discriminative model to calibrate classification confidences of these discovered novel instances (Sec. 3.3). During training, the online mining mechanism takes a teacher-student framework to simultaneously update the FSOD network and the mined implicit novel instances on the fly (Sec. 3.4). Specifically, the teacher model discovers the implicit novel instances in each iteration during training. The offline and online discovered novel instances are combined with an adaptive mingling design. The student model takes updated implicit novel instances as ground-truths and updates the parameters of the teacher model via Exponential Moving Average (EMA) [22, 15, 32].

3.2 FSOD as Initial Miner

In this section, we aim to obtain an object detector that has some basic ability to recognize novel classes. The initial FSOD network can be readily instantiated with different FSOD algorithms. For simplicity, we adopt the widely used TFA [42] in this work, which divides the whole training pipeline into two independent stages as follows,

Base Model Training Stage

In the first base training stage, the whole model, including the box predictors, , the classifier and regressor, and the feature extractor, , the rest of the network, are jointly trained on the base dataset with abundant annotations of base classes. To this end, the base model learns a general feature representation ability and is ready to transfer to novel classes.

Few-Shot Fine-tuning Stage

In the second few-shot fine-tuning stage, only the box predictor is fine-tuned on a small balanced training set that comprises both base and novel classes. The feature extractor will be frozen to preserve the pre-trained general knowledge and prevent the potential over-fitting on the scarce novel set.

3.3 Offline Mining Mechanism

After initializing an FSOD model that can detect novel categories, in this section, we aim to discover implicit novel instances from with in an offline manner. Specifically, we preform inference of over each image . The mining process can be formulated as follows,

(1)

The RPN first predicts a set of proposals , the R-CNN classifies and regress each of proposal , then some post-processing procedures, , NMS, are applied to yield the inference results , where denotes the predicted score, bounding box and label of candidate instance on image, respectively. A fixed high confidence threshold , , 0.9, is set to filter boxes of low quality in . To this end, the remaining instances , will be added to the offline novel instances pool .

Although the fixed threshold method receives remarkable success in semi-supervised object detection (SSOD) [38, 32], it is not sufficient under the scenario of FSOD. The severe data scarcity and extreme class imbalance make the predicted novel scores of

exhibit a large variance and tend to be generally low, hence the fixed high confidence threshold fails to deal with different novel classes. On the other hand, the pervasive misclassification of the novel classifier of

results in massive false positives in . Towards the aforementioned drawbacks, we introduce co-mining with self-supervised discriminator to promote the discriminative ability of the classifier. Furthermore, adaptive thresholding is proposed to find a proper threshold for different novel classes.

Figure 3: Pipeline of offline mining mechanism. SSL model first extracts class prototypes from . The FSOD detector performs inference on

, and the SSL model calibrates its scores via calculating cosine similarities between the class prototypes and the box features. Adaptive thresholding then computes class-wise statistics from calibrated boxes to determine a proper threshold to filter out mined instances with low quality.

Co-Mining with Self-Supervised Discriminator

Given only the training samples of shots, it is challenging to acquire a discriminative classifier. Inspired by the latest development in self-supervised learning (SSL) [7], the self-supervised visual representations are incorporated with strong discriminative power. Hence we propose a novel co-mining scheme that leverages a SSL model to collaborative mine implicit novel instances with .

As shown in Fig. 3, given instances for each novel class from , we first forward the image through , then employ RoIAlign [16] to extract the area bounded by the ground-truth box as follows,

(2)

where denotes the feature embedding of instance, and the class prototype is the mean feature over all instances. During inference in Equ. 1, the RPN first predicts a set of proposals . We then compute the feature embedding of proposal on image similar to Equ. 2, and cosine similarity score is computed with the class prototype for each novel class ,

(3)

where is the temperature factor. We concatenate all cosine similarities of novel classes and apply the calibration as follows,

(4)

where denotes concatenation operation. It is noted we only apply the calibration on the novel part of . To this end, all the inference results will be collected as for filtering in the next step.

Adaptive Thresholding

To filter candidate instances of low quality in , we propose an adaptive thresholding scheme to obtain a proper according to the class-wise score distributions. As shown in Fig. 3, for each novel classes , we first extract its candidate instances set from . After then, the mean and deviation will be computed based on classification scores . To this end, we can compute the final confidence threshold and filter low quality predict results as follows,

(5)

where is a coefficient that controls the magnitude of deviation offset to decide the number of kept instances. It is noted we further clamp the maximum number to be . Intuitively, the score mean is a measure of the transferring hardness of novel class , and indicates the compactness of intra-class score distribution. The leverages both the and to adaptively distinguish the reliable implicit novel instances without introducing computational cost.

3.4 Online Mining Mechanism

With the offline mined novel instances , we are ready to re-train a new detector with satisfactory performance. However, these instances are sourced from a static offline teacher of limit precision, and cannot be updated as the model improves, which hinders the further performance improvement. Hence we introduce an online mining mechanism to update on the fly. Specifically, we adopt a teacher-student learning paradigm as shown in Fig. 2. During training, the student is supervised by the mined novel instances. The teacher shares the same network architecture with the student model. And its parameters are update by exponential moving average (EMA) of the student’s parameters. The slowly updated teacher can be considered a temporal model ensemble of the student at different iterations, hence detecting implicit novel instances more accurately.

After mining implicit novel isntance with the teacher model in each iteration on the fly, the next question is how to update the offline mined novel instances with . It is noteworthy that the poor-learned teacher fail to discover valuable novel instances at the beginning of the training. Thus, we devise a concise adaptive mingling scheme, where the offline and online mined instances are adaptive balanced as the training process proceeds. We further introduce an IoU branching mechanism to improve the quality of online mined novel instances.

Adaptive Mingling

During training, given a training sample , the teacher first online mines novel instances with a similar procedure with Equ. 1, and we mingle the online mined novel instances with the offline mined novel instances as follows,

(6)

We concatenate with and , respectively. Here is RPN proposals predicted by the teacher model. We argue these two concatenations play an important role from two aspects. 1) At the beginning of the training, due to the poor-learned RPN and R-CNN, the high confidence threshold can filter almost all of the novel instances, so that degrades to an empty set , hence only is remained to provide training signal to warm up the beginning training of the student. 2) As the training process proceeds, the online teacher becomes more and more discriminative. By presenting as extra proposals, the teacher will calibrate some misclassification in . Moreover, the teacher can also discover missed instances in . The mingled instances work as ground-truths of novel classes during the training of the student model.

IoU Branching Correction

To further improve the quality of online mined novel instances, we notice the model trained under low data regime cannot well recognize precisely-localized boxes, hence we introduce IoU Branching mechanism to better mine high quality novel instances. Specifically, we introduce an extra IoU branch that parallels to the original R-CNN head to learn to predict the IoU between predicted boxes and ground truths. The structure is the same as the original R-CNN branch, , two fully-connected (FC) layers and followed by an IoU predictor (a single FC layer). During mining, we combines the classification scores with IoU scores in Equ. 1 as follows,

(7)

where denotes the predicted IoU score of proposal on image. And A standard MSE loss is adopted to optimize the IoU branch. All modules of the R-CNN head are jointly optimized by the following loss in an end-to-end manner:

(8)

where denotes the loss weight of the loss of the IoU branch.

4 Experiments

In this section, we first outline the datasets and benchmark protocols in Sec 4.1, the implementation details of our method in Sec 4.2. Then, we compare our approach with the latest methods of FSOD and SSOD in Sec 4.3. Finally, we make an extensive ablation study about different components in Sec 4.4.

4.1 Datasets and Evaluation Protocols

We follow the same data split construction and evaluation protocols used in  [42] for fair comparisons. All experiments are evaluated on both PASCAL VOC [10] and MS COCO [30] datasets.

Pascal Voc

has 20 classes, which are randomly split into 15 base classes and 5 novel classes. There are 3 different class splits, and we refer them as Novel Split 1, 2 and 3, respectively. For each split, there exists exhaustive base instances but only annotated instances for novel classes. All instances are sampled from the union of VOC07 and VOC12 train/val set for training, and the model is tested on VOC07 test set. The standard PASCAL VOC metric, , Average Precision (IoU=0.5) for novel classes (nAP50) is reported.

Ms Coco

has 80 classes, 20 classes that overlap with PASCAL VOC are regarded as novel classes, the remaining 60 classes are considered as base classes. We evaluate our method for shots. And the standard COCO-style metric is adopted, which averages mAP of IoUs from 0.5 to 0.95 with an interval of 0.05. We also report nAP50 and nAP75, respectively.

4.2 Implementation Details

We implement our method based on MMDetection [5] and MMFewShot [33]. We employ the Faster R-CNN [36] with Feature Pyramid Network [28] and ResNet-101 [17] as base model. Please refer to the Appendix for the detailed settings.

4.3 Main Results

width= Method/Shot Novel Split 1 Novel Split 2 Novel Split 3 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10 FSRW  [20] ICCV 19 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 40.5 21.3 25.6 28.4 42.8 45.9 MetaDet [43] ICCV 19 18.9 20.6 30.2 36.8 49.6 21.8 23.1 27.8 31.7 43.0 20.6 23.9 29.4 43.9 44.1 Meta R-CNN [48] ICCV 19 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1 TFA w/ cos [42] ICML 20 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8 MPSR [44] ECCV 20 41.7 - 51.4 55.2 61.8 24.4 - 39.2 39.9 47.8 35.6 - 42.3 48.0 49.7 FSCE [39] CVPR 21 44.2 43.8 51.4 61.9 63.4 27.3 29.5 43.5 44.2 50.2 37.2 41.9 47.5 54.6 58.5 SRR-FSD [52] CVPR 21 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4 CME [26] CVPR 21 41.5 47.5 50.4 58.2 60.9 27.2 30.2 41.4 42.5 46.8 34.3 39.6 45.1 48.3 51.5 TIP [25] CVPR 21 27.7 36.5 43.3 50.2 59.6 22.7 30.1 33.8 40.9 46.9 21.7 30.6 38.1 44.5 50.9 FADI [4] NeurIPS 21 50.3 54.8 54.2 59.3 63.2 30.6 35.0 40.3 42.8 48.0 45.7 49.7 49.1 55.0 59.6 DeFRCN [35] ICCV 21 53.6 57.5 61.5 64.1 60.8 30.1 38.1 47.0 53.3 47.9 48.4 50.9 52.3 54.9 57.4 MINI (Ours) 72.0 74.2 72.4 74.7 76.2 51.8 53.6 62.3 62.1 63.7 65.0 66.5 64.0 66.9 70.4

Table 1: Performance (novel AP50) across three splits on PASCAL VOC dataset. Red/Blue denote best and second-best results, respectively

width= Method nAP nAP50 nAP75 Method nAP nAP50 nAP75 10 30 10 30 10 30 10 30 10 30 10 30 FSRW [20] 5.6 9.1 12.3 19.0 4.6 7.6 SRR-FSD [52] 11.3 14.7 23.0 29.2 9.8 13.5 MetaDet [43] 7.1 11.3 14.6 21.7 6.1 8.1 CME [26] 15.1 16.9 24.6 28.0 16.4 17.8 Meta R-CNN [48] 8.7 12.4 19.1 25.3 6.6 10.8 TIP [25] 16.3 18.3 33.2 35.9 14.1 16.9 TFA w/ cos [42] 10.0 13.7 19.1 24.9 9.3 13.4 FADI [4] 12.2 16.1 22.7 29.1 11.9 15.8 MPSR [44] 9.8 14.1 17.9 25.4 9.7 14.2 DeFRCN [35] 18.5 22.6 - - - - FSCE [39] 11.9 16.4 - - 10.5 16.2 MINI (Ours) 21.8 27.3 38.0 44.9 21.5 28.5

Table 2: Performance on MS COCO dataset. Red/Blue denote best and second-best results, respectively

width= Method nAP50 Method nAP50 1 2 3 5 10 1 2 3 5 10 STAC, 38.8 59.2 60.2 64.8 66.4 UB-T, 3.9 13.2 15.1 11.7 6.2 STAC, 19.3 48.5 57.1 65.2 66.9 UB-T, 1.8 3.6 22.3 52.0 60.6 STAC, 0.0 7.1 16.6 22.6 57.5 UB-T, 0.0 3.6 8.9 25.4 47.6 Ours 72.0 74.2 72.4 74.7 76.2 Ours 72.0 74.2 72.4 74.7 76.2

Table 3: Performance comparison with SSOD methods on PASCAL VOC dataset

Comparison with FSOD Methods

Table 1 presents performance comparisons between our method and the latest FSOD methods across three novel splits on the PASCAL VOC benchmark. In all splits and shots, MINI achieves new SOTA performance and outperforms the second-best by a large margin. Specifically, MINI boosts the current SOTA by 18.4, 16.7, 10.9, 10.6, 12.8 and 19.3, 15.5, 15.3, 8.8, 13.5 and 16.6, 15.6, 11.7, 11.9, 10.8 for =1, 2, 3, 5, 10 on novel split 1, 2 and 3, respectively. The significant performance improvements are consistent across shots and splits, but announces more on low-shot scenarios, since in low-shot scenarios the data scarcity is more severe and the mined implicit shots are good alleviation of that. Similar performance gains can be observed on the MS COCO benchmark. As shown in Table 2, MINI outperforms all FSOD methods by a large margin with the COCO-style AP metric. Concretely, our method achieves 21.8 and 27.3 and boosts the SOTA performance by 3.3 and 4.7 for 10 and 30, respectively. The superior performance on both datasets suggests MINI can generalize well under different datasets.

Comparison with SSOD Methods

In this section, we explore whether it is feasible to directly apply methods from semi-supervised object detection (SSOD) under the scenario of FSOD. We compare MINI with two widely used frameworks, STAC [38] and Unbiased Teacher (UB-T) [32] as representatives for offline and online paradigms, respectively. As shown in Table 3, the performance of both STAC and unbiased teacher are far behind MINI. We adopt the same hyper-parameters setting with the official paper except for the confidence threshold . The original STAC adopts . We notice such a high threshold can filter all novel instances, decreasing from 0.9 to 0.5 can significantly boost performance in lower shots, , 0.0, 7.1 to 38.8, 59.2 for 1 and 2, respectively. But it can harm the performance in higher shots, , nAP50 drops 0.4 and 0.5 when decreasing from 0.7 to 0.5 for 5 and 10, respectively, since it will result in more false positives. For unbiased teacher, we initialize both teacher and student with TFA [42] in the burn in stage [32]. Though unbiased teacher adopts Focal Loss [29], we notice it is not sufficient to resolve the severe data scarcity and extreme class imbalance in FSOD. The proposed MINI significantly outperforms these SSOD methods, demonstrating the superiority of our method.

4.4 Ablation Study

In this section, we conduct thorough ablation studies on each component of our approach. We first demonstrate each component can contribute to the overall performance, then we analyze the effect of different hyper-parameters. Finally, we explore how and why each component works. Unless otherwise specified, all experiments are conducted on novel split 1 of PASCAL VOC benchmark.

Method nAP50
1 2 3 5 10
TFA Base 41.9 49.1 49.9 58.0 58.4
+TFA Mining 19.3 48.5 57.1 65.2 66.9
+Adaptive Thresholding 58.1 63.3 62.5 67.7 67.5
+SSL Co-Mining 63.5 67.7 66.8 70.3 68.8
+Adaptive Mingling 68.3 70.0 68.7 71.3 70.8
+IoU Branching 69.9 72.5 71.7 72.7 73.8
+Fine-Tuning 72.0 74.2 72.4 74.7 76.2
Table 4: Effectiveness of each components

Component Analysis

Table 4 shows the overall performance contribution of each component. The first row indicates our re-implemented TFA baseline, the performances of all shots are higher than the original implementation [42]. Directly applying TFA to offline mine implicit instances and re-train a detector with these instances leads to limited performance gains on higher shots but worse performance on lower shots. Our adaptive thresholding rescues the performance degradation and also improves the performance on all shots, which suggests it is vital to set a proper threshold. SSL Co-Mining results in decent gains in lower shots but lower gain in higher shots, , +5.4 and +1.3 for 1 and 10, which demonstrates SSL is a good enhancement to TFA in low shot, but TFA trained with higher shots has a similar discriminative power will SSL model. The online mining mechanism employs a teacher to mine diverse novel instances and combine it with the offline mined novel instances with an adaptive mingling design, better training samples lead to decent gains in all shots. The IoU Branching mechanism is orthogonal to all other modules, further improving the performance. Finally, we fine-tune the re-trained model on the novel set to mitigate the side effect of the inaccurate supervision from mined implicit instances, especially the box error is harmful to the regressor. Compare with the TFA baseline, our method leads to total +30.1, +25.1, +22.5, +16.7, +17.8 gains for 1, 2, 3, 5 and 10, respectively.

Ablation Study for Hyper-parameters

4 hyper-parameters are introduced, and for adaptive thresholding, for online mining, and for IoU branching. The detailed hyper-parameter study is described in the supplementary materials.

Flexibility of Adaptive Thresholding

To understand how adaptive thresholding works, we study how threshold varies among different shots and classes in Fig. 5. We can see well characterize the transfer hardness among different classes and shots. On the one hand, as shot grows, the classification scores should be higher since the classifier learns better. Adaptive thresholding decides to steadily increase to rigorous the mined novel instances to suppress false positives. On the other hand, the classifier tends to predict higher scores for those novel classes that are similar to base classes [4], , “bus” is an easy class since it is similar to “car”, but “bird” is a hard class since no base class is similar to it. Therefore, adaptive thresholding decides a higher for “bus” and a lower for “bird”. Such flexibility leads to the strong robustness of our adaptive thresholding to fit in different scenarios.

Unsupervised SSL Pre-training is Powerful Discriminator

Fig 5 shows the comparison of the number of true positive (TP) among offline mined novel instances whether applying the SSL co-mining. Positive instances are those overlap a GT bounding box with IoU . We can see SSL co-mining significantly boosts the number of TP in any shot, especially in low-shot scenarios, , +30.0 and +43.4 in 1 and 2 shots, but the number of increases becomes less as the shot grows. This aligns with the observations in Tab. 4 that SSL co-mining brings more gains in low shot. The FSOD miner learned on scarce novel samples can discover limited implicit novel instances, and the SSL model can greatly enrich the diversity with these extra true positives.

Complementivity between Offline and Online Mining

To understand how adaptive mingling balances online and offline mined instances, we record the number of these two types of instances kept after the NMS in Equ. 6 at different iterations in Fig. 6. At the beginning of the training, the online teacher mines no instances and offline instances are mainly kept for training. This well explains the first and second rows of Tab. 6 that it is necessary to enhance the R-CNN, the online teacher cannot discover enough novel instances at the beginning. As the training process proceeds, online instances gradually dominate kept instances, which demonstrates a better online teacher can discover more diverse novel instances than the initial FSOD detector. The last row of Tab. 6 shows enhancing the RPN can also bring a slight gain.

Figure 4: Confidence threshold of adaptive thresholding varies among different shot and novel classes on PASCAL VOC Novel Split 1
Figure 5: Comparison of the number of true positive (TP) of mined novel instances whether applying the SSL co-mining

Generalizing to External Datasets

Currently, we only mine implicit novel instances from the base dataset, can we generalize MINI to external unlabeled

Base Set Extra Set nAP50
PASCAL VOC COCO 1 2 3 5 10
41.9 49.1 49.9 58.0 58.4
72.0 74.2 72.4 74.7 76.2
62.9 69.3 68.2 72.9 72.4
73.7 75.9 76.5 78.1 77.1
(a) PASCAL VOC
Base Set Extra Set nAP
COCO Object365 10 30
10.4 14.7
21.8 27.3
21.2 26.4
23.6 29.3
(b) MS COCO
Table 5: Generalizing MINI to mine novel instances from other unlabeled datasets
Figure 6: Number of offline and online mined instances kept for training at different iterations Table 6: Ablation study for whether enhancing RPN or R-CNN of the online teacher with offline mined novel instances width=1. RPN R-CNN nAP50 1 2 3 5 10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 69.5 72.1 70.9 71.5 72.6 69.9 72.5 71.7 72.7 73.8

datasets in a cross-domain manner? In this section, we explore two such settings, which adopt MS COCO [30] and Object365 [37] as external datasets for the original base set PASCAL VOC [10] and MS COCO, respectively. The results are shown in Tab. 5. We adopt same hyper-parameters and mine 100 and 2000 instances on extra datasets for each novel class in Tab. (a)a and Tab. (b)b, respectively. Mining only from base or extra set can both significantly improve the performance, but the performance of extra set is inferior to base set due to the domain gap between datasets. Moreover, mining from both sets can further bring considerable gains, which demonstrates MINI can well generalize to external datasets and discover valuable instances to enhance the original model.

5 Conclusion

In this paper, we propose Mining Implicit Novel Instances (MINI) to better tackle FSOD. MINI comprises an offline mining mechanism and an online mining mechanism. The offline mining mechanism leverages a self-supervised discriminator to collaboratively mine implicit novel instances with a trained FSOD model. Taking the mined novel instances as auxiliary training samples, the online mining mechanism takes a teacher-student framework to simultaneously update the FSOD model and the mined implicit novel instances on the fly. MINI achieves new SOTA performance on various benchmarks, which demonstrates its effectiveness.

Appendix A: Implementation Details

We implement our method based on MMDetection [5] and MMFewShot [33]. We employ the Faster R-CNN [36] with Feature Pyramid Network [28] and ResNet-101 [17] as base model. All models are trained on 8 Titan-XP GPUs with batch-size 16 (2 images per GPU), and optimized by a standard SGD optimizer with learning rate 0.02, momentum 0.9 and weight decay . We strictly follow the protocol introduced by TFA [42] without any modifications to initialize the . MoCo v2 [7] w/ ResNet-50 [17] is employed to co-mine novel instances with FSOD model, and we take the C4 feature, , the feature of the last layer of ResNet to compute the cosine similarity. in Equ. 5 is set to be 1.5 for all experiments, and we limit the maximum of novel instances to be 300 and 3000 for PASCAL VOC and MS COCO, respectively. For the online learning stage, we follow unbiased teacher [31] to apply weak and strong augmentations to the teacher and student model, respectively. For PASCAL VOC all models are trained for 18k iterations and decayed at 12k and 16k, respectively, and the confidence threshold is set to be 0.7. For MS COCO all models are trained for 160k iterations and decayed at 110k and 145k, respectively, and the confidence threshold is set to be 0.8. In the last fine-tuning stage, for PASCAL VOC, we only fine-tune the box classifier, predictor and IoU predictor for 4k, 8k, 8k, 8k, 12k iterations for , respectively. For MS COCO, we fine-tune the whole R-CNN head for 4k, 8k iterations for , respectively. The learning rate is set as 0.001 for both datasets.

Exclude nAP Bird Bus Cow Motor Sofa
Bird 37.2 27.1 63.8 33.0 45.0 17.2
Bus 37.9 24.0 39.4 39.0 59.7 27.5
Cow 38.5 22.6 54.9 43.0 54.9 16.9
Motor 36.6 18.4 57.7 41.2 42.7 23.0
Sofa 41.8 27.3 70.0 37.1 49.3 25.1
(a) TFA
Exclude nAP Bird Bus Cow Motor Sofa
Bird 65.6 33.0 84.5 70.7 77.9 61.6
Bus 64.1 56.9 49.1 77.1 76.5 61.1
Cow 65.4 57.8 84.1 46.1 77.0 62.2
Motor 67.5 60.6 83.9 73.9 52.3 66.8
Sofa 65.7 62.4 83.5 75.8 74.9 32.0
(b) MINI
Table 7: Performance comparison between TFA and MINI. We use the same base and novel set as PASCAL VOC novel split1 with the shot1 setting, but we exclude images that contain a selected novel class from the base dataset to simulate there is no co-occurred , and the excluded class is marked as red. For example, the “bird” row indicates we exclude all the images contain “bird” instances from the base dataset
Figure 7: Examples of mined similar instances from excluded class dataset, , “wheel” is mined for novel class “bus”, “bicycle” is mined for novel class “motorbike”

Appendix B: Robustness of Mini

Although co-occurrence widely exists in benchmark datasets, there may be a case that the novel class does not co-occur with base classes. In this section, we test the robustness of MINI in such a case. Specifically, we manually remove images that contain a selected novel class from the original base dataset of PASCAL VOC Novel Split1 with the Shot1 setting, and keep the novel dataset unchanged. We then train a TFA and MINI on this processed dataset, the results are shown in Table 7. Surprisingly, even the base dataset does not contain the removed novel class, our MINI can still significantly improve the performance for the excluded class, , boost nAP50 by 5.9 (from 27.1 to 33.0) for “Bird” and “9.7” (39.4 to 49.1) for bus. So what instances are mined by MINI for these excluded novel classes? We draw some examples in Fig. 7. We can see these mined novel instances share a strong texture or shape similarity with the exclude class, , the wheel of the base class “aeroplane” is also a part for novel class “bus”, the shape of the base class “bicycle” is similar to the novel class “motorbike”, the texture of the base class “horse” is similar to the novel class “cow”. We conjecture learning from these similar instances of base classes can also promote the feature representation ability of the corresponding novel classes.

nAP50 nAP50
1 2 3 5 10 1 2 3 5 10
 0.0 63.6 68.4 66.7 69.6 67.7  0.5 19.0 23.7 13.4 14.2 18.4
 1.5 63.5 67.7 66.8 70.3 68.8  0.7 69.9 72.5 71.7 72.7 73.8
 3.0 59.8 59.4 65.9 69.6 67.2  0.9 65.2 70.5 69.8 71.2 71.4
1 2 3 5 10 1 2 3 5 10
 150 62.4 65.9 66.0 68.9 67.7  0.5 69.9 72.5 71.7 72.7 73.8
 300 63.5 67.7 66.8 70.3 68.8  1.0 69.4 72.7 69.8 34.1 31.6
 450 63.9 68.3 66.3 68.8 67.3  2.0 69.1 72.0 30.4 33.0 72.3
Table 8: Ablation study for hyper-parameters of different components. Varying and for adaptive thresholding in offline mining. Varying for online mining. Varying for IoU branching

Appendix C: Hyper-parameters Ablation

In MINI, thera are 4 hyper-parameters introduced, and for adaptive thresholding, for online mining, and for IoU branching. Table 8 analyzes the effect of different choices of hyper-parameters. When studying and , we do not involve the online mining mechanism and the fine-tuning; when studying and , we do not involve the fine-tuning. A smaller and larger will lead to more kept mined novel instances, which is beneficial in lower shots, , 1- and 2- shot, but can be harmful in higher shots since it may result in more false positives. We can observe the performance is not very sensitive to and , and we finally adopt and for the offline mining. During online mining, it is necessary to set a relatively high . Because a too-small , , can severely degrade the performance, as it will induce too many false positives to distract the learning of the student model. And we found a large will disturb the training process, especially in higher shots. Through a coarse study, we adopt and for all experiments.

Discriminative Model nAP50
1 2 3 5 10
ImageNet Pre-train 62.3 67.2 66.4 70.3 69.4
MoCo v2 63.5 67.7 66.8 70.3 68.8
Table 9: Performance comparison between Self-supervised Discriminative Model and Supervised Discriminative Model in offline mining. For self-supervised model, we adopt MoCo v2 [7] w/ ResNet-50; For supervised model, we adopt a ResNet R-50 [17]

supervised trained on ImageNet as the counter-part

Appendix D: Self-supervised Discriminative Model vs. Supervised Discriminative Model in Offline Mining

The offline mining mechanism leverages a self-supervised discriminative model to collaboratively mine implicit novel instances with the trained FSOD network, but how about using a supervised-learned pre-trained model to replace the SSL model? Table 9 shows the comparison between SSL model MoCo v2 and an ImageNet supervised pre-trained ResNet-50. Overall the performance of the SSL model compares favorably against the supervised counterpart, especially on lower shots, , ; but slightly inferior in higher shots, , . This demonstrates the solid discriminative ability of the SSL discriminative model for offline mining in MINI.

References

  • [1] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. Advances in Neural Information Processing Systems. Cited by: §2.3.
  • [2] H. Bao, L. Dong, and F. Wei (2021) Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254. Cited by: §2.3.
  • [3] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1.
  • [4] Y. Cao, J. Wang, Y. Jin, T. Wu, K. Chen, Z. Liu, and D. Lin (2021) Few-shot object detection via association and discrimination. In Advances in Neural Information Processing Systems, Cited by: §1, §2.1, §4.4, Table 1, Table 2.
  • [5] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.2, Appendix A: Implementation Details.
  • [6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International Conference on Machine Learning

    ,
    Cited by: §2.3.
  • [7] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §3.3, Appendix A: Implementation Details, Table 9.
  • [8] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In IEEE International Conference on Computer Vision, Cited by: §2.3.
  • [9] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    .
    Advances in Neural Information Processing Systems. Cited by: §2.3.
  • [10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1, §4.1, §4.4.
  • [11] Q. Fan, W. Zhuo, C. Tang, and Y. Tai (2020) Few-shot object detection with attention-rpn and multi-relation detector. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
  • [12] Z. Fan, Y. Ma, Z. Li, and J. Sun (2021) Generalized few-shot object detection without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [13] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, Cited by: §2.3.
  • [14] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021)

    Masked autoencoders are scalable vision learners

    .
    arXiv preprint arXiv:2111.06377. Cited by: §2.3.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.3, §3.1.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In IEEE International Conference on Computer Vision, Cited by: §1, §3.3.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2, Appendix A: Implementation Details, Table 9.
  • [18] H. Hu, S. Bai, A. Li, J. Cui, and L. Wang (2021) Dense relation distillation with context-aware aggregation for few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [19] J. Jeong, S. Lee, J. Kim, and N. Kwak (2019)

    Consistency-based semi-supervised learning for object detection

    .
    Advances in Neural Information Processing Systems. Cited by: §2.2.
  • [20] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-shot object detection via feature reweighting. In IEEE International Conference on Computer Vision, Cited by: §2.1, Table 1, Table 2.
  • [21] L. Karlinsky, J. Shtok, S. Harary, E. Schwartz, A. Aides, R. Feris, R. Giryes, and A. M. Bronstein (2019) Repmet: representative-based metric learning for classification and few-shot object detection. In IEEE International Conference on Computer Vision, Cited by: §1, §2.1.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
  • [23] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.3.
  • [24] N. Komodakis and S. Gidaris (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, Cited by: §2.3.
  • [25] A. Li and Z. Li (2021) Transformation invariant few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 1, Table 2.
  • [26] B. Li, B. Yang, C. Liu, F. Liu, R. Ji, and Q. Ye (2021) Beyond max-margin: class margin equilibrium for few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1, Table 1, Table 2.
  • [27] Y. Li, H. Zhu, Y. Cheng, W. Wang, C. S. Teo, C. Xiang, P. Vadakkepat, and T. H. Lee (2021) Few-shot object detection via classification refinement and distractor retreatment. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [28] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection.. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2, Appendix A: Implementation Details.
  • [29] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, Cited by: §2.2, §4.3.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: §1, §4.1, §4.4.
  • [31] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016)

    Large-margin softmax loss for convolutional neural networks.

    .
    In International Conference on Machine Learning, Cited by: Appendix A: Implementation Details.
  • [32] Y. Liu, C. Ma, Z. He, C. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda (2021) Unbiased teacher for semi-supervised object detection. In International Conference on Learning Representations, Cited by: §2.2, §3.1, §3.3, §4.3.
  • [33] mmfewshot Contributors (2021) OpenMMLab few shot learning toolbox and benchmark. Note: https://github.com/open-mmlab/mmfewshot Cited by: §4.2, Appendix A: Implementation Details.
  • [34] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, Cited by: §2.3.
  • [35] L. Qiao, Y. Zhao, Z. Li, X. Qiu, J. Wu, and C. Zhang (2021) DeFRCN: decoupled faster r-cnn for few-shot object detection. In IEEE International Conference on Computer Vision, Cited by: §2.1, Table 1, Table 2.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.1, §4.2, Appendix A: Implementation Details.
  • [37] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun (2019) Objects365: a large-scale, high-quality dataset for object detection. In IEEE International Conference on Computer Vision, Cited by: §4.4.
  • [38] K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister (2020) A simple semi-supervised learning framework for object detection. In arXiv:2005.04757, Cited by: §2.2, §3.3, §4.3.
  • [39] B. Sun, B. Li, S. Cai, Y. Yuan, and C. Zhang (2021) FSCE: few-shot object detection via contrastive proposal encoding. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1, Table 1, Table 2.
  • [40] P. Tang, C. Ramaiah, Y. Wang, R. Xu, and C. Xiong (2021) Proposal learning for semi-supervised object detection. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §2.2.
  • [41] Y. Tang, W. Chen, Y. Luo, and Y. Zhang (2021) Humble teachers teach better students for semi-supervised object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [42] X. Wang, T. E. Huang, T. Darrell, J. E. Gonzalez, and F. Yu (2020) Frustratingly simple few-shot object detection. In International Conference on Machine Learning, Cited by: Figure 1, §1, §1, §2.1, §3.2, §4.1, §4.3, §4.4, Table 1, Table 2, Appendix A: Implementation Details.
  • [43] Y. Wang, D. Ramanan, and M. Hebert (2019) Meta-learning to detect rare objects. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, Table 1, Table 2.
  • [44] J. Wu, S. Liu, D. Huang, and Y. Wang (2020) Multi-scale positive sample refinement for few-shot object detection. In European Conference on Computer Vision, Cited by: Table 1, Table 2.
  • [45] Y. Xiao and R. Marlet (2020) Few-shot object detection and viewpoint estimation for objects in the wild. In European Conference on Computer Vision, Cited by: §1, §2.1.
  • [46] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu (2021) Simmim: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886. Cited by: §2.3.
  • [47] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu (2021) End-to-end semi-supervised object detection with soft teacher. In IEEE International Conference on Computer Vision, Cited by: §2.2.
  • [48] X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, and L. Lin (2019) Meta r-cnn: towards general solver for instance-level low-shot learning. In IEEE International Conference on Computer Vision, Cited by: §1, §2.1, Table 1, Table 2.
  • [49] L. Zhang, S. Zhou, J. Guan, and J. Zhang (2021) Accurate few-shot object detection with support-query mutual guidance and hybrid loss. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [50] R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    .
    In European Conference on Computer Vision, Cited by: §2.3.
  • [51] Q. Zhou, C. Yu, Z. Wang, Q. Qian, and H. Li (2021) Instant-teaching: an end-to-end semi-supervised object detection framework. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [52] C. Zhu, F. Chen, U. Ahmed, and M. Savvides (2021) Semantic relation reasoning for shot-stable few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 1, Table 2.