Semantic Relation Reasoning for Shot-Stable Few-Shot Object Detection

03/02/2021 ∙ by Chenchen Zhu, et al. ∙ Carnegie Mellon University 0

Few-shot object detection is an imperative and long-lasting problem due to the inherent long-tail distribution of real-world data. Its performance is largely affected by the data scarcity of novel classes. But the semantic relation between the novel classes and the base classes is constant regardless of the data availability. In this work, we investigate utilizing this semantic relation together with the visual information and introduce explicit relation reasoning into the learning of novel object detection. Specifically, we represent each class concept by a semantic embedding learned from a large corpus of text. The detector is trained to project the image representations of objects into this embedding space. We also identify the problems of trivially using the raw embeddings with a heuristic knowledge graph and propose to augment the embeddings with a dynamic relation graph. As a result, our few-shot detector, termed SRR-FSD, is robust and stable to the variation of shots of novel objects. Experiments show that SRR-FSD can achieve competitive results at higher shots, and more importantly, a significantly better performance given both lower explicit and implicit shots. The proposed benchmark protocol with implicit shots removed from the pretrained classification dataset can serve as a more realistic setting for future research.



There are no comments yet.


page 2

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning algorithms usually require a large amount of annotated data to achieve superior performance. To acquire enough annotated data, one common way is by collecting abundant samples from the real world and paying annotators to generate ground-truth labels. However, even if all the data samples are well annotated based on our requirements, we still face the problem of few-shot learning. Because long-tail distribution is an inherent characteristic of the real world, there always exist some rare cases that have just a few samples available, such as rare animals, uncommon road conditions. In other words, we are unable to alleviate the situation of scarce cases by simply spending more money on annotation even big data is accessible. Therefore, the study of few-shot learning is an imperative and long-lasting task.

Figure 1: FSOD performance (mAP50) on VOC [voc]

Novel Set 1 at different shot numbers. Solid line (original) means the pretrained model used for initializing the detector backbone is trained on the original ImageNet


. Dashed line (rm-nov) means classes in Novel Set 1 are removed from the ImageNet for the pretrained backbone model. Our SRR-FSD is more stable to the variation of explicit shots (x-axis) and implicit shots (original vs. rm-nov).

Recently, efforts have been put into the study of few-shot object detection (FSOD) [lstd, repmet, fsod-mc, yolo-fewshot, meta-rcnn, metadet, attention-rpn, context-transformer, tfa, mpsr, fsdetview]. In FSOD, there are base classes in which sufficient objects are annotated with bounding boxes and novel classes in which very few labeled objects are available. The novel class set does not share common classes with the base class set. The few-shot detectors are expected to learn from limited data in novel classes with the aid of abundant data in base classes and to be able to detect all novel objects in a held-out testing set. To achieve this, most recent few-shot detection methods adopt the ideas from meta-learning and metric learning for few-shot recognition and apply them to conventional detection frameworks, e.g. Faster R-CNN [faster-rcnn], YOLO [yolov2].

Although recent FSOD methods have improved the baseline considerably, data scarcity is still a bottleneck that hurts the detector’s generalization from a few samples. In other words, the performance is very sensitive to the number of both explicit and implicit shots and drops drastically as data becomes limited. The explicit shots refer to the available labeled objects from the novel classes. For example, the 1-shot performance of some FSOD methods is less than half of the 5-shot or 10-shot performance, as shown in Figure 1. In terms of implicit shots, initializing the backbone network with a model pretrained on a large-scale image classification dataset is a common practice for training an object detector. However, the classification dataset contains many implicit shots of object classes overlapped with the novel classes. So the detector can have early access to novel classes and encode their knowledge in the parameters of the backbone. Removing those implicit shots from the pretrained dataset also has a negative impact on the performance as shown in Figure 1. The variation of explicit and implicit shots could potentially lead to system failure when dealing with extreme cases in the real world.

We believe the reason for shot sensitivity is due to exclusive dependence on the visual information. Novel objects are learned through images only and the learning is independent between classes. As a result, visual information becomes limited as image data becomes scarce. However, one thing remains constant regardless of the availability of visual information, i.e. the semantic relation between base and novel classes. For example in Figure 2, if we have the prior knowledge that the novel class “bicycle” looks similar to “motorbike”, can have interaction with “person”, and can carry a “bottle”, it would be easier to learn the concept “bicycle” than solely using a few images. Such explicit relation reasoning is even more crucial when visual information is hard to access [zsr-gnn].

Figure 2: Key insight: the semantic relation between base and novel classes is constant regardless of the data availability of novel classes, which can aid the learning together with visual information.

So how can we introduce semantic relation to few-shot detection? In natural language processing, semantic concepts are represented by word embeddings

[word2vec, glove] from language models, which have been used in zero-shot learning methods [zsr-gnn, zsod]. And explicit relationships are represented by knowledge graphs [wordnet, nell], which are adopted by some zero-shot or few-shot recognition algorithms [zsr-gnn, fsr-kt]. However, these techniques are rarely explored in the FSOD task. Also, directly applying them to few-shot detectors leads to non-trivial practical problems, i.e. the domain gap between vision and language, and the heuristic definition of knowledge graph for classes in FSOD datasets (see Section 3.2 and 3.3 for details).

In this work, we explore the semantic relation for FSOD. We propose a Semantic Relation Reasoning Few-Shot Detector (SRR-FSD), which learns novel objects from both the visual information and the semantic relation in an end-to-end style. Specifically, we construct a semantic space using the word embeddings. Guided by the word embeddings of the classes, the detector is trained to project the objects from the visual space to the semantic space and to align their image representations with the corresponding class embeddings. To address the aforementioned problems, we propose to learn a dynamic relation graph driven by the image data instead of pre-defining one based on heuristics. Then the learned graph is used to perform relation reasoning and augment the raw embeddings for reduced domain gap.

With the help of the semantic relation reasoning, our SRR-FSD demonstrates the shot-stable property in two aspects, see the red solid and dashed lines in Figure 1. In the common few-shot settings (solid lines), SRR-FSD achieves competitive performance at higher shots and significantly better performance at lower shots compared to state-of-the-art few-shot detectors. In a more realistic setting (dashed lines) where implicit shots of novel concepts are removed from the classification dataset for the pretrained model, SRR-FSD steadily maintains the performance while some previous methods have results degraded by a large margin due to the loss of implicit shots. We hope the proposed realistic setting can serve as a new benchmark protocol for future research. Additionally, SRR-FSD generalizes to novel classes without catastrophically forgetting the objects in the base classes, so the performance on the base set holds firmly.

We summarize our contributions as follows:

  • To our knowledge, our work is the first to investigate semantic relation reasoning for the few-shot detection task and show its potential to improve a strong baseline.

  • Our SRR-FSD achieves stable performance w.r.t the shot variation, outperforming state-of-the-art FSOD methods under several existing settings especially when the novel class data is extremely limited.

  • We propose a more realistic FSOD setting in which implicit shots of novel classes are removed from the classification dataset for the pretrained model, and show that our SRR-FSD can maintain a more steady performance compared to previous methods if using the new pretrained model.

2 Related Work

Object Detection

Object detection is a fundamental computer vision task, serving as a necessary step for various down-streaming instance-based understanding. Modern CNN-based detectors can be roughly divided into two categories. One is single-stage detector such as YOLO

[yolov2], SSD [ssd], RetinaNet [retinanet], and FreeAnchor [freeanchor] which directly predict the class confidence scores and the bounding box coordinates over a dense grid. The other is multi-stage detector such as Faster R-CNN [faster-rcnn], R-FCN [r-fcn], FPN [fpn], Cascade R-CNN [cascade-rcnn], and Libra R-CNN [libra-rcnn]

which predict class-agnostic regions of interest and refine those region proposals for one or multiple times. All these methods rely on pre-defined anchor boxes to have an initial estimation of the size and aspect ratio of the objects. Recently, anchor-free detectors eliminate the performance-sensitive hyperparameters for the anchor design. Some of them detect the key points of bounding boxes

[cornernet, extremenet, centernet]. Some of them encode and decode the bounding boxes as anchor points and point-to-boundary distances [guidedanchor, fsaf, fcos, rpdet, sapd]. DETR [detr] reformulates object detection as a direct set prediction problem and solve it with transformers. However, these detectors are trained with full supervision where each class has abundant annotated object instances.

Few-Shot Detection Recently, there have been works focusing on solving the detection problem in the limited data scenario. LSTD [lstd] proposes the transfer knowledge regularization and background depression regularization to promote the knowledge transfer from the source domain to the target domain. [fsod-mc] proposes to iterate between model training and high-confidence sample selection. RepMet [repmet]

adopts a distance metric learning classifier into the RoI classification head. FSRW

[yolo-fewshot] and Meta R-CNN [meta-rcnn]

predict per-class attentive vectors to reweight the feature maps of the corresponding classes. MetaDet

[metadet] leverages meta-level knowledge about model parameter generation for category-specific components of novel classes. In [attention-rpn], the similarity between the few shot support set and query set is explored to detect novel objects. Context-Transformer [context-transformer] relies on discriminative context clues to reduce object confusion. TFA [tfa] only fine-tunes the last few layers of the detector. Two very recent papers are MPSR [mpsr] and FSDetView [fsdetview]. MPSR develops an auxiliary branch to generate multi-scale positive samples as object pyramids and to refine the prediction at various scales. FSDetView proposes a joint feature embedding module to share the feature from base classes. However, all these methods depend purely on visual information and suffer from shot variation.

Semantic Reasoning in Vision Tasks Semantic word embeddings have been used in zero-shot learning tasks to learn a mapping from the visual feature space to the semantic space, such as zero-shot recognition [zsr-gnn] and zero-shot object detection [zsod, polarity]. In [chen_etal], semantic embeddings are used as the ground-truth of the encoder TriNet to guide the feature augmentation. In [lu_etal], semantic embeddings guide the feature synthesis for unseen classes by perturbing the seen feature with the projected difference between a seen class embedding and a unseen class embedding. In zero-shot or few-shot recognition [zsr-gnn, fsr-kt], word embeddings are often combined with knowledge graphs to perform relation reasoning via the graph convolution operation [gcn]. Knowledge graphs are usually defined based on heuristics from databases of common sense knowledge rules [wordnet, nell]. [multi-label-gcn] proposed a knowledge graph based on object co-occurrence for the multi-label recognition task. To our knowledge, the use of word embeddings and knowledge graphs are rarely explored in the FSOD task. Any-Shot Detector (ASD) [anyshot] is the only work that uses word embeddings for the FSOD task. But ASD focuses more on the zero-shot detection and it does not consider the explicit relation reasoning between classes because each word embedding is treated independently.

3 Semantic Relation Reasoning Few-Shot Detector

In this section, we first briefly introduce the preliminaries for few-shot object detection including the problem setup and the general training pipelines. Then based on Faster R-CNN [faster-rcnn], we build our SRR-FSD by integrating semantic relation with the visual information and allowing it to perform relation reasoning in the semantic space. We also discuss the problems of trivially using the raw word embeddings and the predefined knowledge graphs. Finally, we introduce the two-phase training processes. An overview of our SRR-FSD is illustrated in Figure 3.

Figure 3: Overview of the SRR-FSD. A semantic space is built from the word embeddings of all corresponding classes in the dataset and is augmented through a relation reasoning module. Visual features are learned to be projected into the augmented space. “”: dot product.

3.1 FSOD Preliminaries

Conventional object detection problem has a base class set in which there are many instances, and a base dataset with abundant images. consists of a set of annotated images where is the image and is the annotation of labels from and bounding boxes for objects in . For few-shot object detection (FSOD) problem, in addition to and it also has a novel class set and a novel dataset , with . In , objects have labels belong to and the number of objects for each class is for -shot detection. A few-shot detector is expected to learn from and to quickly generalize to with a small such that it can detect all objects in a held-out testing set with object classes in . All classes in have semantically meaningful names.

A typical few-shot detector has two training phases. The first one is the base training phase where the detector is trained on similarly to conventional object detectors. Then in the second phase, it is further fine-tuned on the union of and . To avoid the dominance of objects from , a small subset is sampled from such that the training set is balanced concerning the number of objects per class. As the total number of classes is increased by the size of in the second phase, more class-specific parameters are inserted in the detector and trained to be responsible for the detection of novel objects. The class-specific parameters are usually in the box classification and localization layers at the very end of the network.

3.2 Semantic Space Projection

Our few-shot detector is built on top of Faster R-CNN [faster-rcnn], a popular two-stage general object detector. In the second-stage of Faster R-CNN, a feature vector is extracted for each region proposal and forwarded to a classification subnet and a regression subnet. In the classification subnet, the feature vector is transformed into a -dimentional vector through fully-connected layers. Then is multiplied by a learnable weight matrix

to output a probability distribution as in Eq. (



where is the number of classes and

is a learnable bias vector. Cross-entropy loss is used during training.

To learn objects from both the visual information and the semantic relation, we first construct a semantic space and project the visual feature into this semantic space. Specifically, we represent the semantic space using a set of -dimensional word embeddings [word2vec] corresponding to the object classes (including the background class). And the detector is trained to learn a linear projection in the classification subnet (see Figure 3) such that

is expected to align with its class’s word embedding after projection. Mathematically, the prediction of the probability distribution turns into Eq. (

2) from Eq. (1).


During training, is fixed and the learnable variable is . A benefit is that generalization to novel objects involves no new parameters in . We can simply expand with embeddings of novel classes. We still keep the to model the category imbalance in the detection dataset.

Domain gap between vision and language. encodes the knowledge of semantic concepts from natural language. While it is applicable in zero-shot learning, it will introduce the bias of the domain gap between vision and language to the FSOD task. Because unlike zero-shot learning where unseen classes have no support from images, the few-shot detector can rely on both the images and the embeddings to learn the concept of novel objects. When there are very few images to rely on, the knowledge from embeddings can guide the detector towards a decent solution. But when more images are available, the knowledge from embeddings may be misleading due to the domain gap, resulting in a suboptimal solution. Therefore, we need to augment the semantic embeddings to reduce the domain gap. Some previous works like ASD [anyshot] apply a trainable transformation to each word embedding independently. But we leveraging the explicit relationship between classes is more effective for embedding augmentation, leading to the proposal of the dynamic relation graph in Section 3.3.

3.3 Relation Reasoning

The semantic space projection learns to align the concepts from the visual space with the semantic space. But it still treats each class independently and there is no knowledge propagation among classes. Therefore, we further introduce a knowledge graph to model their relationships. The knowledge graph is a adjacency matrix representing the connection strength for every neighboring class pairs. is involved in classification via the graph convolution operation [gcn]. Mathematically, the updated probability prediction is shown in Eq. (3).

Figure 4: Network architecture of the relation reasoning module for learning the relation graph. “”: dot product. “”: element-wise plus.

The heuristic definition of the knowledge graph. In zero-shot or few-shot recognition algorithms, the knowledge graph is predefined base on heuristics. It is usually constructed from a database of common sense knowledge rules by sampling a sub-graph through the rule paths such that semantically related classes have strong connections. For example, classes from the ImageNet dataset [imagenet] have a knowledge graph sampled from the WordNet [wordnet]. However, classes in FSOD datasets are not highly semantically related, nor do they form a hierarchical structure like the ImageNet classes. The only applicable heuristics we found are based on object co-occurrence from [multi-label-gcn]. Although the statistics of the co-occurrence are straightforward to compute, the co-occurrence is not necessarily equivalent to semantic relation.

Instead of predefining a knowledge graph based on heuristics, we propose to learn a dynamic relation graph driven by the data to model the relation reasoning between classes. The data-driven graph is also responsible for reducing the domain gap between vision and language because it is trained with image inputs. Inspired by the concept of the transformer, we implement the dynamic graph with the self-attention architecture [attention] as shown in Figure 4. The original word embeddings are transformed by three linear layers , and a self-attention matrix is computed from the outputs of . The self-attention matrix is multiplied with the output of followed by another linear layer

. A residual connection

[resnet] adds the output of with the original . Another advantage of learning a dynamic graph is that it can easily adapt to new coming classes. Because the graph is not fixed and is generated on the fly from the word embeddings. We do not need to redefine a new graph and retrain the detector from the beginning. We can simply insert corresponding embeddings of new classes and fine-tune the detector.

3.4 Decoupled Fine-tuning

In the second fine-tuning phase, we only unfreeze the last few layers of our SRR-FSD similar to TFA [tfa]. For the classification subnet, we fine-tune the parameters in the relation reasoning module and the projection matrix . For the localization subnet, it is not dependent on the word embeddings but it shares features with the classification subnet. We find that the learning of localization on novel objects can interfere with the classification subnet via the shared features, leading to many false positives. Decoupling the shared fully-connected layers between the two subnets can effectively make each subnet learn better features for its task. In other words, the classification subnet and the localization subnet have individual fully-connected layers and they are fine-tuned independently.

4 Experiments

4.1 Implementation Details

Our SRR-FSD is implemented based on Faster R-CNN [faster-rcnn] with ResNet-101 [resnet] and Feature Pyramid Network [fpn] as the backbone using the MMDetection [mmdet]

framework. All models are trained with Stochastic Gradient Descent (SGD) and a batch size of 16. For the word embeddings, we use the L2-normalized 300-dimensional Word2Vec

[word2vec] vectors from the language model trained on large unannotated texts like Wikipedia. In the relation reasoning module, we reduce the dimension of word embeddings to 32 which is empirically selected. In the first base training phase, we set the learning rate, the momentum, and the weight decay to 0.02, 0.9, and 0.0001, respectively. In the second fine-tuning phase, we reduce the learning rate to 0.001 unless otherwise mentioned. The input image is sampled by first randomly choosing between the base set and the novel set with a 50% probability and then randomly selecting an image from the chosen set.

4.2 Existing Settings

We follow the existing settings in previous FSOD methods [yolo-fewshot, metadet, meta-rcnn, tfa] to evaluate our SRR-FSD on the VOC [voc]

and COCO

[coco] datasets. For fair comparison and reduced randomness, we use the same data splits and a fixed list of novel samples provided by [yolo-fewshot].

Novel Set 1 Novel Set 2 Novel Set 3
Method / shot 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10
FSRW [yolo-fewshot] 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 40.5 21.3 25.6 28.4 42.8 45.9
MetaDet [metadet] 18.9 20.6 30.2 36.8 49.6 21.8 23.1 27.8 31.7 43.0 20.6 23.9 29.4 43.9 44.1
Meta R-CNN [meta-rcnn] 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1
TFA [tfa] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8
SRR-FSD (Ours) 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4
Table 1: FSOD evaluation on VOC. We report the mAP with IoU threshold 0.5 (mAP50) under 3 different sets of 5 novel classes with a small number of shots.

VOC The 07 and 12 train/val sets are used for training and the 07 test set is for testing. Out of its 20 object classes, 5 classes are selected as novel and the remaining 15 are base classes, with 3 different base/novel splits. The novel classes each have annotated objects, where

equals 1, 2, 3, 5, 10. In the first base training phase, our SRR-FSD is trained for 18 epochs with the learning rate multiplied by 0.1 at the 12th and 15th epoch. In the second fine-tuning phase, we train for

steps where is the number of images in the -shot novel dataset.

We report the mAP50 of the novel classes on VOC with 3 splits in Table 1. In all different base/novel splits, our SRR-FSD achieves a more shot-stable performance. At higher shots like 5-shot and 10-shot, our performance is competitive compared to previous state-of-the-art methods. At more challenging conditions with shots less than 5, our approach can outperform the second-best by a large margin (up to 10+ mAP). Compared to ASD [anyshot] which only reports results of 3-shot and 5-shot in the Novel Set 1, ours is 24.2 and 6.0 better respectively in mAP. We do not include ASD in Table 1 because its paper does not provide the complete results on VOC.

Learning without forgetting is another merit of our SRR-FSD. After generalization to novel objects, the performance on the base objects does not drop at all as shown in Table 2. Both base AP and novel AP of our SRR-FSD compare favorably to previous methods based on the same Faster R-CNN with ResNet-101. The base AP even increases a bit probably due to the semantic relation reasoning from limited novel objects to base objects.

Shot Method Base AP50 Novel AP50
3 Meta R-CNN [meta-rcnn] 64.8 35.0
TFA [tfa] 79.1 44.7
Ours base only 77.7 n/a
SRR-FSD (Ours) 78.2 51.3
10 Meta R-CNN [meta-rcnn] 67.9 51.5
TFA [tfa] 78.4 56.0
Ours base only 77.7 n/a
SRR-FSD (Ours) 78.2 56.8
Table 2: FSOD performance for the base and novel classes on Novel Set 1 of VOC. Our SRR-FSD has the merit of learning without forgetting.

COCO The minival set with 5000 images is used for testing and the rest images in train/val sets are for training. Out of the 80 classes, 20 of them overlapped with VOC are the novel classes with shots per class and the remaining 60 classes are base. We train the SRR-FSD on the base dataset for 12 epochs using the same setting as MMDetection [mmdet] and fine-tune it for a fixed number of steps where is the number of images in the base dataset. Unlike VOC, the base dataset in COCO contains unlabeled novel objects, so the region proposal network (RPN) treats them as the background. To avoid omitting novel objects in the fine-tuning phase, we unfreeze the RPN and the following layers. Table 3 presents the COCO-style averaged AP. Again we consistently outperform previous methods including FSRW [yolo-fewshot], MetaDet [metadet], Meta R-CNN [meta-rcnn], TFA [tfa], and MPSR [mpsr].

Shot Method AP AP50 AP75
10 FSRW [yolo-fewshot] 5.6 12.3 4.6
MetaDet [metadet] 7.1 14.6 6.1
Meta R-CNN [meta-rcnn] 8.7 19.1 6.6
TFA [tfa] 10.0 - 9.3
MPSR [mpsr] 9.8 17.9 9.7
SRR-FSD (Ours) 11.3 23.0 9.8
30 FSRW [yolo-fewshot] 9.1 19.0 7.6
MetaDet [metadet] 11.3 21.7 8.1
Meta R-CNN [meta-rcnn] 12.4 25.3 10.8
TFA [tfa] 13.7 - 13.4
MPSR [mpsr] 14.1 25.4 14.2
SRR-FSD (Ours) 14.7 29.2 13.5
Table 3: FSOD performance of the novel classes on COCO.

COCO to VOC For the cross-domain FSOD setting, we follow [yolo-fewshot, metadet] to use the same base dataset with 60 classes as in the previous COCO within-domain setting. The novel dataset consists of 10 samples for each of the 20 classes from the VOC dataset. The learning schedule is the same as the previous COCO within-domain setting except the learning rate is 0.005. Figure 5 shows that our SRR-FSD achieves the best performance with a healthy 44.5 mAP, indicating better generalization ability in cross-domain situations.

Figure 5: 10-shot cross domain performance on the 20 novel classes under COCO to VOC.

4.3 A More Realistic Setting

Novel Set 1 Novel Set 2 Novel Set 3
Method / shot 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10
FSRW [yolo-fewshot] 13.9 21.1 20.0 29.9 40.8 13.5 14.2 20.6 20.7 36.8 16.2 22.2 26.8 37.0 41.5
Meta R-CNN [meta-rcnn] 11.5 22.2 24.7 36.4 45.2 10.1 16.9 22.7 29.6 40.1 10.0 21.7 27.1 32.8 41.6
TFA [tfa] 35.8 39.5 44.2 50.8 55.3 18.8 26.0 33.2 31.3 39.2 25.6 32.6 36.4 43.7 48.5
SRR-FSD (Ours) 46.3 51.1 52.6 56.2 57.3 31.0 29.9 34.7 37.3 41.7 39.2 40.5 39.7 42.2 45.2
Table 4: FSOD performance (mAP50) on VOC under a more realistic setting where novel classes are removed from the pretrained classification dataset to guarantee . Our SRR-FSD is more robust to the loss of implicit shots comparing with Table 1.

The training of the few-shot detector usually involves initializing the backbone network with a model pretrained on large-scale object classification datasets such as ImageNet [imagenet]. The set of object classes in ImageNet, i.e. , is highly overlapped with the novel class set in the existing settings. This means that the pretrained model can get early access to large amounts of object samples, i.e. implicit shots, from novel classes and encode their knowledge in the parameters before it is further trained for the detection task. Even the pretrained model is optimized for the recognition task, the extracted features still have a big impact on the detection of novel objects (see Figure 1). However, some rare classes may have highly limited or valuable data in the real world that pretraining a classification network on it is not realistic.

Therefore, we propose a more realistic setting for FSOD, which extends the existing settings. In addition to , we also require that . To achieve this, we systematically and hierarchically remove novel classes from . For each class in , we find its corresponding synset in ImageNet and obtain its full hyponym (the synset of the whole subtree starting from that synset) using the ImageNet API 111 The images of this synset and its full hyponym are removed from the pretrained dataset. And the classification model is trained on a dataset with no novel objects. We provide the list of WordNet IDs for each novel class to be removed in Appendix A.

We notice that CoAE [coae] also proposed to remove all COCO-related ImageNet classes to ensure the model does not “foresee” the unseen classes. As a result, a total of 275 classes are removed from ImageNet including both the base and novel classes in VOC [voc], which correspond to more than 300k images. We think the loss of this much data may lead to a worse pretrained model in general. So the pretrained model may not be able to extract features strong enough for down-streaming vision tasks compared with the model trained on full ImageNet. Our setting, on the other hand, tries to alleviate this effect as much as possible by only removing the novel classes in VOC Novel Set 1, 2, and 3 respectively, which correspond to an average of 50 classes from ImageNet.

Under the new realistic setting, we re-evaluate previous methods using their official source code and report the performance on the VOC dataset in Table 4. Our SRR-FSD demonstrates superior performance to other methods under most conditions, especially at challenging lower shot scenarios. More importantly, our SRR-FSD is less affected by the loss of implicit shots. Compared with results in Table 1, our performance is more stably maintained when novel objects are only available in the novel dataset.

4.4 Ablation Study

In this section, we study the contribution of each component. Experiments are conducted on the VOC dataset. Our baseline is the Faster R-CNN [faster-rcnn] with ResNet-101 [resnet] and FPN [fpn]. We gradually apply the Semantic Space Projection (SSP 3.2), Relation Reasoning (RR 3.3) and Decoupled Fine-tuning (DF 3.4) to the baseline and report the performance in Table 5. We also compare three different ways of augmenting the raw word embeddings in Table 6, including the trainable transformation from ASD [anyshot], the heuristic knowledge graph from [multi-label-gcn], and the dynamic graph from our proposed relation reasoning module.

Components Shots in Novel Set 1
SSP RR DF 1 2 3 5 10
Faster R-CNN [faster-rcnn] 32.6 44.4 46.3 49.6 55.6
40.5 46.8 46.5 47.1 52.2
44.1 46.0 47.8 51.7 54.7
SRR-FSD 47.8 50.5 51.3 55.2 56.8
Table 5: Ablative performance (mAP50) on the VOC Novel Set 1 by gradually applying the proposed components to the baseline Faster R-CNN. SSP: semantic space projection. RR: relation reasoning. DF: decoupled fine-tuning.

Semantic space projection guides shot-stable learning. The baseline Faster R-CNN can already achieve satisfying results at 5-shot and 10-shot. But at 1-shot and 2-shot, performance starts to fall apart due to exclusive dependence on images. The semantic space projection, on the other hand, makes the learning more stable to the variation of shot numbers (see 1st and 2nd entries in Table 5). The space projection guided by the semantic embeddings is learned well enough in the base training phase so it can be quickly adapted to novel classes with a few instances. We can observe a major boost at lower shot conditions compared to baseline, i.e. 7.9 mAP and 2.4 mAP gain at 1-shot and 2-shot respectively. However, the raw semantic embeddings limit the performance at higher shot conditions. The performance at 5-shot and 10-shot drops below the baseline. This verifies our argument about the domain gap between vision and language. At lower shots, there is not much visual information to rely on so the language information can guide the detector to a decent solution. But when more images are available, the visual information becomes more precise then the language information starts to be misleading. Therefore, we propose to refine the word embeddings for a reduced domain gap.

Shots in Novel Set 1
1 2 3 5 10
+SSP 40.5 46.8 46.5 47.1 52.2
+SSP +TT [anyshot] 39.3 45.7 43.9 49.4 52.4
+SSP +HKG [multi-label-gcn] 41.6 45.5 47.8 49.7 52.5
+SSP +RR 44.1 46.0 47.8 51.7 54.7
Table 6: Comparison of three ways of refining the word embeddings, including the trainable transformation from ASD [anyshot], the heuristic knowledge graph from [multi-label-gcn], and the dynamic relation graph from our relation reasoning module. SSP: semantic space projection. RR: relation reasoning. TT: trainable transformation. HKG: heuristic knowledge graph.

Relation reasoning promotes adaptive knowledge propagation. The relation reasoning module explicitly learns a relation graph that builds direct connections between base classes and novel classes. So the detector can learn the novel objects using the knowledge of base objects besides the visual information. Additionally, the relation reasoning module also functions as a refinement to the raw word embeddings with a data-driven relation graph. Since the relation graph is updated with image inputs, the refinement tends to adapt the word embeddings for the vision domain. Results in Table 5 (2nd and 3rd entries) confirm that applying relation reasoning improves the detection accuracy of novel objects under different shot conditions. We also compare it with two other ways of refining the raw word embeddings in Table 6. One is the trainable transformation (TT) from ASD [anyshot] where word embeddings are updated with a trainable metric and a word vocabulary. Note that this transformation is applied to each embedding independently which does not consider the explicit relationships between them. The other one is the heuristic knowledge graph (HKG) defined based on the co-occurrence of objects from [multi-label-gcn]. It turns out both the trainable transformation and the predefined heuristic knowledge graph are not as effective as the dynamic relation graph in the relation reasoning module. The effect of the trainable transformation is similar to unfreezing more parameters of the last few layers during fine-tuning as shown in Appendix E, which leads to overfitting when the shot is low. And the predefined knowledge graph is fixed during training thus cannot be adaptive to the inputs. In other words, the dynamic relation graph is better because it can not only perform explicit relation reasoning but also augment the raw embeddings for reduced domain gap between vision and language.

Decoupled fine-tuning reduces false positives. We analyze the false positives generated by our SRR-FSD with and without decoupled fine-tuning (DF) using the detector diagnosing tool [det-analysis]. The effect of DF on reducing the false positives in novel classes is visualized in Figure 6. It shows that most of the false positives are due to misclassification into similar categories. With DF, the classification subnet can be trained independently from the localization subnet to learn better features specifically for classification.

Figure 6: Error analysis of false positives in VOC Novel Set 1 with and without decouple fine-tuning (DF). Detectors are trained with 3 shots. Pie charts indicate the fraction of correct detections (Cor) and top-ranked false positives that are due to poor localization (Loc), confusion with similar objects (Sim), confusion with other VOC objects (Oth), or confusion with background or unlabeled objects (BG).

5 Conclusion

In this work, we propose semantic relation reasoning for few-shot object detection. The key insight is to explicitly integrate semantic relation between base and novel classes with the available visual information, which can help to learn the novel concepts better especially when the novel class data is extremely limited. We apply the semantic relation reasoning to the standard two-stage Faster R-CNN and demonstrate robust few-shot performance against the variation of shot numbers. Compared to previous methods, our approach achieves state-of-the-art results on several few-shot detection settings, as well as a proposed realistic setting where novel concepts encoded in the pretrained backbone model are eliminated. We hope this realistic setting can be a better evaluation protocol for future few-shot detectors. Last but not least, the key components of our approach, i.e. semantic space projection and relation reasoning, can be straightly applied to the classification subnet of other few-shot detectors.

Appendix A Removing Novel Classes from ImageNet

We propose a realistic setting for evaluating the few-shot object detection methods, where novel classes are completely removed from the classification dataset used for training a model to initialize the backbone network in the detector. This can guarantee that the object concept of novel classes will not be encoded in the pretrained model before training the few-shot detector. Because the novel class data is so rare in the real world that pretraining a classifier on it is not realistic.

ImageNet [imagenet] is widely used for pretraining the classification model. It has 1000 classes organized according to the WordNet hierarchy. Each class has over 1000 images for training. We systematically and hierarchically remove novel classes by finding each synset and its corresponding full hyponym (synset of the whole sub-tree starting from that synset) using the ImageNet API 222 So each novel class may contain multiple ImageNet classes.

For the novel classes in the VOC dataset [voc], their corresponding WordNet IDs to be removed are as follows.

  • aeroplane: n02690373, n02692877, n04552348

  • bird: n01514668, n01514859, n01518878, n01530575, n01531178, n01532829, n01534433, n01537544, n01558993, n01560419, n01580077, n01582220, n01592084, n01601694, n01608432, n01614925, n01616318, n01622779, n01795545, n01796340, n01797886, n01798484, n01806143, n01806567, n01807496, n01817953, n01818515, n01819313, n01820546, n01824575, n01828970, n01829413, n01833805, n01843065, n01843383, n01847000, n01855032, n01855672, n01860187, n02002556, n02002724, n02006656, n02007558, n02009229, n02009912, n02011460, n02012849, n02013706, n02017213, n02018207, n02018795, n02025239, n02027492, n02028035, n02033041, n02037110, n02051845, n02056570, n02058221

  • boat: n02687172, n02951358, n03095699, n03344393, n03447447, n03662601, n03673027, n03873416, n03947888, n04147183, n04273569, n04347754, n04606251, n04612504

  • bottle: n02823428, n03062245, n03937543, n03983396, n04522168, n04557648, n04560804, n04579145, n04591713

  • bus: n03769881, n04065272, n04146614, n04487081

  • cat: n02123045, n02123159, n02123394, n02123597, n02124075, n02125311, n02127052

  • cow: n02403003, n02408429, n02410509

  • horse: n02389026, n02391049

  • motorbike: n03785016, n03791053

  • sheep: n02412080, n02415577, n02417914, n02422106, n02422699, n02423022

  • sofa: n04344873

For the novel classes in the COCO dataset [coco], they are very common in the real world. Removing them from the ImageNet does not make sense as much as removing data-scarce classes. So we suggest for large-scale datasets like COCO, we should follow the long-tail distribution of their class frequency and select the data-scarce classes on the distribution tail to be the novel classes.

Appendix B Visualization of Relation Reasoning

Figure 7 visualizes the correlation maps between the semantic embeddings of novel and base classes before and after the relation reasoning, as well as the difference between the two maps. Nearly all the correlations are increased slightly, indicating better knowledge propagation between the two groups of classes. Additionally, it is interesting to see that some novel classes get more correlated than others, e.g. “sofa” with “bottle” and “sofa” with “table”, probably because “sofa” can often be seen together with “bottle” and “table” in the living room but the original semantic embeddings cannot capture these relationships.

(a) Before relation reasoning
(b) After relation reasoning
(c) Difference between above correlation maps
Figure 7: Correlation of the semantic embeddings before and after the relation reasoning between the base classes and the novel classes on the VOC dataset. The novel classes are from Novel Set 1. The last figure shows how does the correlation change subtly. Some novel classes are getting more correlated with base classes after relation reasoning, e.g. “sofa” with “bottle” and “table”. Best viewed in color.

Appendix C Using Other Word Embeddings

In the semantic space projection, we represent the semantic space using word embeddings from the Word2Vec [word2vec]. We could simply set the to be random vectors. Additionally, there are other language models for obtaining vector representations for words, such as the GloVe [glove]. The GloVe is trained with aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. We also explored using word embedding with different dimensions from the GloVe in the semantic space projection step and compared with the results by the Word2Vec. Performance on the VOC Novel Set 1 is reported in Table 7. The Word2Vec can provide better representations than the GloVe of both 300 dimensions and 200 dimensions. The performance of random embeddings is significantly worse than the meaningful Word2Vec and GloVe, which again verifies the importance of semantic information for shot-stable FSOD.

Novel Set 1
Word embeddings shot=1 2 3 5 10
Random-300d 33.2 37.5 43.0 47.0 51.5
Word2Vec-300d [word2vec] 42.8 47.1 49.0 50.8 52.8
GloVe-300d [glove] 38.8 44.8 46.6 49.0 54.3
GloVe-200d [glove] 39.7 44.6 45.8 49.4 53.0
Table 7: FSOD performance (mAP50) on the VOC Novel Set 1 under different word embeddings in the semantic space projection. All models are using the ResNet-50 network. 300d and 200d mean the numbers of embedding dimension are 300 and 200 respectively. The Word2Vec provides better representations than the GloVe.

Appendix D Reduced Dimension in Relation Reasoning

In the relation reasoning module, the dimension of word embeddings is reduced by linear layers before computing the attention map, which saves computational time. We empirically test different dimensions and select the one with the best performance, i.e. when the dimension is 32. But other choices are just slightly worse. Table 8 reports the results on VOC dataset under different dimensions. All the experiments are following the same setting as in the main paper. The only exception is that we use ResNet-50 [resnet] to reduce the computational cost of tuning hyperparameters.

Novel Set 1
Dimension shot=1 2 3 5 10
128 40.9 44.6 44.3 48.1 54.1
64 42.0 47.4 48.9 51.7 54.1
32 42.4 46.8 48.1 51.9 54.7
16 44.1 46.0 47.8 51.7 54.7
Table 8: FSOD performance (mAP50) on the VOC Novel Set 1 under different reduced feature dimension in the relation reasoning module. Bold font indicates best or second best results. All models are using the ResNet-50 network.
Novel Set 1
Tunable Parameters shot=1 2 3 5 10
Last layer (TFA [tfa]) 39.8 36.1 44.7 55.7 56.0
+FCs 36.9 34.9 45.3 53.0 55.9
+FCs +RPN 37.2 39.8 44.3 52.7 56.2
+FCs +RPN +Backbone 16.2 19.5 24.8 39.2 44.6
Table 9: FSOD results (mAP50) on the VOC Novel Set 1 with more and more tunable parameters in the finetuning stage. The baseline is TFA [tfa] which only finetunes the last classification layer in the Faster R-CNN. We gradually unfreeze more previous layers including two fully-connected layers (FCs) after the RoI-pooling, layers in region proposal network (RPN), and layers in the Backbone. This proves that finetuning more parameters does not guarantee better performance in few-shot detection.
Novel Set 1 Novel Set 2 Novel Set 3
Shot Method bird bus cow mbike sofa mean aero bottle cow horse sofa mean boat cat mbike sheep sofa mean
1 FSRW 13.5 10.6 31.5 13.8 4.3 14.8 11.8 9.1 15.6 23.7 18.2 15.7 10.8 44.0 17.8 18.1 5.3 19.2
Meta R-CNN 6.1 32.8 15.0 35.4 0.2 19.9 23.9 0.8 23.6 3.1 0.7 10.4 0.6 31.1 28.9 11.0 0.1 14.3
MPSR 33.5 41.2 57.6 54.5 21.6 41.7 21.2 9.1 36.0 30.9 25.1 24.4 14.9 47.8 57.7 34.7 22.8 35.6
SRR-FSD (Ours) 38.1 53.8 58.7 64.1 24.4 47.8 27.9 4.6 50.5 53.9 25.5 32.5 16.2 57.2 62.9 48.3 16.0 40.1
2 FSRW 21.2 12.0 16.8 17.9 9.6 15.5 28.6 0.9 27.6 0.0 19.5 15.3 5.3 46.4 18.4 26.1 12.4 21.7
Meta R-CNN 17.2 34.4 43.8 31.8 0.4 25.5 12.4 0.1 44.4 50.1 0.1 19.4 10.6 24.0 36.2 19.2 0.8 18.2
MPSR 38.2 28.6 56.5 57.3 32.0 42.5 36.5 9.1 45.1 21.6 34.2 29.3 17.9 49.6 59.2 49.2 32.9 41.8
SRR-FSD (Ours) 35.8 57.7 59.3 61.8 38.0 50.5 34.4 5.7 57.1 44.0 35.5 35.3 15.5 51.4 62.6 44.4 33.7 41.5
3 FSRW 26.1 19.1 40.7 20.4 27.1 26.7 29.4 4.6 34.9 6.8 37.9 22.7 11.2 39.8 20.9 23.7 33.0 25.7
Meta R-CNN 30.1 44.6 50.8 38.8 10.7 35.0 25.2 0.1 50.7 53.2 18.8 29.6 16.3 39.7 32.6 38.8 10.3 27.5
MPSR 35.1 60.6 56.6 61.5 43.4 51.4 49.2 9.1 47.1 46.3 44.3 39.2 14.4 60.6 57.1 37.2 42.3 42.3
SRR-FSD (Ours) 35.2 55.6 61.3 62.9 41.5 51.3 42.3 11.5 57.0 43.6 41.2 39.1 23.1 50.6 60.0 49.3 38.6 44.3
5 FSRW 31.5 21.1 39.8 40.0 37.0 33.9 33.1 9.4 38.4 25.4 44.0 30.1 14.2 57.3 50.8 38.9 41.6 40.6
Meta R-CNN 35.8 47.9 54.9 55.8 34.0 45.7 28.5 0.3 50.4 56.7 38.0 34.8 16.6 45.8 53.9 41.5 48.1 41.2
MPSR 39.7 65.5 55.1 68.5 47.4 55.2 47.8 10.4 45.2 47.5 48.8 39.9 20.9 56.6 68.1 48.4 45.8 48.0
SRR-FSD (Ours) 46.1 58.6 64.6 63.5 43.2 55.2 44.2 12.3 56.5 51.3 39.8 40.8 20.4 55.5 65.4 51.9 41.3 46.9
10 FSRW 30.0 62.7 43.2 60.6 39.6 47.2 43.2 13.9 41.5 58.1 39.2 39.2 20.1 51.8 55.6 42.4 36.6 41.3
Meta R-CNN 52.5 55.9 52.7 54.6 41.6 51.5 52.8 3.0 52.1 70.0 49.2 45.4 13.9 72.6 58.3 47.8 47.6 48.1
MPSR 48.3 73.7 68.2 70.8 48.2 61.8 51.8 16.7 53.1 66.4 51.2 47.8 24.4 55.8 67.5 50.4 50.5 49.7
SRR-FSD (Ours) 45.0 67.4 63.1 65.2 43.3 56.8 46.2 18.4 54.0 59.1 41.4 43.8 17.1 55.1 67.4 47.5 44.7 46.4
Table 10: AP50 performance of each novel class on the few-shot VOC dataset. Bold font indicates the best result in the group. Our SRR-FSD trained with visual information and semantic relation demonstrates shot-stable performance.

Appendix E Finetuning More Parameters

Similar to TFA [tfa], we have a finetuning stage to make the detector generalized to novel classes. For the classification subnet, we finetune the parameters in the relation reasoning module and the projection matrix while all the parameters in previous layers are frozen. Some may argue that the improvement of our SRR-FSD over the baseline is due to more parameters finetuned in the relation reasoning module compared to the Faster R-CNN [faster-rcnn] baseline. But we show that finetuning more parameters does not necessarily lead to better results in Table 9. We take the TFA model which is essentially a Faster R-CNN finetuned with only the last layer trainable and gradually unfreeze the previous layers. It turns out more parameters involved in finetuning do not change the results substantially and that too many parameters will lead to severe overfitting.

Appendix F Complete Results on VOC

In Table 10, we present the complete results on the VOC [voc] dataset as in FSRW [yolo-fewshot] and Meta R-CNN [meta-rcnn]. We also include the very recent MPSR [mpsr] for comparison. MPSR develops an auxiliary branch to generate multi-scale positive samples as object pyramids and to refine the prediction at various scales. Note that MPSR improves its baseline by a considerable margin but its research direction is orthogonal and complimentary to ours because it is still exclusively dependent on visual information. Therefore, our approach combining visual information and semantic relation reasoning can achieve superior performance at extremely low shot (e.g. 1, 2) conditions.

Appendix G Interpretation of the Dynamic Relation Graph

In the relation reasoning module, we propose to learn a dynamic relation graph driven by the data, which is conceptually different from the predefined fixed knowledge graphs used in [zsr-gnn, multi-label-gcn, fsr-kt]. We implement the dynamic graph with the self-attention architecture [attention]. Although it is in the form of a feedforward network, it can also be interpreted as a computation related to the knowledge graph. If we denote the transformations in the linear layers , , , as , , , respectively, we can formulate the relation reasoning in Eq. (4)


where is the matrix of augmented word embeddings after the relation reasoning which will be used as the weights to compute classification scores and is the softmax function operated on the last dimension of the input matrix. The item can be interpreted as a dynamic knowledge graph in which the learnable parameters are and . And it is involved in the computation of the classification scores via the graph convolution operation [gcn], which connects the word embeddings in to allow knowledge propagation among them. The item can be viewed as a learnable transformation applied to each embedding independently.