Universal-Prototype Augmentation for Few-Shot Object Detection

Few-shot object detection (FSOD) aims to strengthen the performance of novel object detection with few labeled samples. To alleviate the constraint of few samples, enhancing the generalization ability of learned features for novel objects plays a key role. Thus, the feature learning process of FSOD should focus more on intrinsical object characteristics, which are invariant under different visual changes and therefore are helpful for feature generalization. Unlike previous attempts of the meta-learning paradigm, in this paper, we explore how to smooth object features with intrinsical characteristics that are universal across different object categories. We propose a new prototype, namely universal prototype, that is learned from all object categories. Besides the advantage of characterizing invariant characteristics, the universal prototypes alleviate the impact of unbalanced object categories. After augmenting object features with the universal prototypes, we impose a consistency loss to maximize the agreement between the augmented features and the original one, which is beneficial for learning invariant object characteristics. Thus, we develop a new framework of few-shot object detection with universal prototypes (FSOD^up) that owns the merit of feature generalization towards novel objects. Experimental results on PASCAL VOC and MS COCO demonstrate the effectiveness of FSOD^up. Particularly, for the 1-shot case of VOC Split2, FSOD^up outperforms the baseline by 6.8% in terms of mAP. Moreover, we further verify FSOD^up on a long-tail detection dataset, i.e., LVIS. And employing FSOD^up outperforms the state-of-the-art method.

READ FULL TEXT VIEW PDF

page 6

page 8

07/23/2020

Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild

Detecting objects and estimating their viewpoint in images are key tasks...
09/16/2021

Few-Shot Object Detection by Attending to Per-Sample-Prototype

Few-shot object detection aims to detect instances of specific categorie...
12/30/2020

MM-FSOD: Meta and metric integrated few-shot object detection

In the object detection task, CNN (Convolutional neural networks) models...
12/02/2020

Meta-Cognition-Based Simple And Effective Approach To Object Detection

Recently, many researchers have attempted to improve deep learning-based...
11/09/2020

Closing the Generalization Gap in One-Shot Object Detection

Despite substantial progress in object detection and few-shot learning, ...
08/11/2020

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

Visual Storytelling (VIST) is a task to tell a narrative story about a c...
11/30/2020

AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection

Few-shot object detection (FSOD) aims at learning a detector that can fa...

1 Introduction

Figure 1: Universal prototypes (colorful stars) are learned from all object categories, which are not specific to certain object categories. Universal prototypes capture different intrinsical object characteristics via latent projection, e.g., the prototype incorporates object characteristics of ‘car’ and ‘motorbike’.

Recently, owing to the success of deep convolutional neural networks (CNN), great progress has been made on object detection

[ren2015faster, girshick2015fast, he2017mask, girshick2014rich]. However, the outstanding performance [redmon2017yolo9000, liu2016ssd, carion2020end, lin2017feature] depends on abundant annotated objects in training images for each object category. As a challenging task, few-shot object detection (FSOD) [kang2019few, wang2019meta] mainly aims to improve the detection performance for novel objects that belong to certain categories but appear rarely in the annotated training images.

The main challenge of FSOD lies in how to learn generalized object features from both abundant samples in base object categories and few samples in novel categories, which can simultaneously describe invariant object characteristics and alleviate the impact of unbalanced categories. Recently, meta-learning strategy [snell2017prototypical, tian2020rethinking, finn2017model] has been utilized in [yan2019meta, xiao2020few, wang2019meta, fan2020few] to adapt representation ability from base object categories to novel categories. However, the weak performance compared to basic fine-tuning methods [wang2020frustratingly, wu2020multi, chen2019a, dhillon2020a]

shows that the meta-learning technique fails to improve the generalization ability of object feature learning. One possible reason is that the adaptation process in meta-learning mechanism could not capture the invariant characteristics across categories sufficiently. The invariance, i.e., invariant under different visual changes like textual variances or environmental noises, is always associated with the intrinsical object characteristics. As demonstrated in

[lyle2019analysis], the models that could extract invariant representations often generalize better than their non-invariant counterparts. Therefore, in this paper, we explore how to enhance the generalization ability of object feature learning with the invariant object characteristics.

We devise universal prototypes (as shown in Fig. 1) to learn the invariant object characteristics. Different from the prototypes that are separately learned from each category [snell2017prototypical, liu2019prototype, wang2019panet], the proposed universal prototypes are learned from all object categories. The benefits are two-fold. On the one hand, prototypes from all categories capture rich information not only from different object categories but also from contexts of images. On the other hand, the universal prototypes reduce the impact of data-imbalance across different categories. Moreover, via fine-tuning, the universal prototypes can be effectively adapted to data-scarce novel categories. To this end, we develop a new framework of few-shot object detection with universal prototypes (). Particularly, we utilize a soft-attention of the learned universal prototypes to augment the object features. Such a universal-prototype augmentation (i.e., each element of the augmented features is a combination of prototypes) aims to simultaneously enhance invariance and retain the semantic information of original object features. Here we employ a consistency loss to enable the maximum agreement between the augmented and original object features. During training, we first train the model on data-abundant base categories. Then, the model is fine-tuned on a re-constructed training set that contains a small number of balanced training samples from both base and novel object categories. Experimental results on multiple benchmarks demonstrate the effectiveness of the proposed method.

The contributions are summarized as follows:

(1) Towards the task of FSOD, we devise a dedicated prototype and a new framework of FSOD with universal-prototype augmentation.

(2) We successfully demonstrate that, after fine-tuning with universal-prototype augmented features, object detectors effectively adapt to novel categories.

(3) We obtain new performance for FSOD on PASCAL VOC [everingham2010pascal, everingham2015pascal] and MS COCO [lin2014microsoft], e.g., for the 1-shot case of VOC Split2, our method outperforms the baseline [wang2020frustratingly] by about 6.8%. Moreover, for the long-tail benchmark of LVIS [gupta2019lvis], embedding our method into a state-of-the-art method [li2020overcoming], we further boost its performance. Enhancing invariance and generalization with universal-prototype augmentation is empirically verified.

2 Related Work

Few-shot image classification. Few-shot image classification [vinyals2016matching, ravi2017optimization, sung2018learning, hariharan2017low, gidaris2018dynamic] targets to recognize novel categories with only few samples in each category. Meta-learning is a widely used method to solve few-shot classification [lu2020learning], which aims to leverage task-level meta knowledge to help the model adapt to new tasks with few labeled samples. Vinyals et al. [vinyals2016matching] and Snell et al. [snell2017prototypical] employed the meta-learning policy to learn the similarity metric that could be transferrable across different tasks. Particularly, based on the policy of meta-learning, prototypical network [snell2017prototypical] is proposed to take the center of congener support samples’ embeddings as the prototype of this category. The classification can be performed by computing distances between the representations of samples and prototype of each category. However, when the data is unbalanced or scarce, the learned prototypes could not represent the information of each category accurately, which affects the classification performance. Besides, during meta-learning, Gidaris et al. [gidaris2018dynamic] and Wang et al. [wang2019tafe] introduced new parameters to promote the adaptation to novel tasks. However, these meta-learning methods for few-shot image classification could not be directly applied to object detection that requires localizing and recognizing objects.

Figure 2: The architecture of few-shot object detection with universal-prototype augmentation. ‘Conv’ and ‘fc layer’ separately indicate convolution and fully-connected layer. The colorful stars are the learned universal prototypes. ‘’ and ‘[,]’ denote the residual operation and concatenation operation, respectively. We focus on improving the generalization of detectors via learning invariant object characteristics. Firstly, universal prototypes are learned from all object categories. With the output of RPN (Region Proposal Network), we obtain the conditional prototypes via a conditional transformation of universal prototypes. Next, the augmented object features are calculated based on conditional prototypes. Finally, a consistency loss is computed between the augmented and original features.

Few-shot object detection. Most existing methods employ meta-learning [fan2020few, Karlinsky_2019_CVPR] or fine-tuning [yang2020context, wu2020multi] strategies to solve FSOD. Specifically, Wang et al. [wang2019meta] developed a meta-learning based framework to leverage meta-level knowledge from data-abundant base categories to learn a detector for novel categories. Yan et al. [yan2019meta] further extended Faster R-CNN [ren2015faster] by performing meta-learning over RoI (Region-of-Interest) features. However, the weak performance compared to basic fine-tuning methods shows meta-learning based methods fail to improve the generalization ability of object detectors. For the method of fine-tuning and the model pre-trained on the base categories, Wang et al. [wang2020frustratingly] employed a two-stage fine-tuning process, i.e., fine-turning the last layers of the detector and freezing the other parameters of the detector, to make the object predictor adapt to novel categories. Wu et al. [wu2020multi] proposed a method of multi-scale positive sample refinement to handle the problem of scale variations in object detection, which is similar to data augmentation [zoph2019learning].

Different from previous methods for FSOD, in this paper, we propose to learn universal prototypes from all object categories. Via fine-tuning, the universal prototypes can be adapted to novel categories effectively. To this end, we develop a new framework of few-shot object detection with universal-prototype augmentation. Experimental results and visualization analysis demonstrate the effectiveness of universal-prototype augmentation.

3 FSOD with Universal Prototypes

In this paper, we follow the same FSOD settings introduced in Kang et al. [kang2019few]. Annotated detection data are divided into a set of base categories that have abundant instances and a set of novel categories that have only few (usually less than 30) instances per category. The main purpose is to improve the detection performance of novel categories without degrading that of base categories.

3.1 Learning of Universal Prototypes

Recently, many methods [snell2017prototypical, liu2019prototype, wang2019panet] construct a prototype for each category to solve few-shot image classification. Though prototypes reflecting category information have been demonstrated to be effective for image classification, they could not be applied to FSOD. The reason may be that these category-specific prototypes represent image-level information and fail to capture object characteristics that are helpful for localizing and recognizing objects. Different from category-specific prototypes, based on all object categories, we attempt to learn universal prototypes that are beneficial for capturing intrinsical object characteristics that are invariant under different visual changes.

Concretely, the left part of Fig. 2 shows the learning process of universal prototypes. We adopt widely used Faster R-CNN [ren2015faster], a two-stage object detector, as the base detection model. Given an input image, we first employ the feature extractor, e.g., ResNet [he2016deep], to extract corresponding features , where , , and separately denote width, height, and the number of channels. Then, the universal prototypes are defined as . Next, based on the prototypical set , we calculate descriptors that represent image-level information.

(1)

where and are convolutional parameters. represents the output descriptors. ‘’ indicates the residual operation. Finally, we take the concatenated result of and as the input of the RPN module. This process is shown as follows:

(2)

where is the reshaped result of . Meanwhile, and are parameters of the fully-connected layer. ‘[,]’ is the concatenation operation.

consists of two convolutional layers with ReLU activation and is used to transform the concatenated result. Finally,

is the output of RPN with RoI Pooling [ren2015faster, he2017mask], where and separately indicate the number of proposals and the size of proposals. The feature dimension of is the same as .

3.2 Augmentation of Object Features

As shown in the right part of Fig. 2, we first compute conditional prototypes based on the universal prototypes . Then, we conduct augmentation of object features with the conditional prototypes.

3.2.1 The Computation of Conditional Prototypes

Since the computation of Eq. (1) is based on the extracted features that represent the whole input image, the universal prototypes mainly reflect image-level information. Here, the image-level information includes object-level information and other associated information about image content. Whereas, after RPN, the proposal features mainly contain object-level information. Thus, the directly using of universal prototypes may not accurately represent object-level information. To this end, we make an affine transformation to promote to move towards the space of object-level features. The affine transformation is shown as follows:

(3)

where and are the transformed parameters. is element-wise product. Finally, represents the conditional prototypes. Next, we employ the same processes as Eq. (1) to generate object-level descriptors. The processes are shown as follows:

(4)

where . and are convolutional parameters. is the -th conditional prototype of . indicates the output descriptors. Finally, we take the concatenated result of and

as the input of the classifier.

(5)

where is the reshaped result of . Meanwhile, and are parameters of the fully-connected layer. consists of two fully-connected layers and outputs a matrix with the dimension . Finally,

is the predicted probability. In the experiment, we find employing the descriptors

that are generated based on the conditional prototypes could improve the performance of FSOD, which demonstrates the effectiveness of conditional prototypes.

3.2.2 Augmentation with Conditional Prototypes

In Fig. 3, we show the augmentation details of object features. Specifically, for proposal features and conditional prototypes , we separately employ a convolutional layer and fully-connected layer to project and into an embedding space, i.e., and . Then, based on each element of , we calculate the soft-attention of to obtain augmentation of object features.

(6)

where . indicates the -th component of . denotes attention weights. consists of two convolutional layers with ReLU activation. And the output dimension of is . is the -th component of . Finally, is the augmented object features. Next, is taken as the input of the classifier to output the predicted probability.

(7)

where is the predicted probability. Besides, Eq. (5) and Eq. (7) share the same classifier.

Figure 3: Augmentation of object features. Based on each element of RPN output , we calculate the soft-attention of the conditional prototypes to generate augmented object features. Each element of the augmented features is a combination of conditional prototypes, which retains the semantic information of output .
Figure 4: Illustration of two-stage fine-tuning approach for . In the base training stage, the entire detector, including the feature extractor, the module for learning of universal prototypes, and the module for augmentation based on conditional prototypes, are jointly trained on the data-abundant base categories. In the few-shot fine-tuning stage, the entire detector is fine-tuned on a balanced training set consisting of both the few base and novel categories.

3.3 Two-stage Fine-tuning Approach

Many semi-supervised learning methods

[berthelot2019mixmatch, berthelot2019remixmatch] rely on a consistency loss to enforce that the model output remains unchanged when the input is perturbed. Inspired by this idea, to learn invariant object characteristics, we compute the consistency loss between the prediction from original features (see Eq. (5)) and the prediction from augmented features. Particularly, the KL-Divergence loss is employed to enforce consistent predictions, i.e., . The joint training loss is defined as follows:

(8)

where is the loss of the RPN to distinguish foreground from background and refine bounding-box anchors. and separately indicate classification loss and box regression loss. And is a hyper-parameter.

During training, we employ a two-stage fine-tuning approach (as shown in Fig. 4) to optimize model. Concretely, in the base training stage, we employ the joint loss to optimize the entire model based on the data-abundant base classes. After the base training stage, only the last fully-connected layer (for classification) of the detection head is replaced. The new classification layer is randomly initialized. Besides, during few-shot fine-tuning stage, different from the work [wang2020frustratingly], none of the network layers is frozen. And we still employ the loss to fine-tune the entire model based on a balanced training set consisting of both the few base and novel categories.

3.4 Further Discussion

In this section, we further discuss universal prototypes for few-shot object detection.

Though prototypes have been demonstrated to be effective for few-shot image classification [snell2017prototypical, vinyals2016matching], it is unclear how to build prototypes for FSOD [kang2019few]. (1) If we follow few-shot image classification and construct prototypes for each category, the computational costs increase for the case of a large number of object categories. Meanwhile, due to the unbalanced object categories, the constructed prototypes may not accurately reflect category information. (2) Related to the above, detectors for certain object category can be affected by co-appearing objects in one image, and thus the quality of the constructed prototype for such category may be burdened. (3) More importantly, since the number of object categories in the stage of the base training is different from that of the few-shot fine-tuning, constructing a prototype for each object category makes it impossible to align the prototypes between the base training and the few-shot fine-tuning. That is to say, the prototypes pre-trained on base categories cannot be directly utilized in the fine-tuning stage. Therefore, for fine-tuning based methods, it is difficult to build a prototype for each category.

To solve FSOD, we propose to learn universal prototypes from all object categories. The universal prototypes are not specific to certain object categories and can be effectively adapted to novel categories via fine-tuning. In the experiments, we find that the universal prototypes are helpful for characterizing the regional information of different object categories. Meanwhile, with the help of universal-prototype augmentation, the performance of few-shot detection can be significantly improved.

Novel Set 1 Novel Set 2 Novel Set 3
Method / Shot 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10
LSTD [chen2018lstd] 8.2 1.0 12.4 29.1 38.5 11.4 3.8 5.0 15.7 31.0 12.6 8.5 15.0 27.3 36.3
YOLO-FS [kang2019few] 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 39.2 19.2 21.7 25.7 40.6 41.3
Meta R-CNN [yan2019meta] 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1
MetaDet [wang2019meta] 18.9 20.6 30.2 36.8 49.6 21.8 23.1 27.8 31.7 43.0 20.6 23.9 29.4 43.9 44.1
RepMet [Karlinsky_2019_CVPR] 26.1 32.9 34.4 38.6 41.3 17.2 22.1 23.4 28.3 35.8 27.5 31.1 31.5 34.4 37.2
FSOD-VE [xiao2020few] 24.2 35.3 42.2 49.1 57.4 21.6 24.6 31.9 37.0 45.7 21.2 30.0 37.2 43.8 49.6
TFA w/fc [wang2020frustratingly] 36.8 29.1 43.6 55.7 57.0 18.2 29.0 33.4 35.5 39.0 27.7 33.6 42.5 48.7 50.2
TFA w/cos [wang2020frustratingly] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8
w/fc [xiao2020few, wang2020frustratingly] 22.9 34.5 40.4 46.7 52.0 16.9 26.4 30.5 34.6 39.7 15.7 27.2 34.7 40.8 44.6
w/cos [xiao2020few, wang2020frustratingly] 25.3 36.4 42.1 47.9 52.8 18.3 27.5 30.9 34.1 39.5 17.9 27.2 34.3 40.8 45.6
[wu2020multi] 40.7 41.2 48.9 53.6 60.3 24.4 29.3 39.2 39.9 47.8 32.9 34.4 42.3 48.0 49.2
Ours () 43.8 47.8 50.3 55.4 61.7 31.2 30.5 41.2 42.2 48.3 35.5 39.7 43.9 50.6 53.5
Table 1: Few-shot detection performance (mAP (%)) on PASCAL VOC dataset. We evaluate the performance on three different sets of novel categories. The IoU threshold is set to 0.5. ‘’ indicates that we directly run the released code to obtain the results.

4 Experiments

We first evaluate our method on PASCAL VOC [everingham2010pascal, everingham2015pascal] and MS COCO [lin2014microsoft]. For a fair comparison, we use the settings in [kang2019few, yan2019meta] to construct few-shot detection datasets. Concretely, for PASCAL VOC, the 20 classes are randomly divided into 5 novel classes and 15 base classes. Here, we follow the work [kang2019few] to use the same three class splits, where only object instances are available for each novel category and is set to 1, 2, 3, 5, 10. For MS COCO, the 20 categories overlapped with PASCAL VOC are used as novel categories with = 10, 30. And the remaining 60 categories are taken as the base categories.

Next, to further verify the effectiveness, we conduct experiments on the task of long-tail object detection. We conduct evaluation on a recent-released benchmark of Large Vocabulary Instance Segmentation (LVIS) [gupta2019lvis], which contains 1,230 categories with both bounding box and instance mask annotations. The number of images in each category of LVIS follows a long-tail distribution.

Implementation Details. Faster R-CNN [ren2015faster] is used as the base detector. Our backbone is Resnet-101 [he2016deep] with the RoI Align [he2017mask]

layer. We use the weights pre-trained on ImageNet

[russakovsky2015imagenet] in initialization. For FSOD, the number of universal prototypes (see Eq. (1)) is set to 24. For long-tail object detection, since the number of categories in LVIS [gupta2019lvis] is much more than that of MS COCO [lin2014microsoft], the number of universal prototypes is set to 28. All these prototypes are randomly initialized. Next, for FSOD, the model is trained with a batchsize of 2 on 2 GPUs, 1 image per GPU. Meanwhile, to alleviate the impact of the scale issue, we employ the positive sample refinement [wu2020multi]. For long-tail object detection, we train our model with a batchsize of 2 on 1 GPU. The hyper-parameter (see Eq. (8)) is set to 1.0. All models are trained using SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. Finally, during inference, we take the output of Eq. (5) as the classification result.

4.1 Performance Analysis of Few-Shot Detection

We compare with two baseline methods, i.e., TFA [wang2020frustratingly] and MPSR [wu2020multi]. These two approaches all use the two-stage fine-tuning method to solve FSOD.

Results on PASCAL VOC. Table 1 shows the results of PASCAL VOC. As the number of novel categories decreases, the performance degrades significantly. This indicates that addressing the few-shot problem is crucial to improve the generalization of detectors. We can see that the proposed method consistently outperforms the two baseline methods. Particularly, for the 1-shot case of Novel Set 2, 2-shot case of Novel Set 1, and 10-shot case of Novel Set 3, our method is 6.8%, 6.6%, and 4.3% higher than MPSR, respectively. This shows that employing universal-prototype augmentation is helpful for learning invariant object characteristics and thus improves performance. Meanwhile, this also indicates that focusing on invariance plays a key role in solving FSOD.

Figure 5: Detection results based on the 5-shot case. The first row shows the results of MPSR [wu2020multi]. The second row is our detection results. Our method detects the objects accurately.

In Fig. 5, we show the detection results of MPSR [wu2020multi] and our method. ‘bird’ and ‘bus’ belong to the novel categories. We can see that our method can successfully detect objects existing in images. Particularly, for these three images, the detection results of MPSR are not accurate, e.g., two bounding-boxes existing in the bird image. This further shows that the proposed universal-prototype augmentation is helpful for capturing invariant object characteristics, which improves the accuracy of detection.

Results on MS COCO. Table 2 shows the few-shot detection performance on MS COCO dataset. Compared with two baseline methods, i.e., TFA [wang2020frustratingly] and MPSR [wu2020multi], our method consistently outperforms their performance. Particularly, for the 10-shot and 30-shot case, our method is 1.5% and 1.8% higher than MPSR, respectively. This further demonstrates the effectiveness of the proposed universal-prototype augmentation. Besides, FSOD-VE [xiao2020few]

is a recently proposed meta-learning based method, which combines FSOD with a few-shot viewpoint estimation and follows Meta R-CNN

[yan2019meta] to optimize detectors. Though FSOD-VE’s performance of the 10-shot case is higher than our method, our method outperforms FSOD-VE on the small objects. Meanwhile, compared with FSOD-VE, the training of our method is much easier. And we do not use the viewpoint information. These results further demonstrate that exploiting universal-prototype augmentation is helpful for improving detectors’ generalization.

Shots Method AP AP75
10 LSTD [chen2018lstd] 3.2 2.1 0.9 2.0 6.5
YOLO-FS [kang2019few] 5.6 4.6 0.9 3.5 10.5
Meta R-CNN [yan2019meta] 8.7 6.6 2.3 7.7 14.0
MetaDet [wang2019meta] 7.1 6.1 1.0 4.1 12.2
FSOD-VE [xiao2020few] 12.5 9.8 2.5 13.8 19.9
TFA w/fc [wang2020frustratingly] 10.0 9.2
TFA w/cos [wang2020frustratingly] 10.0 9.3
w/fc [xiao2020few, wang2020frustratingly] 9.1 8.5
w/cos [xiao2020few, wang2020frustratingly] 9.1 8.8
 [wu2020multi] 9.5 9.5 3.3 8.2 15.9
Ours () 11.0 10.7 4.5 11.2 17.3

30
LSTD [chen2018lstd] 6.7 5.1 0.4 2.9 12.3
YOLO-FS [kang2019few] 9.1 7.6 0.8 4.9 16.8
Meta R-CNN [yan2019meta] 12.4 10.8 2.8 11.6 19.0
MetaDet [wang2019meta] 11.3 8.1 1.1 6.2 17.3
FSOD-VE [xiao2020few] 14.7 12.2 3.2 15.2 23.8
TFA w/fc [wang2020frustratingly] 13.4 13.2
TFA w/cos [wang2020frustratingly] 13.7 13.4
w/fc [xiao2020few, wang2020frustratingly] 12.0 11.8
w/cos [xiao2020few, wang2020frustratingly] 12.1 12.0
 [wu2020multi] 13.8 13.5 4.0 12.9 22.9
Ours () 15.6 15.7 4.7 15.1 25.1
Table 2: Few-shot detection performance (%) on MS COCO dataset. Here, , , and separately indicate the mAP performance of the small, medium, and large objects.
method/shot 1 2 3 5 10
No Condition 38.1 43.8 48.9 55.6 60.6
New Prototype 42.1 44.6 48.8 56.1 60.1
Ours 43.8 47.8 50.3 55.4 61.7
Table 3: Analysis of conditional prototypes. Here, ‘No Condition’ indicates we do not use conditional operation in Eq. (3) and directly use the universal prototypes to make augmentation. ‘New Prototype’ indicates we define a new set of prototypes to replace the conditional prototypes.

4.2 Ablation Analysis

In this section, based on the Novel Set 1 of PASCAL VOC, we make an ablation analysis of our method.

Conditional prototypes. In order to sufficiently represent object-level information, based on the universal prototypes (see Eq. (1)), we make an affine transformation to obtain conditional prototypes (see Eq. (3)). Next, we make an ablation analysis of conditional prototypes.

Table 3 shows the comparison results. We can see that utilizing the conditional operation improves detection performance significantly. Particularly, for the 2-shot case, our method separately outperforms ‘No Condition’ and ‘New Prototype’ by 4.0% and 3.2%. This shows that based on the universal prototypes, the conditional prototypes represent object-level information effectively, which improves the performance of detection.

The number of universal prototypes. For our method, the number of universal prototypes (see Eq. (1)) is an important hyper-parameter. If the number is small, these prototypes could not represent invariant object characteristics sufficiently. On the contrary, a large number of prototypes may increase parameters and computational costs.

number/shot 1 2 3 5 10
16 41.2 42.7 48.3 54.2 60.1
20 42.5 44.1 50.1 56.0 60.5
24 43.8 47.8 50.3 55.4 61.7
28 42.6 44.6 49.6 56.7 60.6
32 41.4 42.1 49.6 53.9 60.0
Table 4: The impact of the number of universal prototypes. Here, we only utilize a different number of prototypes and keep other components unchanged.

Table 4 shows the performance of employing a different number of prototypes. We can see that the performance of utilizing 24 prototypes is the best. When the number is larger or fewer than 24, the performance degrades significantly. This shows the number of prototypes affects FSOD performance. In general, for the case of a large number of categories, employing more prototypes is helpful for improving the performance. For example, for the LVIS dataset [gupta2019lvis] with 1,230 categories, the performance of utilizing 28 prototypes is superior to 24 prototypes.

Novel Classes Mean
Shot Method bird bus cow mbike sofa Novel Base
2  [wu2020multi] 36.8 24.8 56.9 59.1 28.4 41.2 65.4
Ours () 40.7 41.3 58.9 62.2 35.9 47.8 66.3
5  [wu2020multi] 44.1 60.7 54.3 66.8 42.1 53.6 69.5
Ours () 47.0 60.5 57.3 66.4 46.1 55.4 69.7
Table 5: AP (%) of each novel category on the 2-/5-shot case. We also present mAP (%) of novel and base categories.

In Fig. 6, based on different shots, we analyze the distribution of prototypes. Concretely, as the number of novel objects increases, in order to improve the detection performance, the universal prototypes (see Eq. (1)) will become more scattered to capture more image-level information. After RPN, the conditional prototypes are calculated to represent object-level information. And the features calculated based on the conditional prototypes are used for classification. Thus, as the number of novel objects increases, the distribution of the conditional prototypes will become more concentrated to focus on specific categories, which could improve the accuracy of detection.

method mAP
Faster R-CNN 21.0 4.1 19.7 29.3
Finetune tail 22.3 5.7 23.5 27.3
RFS [gupta2019lvis] 23.4 14.6 22.7 27.8
Focal loss-cls [lin2017focal] 19.3 6.6 19.8 23.7
NCM-fc [kang2019decoupling] 16.0 10.3 13.9 20.9
-norm-select [kang2019decoupling] 21.6 6.2 21.0 28.5
 [li2020overcoming] 25.5 16.3 25.2 29.5
Ours () 26.0 17.3 25.9 29.6
Table 6: Performance (%) analysis based on long-tail detection dataset, i.e., LVIS [gupta2019lvis] set. , , and separately indicate the performance for rare categories, common categories, and frequent categories.

In Fig. 7, we visualize the assignment maps of universal prototypes, i.e., the soft-assignment in Eq. (1). For each image, we can see that different object regions are assigned to one same universal prototype. Particularly, for the second image of the first row, the object regions of ‘horse head’, ‘horse tail’, and ‘person’ are all assigned to one same prototype. This indicates the universal prototypes are not specific to certain object categories. Moreover, the universal prototypes are helpful for characterizing the region information of different objects and could be effectively adapted to novel categories via fine-tuning.

Figure 6: The t-SNE plot of prototypes. We analyze the impact of employing different shots. Here, the number of prototypes is 24. and separately denote the universal prototypes (see Eq. (1)) and conditional prototypes (see Eq. (3)). For novel categories, using a different number of samples affects the distribution of the universal and conditional prototypes. As the number of novel objects increases, the universal prototypes become more scattered, whereas the conditional ones become more concentrated.
Figure 7: Assignment of image regions to universal prototypes based on the 5-shot case. The highlight regions in each image are assigned to one same prototype, respectively.

The performance of base categories. Table 5 shows the performance of each novel and base categories. We can see that our method outperforms MPSR [wu2020multi] on novel and base categories. This indicates our method could improve the performance of data-scarce novel categories without degrading the performance of base categories.

Analysis of the output descriptors. In Eq. (2) and (5), the output descriptors are fused as the input of the RPN and classifier. Next, we analyze the impact of the descriptors. Concretely, for Eq. (2), we only take as the input of RPN and keep other components unchanged. For the 1-shot and 5-shot case, fusing the descriptors improves the performance by 2.7% and 1.8%. For Eq. (5), we only take as the input of classifier and keep other components unchanged. For the 1-shot and 5-shot case, fusing the descriptors improves the performance by 2.1% and 1.2%. This shows fusing descriptors into the current features is helpful for improving the representation ability of the features.

In Fig. 8, we show visualization results of and the output of (see Eq. (2)). Here, we separately take and the output of as the input of RPN. We can see that for the base (‘Person’ and ‘Dog’) and novel (‘Cow’) categories, compared with , the output of contains more object-related information. This further indicates fusing descriptors is helpful for enhancing the object-level information.

(a) Input Image
(b) Original Feature
(c) Ours
Figure 8: Visualization of the feature map used for RPN based on the 5-shot case. ‘Original Feature’ and ‘Ours’ separately indicate and the output of , which are fed into RPN (see Eq. (2)). For each feature map, the channels corresponding to the maximum value are selected for visualization.

4.3 Performance Analysis of Long-Tail Detection

To further demonstrate the effectiveness, we evaluate our method on long-tail object detection [gupta2019lvis]. Recently, to solve long-tail detection, BAGS [li2020overcoming] proposes an operation of balanced group softmax to overcome classifier imbalance, which achieves superior performance. We directly embed our method into the BAGS method and take Faster R-CNN with Resnet-50 backbone as the base detector.

Table 6 shows the long-tail detection results based on the LVIS set. Similar to FSOD, as the number of object categories decreases, the detection performance degrades significantly. We can see that embedding our method into the BAGS improves BAGS’s performance. This shows enhancing the invariance and generalization of detectors is helpful for addressing long-tail detection. And our method is effective for solving long-tail object detection.

5 Conclusion

To solve FSOD, we propose to learn universal prototypes from all object categories. Meanwhile, we develop an approach of few-shot object detection with universal prototypes (). Concretely, after obtaining the universal and conditional prototypes, the augmented object features are computed based on the conditional prototypes. Next, through a consistency loss, enhances the invariance and generalization. Experimental results on two few-shot detection datasets and a long-tail detection dataset demonstrate the effectiveness of the proposed method.

References