Top-Related Meta-Learning Method for Few-Shot Detection

07/14/2020 ∙ by Qian Li, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Many meta-learning methods which depend on a large amount of data and more parameters have been proposed for few-shot detection. They require more cost. However, because of imbalance of categories and less features, previous methods exist obvious problems, the strong bias and poor classification for few-shot detection. Therefore, for meta-learning method of few-shot detection, we propose a TCL which exploits the true-label example and the most similar semantics with the example, and a category-based grouping mechanism which groups categories by appearance and environment to enhance the semantic features between similar categories. The whole training consists of the base classes model and the fine-tuning phase. During training, the meta-features related to the category are regarded as the weights of the prediction layer of detection model, exploiting the meta-features with a shared distribution between categories within a group to improve the detection performance. According to group and category, we split category-related meta-features into different groups, so that the distribution difference between groups is obvious, and the one within each group is less. Experimental results on Pascal VOC dataset demonstrate that ours which combines TCL with category-based grouping significantly outperforms previous state-of-the-art methods for 1, 2-shot detection, and obtains detection APs of almost 30 Especially for 1-shot detection, experiments show that our method achieves detection AP of 20

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, neural network has progressed quickly for computer vision. Various efficient methods

Redmon et al. (2015) Redmon and Farhadi (2017) Tian et al. (2019) Kong et al. (2019)

depend on large labeled datasets. However, when datasets are insufficient, it may result in overfitting and hurting generalization performance. On the contrary, there is a quite difference between the human vision system and the computer vision system. According to the labeled datasets, the human vision system can classify, locate and describe. The few-shot learning ability for computer system is very important. Computer systems cannot do those. Most methods require more expensive datasets that are labeled with auxiliary descriptions, such as shape, scene or color etc.

The predecessors propose few-shot learning methods Hariharan Bursuc Shah , solving the above issues, and few-shot learning includes classification, detection and segementation. Few-shot detection Divakaran Porikli Saligrama is one of the most challenging tasks. This paper finds two main challenges. First, due to just few examples, the features which are extracted from standard CNNs are not suitable for few-shot learning, directly. In previous most state-of-the-art few-shot learning methods, the classification is often regarded as the standard task. For each iteration of training, classification is a binary classification task for YOLOv2 Redmon and Farhadi (2017), resulting in bias problem and hurting performance on the other classes. Then, according to auxiliary features related to description, video, or attribute, etc., many methods Pinheiro Xie Hebert are proposed. However, it is difficult to ensure whether the external datasets are beneficial and tell which is noise. Therefore, many methods Pinheiro Zemel Xie use sub-module to learn auxiliary features to improve performance, requiring the cost of labeling datasets and more parameters.

In order to solve these problems, based on Kang et al. (2019), we propose a new top classification loss (TCL) for few-shot learning to improve the detection APs on the novel classes. Although the Cross-Entropy loss Rubinstein (1999) or Focal loss Lin et al. (2017) can reduce the trust on the original label and increase the trust on the other labels to a certain extent. For YOLOv2 Redmon and Farhadi (2017) classification is a binary classification task and ignores other examples. That cannot ensure that all examples expect for the true label are bad for learning features, decreasing detection performance because of few examples. Therefore, we assume that the most similar example promotes the useful information of other category related to the true-label, alleviating the strong bias problem.

(a) The visualization of category grouping
group Class 1 aero bird 2 cow horse cat sheep dog 3 sofa chair 4 tv plant table 5 boat bicycle train car bus mbike 6 bottle person
(b) Category-based grouping table on Pascal VOC dataset
Figure 1: Figure (a) and Table(b). In Figure (a), all categories are divided into groups. All categories in each row are similar in appearance and environment appeared. All categories in each row are a group. When “bird” is flying, it looks like “aero”; “cows, horses, cats, sheep and dogs” have four legs, and they often appear in similar environments. In Table (b), all categories of Pascal VOC dataset are divided into 6 groups. The appearance and the appeared environment are very similar for different categories within a group.

Then, without additional datasets, we propose a category-based grouping method only by labels for few-shot detection. As shown in Figure 1 (a), the left in the first row is the object "aero" and the last one in the same row is "bird", they are similar in appearance. In most conditions, they are often in the same environment. Every two example in the second row is a class, these are also very similar in appearance. And the scenes are also similar between the class objects in 1th, 2th, and 4th column, and between the class objects in 3th and 5th column. According to the characteristic, the appearance or the environment between categories is similar. As seen in figure, we split all classes into different groups, and each row is a group. Therefore, this work proposes a category-based grouping method to assist few-shot detection. Very few methods without additional datasets have found the characteristic, applying that to few-shot detection. Without the additional parameters and datasets, according to the category-based grouping, we improve few-shot detection performance, and further reduce the dispersion of the detection results on all categories. In our work, for few-shot detection, our contributions are as follows:

  • We design a top classification loss (TCL), which allows the true-label example and the most similar example with the highest classification confidence to learn, improving detection APs on the novel classes and alleviating the strong bias problem.

  • Based on categories, we find the semantic appearance and the similarity of the scene between different categories. We group all classes into few sub-groups with disjoint each other. Then, we construct a category-based grouping loss on meta-features, which improves few-shot detection performance and further alleviates the problem of scattered for all detection results on all classes.

  • We compare the impact of different classification functions on few-shot detection, and shows that our TCL improves the performance.

  • Combining the proposed TCL with category-based grouping, for -shot detection, the detection performance achieve almost 20%, 25%, 30%, respectively. Grouping is beneficial for concentration of results on the all classes.

2 Related Work

This paper focuses on few-shot learning. Based on the meta-learning, Girshick Fei-Fei and Hariharan regard classes with only few example as novel classes. Our work studies the classification loss, meta-learning methods, and detection for few-shot learning.

Classification Loss. Different classification losses, such as the BCEwithLogits, the Cross-Entropy loss with SoftMaxRubinstein (1999) Liu et al. (2016) Liu et al. (2017), are proposed. Most computer vision tasks use the Cross-Entropy to implement training. Then, Lin et al. (2017) proposes a focal loss to alleviate the imbalance between the positive and negative examples. However, many tasks based on YOLOv2 Redmon and Farhadi (2017) just exploit a binary classification loss, resulting in the imbalance and ignoring the correlation about categories. In this paper, for novel classes, we assume that too much noise can hurt the detection performance, and only the true-label may fail in learning relation with other categories. Therefore, we propose the TCL for classification, which only focuses on the true-label example and the most similar example. Compared with Barnes which increases the distance between classes, our TCL only exploits semantic information to improve performance between different categories.

Meta-Learning. In recent years, different meta-learning algorithms have been proposed, including metric-based Li et al. (2019) Wertheimer and Hariharan (2019) Lifchitz et al. (2019) Kim et al. (2019), memory networks Santoro et al. (2016) Oreshkin et al. (2018) Abbeel , and optimization Grant et al. (2018) Lee and Choi (2018) Finn and Levine (2017) Finn et al. (2017). The first algorithm learns a metric space based on few samples given and score a label of the target image according to similarity. The second is cross-task learning, and most memory networks widely are model-independent adaptation Finn et al. (2017). A model is learned on a variety of different tasks, making it possible to solve some new learning tasks with only few examples. According to Finn et al. (2017), the researchers proposes many variants Nichol et al. (2018); Sun et al. (2019) Antoniou et al. (2018) Rusu et al. (2019). The last algorithm is a parameter prediction. According to an example of each category for every iteration, the features with classes are regarded as the weight of a prediction layer, learning the parameters of the network layer dynamically. In inference, it is not necessary to train to adapt the learned features of each category to the new category. Most works apply this method to the classification task. Kang et al. (2019) detects objects by Yolov2 Redmon and Farhadi (2017), and based on that, we further improve performance and alleviate many problems for few-shot detection.

Few-shot Detection. Previous most detection methods focus on limited labeled data. The weakly-supervised methods Bilen and Vedaldi (2016) Van Gool (2017) Song et al. (2014) only consider learning object detectors on image-level labels. Some works Hebert Wang and Hebert (2015) Dong et al. (2019) only use few example with bounding box level annotations for each class, and generates pseudo labels on many images to detect object. Many zero-shot methods Bansal et al. (2018) Porikli use sub-module to detect. Chen et al. (2018) transfers the basis domain to the novel. However, this paper only splits all categories into disjoint groups to improve detection performance without additional sub-modules, and captures the correlation between different groups or different categories from the semantic meta-features.

3 Our Approach

Our method is based on Kang et al. (2019). As shown in Figure 2, we propose TCL for classification and correlation with category grouping method to help meta-learning model Kang et al. (2019) learn the related features between different categories.

Figure 2: Overall structure of our method. The detection model is composed of a feature extractor D and a meta-model M. The input of the meta-model is an example and a mask of only an object selected randomly. The value of the mask within object is 1, otherwise, it is 0. The number of the meta-model input is divisible by the number of all categories about training. For the Pascal VOC dataset, when training the base model, the inputs of the meta-model is examples, while fine-tune the novel categories, which is examples,

is the number of GPUs. The meta-model extracts meta-feature vectors about classes as the weight of the prediction layer of the detection model. According to category grouping, we group meta-feature vectors, and exploit the related method to obtain the shared meta-features with different related categories.

3.1 Feature Reweighting for Detection

Different categories may have a common semantic distribution. The author exploits category-based meta-features and ignores unrelated features to improve detection performance of novel categories. As shown in Figure 2, based on YOLOv2 Redmon and Farhadi (2017), this method uses a meta-model to obtain meta-feature about categories as dynamic weights of the prediction layer to detect objects. The meta-learning model takes each annotated sample (, ), and for the category , represents the number of categories. and represent the reference image and the bounding box annotation of the reference image. The annotated example represents the category to be detected by the object. The meta-model learns to predict sets of correlation coefficients , , where represents the dynamic weight vector of category , is the meta-model. Based on Darknet-19, the author builds a feature extractor for extracting basis features from the image : . Then, for class , the weighted feature vector is obtained according to and : =F, the correlation coefficient of category and the basic features are multiplied by channel. This approach can adapt the features to the novel classes. Finally, a prediction layer on top of classifies and regresses.

3.2 Tcl-K

For few-shot detection methods, especially for meta-learning methods, we propose a TCL to encourage model to learn similar semantic distribution and alleviate the strong bias problem. In this method, according to , our TCL makes the features tend to learn the true-label examples improving performance of novel classes. and affect the convergence rate, and threshold , denote the two expectations, the true-label and the sample with most similar semantic spatial, respectively. At the same time, according to the similar semantic with the true-label, urges model learn most similar distribution, reducing the bias problem. As shown in Equation 1 below.

(1)

Where and

represent the loss functions of the true-label sample and the most similar example, respectively. They weaken the only true-label effect.

and represent the prediction on the true label, and the classification prediction of the most similar sample, respectively.

3.3 Categories-Based Grouping

According to Kang et al. (2019)

, meta-learning uses the correlation between different categories for few-shot detection. For the case, without additional datasets, our related method focuses on appearance, followed by the environment, splitting all categories into different groups which are disjoint with each other. We mainly analyze the mean and variance of the meta-feature distribution about categories from

. As shown in Equation 2, according to the principle, the intra-group distance is smaller and the inter-group distance is larger. The paper divides the 20 classes of Pascal VOC DataSet into 6 groups, and proposes a related loss about groups. Our method encourages the variance of the mean value of the feature vector smaller for every group, making the semantic feature distribution more compact between categories within each group, and helps the feature distribution sparser between different groups, improving detection APs and reducing the detection dispersion on the all categories.

(2)

Where represents the distribution dispersion of the mean value of the meta-feature space between categories in group , and we expect that to be as small as possible. represents the distribution of the variance of the feature distributions of different classifications within the group, represents the mean distribution of the feature distributions on the different categories within the group, and we quantify the dispersion and concentration of the two distributions, respectively. According to Equation 2, we expect that the distribution of different categories within the groups is more compact, and the different groups are farther from each other, is smaller, then, and are bigger. Corresponding to each notation, as follows.

(3)

Where denotes the distribution of mean level of the meta-features vector between categories in group , and expects the value to be more smaller. is the mean of all features for th grouping.

(4)

Where and explains the space of dispersion and mean levels about all category-related meta-features vectors for th grouping. According to Equation 2, the distance is more obvious between groups. When there is more than one category in group , obtains the dispersion between categories within group, and obtains the mean of all category-related features for the group. Otherwise, those only obtains the dispersion by the features corresponding to the category.

(5)

Where denotes that the category-related -dimension meta-feature vectors, . In this paper, all 20 categories are divided into 6 groups, , , and represents the number of categories in group, . Because of the correlation loss of each group, the value of the function is less than 0. Therefore, the parameter is used to ensure that the loss is a positive value, and the parameter must be greater than or equal to 1. According to the feature distributions between groups, the method alleviates the phenomenon which the performance of different categories varies greatly for few-shot detection.

3.4 Loss Details

Category-Based Grouping. In this work, considering that different categories in different environments have the similar appearance or that different categories are in the similar environment, in order to reduce the setting, we mainly focus on the appearance similarity, followed by the environment, and we set the category objects with the similar appearance and scenes as a group. As shown in Figure 1 (b), we divide the Pascal VOC dataset with 20 categories into 6 groups, namely . As shown in the Equation 2, we set the parameter to 1.

Loss Functions. In order to train the meta-model and ensure that the shared features are more compact between the meta-features of the similar semantic category objects, we jointly train classification, category-based grouping, and detection, as shown in Equation 6. Compared with state-of-the-art classification methods, our TCL-2 method is more suitable for few-shot detection.

(6)

As detailed in Equation 1, the threshold of true label is 1.0, the negative with the highest score denotes the threshold , and we set to 0.5. The value of cannot be too large or small. If it is too large, it will make meta-model drive the other examples towards the true label. If it is too small, it makes the model trust only the true label, and violates the principle of similarity and reduces performance of novel classes. In Equation 2, denotes the category-based grouping loss. includes the center location loss , and scale loss , . In this experiment, the classification, similarity, and detection balance parameters, , and , are set to 1, 6, and 1, respectively.

4 Experiments and Results

This experiment consists of the base classes training and few-shot fine-tuning. The output of the meta-model is related to the number of categories, and each category-based meta-feature vector is represented as a 1024-dimension features vector. This vector is used as the weight of the prediction of the detection network, which classifies and detects by the semantic similarity between categories, dynamically. In our work, we experiment with different classification losses, BCEwithLogits, Focal Lin et al. (2017), Cross-Entropy Rubinstein (1999) and our proposed TCL-2, combining with our proposed Category-based Grouping, respectively. Methods which combine all classification losses and our proposed category-based grouping are regarded as Re-BCEwithLogits, Re-Focal, Re-Cross-Entropy, and Ours, respectively. Detail as follows.

4.1 DataSets and Setting

The VOC DataSet contains 20 categories, we randomly select 5 categories as novel categories for fine-tuning, and the remaining 15 categories as base classes for the base model. The 20 categories are randomly divided into 6 novel groups, and we experiment 3 groups obtained as the novel classes for fine-tuning -shot, . Our setting (as seen in Appendix A) is the same with Kang et al. (2019).

4.2 Ablation Studies

This experiment is mainly for 5-way -shot. From shared meta-feature distribution of different categories, and different classification methods, we analyze the detection performance for the Pascal VOC. The details are as follows.

The Importance of the TCL. In this experiment, for TCL-2, as shown in Equation 1, plays an important role in our TCL-2. As illustrated in Curve 3(b), when the is greater than 0.5, the APs distribution for fine-tuning on the novel classes is consistent, and best APs which includes the novel classes and mean APs on the all categories is lower than our setting, . When is less than 0.5, the APs on the novel classes tend to be stable because the semantics between the true-label example and the most similar example are clearly separated to the maximum by Equation 1, making model trust the true-label most, then, resulting in hurting the detections generalization of similar semantics. Therefore, when the is set to 0.5, our method can exploit similar semantics distribution between different categories to improve the performance of novel classes better. As shown in Table 1, compared with the state-of-the-art BCEwithLogits, Focal Lin et al. (2017), and Cross-Entropy Rubinstein (1999), our TCL-2 can improve the few-shot detection performance. For novel set 1, the 1-shot detection APs of TCL-2 is 2.73%, 2.88%, and 3.78% better than the other classification losses, respectively. The TCL-2 is better than the other classification methods by 1.23%, 0.73%, and 5.53% for 3-shot, respectively.

Novel Set1 Novel Set2 Novel Set3
Method/shot 1 2 3 5 1 2 3 5 1 2 3 5
BCEwithLogits 16.42 18.51 27.41 36.07 13.59 14.71 26.3 35.2 15.1 15.62 26.14 31.6
Re-BCEwithLogits 13.26 17.46 24.31 33.76 18.29 19.71 26.99 35.3 11.0 13.0 20.74 31.95
Focal 16.27 21.63 27.91 37.43 10.39 15.23 18.36 34.09 9.6 8.87 20.16 27.54
Re-Focal 18.22 20.05 20.45 36.15 14.16 15.88 23.13 27.2 7.3 8.67 16.35 28.6
Cross-Entropy 15.37 19.11 23.11 35.18 16.04 19.2 25.46 35.84 12.19 15.3 20.31 31.91
Re-Cross-Entropy 18.55 21.02 22.25 36.5 15.15 20.81 26.07 33.45 13.07 14.53 23.93 35.58
TCL-2 19.15 21.23 28.64 36.94 17.56 22.25 25.57 38.45 12.27 17.33 30.81 35.08
Ours 20.08 26.75 29.76 36.28 18.07 24.66 30.94 39.04 19.42 17.43 23.24 37.66
Table 1: The results of detection APs on novel classes. For few-shot detection on Pascal VOC, our method significantly outperforms others.
(a) Curves of the APs normalized for all methods.
(b) Curves of the APs normalized for different .
Figure 3:

For Pascal VOC, in Curve (a), for novel set1 2-shot, the curve shows the results of all epoch fine-tuning detection of the APs on the novel classes. our method obviously outperforms the other. In Curve (b), Results with solid line are normalized AP on the novel classes, and results with the dashed line are the detection normalized AP on all categories. For TCL-2 loss, when the

is set to 0.5, our method is better for fine-tuning novel1 2-shot.
Shot/Method BCEwithLogits Re-BCEwithLogit Focal Re-Focal Cross-Entropy Re-Cross-Entropy TCL-2 Ours
1 55.04 58.58 57.0 54.77 57.15 53.86 53.0 52.63
2 52.84 54.46 57.0 51.99 53.36 50.63 50.93 46.45
3 45.81 48.28 45.67 51.22 49.63 48.6 45.46 44.37
5 41.1 42.07 40.74 32.57 41.95 40.64 41.82 41.36
Table 2: Dispersion of the detection APs on all categories. For novel set1, our method obviously alleviates the strong bias, reducing dispersion of detection performance.

As can be seen from the Curve 3(a), compared with all classification losses, the TCL-2 can balance the detection APs on all categories better. As shown in Table 2, for novel set1, our TCL-2 alleviates the strong bias problem. Especially for 1-shot detection dispersion on all categories, the TCL-2 is 2.04%, 4.0%, and 4.15% lower than the BCEwithLogits, Focal and Cross-Entropy, respectively. If the value of dispersion is lower, the strong bias is weaker, and the method is better for few-shot detection. So, our TCL-2 alleviates the imbalance of detection APs for all categories by Equation 1, exploiting the common category-related semantic features between the true labels and the most similar example better.

The Importance of the Category-based Grouping. Without additional data, as detailed in Equation2, we mainly focus on the similar appearance of different categories, supplemented by similar scenes, exploiting the relationships of different categories to improve the few-shot detection performance. According to Equation 2, we analyze category-based grouping, and compare with each part of this method, as detailed in Appendix B. As shown in Table 1, compared with only the classification, splitting 20 categories into disjoint groups can improve the few-shot detection performance. As shown in 2, Re-BCEwithLogits, Re-Focal, and Re-Cross-Entropy is compared with the BCEwithLogits, Focal, and Cross-Entropy, respectively. We find that the better shared meta-feature distribution between different categories can further reduce the dispersion of detection performance on all categories. Especially for novel set1 2-shot, the dispersion of Re-Focal and Re-Cross-Entropy are reduced by 5.01% and 2.73%, respectively. For novel set1 1-shot, 2-shot and 3-shot, ours which combines the TCL-2 with the category-based grouping makes the dispersion lower on all categories better.

Figure 4: Histogram of meta-features distributions. The distribution of meta-feature vectors of 20 categories. In a sub-figure, the feature vectors of each group are represented, and different colors represent different categories. The meta-feature of each in a subgraph is very similar, and the distribution difference between subgraph is obvious. Our method can extract shared meta-features within a group better.

As shown in Figure 4, for a subgraph, each category is represented by different color histograms, and each subgraph is represented as a group with categories. We find that the meta-features distribution of different categories is very similar within a group, and the difference of meta-features distribution between groups is obvious. The meta-feature for dynamic weighting method without additional data, the grouping method obtains each meta-feature shared space of different categories of objects within the group to improve the few-shot detection performance.

4.3 Evaluation of the Proposed Method

Our TCL- and category-based grouping can improve detection performance on novel classes, as seen in Appendix C. First, as detailed in Equation 1 and Curve 3 (a), our TCL-2 detects better for similar semantics to improve the detection APs on the novel classes, and alleviates strong bias problem. Then, according to the similar appearance and environment which different categories appear, as detailed in Equation 2, we split all categories into groups which is disjoint each other. The distribution of meta-features is compact between categories within a group, the distribution between groups is far away from each other. That named category-based grouping exploits similar distribution of meta-features and improves performance, further reducing the detection dispersion on all categories. As can be seen from Curve 3 (a), Figure 4 and Table 2, the category-based grouping helps meta-model extract the shared meta-features between different categories and ours improve the detection APs by similar semantics between categories. The method reduces the dispersion of the detection APs on all classes. Combining TCL-2 with category-based grouping is more beneficial for few-shot detection. As shown in Table 1, For -shot detection, our method is better, and the detection APs are close to 20.0%, 25%, 30%, respectively. For novel set3, the novel classes are " aero, bottle, cow, horse, sofa" and the remaining is the base classes. Although there is no novel class associated with the base categories, our method is 4.32%, 9.82% and 6.35% better than the BCEwithLogits, the Focal and Re-Cross-Entropy for novel set 3 1-shot, respectively.

5 Conclusions

In this work, for few-shot detection, we present a TCL- for exploiting most similar example to reduce the bias on all classes and improve detection performance on novel classes, and a category-based grouping method for helping meta-model extract category-related features better, further improving performance and alleviating the strong bias problem. Based on similar appearance or the environment they appeared, we divide different categories into disjoint groups. This method helps the meta-model extract meta-feature vectors, making meta-features similar between different categories within a group and distribution obvious differences between groups. Our method improves few-shot detection performance for meta-learning method. For 1-shot, 2-shot, and 3-shot detection, our method obtains all detection APs of almost 20%, 25%, and 30%. In the future, we will apply the attention mechanism with our method to improve the few-shot detection performance for different datasets.

References

  • [1] P. Abbeel A simple neural attentive meta-learner. Cited by: §2.
  • A. Antoniou, H. Edwards, and A. Storkey (2018) How to train your maml.. arXiv: Learning. Cited by: §2.
  • A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran (2018) Zero-shot object detection. pp. 397–414. Cited by: §2.
  • [4] N. Barnes Polarity loss for zero-shot object detection. Cited by: §2.
  • H. Bilen and A. Vedaldi (2016) Weakly supervised deep detection networks. pp. 2846–2854. Cited by: §2.
  • [6] A. Bursuc Dense classification and implanting for few-shot learning. Cited by: §1.
  • H. Chen, Y. Wang, G. Wang, and Y. Qiao (2018) LSTD: a low-shot transfer detector for object detection. pp. 2836–2843. Cited by: §2.
  • [8] A. Divakaran Zero-shot object detection. Cited by: §1.
  • X. Dong, L. Zheng, F. Ma, Y. Yang, and D. Meng (2019) Few-example object detection with model communication. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (7), pp. 1641–1654. Cited by: §2.
  • [10] L. Fei-Fei Label efficient learning of transferable representations across domains and tasks. Cited by: §2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. pp. 1126–1135. Cited by: §2.
  • C. Finn and S. Levine (2017) Meta-learning and universality: deep representations and gradient descent can approximate any learning algorithm. arXiv: Learning. Cited by: §2.
  • [13] R. Girshick Low-shot visual recognition by shrinking and hallucinating features. Cited by: §2.
  • E. Grant, C. Finn, S. Levine, T. Darrell, and T. L. Griffiths (2018) Recasting gradient-based meta-learning as hierarchical bayes. arXiv: Learning. Cited by: §2.
  • [15] B. Hariharan Few-shot learning with localization in realistic settings. Cited by: §1.
  • [16] B. Hariharan Low-shot learning from imaginary data. Cited by: §2.
  • [17] M. Hebert

    Watch and learn: semi-supervised learning of object detectors from videos

    .
    Cited by: §1, §2.
  • B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-shot object detection via feature reweighting. pp. 8420–8429. Cited by: §1, §2, §3.3, §3, §4.1.
  • J. Kim, T. Oh, S. Lee, F. Pan, and I. S. Kweon (2019) Variational prototyping-encoder: one-shot learning with prototypical images.

    arXiv: Computer Vision and Pattern Recognition

    .
    Cited by: §2.
  • T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi (2019) FoveaBox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797. Cited by: §1.
  • Y. Lee and S. Choi (2018) Gradient-based meta-learning with learned layerwise metric and subspace. pp. 2927–2936. Cited by: §2.
  • W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. pp. 7260–7268. Cited by: §2.
  • Y. Lifchitz, Y. Avrithis, S. Picard, and A. Bursuc (2019) Dense classification and implanting for few-shot learning. pp. 9258–9267. Cited by: §2.
  • T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis & Machine Intelligence PP (99), pp. 2999–3007. Cited by: §1, §2, §4.2, §4.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017)

    SphereFace: deep hypersphere embedding for face recognition

    .
    pp. 6738–6746. Cited by: §2.
  • W. Liu, Y. Wen, Z. Yu, and M. Yang (2016)

    Large-margin softmax loss for convolutional neural networks

    .

    arXiv: Machine Learning

    .
    Cited by: §2.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms.. arXiv: Learning. Cited by: §2.
  • B. N. Oreshkin, P. R. Lopez, and A. Lacoste (2018) TADAM: task dependent adaptive metric for improved few-shot learning. pp. 721–731. Cited by: §2.
  • [29] P. O. Pinheiro Adaptive cross-modal few-shot learning. Cited by: §1.
  • [30] F. Porikli Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. Cited by: §1, §2.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. Cited by: §1.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision & Pattern Recognition, Cited by: §1, §1, §1, §2, §2, §3.1.
  • R. Rubinstein (1999) The cross-entropy method for combinatorial and continuous optimization.

    Methodology & Computing in Applied Probability

    2 (2), pp. 127–190.
    Cited by: §1, §2, §4.2, §4.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. Cited by: §2.
  • [35] V. Saligrama Zero-shot detection. IEEE Transactions on Circuits & Systems for Video Technology. Cited by: §1.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. pp. 1842–1850. Cited by: §2.
  • [37] M. Shah Task-agnostic meta-learning for few-shot learning. Cited by: §1.
  • H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell (2014) Weakly-supervised discovery of visual pattern configurations. pp. 1637–1645. Cited by: §2.
  • Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019)

    Meta-transfer learning for few-shot learning

    .
    pp. 403–412. Cited by: §2.
  • Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In Proc. Int. Conf. Computer Vision (ICCV), Cited by: §1.
  • L. Van Gool (2017) Weakly supervised cascaded convolutional networks. Cited by: §2.
  • Y. Wang and M. Hebert (2015) Model recommendation: generating object detectors from few samples. pp. 1619–1628. Cited by: §2.
  • D. Wertheimer and B. Hariharan (2019) Few-shot learning with localization in realistic settings. pp. 6558–6567. Cited by: §2.
  • [44] H. Xie Dual adversarial semantics-consistent network for generalized zero-shot learning. Cited by: §1.
  • [45] R. S. Zemel Incremental few-shot learning with attention attractor networks. Cited by: §1.

Appendix A A Implementation Details

During base classes training, only examples with 15 categories are trained, and the remaining regarded as novel classes with 5 categories are fine-tuned, and each novel class has only a few -shot, named 5-way -shot. The input of the meta-model is all trained categories examples with an object mask. The foreground is 1, and otherwise is 0. When there are multiple objects in an example, only an object mask is randomly selected. All models are trained by 4 GPUs with 64 batch sizes, and we train for 80,000 iterations for base model. In our work, we use test sets of VOC2007 as our test sets, and train/validation sets of VOC2007 and VOC2012 as our train sets. We use SGD with momentum 0.9, and L2 weight-decay 0.0005 for detector and meta-model.

Appendix B B Details of Category-Based Grouping

We compare with three grouping methods. The three methods are three cases, details as follows.

Novel Set 1 APs
Shot Method boat cat mbike sheep sofa mean base AP AP
1 Re-TCL 7 9.39 37.6 28.63 17.93 12.84 21.27 65.72 54.61
Re-TCL 8 4.55 32.78 29.89 18.28 10.87 19.27 66.51 54.7
Re-TCL 9 9.09 34.75 21.63 17.11 19.67 20.45 65.14 53.97
Ours 9.53 33.58 32.28 19.66 5.34 20.08 65.44 54.1
2 Re-TCL 7 7.22 41.24 20.34 32.36 13.66 22.96 64.83 55.61
Re-TCL 8 5.22 39.38 33.79 33.46 11.9 24.75 65.05 54.97
Re-TCL 9 2.19 45.05 25.01 27.84 17.8 23.56 64.83 54.51
Ours 10.61 35.11 33.75 35.89 18.38 26.75 65.18 55.58
3 Re-TCL 7 6.32 47.77 22.45 27.92 29.99 26.89 65.55 55.89
Re-TCL 8 10.46 47.35 27.08 26.12 37.72 29.75 65.11 56.27
Re-TCL 9 10.29 39.55 18.76 28.67 33.84 26.22 65.18 55.44
Ours 10.29 46.05 28.11 29.81 34.52 29.76 65.64 56.67
Table 3: The results of detection APs on Pascal VOC by different methods combining our TCL with different category-based grouping methods. For -shot detection, , ours is better than others for novel set1. This table illustrates APs (every novel class, the mean AP on the novel classes, the mean AP on the base classes, and the mean AP on the all categories).
(7)

Where related grouping is only related to meta-feature distribution between groups, the method fails to learn the mean distribution of all meta-feature vectors between groups.

(8)

To solve this problem, we also learn the mean distribution of all meta-feature vectors between groups. However, the method cannot success in learning distribute of meta-feature within a group. Therefore, our category-based grouping makes the distribution of meta-features more compact within a group and the difference between groups more obvious.

In the other hand, if the category-based grouping is only attribute to the disperse of meta-feature between groups and similarity between categories within a group, the method cannot learn the difference of average meta-features between groups, resulting in lower Mean APs, detail as follow.

(9)

As shown in Table 3, according to 7, 8, 9 and our 2, we experiment different Re-TCL methods which combine our TCL with different category-based grouping methods. Ours combining TCL with our 2 is better for few-shot detection. the and are set to 1 and 0.00005, respectively.

Appendix C C Results on Pascal VOC

For novel set 1, 2, 3, this section shows our all results for -shot detection, We visualize all results on all novel categories for novel set1 2-shot.

Novel Base
Shot Method boat cat mbike sheep sofa mean aero bike bird bottle bus car chair cow table dog horse person plant train tv mean
1 YOLO-joint 0.0 9.1 0.0 0.0 0.0 1.8 78.7 76.8 73.4 48.8 79.0 82.3 50.2 68.4 71.4 76.7 80.7 75.0 46.8 83.8 71.7 70.9
YOLO-ft 0.1 25.8 10.7 3.6 0.1 8.1 77.2 74.9 69.1 47.4 78.7 79.7 47.9 68.3 69.6 74.7 79.4 74.2 42.2 82.7 71.1 69.1
YOLO-ft-full 0.1 30.9 26.0 8.0 0.1 13.0 75.1 70.7 65.9 43.6 78.4 79.5 47.8 68.7 68.0 72.8 79.5 72.3 40.1 80.5 68.6 67.4
Baseline 10.8 44.0 17.8 18.1 5.3 19.2 77.1 71.8 66.3 40.4 75.2 77.8 50.1 54.6 66.8 69.1 78.3 68.1 41.9 80.6 70.3 65.9
BCEwithLogits 9.09 22.3 25.47 15.56 9.67 16.42 71.91 73.19 62.09 42.19 72.81 76.81 46.55 51.06 63.8 66.6 77.99 67.24 40 80.42 69.46 64.11
Re-BCEwithLogits 2.6 23.81 19.89 19.76 0.22 13.26 71.61 72.1 66.8 41.64 74.47 75.84 46.48 57.61 66.5 68.8 79.13 70.27 40.78 78.62 69.68 65.36
Focal 1.55 42.88 15.23 16.72 4.98 16.27 75.89 72.62 71.47 43.01 78.7 76.87 48.79 53.16 62.29 71.2 79.33 71.7 38.22 77.16 68.65 65.94
Re-Focal 9.79 37.99 20.64 20.14 2.54 18.22 75.27 71.74 66.44 41.01 75.73 76.07 46.39 52.88 63.02 69.68 81.9 70.64 38.75 80.53 68.17 65.22
Cross-Entropy 3.52 29.01 25.6 16.43 2.27 15.37 71.6 72.08 65.53 39.42 75.42 76.06 43.93 57.07 64.87 68.09 78.69 69.24 37.81 78.67 68.35 64.45
Re-Cross-Entropy 9.52 33.67 27.33 12.35 9.85 18.55 73.86 74.4 66.14 40.42 73.87 76.45 47.09 53.91 68.13 70.78 78.2 67.53 35.64 76.24 67.65 64.69
TCL-2 6.43 33.8 31.21 11.51 12.8 19.15 72.03 71 67.24 42.16 74.23 76.77 46.57 54.34 65.69 71.61 81.02 69.77 41.35 77.03 69.46 65.35
Ours 9.53 33.58 32.28 19.66 5.34 20.08 71.98 72.65 65.45 41.96 75.14 78.44 43.75 52.03 65.04 73.27 80.29 68.78 43.46 80.36 68.97 65.44
2 YOLO-joint 0.0 9.1 0.0 0.0 0.0 1.8 77.6 77.1 74.0 49.4 79.8 79.9 50.5 71.0 72.7 76.3 81.0 75.0 48.4 84.9 72.7 71.4
YOLO-ft 0.0 24.4 2.5 9.8 0.1 7.4 78.2 76.0 72.2 47.2 79.3 79.8 47.3 72.1 70.0 74.9 80.3 74.3 45.2 84.9 72.0 70.2
YOLO-ft-full 0.0 35.2 28.7 15.4 0.1 15.9 75.3 72.0 69.8 44.0 79.1 78.8 42.1 70.0 64.9 73.8 81.7 71.4 40.9 80.9 69.4 67.6
Baseline 5.3 46.4 18.4 26.1 12.4 21.7 71.4 72.4 64.5 37.9 75.3 77.1 42.9 55.0 57.4 73.7 78.9 68.0 41.5 75.9 69.0 64.1
BCEwithLogits 4.57 30.21 23.74 17.66 16.36 18.51 69.8 71.62 62.79 40.77 73.2 77.57 46.19 55.03 64.48 67.55 78.61 68.79 39.37 74.09 66.16 63.74
Re-BCEwithLogits 2.93 28.84 19.64 27.16 8.74 17.46 71.5 72.77 68.1 40.92 74.89 75.99 42.6 58.7 65.78 67.99 79.35 69.94 40.27 73.78 66.97 64.64
Focal 2.58 43.85 11.29 33.16 17.29 21.63 72.16 71.4 69.1 42.31 75.8 73.5 44.85 61.52 64.11 70.09 79.28 70 39.02 78.23 69.14 65.37
Re-Focal 6.15 33.6 18.95 29.43 12.1 20.05 74.88 62.18 66.6 41.13 76.91 76.51 45.4 60.88 62.23 70.71 82.32 71.68 42.19 74.26 67.51 65.03
Cross-Entropy 10.25 36.46 16.2 24.97 7.66 19.11 72.31 75.59 67.87 40.79 74.26 76.95 42.75 59.03 66.24 72.86 79.03 71.4 39.73 75.37 67.32 65.43
Re-Cross-Entropy 10.06 38.83 19.41 17.65 19.15 21.02 70.92 73.33 66.49 38.73 74.36 75.59 46.96 59.51 66.33 70.34 78.13 70.29 40.36 69.74 67.04 64.54
TCL-2 4.95 32.49 28.31 27.06 13.34 21.23 60.79 70.66 67.51 39.37 74.61 76.4 39.26 54.21 66.68 70.81 79.91 69.22 41.68 67.71 67.89 63.71
Ours 10.61 35.11 33.75 35.89 18.38 26.75 70.53 73.53 66.6 43.18 75.52 77.71 41.29 53.9 63.63 72.5 80.85 69.77 43.08 77.27 68.42 65.18
3 YOLO-joint 0.0 9.1 0.0 0.0 0.0 1.8 77.1 77.0 70.6 46.3 77.5 79.7 49.7 68.8 73.4 74.5 79.4 75.6 48.1 83.6 72.1 70.2
YOLO-ft 0.0 27.0 1.8 9.1 0.1 7.6 77.7 76.6 71.4 47.5 78.0 79.9 47.6 70.0 70.5 74.4 80.0 73.7 44.1 83.0 70.9 69.7
YOLO-ft-full 0.0 39.0 18.1 17.9 0.0 15.0 73.2 71.1 68.8 43.7 78.9 79.3 43.1 67.8 62.2 76.3 79.4 70.8 40.5 81.6 69.6 67.1
Baaseline 11.2 39.8 20.9 23.7 33.0 25.7 73.2 68.0 65.9 39.8 77.3 77.5 43.5 57.7 60.7 64.5 77.5 68.4 42.0 80.6 70.2 64.4
BCEwithLogits 10.92 46.62 24.57 23.23 31.71 27.41 73.06 71.22 62.71 40.1 73.9 77.79 46.02 55.06 62.75 67.95 79.51 65.49 39.81 76.69 67.36 63.96
Re-BCEwithLogits 9.77 45.18 22.77 22.37 21.47 24.31 71.95 70.14 63.95 38.09 74.65 74.86 41.02 57.12 58.25 63.08 79.29 65.36 40.36 73.38 67.43 62.6
Focal 10.36 43.77 19.45 28.41 37.56 27.91 72.71 71.66 66.84 42.79 76.13 76.71 44.92 57.42 66.61 68 81.97 68.8 40.25 77.19 67.44 65.3
Re-Focal 11.67 27.17 16.83 30.07 16.52 20.45 73.75 72.52 69.66 43.83 76.44 77.94 48.14 64.6 66.78 71.62 83.64 70.66 43.86 79.86 68.55 67.32
Cross-Entropy 7.42 32.57 14.75 31.04 29.76 23.11 70.8 73.39 69.21 40.52 75.33 78.98 46.07 57.42 66.21 69.44 79.53 69.03 38.8 79.15 65.43 65.29
Re-Cross-Entropy 10.47 32.89 16.06 24.1 28.75 22.25 74.1 74.44 68 41.23 77.62 76.84 48.19 58.56 67.15 69.26 78.71 69.84 38.12 80.31 67.1 65.96
TCL-2 9.67 47.2 25.11 28.97 32.26 28.64 72.5 67.82 67.63 39.08 75.28 77.06 43.79 53.3 65.71 72.9 80.65 69.11 41.24 76.85 69.16 64.8
Ours 10.29 46.05 28.11 29.81 34.52 29.76 72.45 73.14 66.32 41.18 75.07 78.55 45.04 56.81 64.46 72.73 81.23 68.26 42.61 79.87 66.89 65.64
5 YOLO-joint 0.0 9.1 0.0 0.0 9.1 3.6 78.2 78.5 72.1 47.8 76.6 82.1 50.7 70.1 71.8 77.6 80.4 75.4 46.0 84.8 72.5 71.0
YOLO-ft 0.0 33.8 2.6 7.8 3.2 9.5 77.2 77.1 71.9 47.3 78.8 79.8 47.1 69.8 71.8 77.0 80.2 74.3 44.2 82.5 70.6 70.0
YOLO-ft-full 7.9 48.0 39.1 29.4 36.6 32.2 75.5 73.6 69.1 43.3 78.4 78.9 42.3 70.2 66.1 77.4 79.8 72.2 41.9 82.8 69.3 68.1
Baseline 14.2 57.3 50.8 38.9 41.6 40.6 70.1 66.3 66.5 40.0 78.1 77.0 40.4 61.2 61.5 71.2 79.1 70.4 38.5 80.0 68.0 64.6
BCEwithLogits 8.91 49.65 49.11 31.72 40.95 36.07 70.68 73.05 65.16 38.42 75 77.82 41.25 61.02 62.72 71.88 79.94 68.33 40.99 77.75 67.68 64.78
Re-BCEwithLogits 12.61 44.16 42.49 36.93 32.59 33.76 70.83 73.53 67.28 38.7 76.79 76.65 39.31 64.76 63.29 69.56 81.21 66.78 39.42 78.28 68.1 64.97
Focal 8.04 52.42 48.07 39.9 38.76 37.43 72.2 72.54 66.31 40.65 78.69 76.74 43.54 64.5 68.53 69.94 81.07 69.84 37.68 80.03 69.13 66.09
Re-Focal 12.66 42.89 43.28 40.7 43.44 36.15 73.2 66.09 66.03 41.02 77.06 76.78 44.6 65.74 63.48 67.93 82.62 70.46 40.7 76.88 68.1 65.38
Cross-Entropy 7.37 45.65 44.55 39.62 38.7 35.18 72.32 69.03 67.67 40.15 77.41 77.16 40.29 60.94 64.86 73.51 81.17 69.68 38.94 80.75 67.45 65.42
Re-Cross-Entropy 10.66 48.92 44.25 36.05 42.65 36.5 75.5 75.17 69.9 40.72 77.28 76.96 47.54 62.26 65.63 73.91 79.66 70.86 38.87 80.68 68.61 66.9
TCL-2 5.94 55.34 49.25 34.84 39.33 36.94 70.44 66.41 64.71 35.75 75.56 75.54 35.72 59.05 57.86 73.47 77.87 66.64 37 75.44 65.73 62.48
Ours 9.02 47.13 49.78 40.11 35.37 36.28 73.28 74.77 67.25 39.4 76.51 78.53 44.52 57.32 66.44 75.54 81.33 58.92 39.8 80.15 68.98 66.18
Table 4: The results of detection APs. For few-shot detection on Pascal VOC DataSets, our method significantly outperforms others for novel set1.
Novel Base
Shot Method bird bus cow mbike sofa mean aero bike boat bottle car cat chair table dog horse person plant sheep train tv mean
1 YOLO-joint 0.0 0.0 0.0 0.0 0.0 0.0 78.4 76.9 61.5 48.7 79.8 84.5 51.0 72.7 79.0 77.6 74.9 48.2 62.8 84.8 73.1 70.2
YOLO-ft 6.8 0.0 9.1 0.0 0.0 3.2 77.1 78.2 61.7 46.7 79.4 82.7 51.0 69.0 78.3 79.5 74.2 42.7 68.3 84.1 72.9 69.7
YOLO-ft-full 11.4 17.6 3.8 0.0 0.0 6.6 75.8 77.3 63.1 45.9 78.7 84.1 52.3 66.5 79.3 77.2 73.7 44.0 66.0 84.2 72.2 69.4
Baseline 13.5 10.6 31.5 13.8 4.3 14.8 75.1 70.7 57.0 41.6 76.6 81.7 46.6 72.4 73.8 76.9 68.8 43.1 63.0 78.8 69.9 66.4
BCEwithLogits 10.56 17.23 5.64 33.62 0.91 13.59 74.13 69.35 54.19 38.89 75.62 80.05 48.07 61.85 72.03 74.75 64.59 38.88 58.63 77.72 68.27 63.8
Re-BCEwithLogits 5.89 38.39 14.65 32.00 0.51 18.29 73.1 69.84 55.76 40.42 76.81 78.51 46.1 63.73 75.78 77.92 68.69 40.41 59.85 79.28 70.79 65.13
Focal 11.18 1.17 13.66 25.83 0.1 10.39 74.61 73.96 53.91 41.82 53.8 78.28 48.87 24.36 78.16 76.08 71.31 42.12 64.13 81.33 70.97 62.25
Re-Focal 11.41 6.89 22.09 28.55 1.89 14.16 72.78 69.65 55.88 40.02 77.14 82.69 47.32 67.15 78.6 77.33 68.6 40.36 62.98 80.14 70.69 66.09
Cross-Entropy 12.49 11.16 24.72 28.79 3.03 16.04 72.85 69.98 55.87 43.31 76.53 77.72 45.37 69.87 75.76 75.98 69.72 40.07 59.27 78.15 70.13 65.37
Re-Cross-Entropy 12.35 16.28 13.23 33.41 0.47 15.15 73.02 70.19 60.02 41.78 77.59 80.05 44.86 66.17 78.08 75.95 70 40 59.47 79.68 68.27 65.68
TCL-2 18.21 18.02 15.38 24.2 11.99 17.56 69.95 72.15 55.51 37.15 76.74 81.57 45.66 68.99 72.95 75.79 68.26 41.47 62.86 78.64 70.91 65.24
Ours 19.18 22.58 7.62 30.81 10.16 18.07 72.43 76.21 56.25 40.45 77.07 81.02 47.89 67.63 75.06 74.2 69.72 39.38 61.51 76.96 67.46 65.55
2 YOLO-joint 0.0 0.0 0.0 0.0 0.0 0.0 77.6 77.6 60.4 48.1 81.5 82.6 51.5 72.0 79.2 78.8 75.2 47.0 65.2 86.0 72.7 70.4
YOLO-ft 11.5 5.8 7.6 0.1 7.5 6.5 77.9 75.0 58.5 45.7 77.6 84.0 50.4 68.5 79.2 79.7 73.8 44.0 66.0 77.5 72.9 68.7
YOLO-ft-full 16.6 9.7 12.4 0.1 14.5 10.7 76.4 70.2 56.9 43.3 77.5 83.8 47.8 70.7 79.1 77.6 71.7 39.6 61.4 77.0 70.3 66.9
Baseline 21.2 12.0 16.8 17.9 9.6 15.5 74.6 74.9 56.3 38.5 75.5 68.0 43.2 69.3 66.2 42.4 68.1 41.8 59.4 76.4 70.3 61.7
BCEwithLogits 19.22 12.11 21.55 14.98 5.7 14.71 71.83 70.38 54.28 37.96 75.46 81.24 44.12 71.37 77.53 76.8 69.95 39.44 54.8 72.74 66.68 64.3
Re-BCEwithLogits 15.18 6.49 27.69 26.89 22.27 19.71 72.21 72.97 55.33 39.56 74.71 77.94 43.68 65.32 77.86 77.43 68.3 39.71 57.31 76.16 68.11 64.44
Focal 19.13 11.25 28.13 15.09 2.56 15.23 74.46 77.02 54.96 43.77 77.86 82.33 45.03 68.47 79.01 75.34 72.02 43.38 62.98 79.66 68.55 66.99
Re-Focal 12.45 3.88 37.62 19.14 6.28 15.88 72.29 72.52 59.1 44.79 78.15 82.47 47.14 68.25 79.49 76.44 72.27 41.79 67.79 75.44 69.9 67.19
Cross-Entropy 24.64 7.22 27.93 27.00 9.19 19.2 74.52 70.24 55.12 42.94 77.14 78.71 44.82 69.75 76.92 69.73 69.15 41.45 58.8 73.04 68.76 64.74
Re-Cross-Entropy 23.43 11.61 19.13 37.85 12.04 20.81 71.79 70.63 59.38 41.65 77.42 81.35 44.56 66.39 78.41 56.23 70.88 40.77 52.71 73.85 68.52 63.64’
TCL-2 23.02 8.3 36.51 27.34 16.09 22.25 72.65 70.28 54.02 36.6 76.41 80.82 42.55 67.26 75.97 77.13 69.15 41.8 60.92 73.06 68.71 64.49
Ours 25.18 12.29 31.87 32.71 21.23 24.66 72.31 73.62 55.07 38.23 74.73 81.85 44.37 65.41 75.94 75.85 69.93 39.81 56.93 73.31 65.72 64.21
3 YOLO-joint 0 0 0 0 9.1 1.8 78.0 77.2 61.2 45.6 81.6 83.7 51.7 73.4 80.7 79.6 75.0 45.5 65.6 83.1 72.7 70.3
YOLO-ft 10.9 5.5 15.3 0.2 0.1 6.4 76.7 77.0 60.4 46.9 78.8 84.9 51.0 68.3 79.6 78.7 73.1 44.5 67.6 83.6 72.4 69.6
YOLO-ft-full 21.0 22.0 19.1 0.5 0.0 12.5 73.4 67.5 56.8 41.2 77.1 81.6 45.5 62.1 74.6 78.9 67.9 37.8 54.1 76.4 71.9 64.4
Baseline 26.1 19.1 40.7 20.4 27.1 26.7 73.6 73.1 56.7 41.6 76.1 78.7 42.6 66.8 72.0 77.7 68.5 42.0 57.1 74.7 70.7 64.8
BCEwithLogits 24.71 22.63 29.19 29.22 25.75 26.3 71.62 70.05 54.66 37.51 75.44 79.7 46.09 63.79 75.58 77.37 67.16 40.1 49.63 73.3 66.12 63.21
Re-BCEwithLogits 23.03 29.91 22.74 19.06 40.2 26.99 70.37 72.92 56.98 37.24 75.72 79.38 45.12 66.07 74.99 78.59 65.85 40.35 49.45 77.29 68.68 63.93
Focal 10.3 16.38 33.76 17.11 14.24 18.36 73.33 74.27 55.78 43.21 79.32 82.48 47.69 70.4 76.97 79.0 70.32 42.39 61.53 81.29 70.48 67.23
Re-Focal 25.98 11.44 37.02 20.43 20.79 23.13 72.58 70.06 58.43 42.27 78.28 80.97 47.52 66.97 78.47 76.03 68.9 41.7 60.5 75.51 69.02 65.81
Cross-Entropy 28.17 14.07 32.98 24.2 27.86 25.46 73.07 70.39 56.28 41.02 73.88 80.17 45.97 70.5 75.79 77.58 68.22 40.2 55.15 76.83 69.07 64.94
Re-Cross-Entropy 23.25 18.64 37.54 24.33 26.62 26.07 70.03 71.94 56.92 40.92 77.53 81.16 47.0 68.24 76.68 77.33 69.29 41.12 51.47 77.53 69.22 65.09
TCL-2 31.15 23.29 23.97 19.96 29.49 25.57 67.69 66.26 51.65 35.66 76.84 80.62 45.79 66.01 74.73 77.15 65.55 40.55 57.96 77.71 69.72 63.59
Ours 32.1 20.53 30.64 28.52 42.9 30.94 71.19 69.39 51.81 37.66 76.41 81.63 45.39 64.51 75.41 76.11 69.45 39.09 54.88 75.38 67.9 63.75
5 YOLO-joint 0.0 0.0 0.0 0.0 9.1 1.8 77.8 76.4 65.7 45.9 79.5 82.3 50.4 72.5 79.1 79.0 75.5 47.9 67.2 83.0 72.5 70.3
YOLO-ft 11.6 7.1 10.7 2.1 6.0 7.5 76.5 76.4 61.0 45.5 78.7 84.5 49.2 68.7 78.5 78.1 73.7 45.4 66.8 85.3 70.0 69.2
YOLO-ft-full 20.2 20.0 22.4 36.4 24.8 24.8 72.0 70.6 60.7 42.0 76.8 84.2 47.7 63.7 76.9 78.8 72.1 42.2 61.1 80.8 69.9 66.6
Baseline 31.5 21.1 39.8 40.0 37.0 33.9 69.3 57.5 56.8 37.8 74.8 82.8 41.2 67.3 74.0 77.4 70.9 40.9 57.3 73.5 69.3 63.4
BCEwithLogits 30.15 24.25 32.32 52.43 36.83 35.2 70.62 71.46 54.45 38.34 75.34 80.7 47.78 65.24 73.84 77.03 65.86 36.92 48.75 73.89 65.78 63.07
Re-BCEwithLogits 25.41 36.34 25.49 46.96 42.29 35.3 69.14 74.33 55.5 37.18 76.77 79.12 45.9 64.79 76.47 77.61 67.74 38.64 53.98 78.21 68.97 64.29
Focal 28.07 20.89 40.79 49.65 31.03 34.09 73.52 68.13 56.93 39.23 76.85 80.54 44.15 67.99 77.37 75.21 69.25 40.05 59.0 78.25 70.47 65.13
Re-Focal 18.82 21.87 38.92 33.81 22.6 27.2 73.12 73.52 60.79 42.24 78.96 83.06 49.59 68.32 79.09 77.7 71.54 41.5 66.99 80.1 70.96 67.83
Cross-Entropy 33.26 25.01 40.64 38.19 42.11 35.84 71.91 70.27 58.51 40.01 76.6 80.33 45.54 69.93 78.19 77.44 70.96 40.05 58.74 80.47 69.78 65.92
Re-Cross-Entropy 29.4 14.81 37.89 49.28 35.85 33.45 70.98 70.31 56.94 41.35 77.53 83.88 45.11 67.13 76.83 78.14 70.17 41.45 56.7 76.74 67.41 65.38
TCL-2 29.84 42.46 30.11 48.26 41.58 38.45 67.27 60.54 52.2 32.59 74.41 81.03 33.61 62.9 67.29 75.62 64.23 32.44 56.2 71.65 66.99 59.93
Ours 30.35 37.51 30.7 55.16 41.49 39.04 67.82 67.17 48.47 33.66 73.42 78.18 39.45 61.11 69.54 74.0 68.07 36.5 55.67 71.31 65.04 60.63
Table 5: The results of detection APs. For few-shot detection on Pascal VOC DataSet, ours significantly outperforms others for novel set2.
Novel Base
Shot Method aero bottle cow horse sofa mean bike bird boat bus car cat chair table dog mbike person plant sheep train tv mean
1 YOLO-joint 0.0 0.0 0.0 0.0 0.0 0.0 78.8 73.2 63.6 79.0 79.7 87.2 51.5 71.2 81.1 78.1 75.4 47.7 65.9 84.0 73.7 72.7
YOLO-ft 0.4 0.2 10.3 29.8 0.0 8.2 77.9 70.2 62.2 79.8 79.4 86.6 51.9 72.3 77.1 78.1 73.9 44.1 66.6 83.4 74.0 71.8
YOLO-ft-full 0.6 9.1 11.2 41.6 0.0 12.5 74.9 67.2 60.1 78.8 79.0 83.8 50.6 72.7 75.5 74.8 71.7 43.9 62.5 81.8 72.6 70.0
Baseline 11.8 9.1 15.6 23.7 18.2 15.7 77.6 62.7 54.2 75.3 79.0 80.0 49.6 70.3 78.3 78.2 68.5 42.2 58.2 78.5 70.4 68.2
BCEwithLogits 10.12 0.7 7.86 39.72 17.1 15.1 74.81 64.83 52.22 73.61 76.19 78.31 46.72 66.34 78.4 77.34 67.81 37.57 58.53 74.82 70.73 66.55
Re-BCEwithLogits 10.74 0.51 7.39 31.33 5.05 11 73.86 62.79 49.72 72.46 77.28 80.37 44.56 66.54 76.61 75.87 65.98 38.26 57.72 71.58 68.98 65.5
Focal 10.76 0.06 16.13 19.98 1.06 9.6 75.75 65.05 57.9 77.11 77.29 81.48 46.04 62.73 75.47 74.19 69.38 40.55 54.75 78.08 69.18 67
Re-Focal 9.48 0.03 19.36 7.48 0.13 7.3 78.53 68.12 59.02 77.91 78.58 83.09 50.56 57.57 72.67 76.19 69.51 42.75 59.86 72.39 70.87 67.8
Cross-Entropy 14.01 0.34 10.89 25.89 9.88 12.19 73.46 62.63 52.58 74.92 76.8 81.12 46.33 66.23 79.3 74.31 68.15 38.79 56.06 76.86 69.38 66.46
Re-Cross-Entropy 9.28 0.22 21.57 25.2 9.09 13.07 75.33 64.84 54.94 75.81 77.84 82.48 47.64 63.8 78.3 75.53 70.4 38.38 57.96 76.01 67.7 67.13
TCL-2 0.76 0.07 16.96 29.59 13.99 12.27 75.82 60.6 54.09 75.65 77.95 84.08 48.81 64.8 73.36 77.1 71.26 40.53 58.01 78.17 69.27 67.3
Ours 16.33 9.09 8.7 46.9 16.08 19.42 72.71 62.94 45 74.97 75.29 77.39 44.73 64.71 67.84 74.99 68.65 38.36 54.3 76.09 68.95 64.46
2 YOLO-joint 0.0 0.6 0.0 0.0 0.0 0.1 78.4 69.7 64.5 78.3 79.7 86.1 52.2 72.6 81.2 78.6 75.2 50.3 66.1 85.3 74.0 72.8
YOLO-ft 0.2 0.2 17.2 1.2 0.0 3.8 78.1 70.0 60.6 79.8 79.4 87.1 49.7 70.3 80.4 78.8 73.7 44.2 62.2 82.4 74.9 71.4
YOLO-ft-full 1.8 1.8 15.5 1.9 0.0 4.2 76.4 69.7 58.0 80.0 79.0 86.9 44.8 68.2 75.2 77.4 72.2 40.3 59.1 81.6 73.4 69.5
Baseline 28.6 0.9 27.6 0.0 19.5 15.3 75.8 67.4 52.4 74.8 76.6 82.5 44.5 66.0 79.4 76.2 68.2 42.3 53.8 76.6 71.0 67.2
BCEwithLogits 28.39 0.61 28.6 0.53 19.97 15.62 74.24 67.99 51.14 75.32 76.18 79.04 45.98 66.83 77.66 77.48 68.75 38.91 55.13 75.69 70.05 66.69
Re-BCEwithLogits 21.59 0.22 25.02 0.19 17.96 13 75.95 64.26 48.5 76.34 77.12 81.84 43.76 67.99 78.27 76.22 68.65 42.37 54.22 76.54 68.03 66.67
Focal 26.74 1.3 13.04 0.66 2.59 8.87 73.85 67.33 54.91 77.18 78.56 82.54 44.52 63.84 80.36 77.87 69.33 42.3 58.35 77.97 69.95 67.92
Re-Focal 14.44 0.29 24.55 0.13 3.94 8.67 77.32 70.66 56.19 78.09 78.07 83.41 48.05 65.54 79.71 78.13 68.42 42.57 59.41 76.83 70.55 68.86
Cross-Entropy 28.6 0.9 27.6 0 19.5 15.3 75.8 67.4 52.4 74.8 76.6 82.5 44.5 66 79.4 76.2 68.2 42.3 53.8 76.6 71 67.2
Re-Cross-Entropy 28.4 0.13 28.42 0.57 15.15 14.53 75.4 67.05 57.57 75.18 77.82 82.17 49.28 71.61 79.59 76.72 70.17 42.48 59.89 76.29 68.35 68.64’
TCL-2 34.66 0.83 31.32 1.01 18.82 17.33 69.88 61.47 50.09 72.15 73.96 77.26 38.45 61.21 73.48 74.39 66.28 37 50.94 71.34 66.07 62.93
Ours 34.14 0.49 34.6 2.27 15.65 17.43 77.56 65.59 51.74 72.95 76.14 79.49 42.88 64.67 76.65 75.98 70.37 41.08 55.8 75.15 68.4 66.3
3 YOLO-joint 0.0 0.0 0.0 0.0 0.0 0.0 77.6 72.2 61.2 77.9 79.8 85.8 49.9 73.2 80.0 77.9 75.3 50.8 64.3 84.2 72.6 72.2
YOLO-ft 4.9 0.0 11.2 1.2 0.0 3.5 78.7 71.6 62.4 77.4 80.4 87.5 49.5 70.8 79.7 79.5 72.6 44.3 60.0 83.0 75.2 71.5
YOLO-ft-full 10.7 4.6 12.9 29.7 0.0 11.6 74.9 69.2 60.4 79.4 79.1 87.3 43.4 69.7 75.8 75.2 70.5 39.4 52.9 80.8 73.4 68.8
Baseline 29.4 4.6 34.9 6.8 37.9 22.7 62.6 64.7 55.2 76.6 77.1 82.7 46.7 65.4 75.4 78.3 69.2 42.8 45.2 77.9 69.6 66.0
BCEwithLogits 32.19 9.09 29.2 18.99 41.25 26.14 71.29 64.92 50.96 75.89 76.87 80.89 47.81 67.03 75.62 75.87 65.67 37.94 40.8 78.44 70.66 65.38
Re-BCEwithLogits 28.57 1.59 30.45 12.91 30.16 20.74 69.18 62.13 50.28 75.55 77.51 81.86 44.31 67.74 76.39 76.9 65.58 40.88 48.33 77.22 66.98 65.37
Focal 41.55 0.14 30.81 5.24 23.06 20.16 68.15 65.43 53.92 78.67 77.53 81.51 46.51 62.56 79.72 76.13 70.46 41.11 53.42 76.39 68.93 66.7
Re-Focal 23.68 0.14 30.79 11.23 15.91 16.35 71.85 69.68 52.25 78.67 78.78 83.68 50.77 66.81 78.45 76.73 70.02 40.98 56.54 76.72 71.39 68.22
Cross-Entropy 36.47 0.83 35.42 4.65 24.2 20.31 73.66 64.98 51.99 75.61 77.96 84.05 46.85 66.4 75.29 79.23 70.37 41.69 52.23 78.26 69.17 67.18
Re-Cross-Entropy 35.59 1.3 38.54 12.54 31.71 23.93 70.71 67.85 48.91 76.36 78.06 82.38 49.33 68.88 78.77 78.7 69.41 40.87 56.92 77.58 68.21 67.53
TCL-2 44.05 9.09 29.74 30.28 40.9 30.81 64.75 60.22 50.98 73.25 74.53 75.68 42.18 59.55 69.3 73.94 62.4 37.19 42.58 71.69 67.77 61.73
Ours 43.57 3.03 26.28 8.77 34.58 23.24 74.94 65.33 53.83 73.41 77.16 80.42 46.47 62.36 74.49 77.22 68.61 42.38 56.93 78.05 69.73 66.76
5 YOLO-joint 0.0 0.0 0.0 0.0 9.1 1.8 78.0 71.5 62.9 81.7 79.7 86.8 50.0 72.3 81.7 77.9 75.6 48.4 65.4 83.2 73.6 72.6
YOLO-ft 0.8 0.2 11.3 5.2 0.0 3.5 78.6 72.4 61.5 79.4 81.0 87.8 48.6 72.1 81.0 79.6 73.6 44.9 61.4 83.9 74.7 72.0
YOLO-ft-full 10.3 9.1 17.4 43.5 0.0 16.0 76.4 69.6 59.1 80.3 78.5 87.8 42.1 72.1 76.6 77.1 70.7 43.1 58.0 82.4 72.6 69.8
Baseline 33.1 9.4 38.4 25.4 44.0 30.1 73.2 65.6 52.9 75.9 77.5 80.0 43.7 65.0 73.8 78.4 68.9 39.2 56.4 78.0 70.8 66.6
BCEwithLogits 34.95 9.6 39.77 38.42 35.25 31.6 62.67 60.27 47.94 73.84 76.02 81.24 36.22 58.34 70.32 76.77 64.08 34.21 44.07 67.14 68.85 61.46
Re-BCEwithLogits 39.54 10.18 29.8 39.16 41.09 31.95 68.47 62.8 50.71 75.66 75.9 82.76 38.52 64.22 66.51 76.59 65.42 36.26 47.94 77.55 66.9 63.75
Focal 36.72 9.25 38.45 19.4 33.85 27.54 71.35 63.31 50.51 78.66 77.23 82.49 43.95 60.1 76.13 77.55 70 37.36 53.92 77.12 69.18 65.92
Re-Focal 39.04 9.27 37.96 28.32 28.4 28.6 68.26 66.08 51.29 78.9 77.25 83.35 44.7 65.53 76.78 74.94 68.89 38.34 54.3 77.37 70.51 66.43
Cross-Entropy 37.34 9.86 34.27 38.2 39.86 31.91 70.22 59.53 47.57 74.56 75.95 82.2 43.24 65.92 68.43 76.71 68.36 35.22 50.16 77.82 67.55 64.23
Re-Cross-Entropy 38.61 9.65 42.49 39.82 47.31 35.58 67.01 62.41 52.4 74.6 74.65 80.34 42.02 67.56 73.28 75.78 67 38.57 50.48 74.02 67.88 64.53
TCL-2 47.64 9.41 36.71 41.67 39.97 35.08 64.22 60.49 52.15 75.61 73.65 78.92 36.39 58.68 70.4 75.1 62.92 32.91 45.41 75.02 67.69 61.97
Ours 44.66 9.88 37.51 55.49 40.75 37.66 64.54 63.48 50.68 75.2 74.96 77.02 38.67 59.23 71.05 75.06 64.4 35.82 52.06 74.45 68.15 62.98
Table 6: The results of detection APs. For few-shot detection on Pascal VOC DataSets, our method significantly outperforms others for novel set3.
Figure 5: Qualitative 2-shot boat detection results on our test set for novel set1. We visualize the bounding boxes of all methods.
Figure 6: Qualitative 2-shot cat detection results on our test set for novel set1. We visualize the bounding boxes of all methods.
Figure 7: Qualitative 2-shot mbike detection results on our test set for novel set1. We visualize the bounding boxes of all methods.
Figure 8: Qualitative 2-shot sheep detection results on our test set for novel set1. We visualize the bounding boxes of all methods.
Figure 9: Qualitative 2-shot sofa detection results on our test set for novel set1. We visualize the bounding boxes of all methods.