Conventional methods for object detection usually requires substantial amount of training data and to prepare such high quality training data is labor intensive. In this paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training examples. Central to our method is the Attention-RPN and the multi-relation module which fully exploit the similarity between the few shot training examples and the test set to detect novel objects while suppressing the false detection in background. To train our network, we have prepared a new dataset which contains 1000 categories of varies objects with high quality annotations. To the best of our knowledge, this is also the first dataset specifically designed for few shot object detection. Once our network is trained, we can apply object detection for unseen classes without further training or fine tuning. This is also the major advantage of few shot object detection. Our method is general, and has a wide range of applications. We demonstrate the effectiveness of our method quantitatively and qualitatively on different datasets. The dataset link is: https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.READ FULL TEXT VIEW PDF
Detecting rare objects from a few examples is an emerging problem. Prior...
This paper aims to tackle the challenging problem of one-shot object
In this paper, we consider the task of one-shot object detection, which
Over the past few years, we have witnessed the success of deep learning ...
Though the object detection has shown great success when the training se...
Distance metric learning (DML) has been successfully applied to object
In this work, we address the problem of few-shot multi-classobject count...
Object detection has wide range of applications. Existing object detection methods usually rely heavily on a huge amount of annotated data and a long training time. They are also hard to be extended to unseen objects that were not annotated in training data. In contrast, human vision system is excellent at recognizing new objects even with a little supervision. This inspires us to develop a few-shot object detection algorithm.
Few-Shot learning is challenging due to the large object variance of illumination, shape, texture , in the real world. In recent years, it has achieved great progresses[1, 2, 3, 4, 5, 6, 7, 8]. These methods, however, all focus on image classification, and the few-shot object detection is rarely exploited. This is because transferring the experiences from few-shot classification to few-shot object detection is a non-trivial task. Object detection from few shots usually confronts one crucial problem, that is how to localize an unseen object in a cluttered background, namely a generalization problem of object localization from a few training examples of novel categories. It happens that the potential bounding boxes would perfectly miss the unseen objects or have many false detection on background. This could be due to the improper low scores of good bounding boxes in a region proposal network (RPN), which makes an novel object hard to be detected. This problem makes the few-shot object detection intrinsically different from the few-shot classification.
In this work, we aim to solve the problem of few-shot object detection. Given a few support set images of target object, our goal is to detect all foreground objects in the test set that belong to the target object category, as shown in Fig. 1. To this aim, we make two major contributions. First, we propose a general few-shot detection model that can be applied to detect novel objects without re-training and fine tuning. Our method fully exploits the matching relationship between object pairs in a siamese network at multiple network stages. Experiments show that our model can benefit from the attention module at an early stage which enhances the proposal quality, and from the multi-relation module at a late stage that suppresses and filter out false detection on confusing backgrounds. Second, to train the model, we build a large well-annotated dataset with 1000 categories and a few examples for each category. This dataset promotes the general learning of object detection. Our method achieves much better performance by utilizing this dataset than a benchmark dataset of larger scale, COCO . To the best of our knowledge, we give the first attempt to construct a few-shot object detection dataset with such many classes.
In summary, we proposed a novel model for few-shot object detection with a carefully designed attention module on RPN and detector. Contributing to the matching strategy , our model takes an important attribute that its training stage is exactly the same as its testing stage. This essentially enables our model to online test on objects of novel categories, which requires no pre-training or further network adaptation.
General Object Detection.
Object detection is a classical problem in computer vision. In early years, object detection was usually formulated as a sliding window classification problem using handcrafted features[11, 12, 13]
. With the rise of deep learning, CNN-based methods become the dominant object detection solution. Most of the methods can be further divided into two general approaches: proposal-free detectors and proposal-based detectors. The first line of works follow a one-stage training strategy and do not generate proposal boxes explicitly [15, 16, 17, 18, 19]. On the other hand, the second line, pioneered by R-CNN 
, would first extract class-agnostic region proposals of the potential objects in an image, these boxes are then further refined and classified to different categories by a specific module[21, 22, 23, 24]. This strategy can filter out the majority of negative locations by PRN module and facilitates the sequential detector. For this sake, the RPN-based methods usually perform better than the proposal-free ones and lead state-of-the-art of detection task. The methods mentioned above, however, work in an intensive supervision manner and are hard to expand to novel categories with only several examples. This fact arises the research interest in few-shot learning.
Few-shot learning. Few-shot learning, as a classic problem 
, is challenging for traditional machine learning algorithms to learn from just a few training examples. Earlier works attempt to learn a general prior[26, 27, 28, 29, 30], like hand-designed strokes or parts, that can share across categories. Some other works [31, 32, 1, 33]
focus on metric learning which aimed at manually designing the distance formulation among different categories. Recently, a trend is to design a general agent/strategy that can guide the supervised learning within each task, by accumulating knowledge it gains the network an attribute to capture the structure varies across different tasks, this kind of research direction is named meta-learning in general[10, 5, 34, 2, 35]. In this area, 
proposed a siamese network that consists of twin networks sharing weights, where each network is fed a supported image and a query individually. The distance between the query and its support is naturally learned by a logistic regression. This matching strategy intrinsically captures the varies between support and query despite their categories. In the envelope of the matching framework, following works[4, 36, 8, 37, 3, 6] focus on enhancing the feature embedding, where one direction is to build memory modules to capture global contexts among the supports. Some works [38, 39] exploit local descriptor to reap additional knowledge from the limited data. [40, 41]
introduce Graph Neural Network (GNN) to model the relationship between different categories. traverses across the entire support set and identifies task-relevant features to make metric learning in high-dimensional space more effective. There are also other works, such as [2, 43], which dedicates on learning a general agent to guide the parameter optimization.
Until now, few-shot learning has achieved much very progress. It, however, mostly focus on the classification task, rarely on other computer vision tasks, such as semantic segmentation [44, 45, 46], human motion prediction  and object detection . As mentioned before, few-shot object detection is a task intrinsically different from the general few-shot learning on classification.  harnesses unlabeled data and optimizes multiple modules alternatively on images without box. It may be misled by incorrect detection in the weak supervision and requires re-training for a new category and it is out of our scope. LSTD 
proposed a novel few-shot object detection framework that can transfer knowledge from one large dataset to another small one, by minimizing the gap of classifying posterior probability between the source domain and the target domain. This method, however, strongly depends on source domain and hard to extend to a scenario very different from it.
Our work is mainly motivated by the research line pioneered by the matching network . We proposed a general few-shot object detection deep network that learn the matching on image pairs based on the Faster R-CNN framework, which is equipped with our muti-scale and shaped attentions.
The key to few-shot learning lies in the generality ability of the model on novel categories. Thus, a high-diversity dataset with a large number of object categories is necessary in order to train a model that is general enough to detect unseen objects and also to provide a convincing evaluation. However, existing datasets [9, 50, 51, 52, 53] contain very limited categories and they do not follow the few-shot evaluation setting. To this aim, we build a new few-shot object detection dataset. In this section, we will introduce our dataset and give a full analysis of it.
|Avg No. Box / Img||2.82||2.48|
|Min No. Img / Cls||22||30|
|Max No. Img / Cls||208||199|
|Avg No. Img / Cls||75.65||74.31|
|Box Size||[6, 6828]||[13, 4605]|
|Box Area Ratio||[0.0009, 1]||[0.0009, 1]|
|Box W/H Ratio||[0.0216, 89]||[0.0199, 51.5]|
Dataset Construction We build our dataset from existing massive supervised object detection datasets, [54, 52]. These datasets, however, cannot be used directly due to 1) the label system of different datasets are inconsistent where some objects with the same semantic use different words in the datasets; 2) large amount of annotations are less than satisfactory due to the inaccurate and missing labeling, duplicate boxes, too large objects, and . 3) their train/test split contains the same categories, while for few-shot dataset we want the train/test sets contain different categories in order to evaluate its generality on unseen objects. To build the dataset, we first summarize a label system from [54, 52]. By merging the leaf labels in their original label trees, group those of same semantics, such as the ice bear and polar bear, to one category, and remove some semantics that does not belong to any leaf categories. Then, we remove the images with bad labeling quality and those with boxes of improper size. We remove boxes smaller than 0.05% of image size which is usually in bad visual quality and unsuitable to serve as support examples. Next, we follow the few-shot learning principle to split our data into the training set and test set whose categories has no overlap. We construct the training set with categories in MS COCO Dataset  and ImageNet Dataset  in case researchers need a pretraining stage. We then split the test set which contains 200 categories by choosing those with the largest distance with existing training categories, where the distance calculates the shortest path that connects the senses of two phrase in the is-a taxonomy . The remaining categories are merged into the training set that in total contains 800 categories. In all, we construct a dataset of 1000 categories with very clear category split for training and testing, where 531 categories come from ImageNet Dataset and 469 from Open Image Dataset .
Dataset Analysis Our dataset is specifically designed for few-shot learning and intrinsically designed to evaluate the generality of a model on novel categories. Our dataset contains 1000 categories with 800/200 split for training and test set separately, around 66,000 images and 182,000 bounding boxes in total. A detailed statics is shown in Table 1 and Fig. 3. Our dataset has the following attributes.
High diversity of Categories: Our dataset contains 83 parent semantics, such as mammal, clothing, weapon,, which are further split to 1000 leaf categories. Our label tree is shown in Fig. 2. Due to our strict dataset split, our train/test set has very different semantic categories which largely challenges the evaluation of models.
Challenging Setting: Our dataset is challenging for evaluation. It contains objects with large variance on box size and aspect ratios, and it contains 26.5% images with no less than three objects in the test set. Here it is worthy to note that our test set contains a large number of boxes of other categories which are not included in our label system. This also creates a great challenge for a few-shot model.
Even though our dataset has a large number of categories, but our training images and boxes are much less than a benchmark dataset, like MS COCO Dataset, which contains 123,287 images and around 886,000 bounding boxes. Thus, our dataset is compact but efficient for few-shot learning.
In this section, we will introduce our novel few-shot object detection network. Before that, we first introduce the task of few-shot detection.
We define the task of few-shot object detection as below. Given a supported image with a close-up of one target object and a separate query image which potentially contains objects of the support category , the task is to find all the objects of support category in the query and label them with tight bounding boxes. If the support set contains categories and examples for each category, we call it -way -shot detection.
As we knew, few-shot recognition is challenging due to the fact that it is hard to capture a common sense from just a few examples. To this end, we proposed an intensive-attention network that learns a general matching relationship between the support set and queries on both the RPN module and the detector. Fig. 4 shows the overall architecture of our network.
In particular, we build a siamese framework which has two branches and share weights, whose one branch is for support set and the other is for the query set, where the query branch of the siamese framework is a Faster R-CNN network, which contains two stages of RPN and detector. We utilize this framework to train the matching relationship between support and query features, in order to enforce the network to learn general knowledge and the ability to capture common sense among the same categories. Based on the framework, we introduce a novel attention RPN and detection with multi-relation modules to prompt an accurate parsing between support and potential boxes in the query.
In few-shot object detection, RPN as a useful tool can filter out a large number of potential boxes and reduce the pressure of the sequential detector. The RPN does not only distinguish between objects and non-objects, but also filter out unseen objects that were not annotated in training data. However, without any support image information, the RPN will aimlessly active every potential object with high objectness score, even though they do not belong to the support category. A large amount of irrelevant objects will burden subsequent classification in the detector. On account of this fact, we propose the Attention RPN that involves supporting information to filter out the majority of background boxes and those of non-matching categories. Based on that, we can generate a smaller set of candidate proposals with the high potential to contain the target objects. The framework is shown in Fig. 5.
We introduce support information to RPN through attention mechanism to guide the RPN to produce more relevant proposals and depress proposals of other categories. We calculate the similarity between the feature map of support and that of the query in a depth-wise manner. The similarity map then is utilized to build the proposal generation. In particular, let we denote the features of support as and feature map of the query as , we formulate this similarity calculation as follows,
where is the resultant attention feature map. Here the support features is used as the kernel to slide on the query feature map, then a depth-wise convolution between them is calculated . This procedure works as calling for the attention of support on the query. In our work, we adopt the features of top layers of the RPN model, the res4-6 in ResNet50. We find that a kernel size of performs well in our case. This fact is consistent with the experience in  that the global feature can provide a good object prior for objectness classification. In our case, the kernel is calculated by averaging on the support feature map. Our attention map is followed by a convolution and then the ROI pooling and objectiveness classification layer.
In an R-CNN framework, an RPN module will be followed by a detector which takes an important role of re-scoring the proposals and class recognition. Therefore, we want a detector to have a strong discriminative ability to distinguish different categories. To this aim, we propose a novel multi-relation detector to effectively measure the similarity between proposal boxes from the query and the support objects. The detector includes three attention modules, which are the patch-relation head to learn a deep non-linear metric for patch matching, the global-relation head to learn a deep embedding for global matching and the local-correlation head to learn the pixel-wise and depth-wise cross correlation between support and query proposals. We experimentally show that the three matching modules can complement each other and gains higher performance incrementally by adding one by one. We will introduce our multi-relation detector details below.
In patch-relation head, we first concatenate the support and query proposal feature maps in depth. Then the combined feature map are fed into the patch-relation module, whose structure is shown in Table. 2. All the convolution and pooling layers in this module have 0 padding to reduce the feature map from to which is used as inputs for the binary classification and regression heads. This module is compact and efficient. We do a bit exploitation on the structure of the model and we find replacing the two average pooling with convolutions would not improve our model further.
The global-relation head extends the patch relation to model the global-embedding relation between the support and query proposals. Given a concatenated feature of support and its query proposal, we average pooling the feature to a vector with a size of
. We then use an MLP with two fully connected (fc) layers followed by ReLU and a final fc layer to generate matching scores.
Local-correlation head computes the pixel-wise and depth-wise similarity between object ROI feature and the proposal feature, like that in Equ. 1. Different from Equ. 1, we perform dot product on feature pair on the same depth. In particular, we first use a weight-shared convolution to process support and query features individually. They then calculate the depth-wise similarity feature of size . Finally, a successive fc layer is used to generate matching scores.
We only use the patch-relation head to generate bounding box predictions, regression on box coordinates, and use the sum of all matching scores from the three heads as the final matching scores. The intra-class variance and imperfect proposals make the relation between proposals and support objects complex. Our three relation heads contain different attributes and can well handle the complex, where the patch-relation head can generate flexible embedding that be able to match intra-class variances, global-relation head is a stable and general matching, and local-relation patch requires matching on parts.
Few-shot object detection is different from classification because of the massive background proposals. The model not only needs to distinguish different categories, but also needs to distinguish the background and foreground for a target category, and the background proposals usually dominates in the training. For this reason, besides the basic single-way training, we propose a novel multi-way training with ranking proposal to solve the problem.
In order to learn the siamese-based detection model, we construct the trainging epidode in the following steps: We randomly choose a query image and one separate support image containing the same -th category object to construct a image pair (, ). In this pair, only the -th category object in the query image is labeled as foreground and all other objects are treated as background. During training, we learn to match every proposal generated by the Attention-RPN in the query image with the object in the support image.
We propose the multi-way training to distinguish different categories. In the single-way training strategy, the training epidode consists of (, ) image pair, and the model learns to match objects with the same category, which mainly distinguishes the foreground and background for one certain category with different IoU. We construct a image triplet (, , ), where . In this image triplet, the model not only needs to distinguish the foreground and background for the target category with different IoU between (, ), but also distinguish objects with different categories between (, ). However, there are only background in (, ) which results in data imbalance in the foreground and background proposals and harms the matching training procedure. We propose ranking proposal to balance the foreground and background loss. Firstly, we pick all foreground proposals (there are total foreground proposals) and calculate the foreground loss. Then we calculate the matching scores of all background proposals and ranking them according their matching scores. We select the top background proposals for the (, ) pair and select the top background proposals for the (, ) pair and calculate the background loss. The final matching loss is the sum of the foreground loss and background loss.
We train our model on FSOD Dataset training set (800 categories) and evaluate on the test set (200 categories) using mean average precision (mAP) with the thresholds of 0.5 and 0.75. We apply our approach on the wild car detection on KITTI  datasets to prove the generalization of our approach. We further apply our approach on the wild penguin detection , but because there is no bounding box annotation in this dataset, we show some qualitative 5-shot detection results in Fig. 10
. All the experiments are implemented based on PyTorch.
The model is end-to-end trained based on 4 Tesla P40 GPUs using SGD with a weight decay of 0.0001 and momentum of 0.9. The learning rate is 0.002 for the first 56000 iterations, and 0.0002 for the later 4000 iterations. We take the advantage of a pretrained model with its backbone, ResNet50, trained on [14, 9]. As our test set has no overlap with the datasets, it is safe to use it. During our training, we find that more training iterations will damage performance. We suppose that too many training iterations make model over-fitting on the training set. We fix Res1-3 blocks and only train the high-level layers, which can utilize low-level basic feature and avoid over-fitting. The query image is resized to shorter edge to 600 pixels and its max size of the longer edge is restricted to 1000. The support image is cropped around the target object with 16-pixels image context and is resized and zero-padded to a square image of 320x320. For few-shot training and testing, we fuse feature by averaging the object features and then fed them to the RPN attention module and the multi-relation detector.
We propose a -way -shot evaluation to match training and evaluate our approach using standard object detection evaluation tools. For each image in the test set, a test eposide is contrcuted by the test image, random support images containing its category , and support images for each of other cateogries, where the categories are randomly selected. By testing in one episode, we in fact perform a -way -shot evaluation, where each query is detected by 5 supports of its category and the other four non-matching categories.
We provide another evaluation protocol driven by a common real-world application scenario: There are massive images collected from photo albums or drama series whose label set is absent. We want to annotate one novel target object in these images, but we do not know which images contain the target object, and the size and location of the target object in an image. In order to reduce the amount of workload, one practicable solution is to find some images containing the target objects, annotate them, and then apply our method to automatically annotate the remaining. Following this setting, we perform the evaluation as follows: We mix all test images of FSOD dataset, and for each object category, we pick 5 good images that contain the target object to perform 1-way 5-shot testing of each category. Note that, for each object category, we evaluate all the images in the test set whatever the image contains, which is actually an all-way 5-shot testing for each image. This evaluation is very challenging, where the target images only account for average 1.4% of the test set.
We evaluate our model and baselines on FSOD Dataset. In addition, we do the ablation study by incrementally adding different proposed modules of our model. Experimental results in Table 3 show that our model beats baselines and all the proposed modules contribute to the final performance improvement.
We compared with related works in Table 3, where the setting I is an application of Faster R-CNN  by comparing the detector fc feature between query and support proposals (II is its siamese training version), and III mimics Relation Network  with implementation details adjusted for detection. Our method surpasses these methods by a large margin, especially in 5-way evaluation.
The LSTD  is different from ours in framework and application, and needs to train on new cateogies by transferring knowledge from a source domain to the target domain. Our method, however, can apply on new categories without any further re-training or finetuning. This is a fundamental difference between ours and LSTD. To compare empirically, we adjust LSTD to base on Faster R-CNN and re-train it on 5 fixed supports for each test category separately in a fair configuration. Results are shown in Figure 6. We achieve 0.480 on 50 categories that demonstrates the effectiveness of our approach again. Our method beats LSTD by 2.93% (27.15% 24.22%) and its backbone Faster R-CNN by 4.11% (27.15% 23.04%) on all 200 testing categories. More specifically, without pre-training on our dataset, Faster R-CNN’s decreases dramatically. It demonstrates the effectiveness of our dataset.
Our proposed multi-relation detectors all perform well in the 5-shot evaluation. Specifically, viewing each detector individually, the patch-relation head performs better in the 1-way evaluation, while local-correlation head performs better in the 5-way evaluation. This indicates that the path-relation head has a good property to distinguish an object from background, while the local-correlation head has the advantage to distinguish non-matching object from target. By adding the global-relation head, the combination of full three relation heads gets best results. This fact demonstrates that these three detectors are complementary to each other, and they can better differentiate targets from background and non-matching categories. Qualitative 5-shot object detection results on our test set are shown in Fig. 9.
Comparing the models with attention RPN and that with the regular RPN(Model VI and VII, Model VIII and IX) in Table 3, attention RPN gets better performance especially in 1-way evaluation. To evaluate the proposal quality, we calculate the average recall over different IoUs on top 100 proposals. From Fig. 7 we can see that the attention RPN can generate more high-quality proposals (higher IoUs) and benefits from few-shot supports. We then evaluate the average best overlap ratio (ABO) across ground truth boxes for these two RPNs, the ABO of attention RPN is 0.7184/0.7242 (1-shot/5-shot) and the same metric of regular RPN is only 0.7076. We gets the best performance on the model combining multi-relation detector with attention RPN. All these results illustrate that our proposed attention RPN can generate better proposals and benefits the final detection predictions.
We train our model with multi-way strategy with ranking proposal, and we obtain 8.8% improvement on 5-way 5-shot evaluation compared with the single-way strategy. It means that it is important to learn to distinguish different categories during training. With 5-shot training, we can achieve further improvements which is also verified in  that few-shot training is beneficial for few-shot testing.
Our proposed dataset has massive object categories with few images, which is a beneficial property for few-shot object detection. To confirm this view, we train our model on MS COCO dataset, which has more than 120,000 images with only 80 categories, and further train our model on different training splits ranging from 200 to 600 categories on FSOD Dataset. When we split the FSOD Dataset, we guarantee that all the splits have similar image number and the results can only be affected by the category number. We present the experiment results in Fig. 8. From this analysis, we can find that although MS COCO has the most training images but its model performance is worse, while models trained on FSOD Dataset have better performance when the number of categories incrementally increase. It indicates that few categories with too many images may impede few-shot object detection, while large number of categories can always bring benefits for this task. In all, we proved that the category diversity is essential for the few-shot object detection and it can always get benefits when number of categories increase.
Our approach is general and can be easily extended to other applications. We evaluate our approach in wild car detection on KITTI . It is an urban scene dataset for driving scenarios, whose images are captured by the car-mounted video camera. We evaluate on KITTI training set with 7481 images. Because the car category is in the COCO dataset, we discard the COCO pretrained model and only use the ImageNet pretrain model.  uses massive annotated data from the Cityscape  dataset to train the Faster R-CNN and evaluate on KITTI. Without any further re-training or finetuning, our model obatins similar performance (49.7% 53.5%) on the wild car detection.
In this paper, we have introduced a novel few-shot object detection network with Attention RPN and multi-relation module which fully exploit the similarity between the few-shot set and the test set on novel object detection. To train our model, we also have prepared a new dataset which contains 1000 categories of varies objects with high-quality annotations. Contributing to the matching strategy, our model can online detect objects of novel categories, which requires no pre-training or further network adaptation. Our model is validated quantitative and qualitative results on different datasets. In all, we introduce a general model and establish a few-shot object detection dataset which can give great benefit to the research and application of this field.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4080–4088, 2018.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
Rapid adaptation with conditionally shifted neurons.In ICML, 2018.
Generating classification weights with gnn denoising autoencoders for few-shot learning.In CVPR, 2019.
The cityscapes dataset for semantic urban scene understanding.In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Here we show more experimental details. In our experiments, we use top 100 proposals generated by RPN to predict final object detection results. In section 5.2 of the main paper, we present the effect of category number on model performance. In this experiment, we adjust the training iteration according to the number of training images. Specifically, the total training iteration on FSOD dataset and MS COCO dataset is 40,000 and 120,000 respectively. In section 5.3, we fintune our model on FSOD training set with few shot and get better performance. Here we fix all layers and only fintune the RPN and our proposed multi-relation detector.
Here we show qualitative results for the ablation study in Fig. 11. From the visual results, we can see the advantage of our model over baselines and the advantage of our dataset over benchmark dataset, MS COCO. Here, we present results on three categories, blackboard, segway and turban, which are very distinct with the training samples in appearance. From Fig. 11, we can see that both the novel attention RPN and multi-relation detector play an important role in accurate detection. For example, in (1-2)(b) the baseline 2 (Model III) fails to detect some targets while our model with only the multi-relation detector can detect more targets and suppress backgrounds. In (2-3)(d), our full model further correct the detection and pull up the scores on true targets by adding the Attention RPN, especially in the instance of (3)(d). Comparing results of our model trained on our dataset with that on MS COCO dataset, shown in Fig. 11 (1-3)(a), our model can rarely learn a matching between the support and query target when we train it on MS COCO, but it can capture the similarity between the image pair when it trains on our dataset. More detection of our full model on 5-shot setting is shown in Fig. 12 and Fig. 13. Our model performs well even when there are objects of non-support categories in the query image.
Here we describe the training/testing class split in our proposed FSOD Dataset. This split was used in our experiments described in section 5.3.
lipstick, sandal, crocodile, football helmet, umbrella, houseplant, antelope, woodpecker, palm tree, box, swan, miniskirt, monkey, cookie, scissors, snowboard, hedgehog, penguin, barrel, wall clock, strawberry, window blind, butterfly, television, cake, punching bag, picture frame, face powder, jaguar, tomato, isopod, balloon, vase, shirt, waffle, carrot, candle, flute, bagel, orange, wheelchair, golf ball, unicycle, surfboard, cattle, parachute, candy, turkey, pillow, jacket, dumbbell, dagger, wine glass, guitar, shrimp, worm, hamburger, cucumber, radish, alpaca, bicycle wheel, shelf, pancake, helicopter, perfume, sword, ipod, goose, pretzel, coin, broccoli, mule, cabbage, sheep, apple, flag, horse, duck, salad, lemon, handgun, backpack, printer, mug, snowmobile, boot, bowl, book, tin can, football, human leg, countertop, elephant, ladybug, curtain, wine, van, envelope, pen, doll, bus, flying disc, microwave oven, stethoscope, burrito, mushroom, teddy bear, nail, bottle, raccoon, rifle, peach, laptop, centipede, tiger, watch, cat, ladder, sparrow, coffee table, plastic bag, brown bear, frog, jeans, harp, accordion, pig, porcupine, dolphin, owl, flowerpot, motorcycle, calculator, tap, kangaroo, lavender, tennis ball, jellyfish, bust, dice, wok, roller skates, mango, bread, computer monitor, sombrero, desk, cheetah, ice cream, tart, doughnut, grapefruit, paddle, pear, kite, eagle, towel, coffee, deer, whale, cello, lion, taxi, shark, human arm, trumpet, french fries, syringe, lobster, rose, human hand, lamp, bat, ostrich, trombone, swim cap, human beard, hot dog, chicken, leopard, alarm clock, drum, taco, digital clock, starfish, train, belt, refrigerator, dog bed, bell pepper, loveseat, infant bed, training bench, milk, mixing bowl, knife, cutting board, ring binder, studio couch, filing cabinet, bee, caterpillar, sofa bed, violin, traffic light, airplane, closet, canary, toilet paper, canoe, spoon, fox, tennis racket, red panda, cannon, stool, zucchini, rugby ball, polar bear, bench, pizza, fork, barge, bow and arrow, kettle, goldfish, mirror, snail, poster, drill, tie, gondola, scale, falcon, bull, remote control, horn, hamster, volleyball, stationary bicycle, dishwasher, limousine, shorts, toothbrush, bookcase, baseball glove, computer mouse, otter, computer keyboard, shower, teapot, human foot, parking meter, ski, beaker, castle, mobile phone, suitcase, sock, cupboard, crab, common fig, missile, swimwear, saucer, popcorn, coat, plate, stairs, pineapple, parrot, fountain, binoculars, tent, pencil case, mouse, sewing machine, magpie, handbag, saxophone, panda, flashlight, baseball bat, golf cart, banana, billiard table, tower, washing machine, lizard, brassiere, ant, crown, oven, sea lion, pitcher, chest of drawers, crutch, hippopotamus, artichoke, seat belt, microphone, lynx, camel, rabbit, rocket, toilet, spider, camera, pomegranate, bathtub, jug, goat, cowboy hat, wrench, stretcher, balance beam, necklace, scoreboard, horizontal bar, stop sign, sushi, gas stove, tank, armadillo, snake, tripod, cocktail, zebra, toaster, frying pan, pasta, truck, blue jay, sink, lighthouse, skateboard, cricket ball, dragonfly, snowplow, screwdriver, organ, giraffe, submarine, scorpion, honeycomb, cream, cart, koala, guacamole, raven, drawer, diaper, fire hydrant, potato, porch, banjo, hammer, paper towel, wardrobe, soap dispenser, asparagus, skunk, chainsaw, spatula, ambulance, submarine sandwich, axe, ruler, measuring cup, scarf, squirrel, tea, whisk, food processor, tick, stapler, oboe，hartebeest, modem, shower cap, mask, handkerchief, falafel, clipper, croquette, house finch, butterfly fish, lesser scaup, barbell, hair slide, arabian camel, pill bottle, springbok, camper, basketball player, bumper car, wisent, hip, wicket, medicine ball, sweet orange, snowshoe, column, king charles spaniel, crane, scoter, slide rule, steel drum, sports car, go kart, gearing, tostada, french loaf, granny smith, sorrel, ibex, rain barrel, quail, rhodesian ridgeback, mongoose, red backed sandpiper, penlight, samoyed, pay phone, barber chair, wool, ballplayer, malamute, reel, mountain goat, tusker, longwool, shopping cart, marble, shuttlecock, red breasted merganser, shutter, stamp, letter opener, canopic jar, warthog, oil filter, petri dish, bubble, african crocodile, bikini, brambling, siamang, bison, snorkel, loafer, kite balloon, wallet, laundry cart, sausage dog, king penguin, diver, rake, drake, bald eagle, retriever, slot, switchblade, orangutan, chacma, guenon, car wheel, dandie dinmont, guanaco, corn, hen, african hunting dog, pajama, hay, dingo, meat loaf, kid, whistle, tank car, dungeness crab, pop bottle, oar, yellow lady’s slipper, mountain sheep, zebu, crossword puzzle, daisy, kimono, basenji, solar dish, bell, gazelle, agaric, meatball, patas, swing, dutch oven, military uniform, vestment, cavy, mustang, standard poodle, chesapeake bay retriever, coffee mug, gorilla, bearskin, safety pin, sulphur crested cockatoo, flamingo, eider, picket fence, dhole, spaghetti squash, african elephant, coral fungus, pelican, anchovy pear, oystercatcher, gyromitra, african grey, knee pad, hatchet, elk, squash racket, mallet, greyhound, ram, racer, morel, drumstick, bovine, bullet train, bernese mountain dog, motor scooter, vervet, quince, blenheim spaniel, snipe, marmoset, dodo, cowboy boot, buckeye, prairie chicken, siberian husky, ballpoint, mountain tent, jockey, border collie, ice skate, button, stuffed tomato, lovebird, jinrikisha, pony, killer whale, indian elephant, acorn squash, macaw, bolete, fiddler crab, mobile home, dressing table, chimpanzee, jack o’ lantern, toast, nipple, entlebucher, groom, sarong, cauliflower, apiary, english foxhound, deck chair, car door, labrador retriever, wallaby, acorn, short pants, standard schnauzer, lampshade, hog, male horse, martin, loudspeaker, plum, bale, partridge, water jug, shoji, shield, american lobster, nailfile, poodle, jackfruit, heifer, whippet, mitten, eggnog, weimaraner, twin bed, english springer, dowitcher, rhesus, norwich terrier, sail, custard apple, wassail, bib, bullet, bartlett, brace, pick, carthorse, ruminant, clog, screw, burro, mountain bike, sunscreen, packet, madagascar cat, radio telescope, wild sheep, stuffed peppers, okapi, bighorn, grizzly, jar, rambutan, mortarboard, raspberry, gar, andiron, paintbrush, running shoe, turnstile, leonberg, red wine, open face sandwich, metal screw, west highland white terrier, boxer, lorikeet, interceptor, ruddy turnstone, colobus, pan, white stork, stinkhorn, american coot, trailer truck, bride, afghan hound, motorboat, bassoon, quesadilla, goblet, llama, folding chair, spoonbill, workhorse, pimento, anemone fish, ewe, megalith, pool ball, macaque, kit fox, oryx, sleeve, plug, battery, black stork, saluki, bath towel, bee eater, baboon, dairy cattle, sleeping bag, panpipe, gemsbok, albatross, comb, snow goose, cetacean, bucket, packhorse, palm, vending machine, butternut squash, loupe, ox, celandine, appenzeller, vulture, crampon, backboard, european gallinule, parsnip, jersey, slide, guava, cardoon, scuba diver, broom, giant schnauzer, gordon setter, staffordshire bullterrier, conch, cherry, jam, salmon, matchstick, black swan, sailboat, assault rifle, thatch, hook, wild boar, ski pole, armchair, lab coat, goldfinch, guinea pig, pinwheel, water buffalo, chain, ocarina, impala, swallow, mailbox, langur, cock, hyena, marimba, hound, knot, saw, eskimo dog, pembroke, sealyham terrier, italian greyhound, shih tzu, scotch terrier, yawl, lighter, dung beetle, dugong, academic gown, blanket, timber wolf, minibus, joystick, speedboat, flagpole, honey, chessman, club sandwich, gown, crate, peg, aquarium, whooping crane, headboard, okra, trench coat, avocado, cayuse, large yellow lady’s slipper, ski mask, dough, bassarisk, bridal gown, terrapin, yacht, saddle, redbone, shower curtain, jennet, school bus, otterhound, irish terrier, carton, abaya, window shade, wooden spoon, yurt, flat coated retriever, bull mastiff, cardigan, river boat, irish wolfhound, oxygen mask, propeller, earthstar, black footed ferret, rocking chair, beach wagon, litchi, pigeon.
beer, musical keyboard, maple, christmas tree, hiking equipment, bicycle helmet, goggles, tortoise, whiteboard, lantern, convenience store, lifejacket, squid, watermelon, sunflower, muffin, mixer, bronze sculpture, skyscraper, drinking straw, segway, sun hat, harbor seal, cat furniture, fedora, kitchen knife, hand dryer, tree house, earrings, power plugs and sockets, waste container, blender, briefcase, street light, shotgun, sports uniform, wood burning stove, billboard, vehicle registration plate, ceiling fan, cassette deck, table tennis racket, bidet, pumpkin, tablet computer, rhinoceros, cheese, jacuzzi, door handle, swimming pool, rays and skates, chopsticks, oyster, office building, ratchet, salt and pepper shakers, juice, bowling equipment, skull, nightstand, light bulb, high heels, picnic basket, platter, cantaloupe, croissant, dinosaur, adhesive tape, mechanical fan, winter melon, egg, beehive, lily, cake stand, treadmill, kitchen & dining room table, headphones, wine rack, harpsichord, corded phone, snowman, jet ski, fireplace, spice rack, coconut, coffeemaker, seahorse, tiara, light switch, serving tray, bathroom cabinet, slow cooker， jalapeno, cartwheel, laelia, cattleya, bran muffin, caribou, buskin, turban, chalk, cider vinegar, bannock, persimmon, wing tip, shin guard, baby shoe, euphonium, popover, pulley, walking shoe, fancy dress, clam, mozzarella, peccary, spinning rod, khimar, soap dish, hot air balloon, windmill, manometer, gnu, earphone, double hung window, conserve, claymore, scone, bouquet, ski boot, welsh poppy, puffball, sambuca, truffle, calla lily, hard hat, elephant seal, peanut, hind, jelly fungus, pirogi, recycling bin, in line skate, bialy, shelf bracket, bowling shoe, ferris wheel, stanhopea, cowrie, adjustable wrench, date bread, o ring, caryatid, leaf spring, french bread, sergeant major, daiquiri, sweet roll, polypore, face veil, support hose, chinese lantern, triangle, mulberry, quick bread, optical disk, egg yolk, shallot, strawflower, cue, blue columbine, silo, mascara, cherry tomato, box wrench, flipper, bathrobe, gill fungus, blackboard, thumbtack, longhorn, pacific walrus, streptocarpus, addax, fly orchid, blackberry, kob, car tire, sassaby, fishing rod, baguet, trowel, cornbread, disa, tuning fork, virginia spring beauty, samosa, chigetai, blue poppy, scimitar, shirt button.