Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

08/06/2019 ∙ by Qi Fan, et al. ∙ Tencent 1

Conventional methods for object detection usually requires substantial amount of training data and to prepare such high quality training data is labor intensive. In this paper, we propose few-shot object detection which aims to detect objects of unseen class with a few training examples. Central to our method is the Attention-RPN and the multi-relation module which fully exploit the similarity between the few shot training examples and the test set to detect novel objects while suppressing the false detection in background. To train our network, we have prepared a new dataset which contains 1000 categories of varies objects with high quality annotations. To the best of our knowledge, this is also the first dataset specifically designed for few shot object detection. Once our network is trained, we can apply object detection for unseen classes without further training or fine tuning. This is also the major advantage of few shot object detection. Our method is general, and has a wide range of applications. We demonstrate the effectiveness of our method quantitatively and qualitatively on different datasets. The dataset link is: https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.

READ FULL TEXT VIEW PDF

Authors

page 1

page 9

page 12

page 13

page 14

Code Repositories

Few-Shot-Object-Detection-Dataset

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Explanation of our approach. Given different objects as supports, our approach can detect all objects with same categories in the given query image.

Object detection has wide range of applications. Existing object detection methods usually rely heavily on a huge amount of annotated data and a long training time. They are also hard to be extended to unseen objects that were not annotated in training data. In contrast, human vision system is excellent at recognizing new objects even with a little supervision. This inspires us to develop a few-shot object detection algorithm.

Few-Shot learning is challenging due to the large object variance of illumination, shape, texture , in the real world. In recent years, it has achieved great progresses 

[1, 2, 3, 4, 5, 6, 7, 8]. These methods, however, all focus on image classification, and the few-shot object detection is rarely exploited. This is because transferring the experiences from few-shot classification to few-shot object detection is a non-trivial task. Object detection from few shots usually confronts one crucial problem, that is how to localize an unseen object in a cluttered background, namely a generalization problem of object localization from a few training examples of novel categories. It happens that the potential bounding boxes would perfectly miss the unseen objects or have many false detection on background. This could be due to the improper low scores of good bounding boxes in a region proposal network (RPN), which makes an novel object hard to be detected. This problem makes the few-shot object detection intrinsically different from the few-shot classification.

In this work, we aim to solve the problem of few-shot object detection. Given a few support set images of target object, our goal is to detect all foreground objects in the test set that belong to the target object category, as shown in Fig. 1. To this aim, we make two major contributions. First, we propose a general few-shot detection model that can be applied to detect novel objects without re-training and fine tuning. Our method fully exploits the matching relationship between object pairs in a siamese network at multiple network stages. Experiments show that our model can benefit from the attention module at an early stage which enhances the proposal quality, and from the multi-relation module at a late stage that suppresses and filter out false detection on confusing backgrounds. Second, to train the model, we build a large well-annotated dataset with 1000 categories and a few examples for each category. This dataset promotes the general learning of object detection. Our method achieves much better performance by utilizing this dataset than a benchmark dataset of larger scale, COCO [9]. To the best of our knowledge, we give the first attempt to construct a few-shot object detection dataset with such many classes.

In summary, we proposed a novel model for few-shot object detection with a carefully designed attention module on RPN and detector. Contributing to the matching strategy [10], our model takes an important attribute that its training stage is exactly the same as its testing stage. This essentially enables our model to online test on objects of novel categories, which requires no pre-training or further network adaptation.

2 Related Works

General Object Detection.

Object detection is a classical problem in computer vision. In early years, object detection was usually formulated as a sliding window classification problem using handcrafted features 

[11, 12, 13]

. With the rise of deep learning 

[14], CNN-based methods become the dominant object detection solution. Most of the methods can be further divided into two general approaches: proposal-free detectors and proposal-based detectors. The first line of works follow a one-stage training strategy and do not generate proposal boxes explicitly [15, 16, 17, 18, 19]. On the other hand, the second line, pioneered by R-CNN [20]

, would first extract class-agnostic region proposals of the potential objects in an image, these boxes are then further refined and classified to different categories by a specific module 

[21, 22, 23, 24]. This strategy can filter out the majority of negative locations by PRN module and facilitates the sequential detector. For this sake, the RPN-based methods usually perform better than the proposal-free ones and lead state-of-the-art[24] of detection task. The methods mentioned above, however, work in an intensive supervision manner and are hard to expand to novel categories with only several examples. This fact arises the research interest in few-shot learning.

Few-shot learning. Few-shot learning, as a classic problem [25]

, is challenging for traditional machine learning algorithms to learn from just a few training examples. Earlier works attempt to learn a general prior 

[26, 27, 28, 29, 30], like hand-designed strokes or parts, that can share across categories. Some other works [31, 32, 1, 33]

focus on metric learning which aimed at manually designing the distance formulation among different categories. Recently, a trend is to design a general agent/strategy that can guide the supervised learning within each task, by accumulating knowledge it gains the network an attribute to capture the structure varies across different tasks, this kind of research direction is named meta-learning in general 

[10, 5, 34, 2, 35]. In this area, [10]

proposed a siamese network that consists of twin networks sharing weights, where each network is fed a supported image and a query individually. The distance between the query and its support is naturally learned by a logistic regression. This matching strategy intrinsically captures the varies between support and query despite their categories. In the envelope of the matching framework, following works 

[4, 36, 8, 37, 3, 6] focus on enhancing the feature embedding, where one direction is to build memory modules to capture global contexts among the supports. Some works [38, 39] exploit local descriptor to reap additional knowledge from the limited data. [40, 41]

introduce Graph Neural Network (GNN) to model the relationship between different categories.

[42] traverses across the entire support set and identifies task-relevant features to make metric learning in high-dimensional space more effective. There are also other works, such as [2, 43], which dedicates on learning a general agent to guide the parameter optimization.

Until now, few-shot learning has achieved much very progress. It, however, mostly focus on the classification task, rarely on other computer vision tasks, such as semantic segmentation [44, 45, 46], human motion prediction [47] and object detection [48]. As mentioned before, few-shot object detection is a task intrinsically different from the general few-shot learning on classification. [49] harnesses unlabeled data and optimizes multiple modules alternatively on images without box. It may be misled by incorrect detection in the weak supervision and requires re-training for a new category and it is out of our scope. LSTD [48]

proposed a novel few-shot object detection framework that can transfer knowledge from one large dataset to another small one, by minimizing the gap of classifying posterior probability between the source domain and the target domain. This method, however, strongly depends on source domain and hard to extend to a scenario very different from it.

Our work is mainly motivated by the research line pioneered by the matching network [10]. We proposed a general few-shot object detection deep network that learn the matching on image pairs based on the Faster R-CNN framework, which is equipped with our muti-scale and shaped attentions.

3 FSOD Dataset: A High-Diverse Few-Shot Object Detection Dataset

Figure 2:

Dataset label tree. The ImageNet categories (red circles) are merged with Open Image categories (green circles). The superclass is borrowed from Open Image.

The key to few-shot learning lies in the generality ability of the model on novel categories. Thus, a high-diversity dataset with a large number of object categories is necessary in order to train a model that is general enough to detect unseen objects and also to provide a convincing evaluation. However, existing datasets [9, 50, 51, 52, 53] contain very limited categories and they do not follow the few-shot evaluation setting. To this aim, we build a new few-shot object detection dataset. In this section, we will introduce our dataset and give a full analysis of it.

Train Test
No. Class 800 200
No. Image 52350 14152
No. Box 147489 35102
Avg No. Box / Img 2.82 2.48
Min No. Img / Cls 22 30
Max No. Img / Cls 208 199
Avg No. Img / Cls 75.65 74.31
Box Size [6, 6828] [13, 4605]
Box Area Ratio [0.0009, 1] [0.0009, 1]
Box W/H Ratio [0.0216, 89] [0.0199, 51.5]
Table 1: Dataset Summary. We construct a challenging dataset with large variance of box size and aspect ratios.
Figure 3: Dataset histogram for image distribution per class.

Dataset Construction We build our dataset from existing massive supervised object detection datasets,  [54, 52]. These datasets, however, cannot be used directly due to 1) the label system of different datasets are inconsistent where some objects with the same semantic use different words in the datasets; 2) large amount of annotations are less than satisfactory due to the inaccurate and missing labeling, duplicate boxes, too large objects, and . 3) their train/test split contains the same categories, while for few-shot dataset we want the train/test sets contain different categories in order to evaluate its generality on unseen objects. To build the dataset, we first summarize a label system from [54, 52]. By merging the leaf labels in their original label trees, group those of same semantics, such as the ice bear and polar bear, to one category, and remove some semantics that does not belong to any leaf categories. Then, we remove the images with bad labeling quality and those with boxes of improper size. We remove boxes smaller than 0.05% of image size which is usually in bad visual quality and unsuitable to serve as support examples. Next, we follow the few-shot learning principle to split our data into the training set and test set whose categories has no overlap. We construct the training set with categories in MS COCO Dataset [9] and ImageNet Dataset [50] in case researchers need a pretraining stage. We then split the test set which contains 200 categories by choosing those with the largest distance with existing training categories, where the distance calculates the shortest path that connects the senses of two phrase in the is-a taxonomy [55]. The remaining categories are merged into the training set that in total contains 800 categories. In all, we construct a dataset of 1000 categories with very clear category split for training and testing, where 531 categories come from ImageNet Dataset and 469 from Open Image Dataset [52].

Dataset Analysis Our dataset is specifically designed for few-shot learning and intrinsically designed to evaluate the generality of a model on novel categories. Our dataset contains 1000 categories with 800/200 split for training and test set separately, around 66,000 images and 182,000 bounding boxes in total. A detailed statics is shown in Table 1 and Fig. 3. Our dataset has the following attributes.

  • High diversity of Categories: Our dataset contains 83 parent semantics, such as mammal, clothing, weapon,, which are further split to 1000 leaf categories. Our label tree is shown in Fig. 2. Due to our strict dataset split, our train/test set has very different semantic categories which largely challenges the evaluation of models.

  • Challenging Setting: Our dataset is challenging for evaluation. It contains objects with large variance on box size and aspect ratios, and it contains 26.5% images with no less than three objects in the test set. Here it is worthy to note that our test set contains a large number of boxes of other categories which are not included in our label system. This also creates a great challenge for a few-shot model.

Even though our dataset has a large number of categories, but our training images and boxes are much less than a benchmark dataset, like MS COCO Dataset, which contains 123,287 images and around 886,000 bounding boxes. Thus, our dataset is compact but efficient for few-shot learning.

4 Our Methodology

In this section, we will introduce our novel few-shot object detection network. Before that, we first introduce the task of few-shot detection.

Figure 4: Our network architecture uses ResNet-50 as backbone. The support image (in green green) and query image (in blue blue) are fed into the weight-shared backbone. The RPN use attention feature generated by the depth-wise cross correlation between compact support feature and query feature. The class score generated by patch-relation head (the top head), global-relation head (the middle head) and local-correlation head (the bottom head) is added together as the final matching score, and the bounding box prediction are generated by the patch-relation head.

4.1 Problem Definition

We define the task of few-shot object detection as below. Given a supported image with a close-up of one target object and a separate query image which potentially contains objects of the support category , the task is to find all the objects of support category in the query and label them with tight bounding boxes. If the support set contains categories and examples for each category, we call it -way -shot detection.

4.2 Deep Attentioned Few-Shot Detection

As we knew, few-shot recognition is challenging due to the fact that it is hard to capture a common sense from just a few examples. To this end, we proposed an intensive-attention network that learns a general matching relationship between the support set and queries on both the RPN module and the detector. Fig. 4 shows the overall architecture of our network.

In particular, we build a siamese framework which has two branches and share weights, whose one branch is for support set and the other is for the query set, where the query branch of the siamese framework is a Faster R-CNN network, which contains two stages of RPN and detector. We utilize this framework to train the matching relationship between support and query features, in order to enforce the network to learn general knowledge and the ability to capture common sense among the same categories. Based on the framework, we introduce a novel attention RPN and detection with multi-relation modules to prompt an accurate parsing between support and potential boxes in the query.

4.2.1 Attention-Based Region Proposal Network

In few-shot object detection, RPN as a useful tool can filter out a large number of potential boxes and reduce the pressure of the sequential detector. The RPN does not only distinguish between objects and non-objects, but also filter out unseen objects that were not annotated in training data. However, without any support image information, the RPN will aimlessly active every potential object with high objectness score, even though they do not belong to the support category. A large amount of irrelevant objects will burden subsequent classification in the detector. On account of this fact, we propose the Attention RPN that involves supporting information to filter out the majority of background boxes and those of non-matching categories. Based on that, we can generate a smaller set of candidate proposals with the high potential to contain the target objects. The framework is shown in Fig. 5.

We introduce support information to RPN through attention mechanism to guide the RPN to produce more relevant proposals and depress proposals of other categories. We calculate the similarity between the feature map of support and that of the query in a depth-wise manner. The similarity map then is utilized to build the proposal generation. In particular, let we denote the features of support as and feature map of the query as , we formulate this similarity calculation as follows,

(1)

where is the resultant attention feature map. Here the support features is used as the kernel to slide on the query feature map, then a depth-wise convolution between them is calculated [56]. This procedure works as calling for the attention of support on the query. In our work, we adopt the features of top layers of the RPN model, the res4-6 in ResNet50. We find that a kernel size of performs well in our case. This fact is consistent with the experience in [22] that the global feature can provide a good object prior for objectness classification. In our case, the kernel is calculated by averaging on the support feature map. Our attention map is followed by a convolution and then the ROI pooling and objectiveness classification layer.

Figure 5: Attention RPN. The support feature is average pooled to a vector, and then caculate depth-wise cross correlation with the query feature whose output is used as attention feature and is fed into RPN to generate proposals.

4.2.2 Multi-Relation Detector

Type Filter Shape Stride/Padding
Avg Pool 3x3x4096 s1/p0
Conv 1x1x512 s1/p0
Conv 3x3x512 s1/p0
Conv 1x1x2048 s1/p0
Avg Pool 3x3x2048 s1/p0
Table 2: Architecture of patch-relation CNN.

In an R-CNN framework, an RPN module will be followed by a detector which takes an important role of re-scoring the proposals and class recognition. Therefore, we want a detector to have a strong discriminative ability to distinguish different categories. To this aim, we propose a novel multi-relation detector to effectively measure the similarity between proposal boxes from the query and the support objects. The detector includes three attention modules, which are the patch-relation head to learn a deep non-linear metric for patch matching, the global-relation head to learn a deep embedding for global matching and the local-correlation head to learn the pixel-wise and depth-wise cross correlation between support and query proposals. We experimentally show that the three matching modules can complement each other and gains higher performance incrementally by adding one by one. We will introduce our multi-relation detector details below.

  • In patch-relation head, we first concatenate the support and query proposal feature maps in depth. Then the combined feature map are fed into the patch-relation module, whose structure is shown in Table. 2. All the convolution and pooling layers in this module have 0 padding to reduce the feature map from to which is used as inputs for the binary classification and regression heads. This module is compact and efficient. We do a bit exploitation on the structure of the model and we find replacing the two average pooling with convolutions would not improve our model further.

  • The global-relation head extends the patch relation to model the global-embedding relation between the support and query proposals. Given a concatenated feature of support and its query proposal, we average pooling the feature to a vector with a size of

    . We then use an MLP with two fully connected (fc) layers followed by ReLU and a final fc layer to generate matching scores.

  • Local-correlation head computes the pixel-wise and depth-wise similarity between object ROI feature and the proposal feature, like that in Equ. 1. Different from Equ. 1, we perform dot product on feature pair on the same depth. In particular, we first use a weight-shared convolution to process support and query features individually. They then calculate the depth-wise similarity feature of size . Finally, a successive fc layer is used to generate matching scores.

We only use the patch-relation head to generate bounding box predictions, regression on box coordinates, and use the sum of all matching scores from the three heads as the final matching scores. The intra-class variance and imperfect proposals make the relation between proposals and support objects complex. Our three relation heads contain different attributes and can well handle the complex, where the patch-relation head can generate flexible embedding that be able to match intra-class variances, global-relation head is a stable and general matching, and local-relation patch requires matching on parts.

4.3 Training Strategy

Few-shot object detection is different from classification because of the massive background proposals. The model not only needs to distinguish different categories, but also needs to distinguish the background and foreground for a target category, and the background proposals usually dominates in the training. For this reason, besides the basic single-way training, we propose a novel multi-way training with ranking proposal to solve the problem.

Single-way Training

In order to learn the siamese-based detection model, we construct the trainging epidode in the following steps: We randomly choose a query image and one separate support image containing the same -th category object to construct a image pair (, ). In this pair, only the -th category object in the query image is labeled as foreground and all other objects are treated as background. During training, we learn to match every proposal generated by the Attention-RPN in the query image with the object in the support image.

Multi-way Training with Ranking Proposal

We propose the multi-way training to distinguish different categories. In the single-way training strategy, the training epidode consists of (, ) image pair, and the model learns to match objects with the same category, which mainly distinguishes the foreground and background for one certain category with different IoU. We construct a image triplet (, , ), where . In this image triplet, the model not only needs to distinguish the foreground and background for the target category with different IoU between (, ), but also distinguish objects with different categories between (, ). However, there are only background in (, ) which results in data imbalance in the foreground and background proposals and harms the matching training procedure. We propose ranking proposal to balance the foreground and background loss. Firstly, we pick all foreground proposals (there are total foreground proposals) and calculate the foreground loss. Then we calculate the matching scores of all background proposals and ranking them according their matching scores. We select the top background proposals for the (, ) pair and select the top background proposals for the (, ) pair and calculate the background loss. The final matching loss is the sum of the foreground loss and background loss.

5 Experiments

No. ST PR LC GR AR MC MS 5-shot
1-way 5-way
[22] 0.414 0.267 0.273 0.179
II 0.619 0.438 0.333 0.244

III [8]
0.686 0.487 0.429 0.316
IV 0.689 0.484 0.465 0.340
V 0.681 0.490 0.457 0.339

VI
0.682 0.478 0.489 0.353
VII 0.693 0.480 0.493 0.351
VIII 0.690 0.488 0.504 0.367

IX
0.698 0.491 0.506 0.367
X 0.684 0.457 0.594 0.399
XI 0.686 0.452 0.616 0.408

Table 3: Our few-shot results on FSOD Dataset. ST: Siamese Training, PR: Patch-Relation Head, LC: Local-Correlation Head, GR: Global-Relation Head, AR: Attention RPN, MC: Multi-class training, MS: Multi-shot training.

We train our model on FSOD Dataset training set (800 categories) and evaluate on the test set (200 categories) using mean average precision (mAP) with the thresholds of 0.5 and 0.75. We apply our approach on the wild car detection on KITTI [50] datasets to prove the generalization of our approach. We further apply our approach on the wild penguin detection [57], but because there is no bounding box annotation in this dataset, we show some qualitative 5-shot detection results in Fig. 10

. All the experiments are implemented based on PyTorch.

5.1 Training Details

The model is end-to-end trained based on 4 Tesla P40 GPUs using SGD with a weight decay of 0.0001 and momentum of 0.9. The learning rate is 0.002 for the first 56000 iterations, and 0.0002 for the later 4000 iterations. We take the advantage of a pretrained model with its backbone, ResNet50, trained on [14, 9]. As our test set has no overlap with the datasets, it is safe to use it. During our training, we find that more training iterations will damage performance. We suppose that too many training iterations make model over-fitting on the training set. We fix Res1-3 blocks and only train the high-level layers, which can utilize low-level basic feature and avoid over-fitting. The query image is resized to shorter edge to 600 pixels and its max size of the longer edge is restricted to 1000. The support image is cropped around the target object with 16-pixels image context and is resized and zero-padded to a square image of 320x320. For few-shot training and testing, we fuse feature by averaging the object features and then fed them to the RPN attention module and the multi-relation detector.

5.2 Evaluation Protocol

K-way N-shot Evaluation

We propose a -way -shot evaluation to match training and evaluate our approach using standard object detection evaluation tools. For each image in the test set, a test eposide is contrcuted by the test image, random support images containing its category , and support images for each of other cateogries, where the categories are randomly selected. By testing in one episode, we in fact perform a -way -shot evaluation, where each query is detected by 5 supports of its category and the other four non-matching categories.

Realistic Evaluation

We provide another evaluation protocol driven by a common real-world application scenario: There are massive images collected from photo albums or drama series whose label set is absent. We want to annotate one novel target object in these images, but we do not know which images contain the target object, and the size and location of the target object in an image. In order to reduce the amount of workload, one practicable solution is to find some images containing the target objects, annotate them, and then apply our method to automatically annotate the remaining. Following this setting, we perform the evaluation as follows: We mix all test images of FSOD dataset, and for each object category, we pick 5 good images that contain the target object to perform 1-way 5-shot testing of each category. Note that, for each object category, we evaluate all the images in the test set whatever the image contains, which is actually an all-way 5-shot testing for each image. This evaluation is very challenging, where the target images only account for average 1.4% of the test set.

Figure 6: The comparisons with related work are evaluated by the realistic protocol. is accumulated averaged on the top-k categories, where good supports are used in test.

5.3 Results on FSOD Dataset

We evaluate our model and baselines on FSOD Dataset. In addition, we do the ablation study by incrementally adding different proposed modules of our model. Experimental results in Table 3 show that our model beats baselines and all the proposed modules contribute to the final performance improvement.

5.3.1 Comparison with benchmarks

We compared with related works in Table 3, where the setting I is an application of Faster R-CNN [22] by comparing the detector fc feature between query and support proposals (II is its siamese training version), and III mimics Relation Network [8] with implementation details adjusted for detection. Our method surpasses these methods by a large margin, especially in 5-way evaluation.

The LSTD [58] is different from ours in framework and application, and needs to train on new cateogies by transferring knowledge from a source domain to the target domain. Our method, however, can apply on new categories without any further re-training or finetuning. This is a fundamental difference between ours and LSTD. To compare empirically, we adjust LSTD to base on Faster R-CNN and re-train it on 5 fixed supports for each test category separately in a fair configuration. Results are shown in Figure 6. We achieve 0.480 on 50 categories that demonstrates the effectiveness of our approach again. Our method beats LSTD by 2.93% (27.15% 24.22%) and its backbone Faster R-CNN by 4.11% (27.15% 23.04%) on all 200 testing categories. More specifically, without pre-training on our dataset, Faster R-CNN’s decreases dramatically. It demonstrates the effectiveness of our dataset.

5.3.2 Ablation study

Figure 7: The average recall over different IoUs on top 100 proposals generated by RPN.
Multi-Relation Detector

Our proposed multi-relation detectors all perform well in the 5-shot evaluation. Specifically, viewing each detector individually, the patch-relation head performs better in the 1-way evaluation, while local-correlation head performs better in the 5-way evaluation. This indicates that the path-relation head has a good property to distinguish an object from background, while the local-correlation head has the advantage to distinguish non-matching object from target. By adding the global-relation head, the combination of full three relation heads gets best results. This fact demonstrates that these three detectors are complementary to each other, and they can better differentiate targets from background and non-matching categories. Qualitative 5-shot object detection results on our test set are shown in Fig. 9.

Attention RPN

Comparing the models with attention RPN and that with the regular RPN(Model VI and VII, Model VIII and IX) in Table 3, attention RPN gets better performance especially in 1-way evaluation. To evaluate the proposal quality, we calculate the average recall over different IoUs on top 100 proposals. From Fig. 7 we can see that the attention RPN can generate more high-quality proposals (higher IoUs) and benefits from few-shot supports. We then evaluate the average best overlap ratio (ABO) across ground truth boxes for these two RPNs, the ABO of attention RPN is 0.7184/0.7242 (1-shot/5-shot) and the same metric of regular RPN is only 0.7076. We gets the best performance on the model combining multi-relation detector with attention RPN. All these results illustrate that our proposed attention RPN can generate better proposals and benefits the final detection predictions.

Multi-way training strategy

We train our model with multi-way strategy with ranking proposal, and we obtain 8.8% improvement on 5-way 5-shot evaluation compared with the single-way strategy. It means that it is important to learn to distinguish different categories during training. With 5-shot training, we can achieve further improvements which is also verified in [1] that few-shot training is beneficial for few-shot testing.

Figure 8: The effect of category number on model performance. The circle area is in proportion to training image number.
Figure 9: Qualitative 5-shot detection results of our approach results on FSOD test set. We recommend to view the figure by zooming it up.
Figure 10: Our application results on the penguin dataset [57]. Given 5 penguin images as support, our approach can detect all penguins in the wild in the given query image.

5.3.3 Dataset with more categories or more samples?

Our proposed dataset has massive object categories with few images, which is a beneficial property for few-shot object detection. To confirm this view, we train our model on MS COCO dataset, which has more than 120,000 images with only 80 categories, and further train our model on different training splits ranging from 200 to 600 categories on FSOD Dataset. When we split the FSOD Dataset, we guarantee that all the splits have similar image number and the results can only be affected by the category number. We present the experiment results in Fig. 8. From this analysis, we can find that although MS COCO has the most training images but its model performance is worse, while models trained on FSOD Dataset have better performance when the number of categories incrementally increase. It indicates that few categories with too many images may impede few-shot object detection, while large number of categories can always bring benefits for this task. In all, we proved that the category diversity is essential for the few-shot object detection and it can always get benefits when number of categories increase.

5.4 Results on KITTI

Our approach is general and can be easily extended to other applications. We evaluate our approach in wild car detection on KITTI [50]. It is an urban scene dataset for driving scenarios, whose images are captured by the car-mounted video camera. We evaluate on KITTI training set with 7481 images. Because the car category is in the COCO dataset, we discard the COCO pretrained model and only use the ImageNet pretrain model. [59] uses massive annotated data from the Cityscape [60] dataset to train the Faster R-CNN and evaluate on KITTI. Without any further re-training or finetuning, our model obatins similar performance (49.7% 53.5%) on the wild car detection.

6 Conclusion

In this paper, we have introduced a novel few-shot object detection network with Attention RPN and multi-relation module which fully exploit the similarity between the few-shot set and the test set on novel object detection. To train our model, we also have prepared a new dataset which contains 1000 categories of varies objects with high-quality annotations. Contributing to the matching strategy, our model can online detect objects of novel categories, which requires no pre-training or further network adaptation. Our model is validated quantitative and qualitative results on different datasets. In all, we introduce a general model and establish a few-shot object detection dataset which can give great benefit to the research and application of this field.

References

  • [1] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [2] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In In International Conference on Learning Representations (ICLR), 2017.
  • [3] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
  • [4] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • [5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [6] Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, and Tao Mei. Memory matching networks for one-shot image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4080–4088, 2018.
  • [7] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018.
  • [8] Flood Sung Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018.
  • [9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [10] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
  • [11] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In international Conference on computer vision & Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE Computer Society, 2005.
  • [12] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • [13] P VIODA. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE CVPR 2001, pages 905–910, 2001.
  • [14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [15] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [16] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
  • [17] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [18] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [19] Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 385–400, 2018.
  • [20] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [21] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [23] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [24] Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training. In Advances in Neural Information Processing Systems, pages 9333–9343, 2018.
  • [25] Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems, pages 640–646, 1996.
  • [26] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
  • [27] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.
  • [28] Brenden M Lake, Ruslan R Salakhutdinov, and Josh Tenenbaum. One-shot learning by inverting a compositional causal process. In Advances in neural information processing systems, pages 2526–2534, 2013.
  • [29] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [30] Alex Wong and Alan L Yuille. One shot learning via compositions of meaningful patches. In Proceedings of the IEEE International Conference on Computer Vision, pages 1197–1205, 2015.
  • [31] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 719–729, 2018.
  • [32] Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. Few-shot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, pages 2255–2265, 2017.
  • [33] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, pages 3018–3027, 2017.
  • [34] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2554–2563. JMLR. org, 2017.
  • [35] Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler.

    Rapid adaptation with conditionally shifted neurons.

    In ICML, 2018.
  • [36] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. arXiv preprint arXiv:1812.01866, 2018.
  • [37] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7278–7286, 2018.
  • [38] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Gao Yang, and Jiebo Luo. Revisiting local descriptor based image-to-class measure for few-shot learning. In CVPR, 2019.
  • [39] Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, and Andrei Bursuc. Dense classification and implanting for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9258–9267, 2019.
  • [40] Sungwoong Kim Chang D. Yoo Jongmin Kim, Taesup Kim. Edge-labeling graph neural network for few-shot learning. In Computer Vision and Pattern Recognition (CVPR), 2019.
  • [41] Spyros Gidaris and Nikos Komodakis.

    Generating classification weights with gnn denoising autoencoders for few-shot learning.

    In CVPR, 2019.
  • [42] Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, and Xiaogang Wang. Finding task-relevant features for few-shot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–10, 2019.
  • [43] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
  • [44] Nanqing Dong and Eric P Xing. Few-shot semantic segmentation with prototype learning. In BMVC, volume 3, page 4, 2018.
  • [45] Claudio Michaelis, Matthias Bethge, and Alexander S. Ecker. One-shot segmentation in clutter. In ICML, 2018.
  • [46] Tao Hu, Pengwan, Chiliang Zhang, Gang Yu, Yadong Mu, and Cees G. M. Snoek. Attention-based multi-context guiding for few-shot semantic segmentation. In AAAI, 2019.
  • [47] Liang-Yan Gui, Yu-Xiong Wang, Deva Ramanan, and José M. F. Moura. Few-shot human motion prediction via meta-learning. In ECCV, 2018.
  • [48] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. Lstd: A low-shot transfer detector for object detection. In AAAI, 2018.
  • [49] Xuanyi Dong, Liang Zheng, Fan Ma, Yi Yang, and Deyu Meng. Few-example object detection with model communication. IEEE transactions on PAMI, 2018.
  • [50] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [51] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [52] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.
  • [53] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [54] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [55] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  • [56] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. arXiv preprint arXiv:1812.11703, 2018.
  • [57] C. Arteta, V. Lempitsky, and A. Zisserman. Counting in the wild. In European Conference on Computer Vision, 2016.
  • [58] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. Lstd: A low-shot transfer detector for object detection. arXiv preprint arXiv:1803.01529, 2018.
  • [59] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3339–3348, 2018.
  • [60] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Appendix A: More Implementation Details

Here we show more experimental details. In our experiments, we use top 100 proposals generated by RPN to predict final object detection results. In section 5.2 of the main paper, we present the effect of category number on model performance. In this experiment, we adjust the training iteration according to the number of training images. Specifically, the total training iteration on FSOD dataset and MS COCO dataset is 40,000 and 120,000 respectively. In section 5.3, we fintune our model on FSOD training set with few shot and get better performance. Here we fix all layers and only fintune the RPN and our proposed multi-relation detector.

Figure 11: Qualitative results for ablation study. We visualize the boxes with scores larger than 0.8 or the box with highest score in (3)(c).

Appendix B: Qualitative Visualization Results

Here we show qualitative results for the ablation study in Fig. 11. From the visual results, we can see the advantage of our model over baselines and the advantage of our dataset over benchmark dataset, MS COCO. Here, we present results on three categories, blackboard, segway and turban, which are very distinct with the training samples in appearance. From Fig. 11, we can see that both the novel attention RPN and multi-relation detector play an important role in accurate detection. For example, in (1-2)(b) the baseline 2 (Model III) fails to detect some targets while our model with only the multi-relation detector can detect more targets and suppress backgrounds. In (2-3)(d), our full model further correct the detection and pull up the scores on true targets by adding the Attention RPN, especially in the instance of (3)(d). Comparing results of our model trained on our dataset with that on MS COCO dataset, shown in Fig. 11 (1-3)(a), our model can rarely learn a matching between the support and query target when we train it on MS COCO, but it can capture the similarity between the image pair when it trains on our dataset. More detection of our full model on 5-shot setting is shown in Fig. 12 and Fig. 13. Our model performs well even when there are objects of non-support categories in the query image.

Figure 12: Qualitative 5-shot object detection results on our test set. We visualize the bounding boxes with score larger than 0.8.
Figure 13: Qualitative results of our 5-shot object detection on test set. We visualize the bounding boxes with score larger than 0.8.

Appendix C: FSOD Dataset Class Split

Here we describe the training/testing class split in our proposed FSOD Dataset. This split was used in our experiments described in section 5.3.

lipstick, sandal, crocodile, football helmet, umbrella, houseplant, antelope, woodpecker, palm tree, box, swan, miniskirt, monkey, cookie, scissors, snowboard, hedgehog, penguin, barrel, wall clock, strawberry, window blind, butterfly, television, cake, punching bag, picture frame, face powder, jaguar, tomato, isopod, balloon, vase, shirt, waffle, carrot, candle, flute, bagel, orange, wheelchair, golf ball, unicycle, surfboard, cattle, parachute, candy, turkey, pillow, jacket, dumbbell, dagger, wine glass, guitar, shrimp, worm, hamburger, cucumber, radish, alpaca, bicycle wheel, shelf, pancake, helicopter, perfume, sword, ipod, goose, pretzel, coin, broccoli, mule, cabbage, sheep, apple, flag, horse, duck, salad, lemon, handgun, backpack, printer, mug, snowmobile, boot, bowl, book, tin can, football, human leg, countertop, elephant, ladybug, curtain, wine, van, envelope, pen, doll, bus, flying disc, microwave oven, stethoscope, burrito, mushroom, teddy bear, nail, bottle, raccoon, rifle, peach, laptop, centipede, tiger, watch, cat, ladder, sparrow, coffee table, plastic bag, brown bear, frog, jeans, harp, accordion, pig, porcupine, dolphin, owl, flowerpot, motorcycle, calculator, tap, kangaroo, lavender, tennis ball, jellyfish, bust, dice, wok, roller skates, mango, bread, computer monitor, sombrero, desk, cheetah, ice cream, tart, doughnut, grapefruit, paddle, pear, kite, eagle, towel, coffee, deer, whale, cello, lion, taxi, shark, human arm, trumpet, french fries, syringe, lobster, rose, human hand, lamp, bat, ostrich, trombone, swim cap, human beard, hot dog, chicken, leopard, alarm clock, drum, taco, digital clock, starfish, train, belt, refrigerator, dog bed, bell pepper, loveseat, infant bed, training bench, milk, mixing bowl, knife, cutting board, ring binder, studio couch, filing cabinet, bee, caterpillar, sofa bed, violin, traffic light, airplane, closet, canary, toilet paper, canoe, spoon, fox, tennis racket, red panda, cannon, stool, zucchini, rugby ball, polar bear, bench, pizza, fork, barge, bow and arrow, kettle, goldfish, mirror, snail, poster, drill, tie, gondola, scale, falcon, bull, remote control, horn, hamster, volleyball, stationary bicycle, dishwasher, limousine, shorts, toothbrush, bookcase, baseball glove, computer mouse, otter, computer keyboard, shower, teapot, human foot, parking meter, ski, beaker, castle, mobile phone, suitcase, sock, cupboard, crab, common fig, missile, swimwear, saucer, popcorn, coat, plate, stairs, pineapple, parrot, fountain, binoculars, tent, pencil case, mouse, sewing machine, magpie, handbag, saxophone, panda, flashlight, baseball bat, golf cart, banana, billiard table, tower, washing machine, lizard, brassiere, ant, crown, oven, sea lion, pitcher, chest of drawers, crutch, hippopotamus, artichoke, seat belt, microphone, lynx, camel, rabbit, rocket, toilet, spider, camera, pomegranate, bathtub, jug, goat, cowboy hat, wrench, stretcher, balance beam, necklace, scoreboard, horizontal bar, stop sign, sushi, gas stove, tank, armadillo, snake, tripod, cocktail, zebra, toaster, frying pan, pasta, truck, blue jay, sink, lighthouse, skateboard, cricket ball, dragonfly, snowplow, screwdriver, organ, giraffe, submarine, scorpion, honeycomb, cream, cart, koala, guacamole, raven, drawer, diaper, fire hydrant, potato, porch, banjo, hammer, paper towel, wardrobe, soap dispenser, asparagus, skunk, chainsaw, spatula, ambulance, submarine sandwich, axe, ruler, measuring cup, scarf, squirrel, tea, whisk, food processor, tick, stapler, oboe,hartebeest, modem, shower cap, mask, handkerchief, falafel, clipper, croquette, house finch, butterfly fish, lesser scaup, barbell, hair slide, arabian camel, pill bottle, springbok, camper, basketball player, bumper car, wisent, hip, wicket, medicine ball, sweet orange, snowshoe, column, king charles spaniel, crane, scoter, slide rule, steel drum, sports car, go kart, gearing, tostada, french loaf, granny smith, sorrel, ibex, rain barrel, quail, rhodesian ridgeback, mongoose, red backed sandpiper, penlight, samoyed, pay phone, barber chair, wool, ballplayer, malamute, reel, mountain goat, tusker, longwool, shopping cart, marble, shuttlecock, red breasted merganser, shutter, stamp, letter opener, canopic jar, warthog, oil filter, petri dish, bubble, african crocodile, bikini, brambling, siamang, bison, snorkel, loafer, kite balloon, wallet, laundry cart, sausage dog, king penguin, diver, rake, drake, bald eagle, retriever, slot, switchblade, orangutan, chacma, guenon, car wheel, dandie dinmont, guanaco, corn, hen, african hunting dog, pajama, hay, dingo, meat loaf, kid, whistle, tank car, dungeness crab, pop bottle, oar, yellow lady’s slipper, mountain sheep, zebu, crossword puzzle, daisy, kimono, basenji, solar dish, bell, gazelle, agaric, meatball, patas, swing, dutch oven, military uniform, vestment, cavy, mustang, standard poodle, chesapeake bay retriever, coffee mug, gorilla, bearskin, safety pin, sulphur crested cockatoo, flamingo, eider, picket fence, dhole, spaghetti squash, african elephant, coral fungus, pelican, anchovy pear, oystercatcher, gyromitra, african grey, knee pad, hatchet, elk, squash racket, mallet, greyhound, ram, racer, morel, drumstick, bovine, bullet train, bernese mountain dog, motor scooter, vervet, quince, blenheim spaniel, snipe, marmoset, dodo, cowboy boot, buckeye, prairie chicken, siberian husky, ballpoint, mountain tent, jockey, border collie, ice skate, button, stuffed tomato, lovebird, jinrikisha, pony, killer whale, indian elephant, acorn squash, macaw, bolete, fiddler crab, mobile home, dressing table, chimpanzee, jack o’ lantern, toast, nipple, entlebucher, groom, sarong, cauliflower, apiary, english foxhound, deck chair, car door, labrador retriever, wallaby, acorn, short pants, standard schnauzer, lampshade, hog, male horse, martin, loudspeaker, plum, bale, partridge, water jug, shoji, shield, american lobster, nailfile, poodle, jackfruit, heifer, whippet, mitten, eggnog, weimaraner, twin bed, english springer, dowitcher, rhesus, norwich terrier, sail, custard apple, wassail, bib, bullet, bartlett, brace, pick, carthorse, ruminant, clog, screw, burro, mountain bike, sunscreen, packet, madagascar cat, radio telescope, wild sheep, stuffed peppers, okapi, bighorn, grizzly, jar, rambutan, mortarboard, raspberry, gar, andiron, paintbrush, running shoe, turnstile, leonberg, red wine, open face sandwich, metal screw, west highland white terrier, boxer, lorikeet, interceptor, ruddy turnstone, colobus, pan, white stork, stinkhorn, american coot, trailer truck, bride, afghan hound, motorboat, bassoon, quesadilla, goblet, llama, folding chair, spoonbill, workhorse, pimento, anemone fish, ewe, megalith, pool ball, macaque, kit fox, oryx, sleeve, plug, battery, black stork, saluki, bath towel, bee eater, baboon, dairy cattle, sleeping bag, panpipe, gemsbok, albatross, comb, snow goose, cetacean, bucket, packhorse, palm, vending machine, butternut squash, loupe, ox, celandine, appenzeller, vulture, crampon, backboard, european gallinule, parsnip, jersey, slide, guava, cardoon, scuba diver, broom, giant schnauzer, gordon setter, staffordshire bullterrier, conch, cherry, jam, salmon, matchstick, black swan, sailboat, assault rifle, thatch, hook, wild boar, ski pole, armchair, lab coat, goldfinch, guinea pig, pinwheel, water buffalo, chain, ocarina, impala, swallow, mailbox, langur, cock, hyena, marimba, hound, knot, saw, eskimo dog, pembroke, sealyham terrier, italian greyhound, shih tzu, scotch terrier, yawl, lighter, dung beetle, dugong, academic gown, blanket, timber wolf, minibus, joystick, speedboat, flagpole, honey, chessman, club sandwich, gown, crate, peg, aquarium, whooping crane, headboard, okra, trench coat, avocado, cayuse, large yellow lady’s slipper, ski mask, dough, bassarisk, bridal gown, terrapin, yacht, saddle, redbone, shower curtain, jennet, school bus, otterhound, irish terrier, carton, abaya, window shade, wooden spoon, yurt, flat coated retriever, bull mastiff, cardigan, river boat, irish wolfhound, oxygen mask, propeller, earthstar, black footed ferret, rocking chair, beach wagon, litchi, pigeon.

beer, musical keyboard, maple, christmas tree, hiking equipment, bicycle helmet, goggles, tortoise, whiteboard, lantern, convenience store, lifejacket, squid, watermelon, sunflower, muffin, mixer, bronze sculpture, skyscraper, drinking straw, segway, sun hat, harbor seal, cat furniture, fedora, kitchen knife, hand dryer, tree house, earrings, power plugs and sockets, waste container, blender, briefcase, street light, shotgun, sports uniform, wood burning stove, billboard, vehicle registration plate, ceiling fan, cassette deck, table tennis racket, bidet, pumpkin, tablet computer, rhinoceros, cheese, jacuzzi, door handle, swimming pool, rays and skates, chopsticks, oyster, office building, ratchet, salt and pepper shakers, juice, bowling equipment, skull, nightstand, light bulb, high heels, picnic basket, platter, cantaloupe, croissant, dinosaur, adhesive tape, mechanical fan, winter melon, egg, beehive, lily, cake stand, treadmill, kitchen & dining room table, headphones, wine rack, harpsichord, corded phone, snowman, jet ski, fireplace, spice rack, coconut, coffeemaker, seahorse, tiara, light switch, serving tray, bathroom cabinet, slow cooker, jalapeno, cartwheel, laelia, cattleya, bran muffin, caribou, buskin, turban, chalk, cider vinegar, bannock, persimmon, wing tip, shin guard, baby shoe, euphonium, popover, pulley, walking shoe, fancy dress, clam, mozzarella, peccary, spinning rod, khimar, soap dish, hot air balloon, windmill, manometer, gnu, earphone, double hung window, conserve, claymore, scone, bouquet, ski boot, welsh poppy, puffball, sambuca, truffle, calla lily, hard hat, elephant seal, peanut, hind, jelly fungus, pirogi, recycling bin, in line skate, bialy, shelf bracket, bowling shoe, ferris wheel, stanhopea, cowrie, adjustable wrench, date bread, o ring, caryatid, leaf spring, french bread, sergeant major, daiquiri, sweet roll, polypore, face veil, support hose, chinese lantern, triangle, mulberry, quick bread, optical disk, egg yolk, shallot, strawflower, cue, blue columbine, silo, mascara, cherry tomato, box wrench, flipper, bathrobe, gill fungus, blackboard, thumbtack, longhorn, pacific walrus, streptocarpus, addax, fly orchid, blackberry, kob, car tire, sassaby, fishing rod, baguet, trowel, cornbread, disa, tuning fork, virginia spring beauty, samosa, chigetai, blue poppy, scimitar, shirt button.