. When the labeled data are scarce, neural networks can severely overfit and fail to generalize well. In contrast, human vision systems tend to exhibit strong performance in such task: children can learn a visual concept quickly from very few given examples. Such ability to learn from few examples is also important for computer vision systems, since for some object categories labels are naturally scarce (e.g., California Firetrucks, or endangered animals). More crucially, accurate labels are extremely expensive to obtain in some situations, for example in certain medical problems.
Few-shot learning [22, 20, 19] aims to solve this problem. Various methods have been explored for this research topic in prior works. For example, metric learning-based methods [19, 43] learn to compare a target image with a set of few-shot labeled images and output the class that gives the highest similarity score; meta-optimization based methods [32, 13] learn a model that can quickly adapt and converge for a new few-shot task. However, most prior works only consider the image classification task on small-scale datasets such as Omniglot , MiniImageNet , and the few-shot tasks are often very simple, e.g., 5-way classification [43, 13].
In this work, we consider a more challenging computer vision problem for real-world images, namely few-shot object detection where only a few examples with annotated bounding boxes are provided. Object detection is much more difficult as it involves not only class predictions but also localization of the objects. Therefore, conventional methods for few-shot learning are not directly applicable. Other settings where the labels are limited are explored for object detection, for example, the weakly-supervised setting [3, 8, 40] (only image-level labels), or the semi-supervised setting [28, 45, 9] (given large amount of unlabeled images). However, these problem settings are not as challenging as the few-shot learning one, and sometimes even collecting that many labels can be unrealistic.
To solve the few-shot object detection problem, we propose a novel meta-learning based framework as shown in Figure 1. The framework is motivated by fully exploring knowledge learned from some base objects to help detect object from novel categories with a few examples. We find when training a preliminary deep CNN based object detection model on a few base classes with abundant examples, the model could learn intermediate features at its top layers that are specific for certain object attributes. These features can implicitly compose high-level representations for different objects. Therefore, instead of learning the representation from the lowest level for novel objects that consumes great data resource, our framework pursues to learn how to adjust the intermediate features from base categories and detect novel objects accordingly.
Our proposed framework thus consists of a detection model that offers basic features and a light-weight meta-model that learns to adapt these features to novel objects with reference to a few examples (Figure 2). To make the adaptation fast, we only require the meta-model to learn to predict reweighting coefficients over the basic features and adjust their contributions to final detection. The meta-model takes in classes of examples, predicts
feature reweightings, each responsible for detecting the corresponding class, through end-to-end training. With different feature reweightings, the confidence score and bounding box coordinates are sufficiently adjusted. A final softmax layer is applied on the confidence scores for calibration. As illustrated inFigure 1, the whole few-shot detection model is trained within two phases: first learning representations on base classes and then fine-tuning for adapting to novel classes. The feature reweighting module is trained in both phases to meta-learn how to reweigh the features for both base and novel classes. Our proposed few-shot detector outperforms competitive baseline methods on multiple datasets and various settings. Besides, our model also demonstrates good transferability from one dataset to another different one through the feature reweighting.
Our contributions can be summarized as follows: 1. We study the problem of few-shot object detection, which is of great practical values but a less explored task than image classification in the few-shot learning literature. 2. We design a meta-learning based model, where the feature importance in the detector are predicted by a meta-model based on the object category to detect. The model is simple and highly data-efficient. 3. We demonstrate through experiments that our model can outperform baseline methods by a large margin, especially when the number of labels is extremely low. Our model can also adapt to novel classes significantly faster than baselines. 4. Through ablation studies we analyze the effects of certain components of our proposed model and reveal some good practices towards solving the few-shot detection problem.
2 Related Work
General Object Detection
uses pre-trained convolutional networks to classify the region proposals generated by selective search . SPP-Net  and Fast-RCNN  improve RCNN with a Region of Interest (RoI) pooling layer to extract features on the convolutional feature maps. Faster-RCNN  introduces a region-proposal-network (RPN) to improve the efficiency of generating proposals. In contrast, YOLO  is a proposal-free method, which uses a single fully-convolutional network to directly perform class and bounding box predictions. SSD  improves YOLO by using default boxes (anchors) to adjust to various object shapes. YOLOv2  improves YOLO with a series of techniques, e.g., multi-scale training, new network architecture (DarkNet-19). Compared with proposal-based methods, proposal-free methods do not require a per-region classifier, thus are conceptually simpler and significantly faster. Our few-shot detector is built on the YOLOv2 architecture.
Few-shot learning refers to learning from just a few training examples per class, a task that human usually performs better than traditional machine learning algorithms.
uses Bayesian inference to generalize knowledge from a pre-trained model to tackle the one-shot learning problem. proposes a Hierarchical Bayesian one-shot learning system that exploits compositionality and causality. [16, 44] introduces image hallucination techniques to augment the training data and improve generalization from base categories to few-shot novel categories.  considers the problem of adapting to novel classes in a new domain.  assumes abundant unlabeled images and adopts label propagation in a semi-supervised setting.
An increasingly popular solution for few-shot learning is meta-learning (also referred as learning to learn), which can further be divided into three categories: a) Metric learning. Siamese Networks  rank feature similarities between inputs to make predictions on unknown classes. Matching Networks  learn the task of finding the most similar class for the target image among a small set of labeled images. Prototypical Networks  extend Matching Networks by producing a linear classifier instead of weighted nearest neighbor for each class. Relation Networks  learn a distance metric to compare the target image to a few labeled images. b) Optimization for fast adaptation.  propose an LSTM meta-learner that is trained to quickly converge a learner classifier in new few-shot tasks. Model-Agnostic Meta-Learning (MAML)  optimizes a task-agnostic network so that a few gradient updates on its parameters would lead to good performance on new few-shot tasks. c) Parameter prediction. Learnet  dynamically learns the parameters of factorized weight layers based on a single example of each class to realize one-shot learning. Weight imprinting  sets weights for a new category using a scaled embedding of labeled examples.  predicts weights from activations to adapt to novel categories without training. These previous works only tackle image classification task, while our work focuses on object detection, a more challenging and realistic problem in computer vision.
Object Detection with Limited Labels
There are a number of prior works on detection focusing on settings with limited labels. The weakly-supervised setting [3, 8, 40] considers the problem of training object detectors with only image-level labels, but without bounding box annotations, which are more expensive to obtain. Few example object detection [28, 45, 9] assumes only a few labeled bounding boxes per class, but relies on abundant unlabeled images to generate trustworthy pseudo annotations for training. Zero-shot object detection [1, 31, 48] aims to detect previously unseen object categories, thus usually requires external information such as relations between classes. Different from those settings, our few-shot detector uses very few bounding box annotations (1-10) for each novel class, without the need for unlabeled images or external knowledge. 
studies a similar setting but only in a transfer learning context, where the target domain images only contains novel classes without base classes.
3 Few-shot Detection via Feature Reweighting
In this section we describe the problem setup, our model design and training phases in detail.
3.1 Few-shot Detection Setting
In this work, we define a novel and realistic setting for few-shot object detection, in which there are two kinds of data available for training, i.e., the base classes and the novel classes. For the base classes, abundant annotated data are available, while only a few labeled samples are given to the novel classes . This setting is worth exploring since it aligns well with a practical situation—one may expect to deploy a pre-trained detector for new classes with only a few labeled samples. More specifically, Large-scale object detection datasets (e.g., VOC, COCO) are available to pre-train a detection model. However, the number of object categories therein is quite limited, especially compared to the vast object categories in real world. Thus, solving this few-shot object detection problem is heavily desired.
To instantiate such a problem setting and effectively evaluate different solutions, we propose to split the objects provided by a detection dataset into base and novel classes, For base classes, we retain all the bounding box labels. For novel classes, we sample a set of images so that each class has exactly bounding box annotations. The usage of the sampled data will become clear when we introduce the model training.
3.2 Feature Reweighting for Detection
Our proposed few-shot detection model is based on proposal-free detection framework YOLOv2 . A typical proposal-free detector consists of a feature extractor that is shared among all classes and a final prediction layer that outputs object classification and bounding box coordinates. However, it is known that certain features are more important for certain classes [47, 46]
. Moreover, some classes tend to share common features (e.g., cat and dog). For example, if our goal is to detect cats, the features for detecting dogs are probably more useful than those for detecting aeroplanes. This motivates us to enhance the reusability of features by utilizing the feature representation learned on base classes to help better detecting the novel classes. Therefore, we propose to assign a weight for each of the extracted features while detecting a specific class of objects, such that the model can put more attention on features related to this class and ignore features that are irrelevant.
To this end, as shown in Figure 2
, we carefully design a light-weight convolutional neural network, namely ameta-model
to generate such weight vectors of reweighting coefficients. Formally, we denote the meta-model as, the reference images and their associated object bounding box annotation as and respectively, for class where is the number of classes. The meta-model takes in one annotated sample from each of the classes. Such annotated samples are used as references indicating the target class to detect. Then it learns to predict set of coefficients , which are responsible for adjusting the features to detect the corresponding classes . The reweighting vector is predicted as . The basis features are provided by a DarkNet-19 based feature extractor is used to extracted basis features from the input image : . The number of feature maps in is equivalent to the number of weights in . Our proposed model obtains the class-specific feature for novel class by following feature reweighting:
where denotes channel-wise multiplication. The reweighting operation could be easily implemented using a 11 depth-wise convolution.
After acquiring class-specific features , we put a prediction layer on top of to directly regress the detection relevant output, by following the common practice of current proposal-free detectors [33, 35, 34]. The output corresponds to the objectness score , bounding box location offsets , and classification score for each of a predefined set of anchors. The prediction layer is shared across different classes. The above feature reweighting (Equation 1) has adapted the features to novel classes. These features are then fed into the prediction layer to generate detections for the target classes:
where is one-versus-all classification score indicating the probability of the corresponding object belongs to class .
3.3 Training Details
The input of the meta-model should be the object of interest. However, in object detection task, usually one image contains multiple classes of objects. To let the meta-model know what the target class is, in additional to three RGB channels, we include an additional “mask” channel () that has only binary values: on the position of the object of interest, the value is 1, otherwise it is 0 (see left-bottom of Figure 2). If multiple target objects are present on the image, only one object is used. This additional mask channel gives the meta-model the knowledge of what part of the image’s information it should use, and what part should be considered as “background”.
To train the meta-model for feature reweighting, we need to carefully choose the loss functions in particular for the class prediction considering the sample number is very few. Given that now the predictions are made class-wisely, it seems natural to use binary cross entropy loss, regressing 1 if the object is the target class and 0 otherwise. However, we found this loss function gives a model outputting redundant detections (e.g., detecting a train as a train, a bus and a car). This could be due to that for a specific region, only one out ofclasses is truly positive. However, binary loss strives to produce balanced positive and negative predictions. Non-maximum suppression could not help since it only operates on predictions within each class.
To resolve this issue, we attach a softmax layer for calibrating the classification scores among classes, and suppress detecting wrong classes. Therefore, the actual classification score is given by . Then to better align training and prediction, a cross-entropy loss over the calibrated scores is adopted for optimization:
where is an indicator function for whether current anchor box belongs to class or not. After introducing softmax, the summation of classification scores for a specific anchor is equal to 1, and less probable class predictions will be suppressed. This softmax loss will be shown to be superior to binary loss in the following experiments. For bounding box and objectiveness regression, we adopt the similar loss function as YOLOv2  but we balance the positive and negative by not computing some loss from negatives samples for the objectiveness scores.
The training procedure consists of two phases. (1) Base training: given that we have a lot of labeled data on base classes, we first train our model on images with only base class annotations for learning representation. In this phase, despite abundant labels are available for each class so that a conventional detector would also learn good representations, we still jointly train the detector together with the meta-model. This is to make them coordinate in a desired way: the meta-model needs to identify the class to detect and predict the reweighting accordingly. In each iteration, the meta-model takes (number of base classes) images and masks as input, produces set of reweighting coefficients, with each used for detecting one class of objects. (2) Few-shot fine-tuning: in this phase, we train the model on both base and novel classes. As only -shot labels are available for novel classes, to balance between classes we also include boxes for each base classes. The training process is the same as the first phase, except that it takes significantly fewer iterations for the model to converge.
In both training phases, the reweighting coefficients depend on the meta-model’s input, which is randomly sampled from the available data we have in each iteration. After few-shot fine-tuning, we would like to obtain a model that does not depend on any meta-model input. This is achieved by setting the weights for a target class to the average weights predicted by the meta-model taking the -shot samples as input. After setting the weights, the meta-model can be completely detached, so during inference time our method adds negligible overhead to the original detector.
4 Experimental Evaluations
In this section, we present our main experimental results. We also compare with various baseline methods and show our method can adapt to novel object detection significantly faster and more accurately. We use PyTorch for implementation and YOLOv2  as the base detector. More implementation details are given in the supplementary material. The code to reproduce the results will be made publicly available.
4.1 Datasets and Settings
VOC 07 and 12 each consists of training, validation, and test image sets. We follow the common practice [34, 36, 38, 6] of evaluating on the 07 test set while use 07 and 12 train/val images for training. The images in VOC contain 20 object categories, from which we randomly select 5 categories as novel classes, while the remaining 15 being base classes. We evaluate with 3 different random seeds for the class set split. During base training, only annotations of the base classes are given. For few-shot fine-tuning, we use a very small set of training images to ensure that each category of objects only has annotated bounding boxes, where is set to be 1, 2, 3, 5 and 10.
COCO is a larger-scale dataset containing 80k training images, 40k validation images and 20k test images. We use 5000 images from the validation set for evaluation, and the rest images in train/val sets for training. COCO has a more diverse set of 80 object classes, from which we select 20 classes that are also in PASCAL VOC to be the set of novel classes, and the remaining 60 classes to be base classes. In this way we also consider learning base representations on the 60 classes of COCO and detecting the 20 novel objects in PASCAL VOC. This is the cross-dataset setting denoted as COCO to PASCAL.
We compare our method with three baselines, all based on the original YOLOv2 detector. The first is to jointly train the original YOLOv2 detector on images with abundantly-labeled base classes and rarely-labeled novel classes. We term this baseline as YOLO-joint. We train this baseline for the same total iterations in our proposed model. Besides joint training, the other two baselines also use two phases of training (base training + few-shot fine-tuning) as our model. But here the original YOLOv2 model is used. The base training phase are the same for these two baselines; for few-shot fine-tuning, we train the second baseline, YOLO-ft, for the same iterations as our method, and train the third baseline, YOLO-ft-full, to full convergence.
Before we present the results on target novel classes, we first analyze the first-stage representation learning on base classes. Ideally, a desirable low-shot detection system should preferably perform as well when data are abundant. We compare the mAP on base classes for models obtained after the first-stage base training, between our meta-learning model and the original YOLO detector (used in latter two baselines). The results are shown in Table 1. Despite our detector is designed for a few-shot scenario, it also has strong representation power to reach comparable performance with the original YOLOv2 detector trained on a lot of samples. This indicates that using class-specific, meta-model predicted weights instead of sharing the weights for all the classes, does not introduce additional difficulty for the optimization process. This lays a basis for solving the few-shot object detection problem.
|Base Set 1||Base Set 2||Base Set 3|
|Novel Set 1||Novel Set 2||Novel Set 3|
|Method / Shot||1||2||3||5||10||1||2||3||5||10||1||2||3||5||10|
We present our main results on novel classes in Table 2. First we note that our model significantly outperforms the baselines, especially when the labels are extremely scarce (1-3 shot). The improvements are also consistent for different base/novel class splits and number of shots. Second, we note that YOLO-ft/YOLO-ft-full also performs significantly better than YOLO-joint. This demonstrates the necessity of the two training phases employed in our model: it is better to first train a good knowledge representation on base classes and then fine-tune with few-shot data, otherwise joint training with let the detector bias towards base classes and learn nearly nothing about novel classes.
More detailed results about each class is available at Table 3 (novel set 1). Compared with the results at Table 1, it can be seen that after few-shot fine-tuning, the accuracy on base classes will be hurt. This phenomenon holds for both our method (from 69.7% to 63.6%) and the YOLO-ft/YOLO-ft-full (from 70.33% to 68.9%/65.7%). During few-shot fine-tuning the labels for base classes are also scarce, this might explain why the accuracy on base classes are hurt. However, this is a tradeoff we have to make. In practice, one can use the few-shot fine-tuned detector to detect novel objects and the original detector that is trained on large data for detecting base class objects.
|Average Precision||Average Recall|
The results for COCO dataset is shown in Table 4. We evaluate and . In both cases, our model outperforms YOLO baselines. When the baseline is trained with same iterations with our model, it achieves an AP of less than 1%. We also observe that there is much room to improve the results obtained in the few-shot scenario. This is possibly due to the large amount of data in COCO and few-shot detection over it is quite challenging.
COCO to PASCAL
We evaluate this setting using 10-shot data of each class from PASCAL. The mAP of YOLO-ft and YOLO-ft-full are 11.24% and 28.29% respectively, while our method achieves 32.29%. The performance on PASCAL novel classes is worse than that when we use base classes in PASCAL dataset (usually around 40%), which could be explained by the domain shift, as images in COCO and PASCAL are of different complexities.
4.3 Adaptation Speed
Some previous works on meta-learning [32, 13] explicitly optimizes the network model so that the system can adapt to new few-shot classification tasks quickly. Here we show that despite the fact that our few-shot detection model does not consider adaptation speed explicitly in the optimization process, it still exhibits surprisingly fast adaptation ability. Note that in experiments of Table 2 and Table 3, YOLO-ft-full requires 25,000 iterations for it to fully converge, while our meta-learning model only require 1200 iterations to converge to a higher accuracy. When the baseline YOLO is trained for the same iterations (YOLO-ft) as our method, the performance is far worse. In this section, we compare the full convergence behavior of YOLO-joint, YOLO-ft-full and our method in Figure 3. The AP value are normalized by the maximum value during the training of our method and the baseline together. This experiment is conducted on PASCAL VOC base/novel split 1, with 10-shot bounding box labels on novel classes.
From Figure 3, our method (solid lines) converges significantly faster than the baseline YOLO detector (dashed lines), for each novel class as well as on average. For the class Sofa (orange line), despite the baseline YOLO detector eventually slightly outperforms our method, it takes a great amount of training iterations to catch up with the latter. This behavior makes our model a good few-shot detector in practice, where scarcely labeled novel classes may come in any time and short adaptation time is desired to put the system in real usage fast. This also opens up our model’s potential in a life-long learning setting , where the model accumulates the knowledge learned from past and uses/adapts it for future prediction.
4.4 Analysis of Predicted Reweightings
(a) Visualization of the predicted weights (in row vectors) from the meta model for each class. Column correspond to feature maps, ranked by variance among classes. Due to space limit, we only plot randomly sampled 256 features. (b) t-SNE embedding of the reweighting vectors. More visually similar classes tend to have closer embeddings.
The feature reweighting for a class is predicted by the meta-model, and averaged over multiple inputs during testing. Therefore, in the final model, each class (both base and novel) corresponds to a unique weight vector that decides which features are important for detecting that class. A natural question is to see whether there exist patterns on those predicted weights. We present a detailed analysis of the predicted weights, and discuss some interesting observations.
We first plot the reweighting coefficients for each class in (a). In our architecture, the weights form a 1024 dimensional vector. In the figure, each row corresponds to a class and each column corresponds to a feature. The features are ranked by variance among 20 classes from left to right. We observe that roughly half of the features (columns) have notable variance among different classes (multiple colors in a column), while the other half are insensitive to classes (roughly the same color in a column). This suggests that indeed only a portion of features are used differently when detecting different classes, while the remaining features are shared across different classes.
We further visualize the predicted weight vectors by t-SNE  in (b). In this figure, we also plot the weight vector generated by each input of the meta-model, along with their average for each class. We use the weights of the 10-shot trained model so each class has 10 points with one mean. The model is trained on base/novel split 1, and the novel classes are bold. We observe that not only weights of the same classes tend to form clusters, visually similar classes also tend to be close. The classes Cow, Horse, Sheep, Cat and Dog are all around the right-bottom corner, and they are all animals. Classes of transportation tools are at the top of the figure. Person and Bird are more visually different from the mentioned animals, but are still closer to them than the transportation tools.
4.5 Ablation Studies
We analyze the effects of various components in our system, by comparing the performance on both base classes and novel classes. The experiments are on PASCAL VOC base/novel split 1, using 10-shot data on novel classes.
Which Layer Output to Reweight
In our experiments, we apply the reweighting technique to the output of the second last layer (layer 21). This is the highest level of intermediate features we could use. However, other options could be considered as well. We experiment with reweighting layer 20 and 13, while also considering reweighting only half of features in layer 21. The results are shown in Table 5. We can see that the reweighting technique is more suited to be applied at deeper layers, as using earlier layers gives us worse performance. Moreover, reweighting only half of the features does not hurt the performance much, which demonstrates that a significant portion of features can be shared among classes, as we analyzed in subsection 4.4.
|Layer 13||Layer 20||Layer 21||Layer 21(half)|
As we mentioned in subsection 3.3, for classification scores produced by different predicted reweightings, there are several options for the classification loss. Among them the binary loss is the most straightforward one: if the inputs to the meta-model and the detector are from the same class, the model predicts 1 and otherwise 0. This binary loss can be defined in following two ways. The single-binary loss refers to that in each iteration the meta-model only takes one class of input, and the detector regresses 0 or 1; and the multi-binary loss refers to that per iteration the meta-model takes examples from classes, and compute binary loss in total. Prior works on Siamese Network  and Learnet  use the single-binary loss. Instead, our model uses the softmax loss for calibrating the classification scores of classes. To investigate the effects of using different loss functions, we compare model performance trained with the single-binary, multi-binary loss and with our softmax loss in Table 6. We observe that using softmax loss significantly outperforms binary loss. This is likely due to its effect in suppressing redundant detections, since the classification scores must sum to 1.
Input Form of the Meta-model
In our experiments, we use an image of the target class with a binary mask channel indicating position of the object as input to the meta-model. We examine the case where we only feed the image. From Table 7 we see that this gives lower performance especially on novel classes. An apparently reasonable alternative is to feed the cropped target object together with the image. From Table 7, this solution is also slightly worse. The necessity of the mask may lie in that it provides the precise information about the object location and its context.
We also analyze the meta-model’s input sampling scheme for testing and the effect of sharing weights between feature extractor and meta-model. Due to space limit, we defer the results to the supplementary material.
This work is among the first to explore the practical and challenging few-shot detection problems. It introduced a meta-model to learn to fast adjust contributions of the basic features to detect novel classes with a few example. Experiments on realistic benchmark datasets clearly demonstrate effectiveness of the proposed model. This work also compared the model adaption speed, analyzed predicted feature weights and contributions of each design component to provide an in-depth understanding of the proposed model. Few-shot detection is challenging and we will further explore how to improve its performance for complex scenes.
-  A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran. Zero-shot object detection. arXiv preprint arXiv:1804.04340, 2018.
-  L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pages 523–531, 2016.
H. Bilen and A. Vedaldi.
Weakly supervised deep detection networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2846–2854, 2016.
-  H. Chen, Y. Wang, G. Wang, and Y. Qiao. Lstd: A low-shot transfer detector for object detection. arXiv preprint arXiv:1803.01529, 2018.
Z. Chen and B. Liu.
Lifelong machine learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks. In CVPR, 2017.
-  X. Dong, L. Zheng, F. Ma, Y. Yang, and D. Meng. Few-example object detection with model communication. arXiv preprint arXiv:1706.08249, 2017.
-  M. Douze, A. Szlam, B. Hariharan, and H. Jégou. Low-shot learning with large-scale diffusion. In Computer Vision and Pattern Recognition (CVPR), 2018.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, 2017.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3037–3046. IEEE, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pages 346–361. Springer, 2014.
-  G. Koch. Siamese neural networks for one-shot image recognition. In ICML Workshop, 2015.
-  B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.
-  B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a compositional causal process. In Advances in neural information processing systems, pages 2526–2534, 2013.
-  F.-F. Li, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 165–177, 2017.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
I. Misra, A. Shrivastava, and M. Hebert.
Watch and learn: Semi-supervised learning for object detectors from video.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3593–3602, 2015.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
-  H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. arXiv preprint arXiv:1712.07136, 2017.
-  S. Rahman, S. Khan, and F. Porikli. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. arXiv preprint arXiv:1803.06049, 2018.
-  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525. IEEE, 2017.
-  J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image analysis. Annual review of biomedical engineering, 19:221–248, 2017.
-  Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue. Dsod: Learning deeply supervised object detectors from scratch. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1937–1945. IEEE, 2017.
-  J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
-  H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly-supervised discovery of visual pattern configurations. In Advances in Neural Information Processing Systems, pages 1637–1645, 2014.
-  F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. CVPR, 2018.
-  J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
-  O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
-  Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data. In CVPR, 2018.
-  Y.-X. Wang and M. Hebert. Model recommendation: Generating object detectors from few samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1619–1628, 2015.
-  F. Yu, Z. Qin, and X. Chen. Distilling critical paths in convolutional neural networks. arXiv preprint arXiv:1811.02643, 2018.
-  B. Zhou, Y. Sun, D. Bau, and A. Torralba. Revisiting the importance of individual units in cnns via ablation. arXiv preprint arXiv:1806.02891, 2018.
-  P. Zhu, H. Wang, T. Bolukbasi, and V. Saligrama. Zero-shot detection. arXiv preprint arXiv:1803.07113, 2018.
A. Implementation Details
All our models are trained using SGD with momentum 0.9, and weight-decay 0.0005 (on both detector and meta-model). The batch size is set to be 64. For base training we train for 80,000 iterations, a step-wise learning rate decay strategy is used, with learning rate being 10, 10, 10, 10, and changes happening in iteration 500, 40,000, 60,000. For few-shot fine-tuning, we use a constant learning rate of 0.001 and train for 1500 iterations. We use multi-scale training, and evaluate the model in 416 416 resolution, as with the original YOLOv2.
B. Additional Ablation Studies
Sampling of Examples for Testing
During training, the meta-model takes random input from the -shot data each of the classes. In testing, we take the -shot example as meta-model’s input and use the average of their predicted weights for detecting the corresponding class. If we replace the averaging process by randomly selecting meta-model’s input (as during training), the performance on base/novel classes will drop significantly from 69.7%/47.2% to 63.9%/45.1%. This is similar to the ensembling effect, except that this averaging over reweighting coefficients do not need additional inference time as in normal ensembling.
Sharing Weights Between Feature Extractor and Meta-model
The first few layers of the meta-model and the backbone feature extractor share the same architecture. Thus some weights can be shared between them. We evaluate this alternative and found the performance on base/novel classes decrease from 69.7%/47.2% to 68.3%/44.8%. The reason could be it imposes more constraints in the optimization process.
C. Complete Results on PASCAL VOC
Here we present the complete results for each class and number of shot on PASCAL VOC dataset. The results for base/novel split 1/2/3 are shown in Table 1/2/3 respectively.