Few-Shot Learning for Video Object Detection in a Transfer-Learning Scheme

by   Zhongjie Yu, et al.

Different from static images, videos contain additional temporal and spatial information for better object detection. However, it is costly to obtain a large number of videos with bounding box annotations that are required for supervised deep learning. Although humans can easily learn to recognize new objects by watching only a few video clips, deep learning usually suffers from overfitting. This leads to an important question: how to effectively learn a video object detector from only a few labeled video clips? In this paper, we study the new problem of few-shot learning for video object detection. We first define the few-shot setting and create a new benchmark dataset for few-shot video object detection derived from the widely used ImageNet VID dataset. We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects. By analyzing the results of two methods under this framework (Joint and Freeze) on our designed weak and strong base datasets, we reveal insufficiency and overfitting problems. A simple but effective method, called Thaw, is naturally developed to trade off the two problems and validate our analysis. Extensive experiments on our proposed benchmark datasets with different scenarios demonstrate the effectiveness of our novel analysis in this new few-shot video object detection problem.



There are no comments yet.


page 4

page 8


Few-Shot Video Object Detection

We introduce Few-Shot Video Object Detection (FSVOD) with three importan...

ODIP: Towards Automatic Adaptation for Object Detection by Interactive Perception

Object detection plays a deep role in visual systems by identifying inst...

Generalized Few-Shot Object Detection without Forgetting

Recently few-shot object detection is widely adopted to deal with data-l...

Representation learning from videos in-the-wild: An object-centric approach

We propose a method to learn image representations from uncurated videos...

Restoring Negative Information in Few-Shot Object Detection

Few-shot learning has recently emerged as a new challenge in the deep le...

COBE: Contextualized Object Embeddings from Narrated Instructional Video

Many objects in the real world undergo dramatic variations in visual app...

Low-shot Object Detection via Classification Refinement

This work aims to address the problem of low-shot object detection, wher...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the popularity of cameras in surveillance systems and mobile phones, as well as the mass adoption of social media content sharing platforms, there are more and more video content generated every day. Therefore, the need for developing algorithms to detect objects in videos grows rapidly in computer vision. Although it is possible to train a robust video object detector with sufficient labeled videos, powerful deep neural networks and abundant computational resources, collecting such a large number of videos with bounding box annotations is costly.

Humans can learn new concepts easily with only a few examples. Despite that deep learning has been successfully applied to many real-world applications, it usually suffers from the overfitting problem when there are only a few samples for new concepts. Few-shot learning, which tries to learn a robust model from only a few samples of a new concept, has thus attracted great attention recently [20, 31, 7, 28, 26, 17, 15, 18, 12, 13, 6, 19, 33, 2, 39, 36].

Most existing few-shot learning methods focus on either image classification [20, 7, 21, 31, 28, 26, 17, 15, 18] or video classification [2, 39]. While some recent few-shot learning methods [12, 13, 6, 19, 33] have investigated object detection, all of them focus on object detection in static images instead of videos. Different from static images, videos contain abundant spatial and temporal information of objects. Therefore, it becomes more imperative to design a model for video object detection given a few videos of novel-class objects. This poses a new problem of few-shot learning for video object detection.

Since videos contain more information than static images, using the spatial-temporal information in videos is critical to achieving good performance. Since it is computationally prohibitive to build episodes by representing a video by its frames, the meta-learning based few-shot learning methods designed for the image classification problem cannot be directly used to solve the few-shot video object detection problem. While some techniques target few-shot video classification [2, 39]

, it is different from video object detection, as video classification only aims to classify the entire video as one of the classes. In contrast, video object detection needs to detect both the presence and spatial-temporal location of an object in all frames in a video. This increases difficulty exponentially in terms of both computing and realization. The recent finding

[33] that transfer-learning based methods can achieve good results on image object detection opens a possible path for solving the problem of few-shot video object detection.

In this paper, we study the new problem of few-shot learning for video object detection. In particular, given a few video clips of novel-class objects, we would like to build a robust object detector for novel-class objects. Realizing there is no prior work studying this, we curate a new benchmark dataset derived from the popular ImageNet-VID dataset [25] for this problem. We design two types of base datasets: a strong dataset and a weak dataset to investigate the influence of the strength of the feature extractor for the video object detector. We employ the transfer-learning framework. Specifically, we first train a video object detector on the whole base dataset, which can aggregate the temporal information from other frames in the entire video based on the state-of-the-art method MEGA [4]. After that, we fine-tune the cosine classifier and bounding box regressor in the RPN head. Based on different fine-tuning strategies, we consider two methods: Joint and Freeze. By analyzing the performance of the two methods on the curated benchmark datasets, we reveal the insufficiency and overfitting problems. In order to solve this issue, a simple but effective method called Thaw is naturally developed by our analysis to balance the insufficiency and overfitting problems effectively. Based on the evaluation of the curated benchmark dataset, we demonstrate the effectiveness of our proposed method.

Our contributions are summarized as follows:

1) We propose a new paradigm of few-shot learning for video object detection. Specifically, we study how an object detector can be learned from a few videos of new concepts where abundant temporal information is available while maintaining good performance for existing classes.

2) We curate a dataset derived from the popular ImageNet-VID dataset [25] for investigating this new problem. A strong base dataset and a weak base dataset are designed for further scenario analysis.

3) We propose a transfer-learning framework for solving this problem and investigate two methods under the framework: Joint and Freeze. We reveal two issues: insufficiency and overfitting based on a novel quantitative analysis.

4) We propose a simple method called Thaw to trade off the insufficiency and overfitting problems revealed by our analysis. Extensive experiments demonstrate that our proposed Thaw naturally motivated by our novel analysis can help the video object detector efficiently learn new concepts from a few novel-class videos and achieve promising performance.

2 Related Work

2.1 Few-Shot Learning

Few-shot learning methods can be categorized into meta-learning based methods and transfer-learning methods.

Meta-learning aims to learn a paradigm based on the base-class data with an episodic training strategy so that it can generalize to new tasks with only a few novel examples. Metric-based meta-learning methods learn a good distance metric from the few-shot examples of base classes [31, 28, 29, 15]. For example, Prototypical Network [28] measures the distance from queries and the prototypes of each class. DN4 [15] explores the local descriptors to measure similarities. Optimization-based meta-learning methods like MAML [7], LEO [26] and Reptile [18] aim to find a good optimization direction that converges faster to the optimal solution with fewer gradient steps. However, most meta-learning based methods are designed for image classification. It is challenging to directly extend them to the few-shot video object detection scenario in this paper.

Transfer-learning based methods focus on how to train a good base model from the large amount of base-class data and then adapt the model to novel classes with only a few-shot of samples [20, 21, 3, 36]

. Unlike meta-learning based methods, the base model is trained using traditional methods and a new classifier is built with a frozen feature extractor. Cosine-similarity is usually used to build the new classifier given novel-class samples

[20, 3, 21]. Chen et al. [3] systematically analyzed the performance of building a cosine classifier when freezing the feature extractor and compared with popular meta-learning methods, demonstrating the effectiveness of transfer-learning based methods for few-shot learning. Again, most transfer-learning based methods are for image classification and have not been applied to the few-shot video object detection problem.

2.2 Object Detection

Object detection is a fundamental problem in computer vision and has been studied for decades with significant progress made in recent years due to the advancement of deep learning. Nowadays, CNN-based object detection methods have become the mainstream and can be divided into two main categories: 1) One-stage detection; and 2) two-stage detection. One-stage detection methods, such as YOLO [22] and SSD [16], do not require region proposals and can predict the bounding box directly. Two-stage detection methods require region proposals and generally show better performance than one-stage detection methods. The representative methods include RCNN [9], Fast-RCNN [8], Faster-RCNN [24], and Mask R-CNN [10]. Recently, anchor-free methods have also been proposed, which can directly regress a bounding box from the location in the feature map [38, 14, 30]. However, its application to few-shot learning is not sufficiently explored yet.

2.3 Few-Shot Image Object Detection

Few-shot image object detection is is still a developing area of research, with only a handful of notable works [35, 12, 13, 6, 19, 33]. Most of them are based on meta-learning. Meta-RCNN [35] proposed the meta-learning process for RoI features in Faster-RCNN [24]. Feature Reweighting [12] developed a reweighting module and map features with corresponding classes in YOLOv2 [23]. Qi et al. [6] proposed Attention RPN and Multi-Relation Head for matching classes. Recently, Wang et al. [33] investigated the transfer-learning based method by first training a Faster-RCNN model on the base-class data, and then freezing the feature extractor and fine-tuning a cosine classifier and a regressor on a balanced dataset of base-class and novel-class data. This method has achieved promising performance compared to previous meta-learning methods. However, none of the existing works considered few-shot video object detection, which is the focus of this paper.

2.4 Video Object Detection

Object detection in videos is a challenging task due to the high variation across the videos. The objects in videos may be blurry and change pose and status. Meanwhile, the moving background and unstable lighting conditions pose challenges to object detection. On the other hand, the temporal information in videos can provide more information than static images. These aspects have remained under-explored in the past several years but have recently started to attract more attention after the ImageNet-VID dataset [25] was released. Most recent works (e.g., RDN [5], FGFA [40] and STSN [1]) focus on utilizing nearby frames in a short range to gather local information. Meanwhile, other works aim to use global information from frames in a wider range like SELSA [34]. MEGA [4] was recently proposed to effectively combine the local and global information with a memory-enhanced module so that one frame can access more content from nearby and far-away frames, resulting in state-of-the-art performance.

3 Few-Shot Video Object Detection

In this section, we introduce the definition of few-shot video object detection and dataset construction process.

3.1 Problem Definition

Let us define two sets of classes: base classes and novel classes , where , with . Assume there is a large base dataset consisting of many videos per class , where is the -th video in the base dataset. is the -th frame in the -th video, and with , being the class and the bounding box coordinates for -th object in -th frame of the -th video, respectively. Note that could be and . A video object detection model can be built on .

After building the model from the base dataset, we would like to adapt the base model to novel classes given only a few videos per novel class. It is natural to define shot for image classification and image object detection because one image or one bounding box has only one class label. In order to have an appropriate definition of shot for video object detection, we define two types of videos as follows.

Clean videos: Videos contain only one class of objects. Each frame may contain more than one object of that class.

Perfect videos: Videos contain only one class of objects and each frame contains only one object from that class. It is a subset of clean videos.

We denote one perfect video as one-shot. Also, we expect a good detector to perform well on both base and novel classes. Motivated by these aspects, we denote an -way -shot balanced dataset with both novel and base classes from perfect videos as , where , and for all . The model will be further trained on this balanced dataset.

Figure 1: The proposed framework of few-shot learning for video object detection. A video object detector is pretrained on the base dataset by aggregating local and global information from different frames in the videos, and then adapted to novel classes based on few-shot novel-class video dataset. During the adaptation, the cosine classifier is used in the detection head and the model is fine-tuned by three methods: Joint, Freeze, and our developed Thaw.

3.2 Dataset Construction

Dataset Type Total Number of Frames
All training videos 57,834 key frames
Perfect training videos 31,530 key frames
All validation videos 176,126 frames
Balanced validation videos 23,624 frames
Table 1: The statistics of different types of datasets.

Since we are studying a new few-shot learning video object detection problem, we construct a new dataset from ImageNet VID dataset [25] widely used for video object detection. It consists of categories with training videos and validation videos. We split them to 25 base classes and 5 novel classes in each novel-base split. Specifically, we create four different types of datasets as follows:

Strong base dataset: It consists of all the videos from base-class objects. Meanwhile, like existing work for video object detection [4, 27], we additionally include images from the image object detection dataset DET [25], leading to a strong base dataset for learning strong feature extractors of the base classes.

Weak base dataset: It contains only the perfect videos from base classes, and thus is a subset of strong base dataset.

Remarks: We need to clarify that it is important to investigate the impact from different types of base datasets for real-life applications since the available amount of data for base classes could vary in different scenarios. This has been overlooked in few-shot learning for image classification. Recently, Yue et al. [37] investigated the different performances between strong and weak backbones used in few-shot image classification. This work shares similarities with our design for strong and weak base datasets.

Balanced few-shot dataset: It is used for few-shot adaptation, and is randomly sampled from the perfect videos in the ImageNet VID training set for both novel and base class objects. We sample videos per class with as the shot number.

Balanced validation dataset: It is the subset of the original clean validation videos and is used to evaluate the few-shot video object detector. Noting that the minimum number of clean validation videos among all classes is , we randomly sample clean videos per class to construct this dataset. In this way, we can alleviate the impact from different numbers of videos in different classes.

4 The Proposed Framework

In this section, we introduce our proposed framework for few-shot video object detection, which is illustrated in Figure 1. First, a video object detector is pretrained on base dataset. Second, a cosine classifier is used to replace the classifier in the detection head and fine-tuned on the balanced few-shot video dataset. More details are provided in the subsequent sections.

4.1 Part I: Pretraining the Video Object Detector

In the first stage, we train a video object detector on the base dataset where many videos per class are provided. In particular, given the base dataset , we first construct a video object detector to fully utilize the abundant video information by employing the state-of-the-art method MEGA [4]. Any video detector could be used here, and we use MEGA as it can efficiently combine both local and global information throughout the video with its memory enhanced module.

We rewrite the video as a set of consecutive frames . And denotes the set of proposals generated by RPN in each frame . Following [4], the local pool for proposals for a key frame is the proposals in the nearby frames, i.e.,


We omit the index for for simplicity. For the global pool, the ordered frames are randomly reordered such that the index set is mapped to a new shuffled set . The global pool is constructed by


A function is used to aggregate the global features from to to produce a new pool , where consists of stacked location-free relation modules 111For simplicity, we did not elaborate this in this paper..

After obtaining the global aggregated pool , it is then aggregated with a long range memory pool by another function , which is composed of stacked location-based relation modules proposed by Chen et al. [4] to generate an enhanced pool . The aggregation process is given by


The memory pool is initialized as an empty set and is updated throughout all frames in one video. When finishing the detection on , the features in are added to , which will be used for the next key frame . This recurrent process increases the efficiency of combining features for each key frame. Moreover, one frame can benefit from its subsequent frames without forgetting.

We denote the feature extractor as , which is extracted by ROI-Align and full-connected layer [10]. For each key frame , the final features for all proposals is denoted as


And is used in the RPN head for classification and bounding box regression. This process is used for all the key frames in a video.

4.2 Part II: Adaptation on Few-Shot Videos

After obtaining the pretrained video object detector, we adapt the model based on the balanced few-shot video dataset including novel-class and base-class objects.

4.2.1 Modification on the RPN Head

Since the cosine classifier is shown to be effective for few-shot image classification problem in [3], and more suitable to de-correlate the feature space for different classes [32], we adopt a cosine classifier for the few-shot fine-tuning stage. Specifically, a weight matrix is fine-tuned where represents the prototype of -th class in . Then, the cosine similarity with the different classes for a given proposal is written as:


Because cosine similarity ranges between -1 and 1, the softmax function is unable to predict the correct class via the one-hot encoding regime for class labels, causing a discrepancy between the one-hot and real distribution. In order to solve this issue, a scaling factor is usually applied on softmax for better convergence as used in

[20, 3, 32]

. With that, the probability for

-th class can be represented as:


where is the scale factor.

The structure of the regressor remains the same as in the pretraining stage but is appended dimension (i.e., coordinates of bounding boxes) to account for the -way novel-class videos. The weights of cosine classifier and regressor for novel-class videos are randomly initialized.

4.2.2 Adaptation Strategies

With the new RPN head, we use multiple strategies (including the newly proposed Thaw to be described in Section 6) to adapt the pretrained model when given few-shot videos.

Joint: All the weights from the feature extractor and detection head are fine-tuned jointly on the balanced few-shot dataset. This joint fine-tuning is not designed for few-shot learning and usually suffers from overiftting, because the feature extractor can be easily impacted by the few-shot samples. Therefore, it is seldom used in few-shot image classification and usually serves as a low-performing baseline for few-shot image object detection [35, 12].

Freeze: In this method, the feature extractor is frozen and only the detection head is fine-tuned. Freezing feature extractor method is particularly suitable for few-shot learning and widely used in image classification [3]. The recent work in [33] also shows its superiority in overcoming overfitting and achieves new state-of-the-art performance in few-shot image object detection.

5 Preliminary Experiments

In this section, we conduct experiments on the designed dataset under our proposed framework. We evaluate the different adaptation strategies of Joint and Freeze in different settings. We analyze the results and discover the insufficiency and overfitting problems in few-shot video object detection.

5.1 Implementation Details

Video object detector network: ResNet-101 [11] is used as the backbone. RPN head is applied to conv4 block in ResNet where the anchors have scales and aspect ratios. bounding box proposals per frame are created during training and validation with the default IoU threshold set to . Next, RoI-Align and a fully-connected layer are employed after conv5 block to extract RoI pooled features, followed by the classifier and regressor. For the video detector, we set the local pool and global pool range and to

and 10, respectively. The hyperparameters in

and modules are the same as in [4]. For the cosine classifier, we set the scale factor to .

Base Class Method / Split 1-shot 2-shot 3-shot
Dataset A B C A B C A B C
Weak Novel Joint (better) 22.11 21.88 9.76 32.57 32.29 23.28 36.5 39.84 29.86
Freeze 18.93 24.05 8.09 25.07 31.60 13.88 28.89 40.53 13.73
Base Joint 56.41 59.23 56.30 58.41 61.77 58.86 60.12 63.75 60.26
Freeze (better) 60.98 63.97 60.95 61.15 64.12 60.83 61.32 65.02 61.10
Strong Novel Joint 21.69 31.10 20.06 39.37 45.94 34.74 44.56 51.43 43.33
Freeze (better) 41.85 40.69 31.71 50.14 47.81 42.66 53.15 52.41 43.08
Base Joint 72.08 76.09 75.80 73.87 77.06 76.97 75.96 78.66 77.25
Freeze (better) 82.79 84.75 86.45 83.00 84.69 86.53 83.14 85.00 86.43
Table 2: Novel-class and base-class mAP50 (in %) on the validation videos when the base dataset is weak or strong. Better results are in bold. For novel-class performance, Freeze is better than Joint on strong base dataset but worse than Joint on weak base dataset, opposite to the research findings on images. For base-class performance, Freeze performs consistently better than Joint.

Training: All models are trained on Tesla V100 GPUs, with each GPU holding one set of frames. During pretraining, the initial learning rate is set to and drops to after iterations. We train the model for a total of iterations. During fine-tuning on the balanced few-shot dataset, the classifier and regressor are fine-tuned for iterations in all settings. We set the learning rate during fine-tuning to . We create three random splits (denoted as A, B, C) from novel-base classes, which is a common practice in few-shot image object detection. During training the detector on the base dataset and fine-tuning it on the balanced few-shot dataset, frames are evenly-spaced selected from the whole videos. For videos with fewer than frames, all frames are selected. The experiments for each few-shot video setting are repeated times. Each time the balanced few-shot dataset is selected randomly but kept the same for different methods to ensure a fair comparison. The dataset statistics are shown in Table 1. The code for the algorithm and dataset will be released.

Inference: During inference on the validation set, we set the NMS threshold to

IoU and use mean average precision at 0.5 IoU (mAP50) as the evaluation metric. The balanced validation dataset remains the same for different few-shot settings to reduce randomness. The reported mAP50 in each setting is the average of

random experiments.

Figure 2: Illustration of the insufficiency and overfitting problems. represents the insufficiency problem and represents the overfitting problem. Strong and weak base datasets indicate the amount of base dataset. Freeze and Joint indicate the flexibility of the feature extractor. The green down arrows indicate the problem is alleviated and the red up arrows indicate the problem is aggravated.

5.2 Preliminary Results

The experiment results for weak and strong base datasets are shown in Table 2.

Base-class performance: From the results on both base datasets, Freeze is consistently better than Joint on the base-class objects. This is not surprising as freezing the feature extractor can alleviate the overfitting problem.

Novel-class performance: When the base dataset is strong, Freeze is better than Joint on novel-class objects in almost all cases but one. This is consistent with the prior work in few-shot image classification and image object detection, as freezing the feature extractor can prevent overfitting. However, the pattern shown on the weak base dataset is opposite to strong base dataset (see Table 2). When the base dataset is weak, Joint generally performs better than Freeze, which is the opposite to the the research findings on images. An explanation is that it is usually easier to obtain sufficient information from the base dataset to build a sufficient feature extractor for the single image situation. Therefore, the overfitting problem will dominate for novel-class images and Freeze could work well in this situation. However, the complicated structure and abundant information in videos may not be sufficiently learned from the base dataset and thus learning good features for novel-class objects cannot be guaranteed.

5.3 Insufficiency vs. Overfitting

The preliminary results from the previous section reveal the insufficiency and overfitting problems for the novel and base classes.

Insufficiency problem corresponds to the situation that the features learned from the base dataset may not be sufficient for building detectors on novel-class objects.

Overfitting problem corresponds to the situation that some good features learned from the base dataset may be distorted by the few-shot videos during fine-tuning when the feature extractor is unfrozen.

We summarize the analysis in Figure 2

. It is clear that a strong base dataset can alleviate the base-class insufficiency problem. Meanwhile, a strong base dataset can provide a better feature extractor, which improves the capability to extract better features for novel classes. Therefore, it can alleviate the insufficiency problem for novel classes. On the other hand, freezing the feature extractor (Freeze) could largely solve the overfitting problem both for base and novel classes when fine-tuning on the few-shot videos. However, it does not allow the feature extractor to encode possible novel information from novel classes. Therefore, unfreezing the feature extractor (Joint) would alleviate the insufficiency problem for novel classes since the feature extracted for base classes in the base training stage may not be sufficient to describe novel-class objects. Since the base-class objects in few-shot videos do not provide any further useful information for the base classes (they come from the base-class objects used for pretraining the feature extractor), unfreezing the feature extractor (Joint) does not help reduce the base-class insufficiency problem.

Therefore, in terms of novel-class performance, when the base dataset is weak, there is a significant novel-class insufficiency problem. Although Joint aggravates the novel-class overfitting problem, it largely alleviates novel-class insufficiency problem, such that its performance exceeds Freeze. When the base dataset is strong, the novel-class insufficiency problem becomes negligible, such that Freeze’s performance is better than Joint’s in this case.

On the other hand, in terms of base-class performance, unfreezing the feature extractor could only increase the base-class overfitting problem and could not reduce the base-class insufficiency problem, such that Joint’s performance is always worse than Freeze no matter the base dataset is strong or weak.

6 Improved Method and Experiment

In this section, we further propose a simple but effective method called Thaw to balance the tradeoff of Joint and Freeze during adaptation and demonstrate the rationality of our analysis. We conduct further experiments to illustrate the insufficiency and overfitting problems.

Class Method / Shot Weak Base Dataset Strong Base Dataset
1-shot 2-shot 3-shot Rank 1-shot 2-shot 3-shot Rank
Novel Joint 17.92 29.38 35.40 2 24.28 40.02 46.44 3
Freeze 17.02 23.52 27.72 3 38.08 46.87 49.55 1
Thaw (Ours) 20.05 32.13 37.32 1 36.73 48.71 51.38 1
Base Joint 57.31 59.68 61.37 3 74.66 75.97 77.29 3
Freeze 61.97 62.03 62.48 1 84.66 84.74 84.86 1
Thaw (Ours) 60.13 60.33 61.52 2 80.79 78.81 78.94 2
Table 3: Novel-class and base-class mAP50 (in %) on the validation videos, averaged from all novel-base splits. Best results are in bold. indicates results are similar.

6.1 Improved Method: Thaw

As discussed in the previous section, there is a tradeoff between insufficiency and overfitting problems caused by unfreezing the feature extractor. To this end, we still first freeze the feature extractor and fine-tune on the detection head. This would give us a good detection head and prevent the overfitting problem. After convergence, we further unfreeze the feature extractor and fine-tune all the weights jointly to let the feature extractor learn extra information to reduce novel-class insufficiency problem. The detection head can be regarded as being initialized with better weights compared with Joint. We call this improved process Thaw:


During the experiments, after the detection head is fine-tuned by Freeze, the entire model is further trained for iterations for shot and iterations for and shots.

6.2 Results

Figure 3: Novel-class mAP50 (in %) improvement of Joint and Thaw over Freeze. The gain is from the component of unfreezing the feature extractor. Clearly, the gain is increasing with more shots, and our Thaw improves over Joint by a large margin.

The experimental results are shown in Table 3. When the base dataset is weak, it is apparent that our proposed Thaw outperforms Freeze and Joint by a large margin for novel classes. As described in the previous section, Joint outperforms Freeze substantially in this case (see Table 2). Our proposed Thaw is even better, which demonstrates that it can balance the overfitting and insufficiency problems.

When the base dataset is strong, Thaw performs comparably to Freeze. We should note that Joint performs poorly in this case as also mentioned in Table 2. Our simple method improves Joint by a large margin.

For the base-class performance, our Thaw also significantly outperforms Joint, demonstrating this simple technique significantly alleviates the base-class overfitting problem. Note that the unfreezing part in Thaw can aggravate base-class overfitting problem, so it is expected that Freeze performs the best on the base classes (mentioned in section 5.3).

Number of shots vs. insufficiency/overfitting: When the feature extractor is unfrozen, more shots during few-shot adaptation can simultaneously alleviate novel-class insufficiency and overfitting problems. To better illustrate this phenomenon, we compare the improvement of Joint and Thaw over Freeze on novel classes. The novel-class gain of Joint or Thaw over Freeze is calculated as


The gain can be seen as the contribution of unfreezing feature extractor. The results are shown in Figure 3. We can clearly see the trend that the novel-class gain increases with the increase of the number of shots.

Figure 4: Examples of few-shot learning for video object detection of giant panda (weak base dataset). Blue (resp. red) bounding boxes denote correct (resp. incorrect) detection. Note that red panda is a different animal. The 1st row shows the 1-shot novel-class video used in few-shot adaptation. The 2nd, 3rd, and 4th rows show the detection results in the validation videos by Joint, Freeze and Thaw, respectively.
Method Classifier / Shot Novel-Class mAP50
1 2 3
Thaw Fully-connected 36.91 44.62 52.86
Cosine 37.81 50.55 56.15
Table 4: Ablation study on two types of classifiers on strong base dataset and novel-base split A.
Method Format / Shot Novel-Class mAP50
1 2 3
Freeze Image-based 19.73 32.24 37.56
Video-based 41.85 50.14 56.00
Table 5: Ablation study on the influence of temporal information from the video on the strong base dataset and novel-base split A. Note that 1-shot video is similar to 15-shot images. The image-based Freeze (the first row) could be regarded as the implementation of the state-of-the-art few-shot image object detection method [33] on 15-shot video images.

6.3 Ablation Study

Influence of classifier: We conduct the ablation study of choosing different classifiers in detection head on novel-base split A. Table 4 shows the results of these two classifiers on 1-, 2-, and 3-shot settings with Thaw. We can see that the cosine classifier outperforms the fully-connected classifier in our few-shot video object detection.

Influence of video temporal information: Without using the temporal information in the video, the video object detector degenerates to an image object detector. In this case, given few-shot videos, all the key frames are used separately for fine-tuning the image object detector. Thus 1-shot in our video object detection is equivalent to -shot in image object detection. To study this further, we conduct an ablation study using novel-base split A. We use Freeze here since image-based Freeze could be considered as the state-of-the-art transfer-learning based few-shot image object detection method [33]. The results in Table 5 clearly demonstrate the importance of using video information rather than single image information and the meaning of few-shot learning for video object detection.

Visualization of few-shot video object detection: An example of 1-shot video object detection results for different methods are shown in Figure 4. The blue and red bounding boxes denote the correct and wrong detection, respectively. From the results, we can see that Joint suffers from the overfitting problem as indicated by the red boxes. For Freeze, there is a mixture of correct and wrong detections. For Thaw, there are more correct bounding boxes compared with Joint and Freeze, highlighting its effectiveness.

7 Conclusion

We study a new problem of few-shot learning for video object detection. Specifically, we define the problem, construct a new dataset, and propose a transfer-learning framework for solving this problem. Insufficiency and overfitting problems are revealed from extensive experiments on our designed weak and strong base datasets by two methods (Joint and Freeze) under the proposed framework. Finally, a simple yet effective method called Thaw is naturally developed to validate our analysis and trade off the observed insufficiency and overfitting problems. Our work leads to significantly improved novel-class performance on the weak base dataset and competitive novel-class performance on the strong base dataset, while maintaining high base-class performance in few-shot video object detection.


  • [1] Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. Object detection in video with spatiotemporal sampling networks. In ECCV, pages 331–346, 2018.
  • [2] Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and Juan Carlos Niebles. Few-shot video classification via temporal alignment. In CVPR, 2020.
  • [3] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Wang, and Jia-Bin Huang. A closer look at few-shot classification. In ICLR, 2019.
  • [4] Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. Memory enhanced global-local aggregation for video object detection. In CVPR, pages 10337–10346, 2020.
  • [5] Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. Relation distillation networks for video object detection. In ICCV, pages 7023–7032, 2019.
  • [6] Qi Fan, Wei Zhuo, Chi-Keung Tang, and Yu-Wing Tai. Few-shot object detection with attention-rpn and multi-relation detector. In CVPR, pages 4013–4022, 2020.
  • [7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [8] Ross Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
  • [9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
  • [10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In ICCV, pages 2961–2969, 2017.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [12] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. In ICCV, 2019.
  • [13] Leonid Karlinsky, Joseph Shtok, Sivan Harary, Eli Schwartz, Amit Aides, Rogerio Feris, Raja Giryes, and Alex M Bronstein. RepMet: Representative-based metric learning for classification and few-shot object detection. In CVPR, pages 5197–5206, 2019.
  • [14] Hei Law and Jia Deng. CornerNet: Detecting objects as paired keypoints. In ECCV, 2018.
  • [15] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. Revisiting local descriptor based image-to-class measure for few-shot learning. In CVPR, pages 7260–7268, 2019.
  • [16] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In ECCV, pages 21–37, 2016.
  • [17] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002, 2018.
  • [18] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • [19] Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M Hospedales, and Tao Xiang. Incremental few-shot object detection. In CVPR, pages 13846–13855, 2020.
  • [20] Hang Qi, Matthew Brown, and David G. Lowe. Low-shot learning with imprinted weights. In CVPR, 2018.
  • [21] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-shot image recognition by predicting parameters from activations. In CVPR, 2018.
  • [22] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [23] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In CVPR, pages 7263–7271, 2017.
  • [24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
  • [26] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
  • [27] Mykhailo Shvets, Wei Liu, and Alexander C. Berg. Leveraging long-range temporal relationships between proposals for video object detection. In ICCV, 2019.
  • [28] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pages 4077–4087, 2017.
  • [29] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199–1208, 2018.
  • [30] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
  • [31] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
  • [32] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere embedding for face verification. In ACM International Conference on Multimedia, 2017.
  • [33] Xin Wang, Thomas E. Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. In ICML, 2020.
  • [34] Haiping Wu, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Sequence level semantics aggregation for video object detection. In ICCV, pages 9217–9225, 2019.
  • [35] Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xiaodan Liang, and Liang Lin. Meta R-CNN: Towards general solver for instance-level low-shot learning. In ICCV, pages 9577–9586, 2019.
  • [36] Zhongjie Yu, Lin Chen, Zhongwei Cheng, and Jiebo Luo. TransMatch: A transfer-learning scheme for semi-supervised few-shot learning. In CVPR, 2020.
  • [37] Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. Interventional few-shot learning, 2020.
  • [38] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [39] Linchao Zhu and Yi Yang. Compound memory networks for few-shot video classification. In ECCV, pages 751–766, 2018.
  • [40] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In ICCV, pages 408–417, 2017.