Few-shot learning methods operate in low data regimes. The aim is to learn with few training examples per class. Although significant progress has been made in few-shot image classification, few-shot video recognition is relatively unexplored and methods based on 2D CNNs are unable to learn temporal information. In this work we thus develop a simple 3D CNN baseline, surpassing existing methods by a large margin. To circumvent the need of labeled examples, we propose to leverage weakly-labeled videos from a large dataset using tag retrieval followed by selecting the best clips with visual similarities, yielding further improvement. Our results saturate current 5-way benchmarks for few-shot video classification and therefore we propose a new challenging benchmark involving more classes and a mixture of classes with varying supervision.READ FULL TEXT VIEW PDF
In the video domain annotating data is very time-consuming due to the additional time dimension. The lack of labeled training data is more prominent for some fine-grained action classes at the “tail” of the skewed long-tail distribution (see Figure1
), e.g., “arabesque in ballet”. It is thus important to learn to classify videos in the limited labeled training data regime. Visual recognition methods that operate in the few-shot learning setting aim to generalize a classifier trained onbase classes with enough training data to novel classes with only a few labeled training examples. While considerable attention has been devoted to this scenario in the image domain [41, 28, 29, 4], few-shot video classification is relatively unexplored.
are mostly based on frame-level features extracted from a 2D CNN, which essentially ignores the important temporal information. Although additional temporal modules have been added at the top of a pre-trained 2D CNN, necessary temporal cues may be lost when temporal information is learned on top of static image features. We argue that under-representing temporal cues may negatively impact the robustness of the classifier. In fact, in the few-shot scenario it may be risky for the model to rely exclusively on appearance and context cues extrapolated from the few available examples. In order to make temporal information available we propose to represent the videos by means of a 3D CNN.
While obtaining labeled videos for target classes is time-consuming and challenging, there are many videos tagged by users available on the internet. For example, there are 400,000 tag-labeled videos in the YFCC100M  dataset. Our second goal is thus to leverage such tag-labeled videos (Figure 1) to alleviate the lack of labeled training data.
Existing experimental settings for few-shot video classification [48, 2] are limited. Predicting a label among just 5 novel classes in each testing episode is in fact relatively easy. Moreover, restricting the label space to only novel classes at test time, and ignoring the base classes is unrealistic. In real-world applications test videos are expected to belong to any class.
In this work, our goal is to push the progress of few-shot video classification in three ways: 1) To learn the temporal information, we revisit spatiotemporal CNNs in the few-shot video classification regime. We develop a 3D CNN baseline that maintains significant temporal information within short clips; 2) We propose to retrieve relevant videos annotated with tags from a large video dataset (YFCC100M) to circumvent the need for labeled videos of novel classes; 3) We extend current few-shot video classification evaluation settings by introducing two challenges. In our generalized few-shot video classification task, the label space has no restriction in terms of classes. In many-way few-shot video classification with, the number of classes goes well beyond five, and towards all available classes. Our extensive experimental results demonstrate that on existing settings spatiotemporal CNNs outperform the state-of-the-art by a large margin, and on our proposed settings weakly-labeled videos retrieved using tags successfully tackles both of our new few-shot video classification tasks.
Low-shot learning setup. The low-shot image classification [26, 29, 15] setting uses a large-scale fully labeled dataset for pre-training a DNN on the base classes, and a low-shot dataset with a small number of examples from a disjoint set of novel classes. The terminology “-shot -way classification” means that in the low-shot dataset there are distinct classes and examples per class for training. Evaluating with few examples ( small) is bound to be noisy. Therefore, the training examples are often sampled several times and accuracy results are averaged [15, 6]. Many authors focus on cases where the number of classes is small as well, which amplifies the measurement noise. For that case  introduces the notion of “episodes”. An episode is one sampling of classes and examples per class, and the accuracy measure is averaged over episodes.
It is feasible to use distinct datasets for pre-training and low-shot evaluation. However, to avoid dataset bias  it is easier to split a large supervised dataset into disjoint sets of “base” and “novel” classes. The evaluation is often performed only on novel classes, except [15, 45, 32] who evaluate on the combination of base+novel classes.
Recently, a low-shot video classification setup has been proposed [48, 7]. They use the same type of decomposition of the dataset as , with learning episodes and random sampling of low-shot classes. In this work, we follow and extend the evaluation protocol of .
Tackling low-shot learning. The simplest low-shot learning approach is to extract embeddings from the images using the pre-trained DNN and train a linear classifier 15] on these embeddings using the available training examples. Another approach is to cast low-shot learning as a nearest-neighbor classifier . The “imprinting” approach , consists in building a linear classifier from the embeddings of training examples, then fine-tune it. Note that this is close to a nearest-neighbor classifier, since it is equivalent to doing class-mean similarity search with a cosine distance. As a complementary approach,  has looked into exploiting noisy labels to aid classification. By leveraging tags of 100M images from the YFCC100M dataset 
, they show improvements over Imagenet-pretraining. In this work, we use videos from YFCC100M retrieved by tags to augment and improve training of our few-shot classifier.
In a meta-learning setup, the low-shot classifier is assumed to have hyper-parameters or parameters that must be adjusted before training. Thus, there is a preliminary meta-learning step that consists in training those parameters on simulated episodes sampled from the base classes. Both Matching networks  and Prototypical Networks employ metric learning to “meta-learn” deep neural features and adopt a nearest neighbor classifier.  meta-learns the optimization algorithm via an LSTM that maps the low-shot training examples into a classifier. Feature hallucination  meta-learns how to generate additional training data for novel classes, directly in the feature space. In MAML , the embedding classifier is meta-learned to adapt quickly and without overfitting to fine-tuning. Ren et al.  introduce a semi-supervised meta-learning approach that includes unlabeled examples in each training episode. While that method holds out a subset from the same target dataset as the unlabeled images, our retrieval-enhanced approach leverages weakly-labeled videos from another heterogeneous dataset which may have domain shift issues and a huge amount of distracting videos.
Recent works [4, 44] suggest that state-of-the-art performance can be obtained by methods that do not need meta learning. In particular,  show that meta-learning methods are less useful when the image descriptors are expressive enough, which is the case when they are from high-capacity networks trained on large datasets. Therefore, we focus on techniques that do not require a meta-learning stage.
Deep descriptors for videos. Moving from hand-designed descriptors [5, 24, 31, 42] to learned deep network based descriptors [9, 10, 21, 33, 43, 38] has been enabled by labeled large-scale datasets [22, 21], and parallel computing hardware. Deep descriptors are sometimes based on 2D-CNN models operating on a frame-by-frame basis with temporal aggregation [13, 47]. More commonly they are 3D-CNN models that operate on short sequences of images that we refer to as video-clips [38, 40]. Recently, ever-more-powerful descriptors have been developed leveraging two-stream architectures using additional modalities [10, 33], factorized 3D convolutions [40, 39], or multi-scale approaches . While most of these descriptors are trained in a fully supervised way, advances in learning deep descriptors in either weakly supervised [46, 12, 25] or self supervised fashion have been explored as well [23, 27].
In the few-shot learning setting , classes are split into two disjoint label sets, i.e., base classes (denoted as ) that have a large number of training examples, and novel classes (denoted as ) that have only a small set of training examples. Let denote the training videos with labels from the base classes and be the training videos with labels from the novel classes (). Given the training data and , the goal of the conventional few-shot video classification task (FSV) [48, 2] is to learn a classifier which predicts labels among novel classes at test time. As the test-time label space is restricted to a few novel classes, the FSV setting is unrealistic. Thus, in this paper, we additionally study the generalized few-shot video classification (GFSV) which allows videos at test time to belong to any base or novel class.
In this section, we introduce our spatiotemporal CNN baseline for few-shot video classification (3DFSV). Our approach in Figure 2 consists of 1) a representation learning stage which trains a spatiotemporal CNN on the base classes, 2) a few-shot learning stage that trains a linear classifier for novel classes with few labeled videos, and 3) a testing stage which evaluates the model on unseen test videos. The details of each of these stages are given below.
Representation learning. Our model adopts a 3D CNN  , encoding a short, fixed-length video clip of RGB frames with spatial resolution
to a feature vector in a-dimensional embedding space. On top of the feature extractor , we define a linear classifier parameterized by a weight matrix
, producing a probability distribution over the base classes. The objective is to jointly learn the networkand the classifier by minimizing the cross-entropy classification loss on video clips randomly sampled from training videos of base classes. More specifically, given a training video with a label , the loss for a video clip sampled from video x is defined as
where denotes the softmax function that produces a probability distribution and is the probability at class y. Following , we do not do meta-learning, so we can use all the base classes to learn the network .
Few-shot learning. This stage aims to adapt the learned network to recognize novel classes with a few training videos . To reduce overfitting, we fix the network and learn a linear classifier by minimizing the cross-entropy loss on video clips randomly sampled from videos in , where is the weight matrix of the linear classifier. Similarly, we define the loss for a video clip sampled from with a label y as
Testing. The spatiotemporal CNN operates on fixed-length video clips of RGB frames and the classifiers make clip-level predictions. At test time, the model must predict the label of a test video with arbitrary time length . We achieve this by randomly drawing a set of clips from video x, where . The video-level prediction is then obtained by averaging the prediction scores after the softmax function over those clips. For few-shot video classification (FSV), this is:
For generalized few-shot video classification (GFSV), both base and novel classes are taken into account and we concatenate the base class weight learned in the representation stage with the novel class weight learned in the few-shot learning stage:
During few-shot learning, fine-tuning the network or learning the classifier alone is prone to overfitting. Moreover, class-labeled videos to be used for fine-tuning are scarce. Instead, the hypothesis is that leveraging a massive collection of weakly-labeled real-world videos would improve our novel-class classifier. To this end, for each novel class, we propose to retrieve a subset of weakly-labeled videos, associate pseudo-labels to these retrieved videos and use them to expand the training set of novel classes. It is worth noting that those retrieved videos may be assigned with wrong labels and have domain shift issues as they belong to another heterogeneous dataset, making this idea challenging to implement. For efficiency and to reduce the label noise, we adopt the following two-step retrieval approach.
Tag-based video retrieval. The YFCC100M dataset  includes around 800K videos collected from Flickr, with a total length of over 8000 hours. Processing a large collection of videos has a high computational demand and a large portion of them are irrelevant to our target classes. Thus, we restrict ourselves to videos with tags related to those of the target class names and leverage information that is complementary to the actual video content to increase the visual diversity.
Given a video with user tags where is a word or phrase and is the number of tags, we represent it with an average tag embedding . The tag embedding maps each tag to a dimensional embedding space, e.g., Fasttext . Similarly, we can represent each class by the text embedding of its class name and then for each novel class
, we compute its cosine similarity to all the video tags and retrieve themost similar videos according to this distance.
Selecting best clips. The video tag retrieval selects a list of candidate videos for each novel class. However, those videos are not yet suitable for training because the annotation may be erroneous, which can harm the performance. Besides, some weakly-labeled videos can last as long as an hour. We thus propose to select the best short clips of frames from those candidate videos using the few-shot videos of novel classes.
Given a set of few-shot videos from novel class , we randomly sample video clips from each video. We then extract features from those clips with the spatiotemporal CNN and compute the class prototype by averaging over clip features. Similarly, for each retrieved candidate video of novel class , we also randomly draw video clips and extract clip features from . Finally, we perform a nearest neighbour search with cosine distance to find the best matching clips of the class prototype:
where denotes the class prototype of class , is the clip belonging to the retrieved weakly-labeled videos. After repeating this process for each novel class, we obtain a collection of pseudo-labeled video clips where indicates the best video clips from YFCC100M for novel class .
Batch denoising. The retrieved video clips contribute to learning a better novel-class classifier in the few-shot learning stage by expanding the training set of novel classes from to . may include video clips with wrong labels. During the optimization, we adopt a simple strategy to alleviate the noise: half of the video clips per batch come from and another half from at each iteration. The purpose is to reduce the gradient noise in each mini-batch by enforcing that half of the samples are trustworthy.
In this section, we first describe the existing experimental settings and our proposed setting for few-shot video recognition. We then present the results comparing our approaches with the state-of-the-art methods in the existing setting on two datasets, the results of our approach in our proposed settings, model analysis and qualitative results.
Here we describe the four datasets we use, previous few-shot video classification protocols and our settings.
Datasets. Kinetics  is a large-scale video classification dataset which covers 400 human action classes including human-object and human-human interactions. Its videos are collected from Youtube and trimmed to include only one action class. The UCF101  dataset is also collected from Youtube videos, consisting of 101 realistic human action classes, with one action label in each video. SomethingV2  is a fine-grained human action recognition dataset, containing 174 action classes, in which each video shows a human performing a predefined basic action, such as “picking something up” and “pulling something from left to right”. We use the second release of the dataset. YFCC100M  is the largest publicly available multimedia collection with about 99.2 million images and 800k videos from Flickr. Although none of these videos are annotated with a class label, half of them (400k) have at least one user tag. We use the tag-labeled videos of YFCC100M to improve the few-shot video classification.
|# classes||# videos|
indicates randomly selecting 100 classes on Kinetics and on SomethingV2 datasets respectively. Those 100 classes are then randomly divided into 64, 12, and 24 non-overlapping classes to construct the meta-training, meta-validation and meta-testing sets. The meta-training and meta-validation sets are used for training models and tuning hyperparameters. In the testing phase of this meta-learning setting[48, 2], each episode simulates a -way, -shot classification problem by randomly sampling a support set consisting of samples from each of the classes, and a query set consisting of one sample from each of the classes. While the support set is used to adapt the model to recognize novel classes, the classification accuracy is computed at each episode on the query set and mean top-1 accuracy over 20,000 episodes constitutes the final accuracy.
Proposed setup. The prior experimental setup is limited to classes in each episode, even though there are 24 novel classes in the test set. As in this setting the performance saturates quickly, we extend it to 10-way, 15-way and 24-way settings. Similarly, the previous meta-learning setup assumes that test videos all come from novel classes. On the other hand, it is important in many real-world scenarios that the classifier does not forget about previously learned classes while learning novel classes. Thus, we propose the more challenging generalized few-shot video classification (GFSV) setting where the model needs to predict both base and novel classes.
To evaluate a -way -shot problem in GFSV, in addition to a support and a query set of novel classes, at each test episode we randomly draw an additional query set of 5 samples from each of the 64 base classes. We do not sample a support set for base classes because base class classifiers have been learned during the representation learning phase. We report the mean top-1 accuracy of both base and novel classes over 500 episodes.
Kinetics, UCF101 and SomethingV2 datasets are used as our few-shot video classification datasets with disjoint sets of train, validation and test classes (see Table 1 for details). Here we refer to base classes as train classes. Test classes include the classes we sample novel classes from in each testing episode. For Kinetics and SomethingV2, we follow the splits proposed by  and  respectively for a fair comparison. It is worth noting that 3 out of 24 test classes in Kinetics appear in Sports1M, which is used for pretraining our 3D ConvNet. But the performance drop is negligible if we replace those 3 classes with other 3 random kinetics classes that are not present in Sports1M (more details can be found in the supplementary material). Following the same convention, we randomly select 64, 12 and 24 non-overlapping classes as train, validation and test classes from UCF101 dataset, which is widely used for video action recognition. We ensure that in our splits the novel classes do not overlap with the classes of Sports1M. For the GFSV setting, in each dataset the test set includes samples from base classes coming from the validation split of the original dataset.
Implementation details. Unless otherwise stated our backbone is a 34-layer R(2+1)D  pretrained on Sports1M  which takes as input video clips consisting of RGB frames with spatial resolution of . We extract clip features from the dimensional top pooling units of the R(2+1)D.
In the representation learning stage, we fine-tune the R(2+1)D with a constant learning rate
on all datasets and stop training when the validation accuracy of base classes saturates. We perform standard spatial data augmentation including random cropping and horizontal flipping. We also apply temporal data augmentation by randomly drawing 8 clips from a video in one epoch. In the few-shot learning stage, the same data augmentation is applied and the novel class classifier is learned with a constant learning ratefor epochs on all the datasets. At test time, we randomly draw clips from each video and average their predictions for a video-level prediction.
As for the retrieval approach, we use the 400 dimensional () fasttext  embedding trained with GoogleNews. We first retrieve candidate videos for each class with video tag retrieval and then select best clips among those videos with visual similarities.
|3DFSV (ours, scratch)||48.9||67.8||57.9||75.0|
|3DFSV (ours, pretrained)||92.5||97.8||59.1||80.1|
|R-3DFSV (ours, pretrained)||95.3||97.8||-||-|
In this section, we compare our model with the state-of-the-art in existing evaluation settings which mainly consider 1-shot, 5-way and 5-shot, 5-way problems and evaluate only on novel classes, i.e., FSV. The baselines CMN  and TAM  are considered as the state-of-the-art in few-shot video classification. CMN  proposes a multi-saliency embedding function to extract video descriptor, and few-shot classification is then done by the compound memory network . TAM  proposes to leverage the long-range temporal ordering information in video data through temporal alignment. They additionally build a stronger CMN, namely CMN++, by using the few-shot learning practices from . We use their reported numbers for fair comparison. The results are shown in Table 2. As the code from CMN  and TAM  is not available at the time of submission we do not include UCF101 results.
On Kinetics, we observe that our 3DFSV (pretrain) approach, i.e. without retrieval, outperforms the previous best results by over in 1-shot case ( of TAM vs of ours), and by in 5-shot case ( of TAM vs of ours). On SomethingV2 dataset, we would like to first highlight that our 3DFSV (scratch) significantly improves over TAM by in 1-shot ( of TAM vs of ours) and by surprisingly in 5-shot ( of TAM vs of ours). This is encouraging because the 2D CNN backbone of TAM is pretrained on ImageNet, while our R(2+1)D backbone is trained from random initialization.
Our 3DFSV (pretrain) yields further improvement after using the Sports1M-pretrained R(2+1)D. We observe that the effect of the Sports1M-pretrained model on SomethingV2 is not as significant as on Kinetics because there is a large domain gap between Sports1M to SomethingV2 datasets. Those results show that a simple linear classifier on top of a pretrained 3D CNN, e.g. R(2+1)D , performs better than sophisticated methods with a pretrained 2D ConvNet as a backbone.
Although as shown in C3D , I3D , R(2+1)D , spatiotemporal CNNs have an edge over 2D spatial ConvNet  in the fully supervised video classification with enough annotated training data, we are the first to apply R(2+1)D in the few-shot video classification with limited labeled data. It is worth noting that our R(2+1)D is pretrained on the Sports1M while the 2D ResNet backbone of CMN  and TAM  is pretrained on ImageNet. A direct comparison between 3D CNNs and 2D CNNs is hard because they are designed for different input data. While it is standard to use an ImageNet-pretrained 2D CNN in image domains, it is common to apply a Sports1M-pretrained 3D CNN in video domains. One of our goals is to establish a strong few-shot video classification baseline with 3D CNNs. Intuitively, the temporal cue of the video is better preserved when clips are processed directly by a spatiotemporal CNN as opposed to processing them as images via a 2D ConvNet. Indeed, even though we train our 3DFSV from the random initialization on SomethingV2 dataset which requires strong temporal information, our results still remain promising. This confirms the importance of 3D CNNs for few-shot video classification.
Our R-3DFSV (pretrain) approach, i.e. with retrieved weakly-labeled video clips, lead to further improvements in 1-shot case (3DFSV (pretrain) vs R-3DFSV (pretrain) ) on Kinetics dataset. This implies that weakly-labeled videos retrieved from the YFCC100M dataset include discriminative cues for Kinetics tasks. In 5-shot, our R-3DFSV (pretrain) approach achieves similar performance as our 3DFSV (pretrain) approach however with an this task is almost saturated. We do not retrieve any weakly-labeled videos for the SomethingV2 dataset because it is a fine-grained dataset of basic actions and it is unlikely that YFCC100M includes any relevant video for that dataset. As a summary, although 5-way classification setting is still challenging to those methods with 2D ConvNet backbone, the results saturate with the stronger spatiotemporal CNN backbone.
Although prior works evaluated few-shot video classification on 5-way, i.e. the number of novel classes at test time is 5, our 5-way results are already saturated. Hence, in this section, we go beyond 5-way classification and extensively evaluate our approach in the more challenging, i.e., 10-way, 15-way and 24-way few-shot video classification (FSV) setting. Note that from every class we use one sample per class during training, i.e. one-shot video classification.
As shown in Figure 3, our R-3DFSV method exceeds 95% accuracy both in Kinetics and UCF101 datasets for 5-way classification. With the increasing number of novel classes, e.g. 10, 15 and 24, as expected, the performance degrades. Note that, our R-3DFSV approach with retrieval consistently outperforms our 3DFSV approach without retrieval and the more challenging the task becomes, e.g. from 5-way to 24-way, the larger improvement retrieval approach can achieve on Kinetics, i.e. our retrieval-based method is better than our baseline method by in 5-way (ours 3DFSV vs our R-3DFSV ) and the gap becomes in 24-way (our 3DFSV vs our R-3DFSV ).
The trend with a decreasing accuracy by going from 5-way to 24-way indicates that the more realistic task on few-shot video classification has not yet been solved even with a spatiotemporal CNN. We hope that these results will encourage more progress in this challenging setting of many-way few-shot video classification setting.
The FSV setting has a strong assumption that test videos all come from novel classes. In contrast to the FSV, GFSV is more realistic and requires models to predict both base and novel classes in each testing episode. In other words, 64 base classes become distracting classes when predicting novel classes which makes the task more challenging. Intuitively, distinguishing novel and base classes is a challenging task because there are severe imbalance issues between the base classes with a large number of training examples and the novel classes with only few-shot examples. In this section, we evaluate our methods in the more realistic and challenging generalized few-shot video classification (GFSV) setting.
In Table 4, on the Kinetics dataset, we observe a large performance gap between base and novel classes in both 1-shot and 5-shot cases, i.e., 3DFSV only achieves on novel classes vs on base classes. The reason is that predictions of novel classes are dominated by the base classes. Interestingly, our R-3DFSV improves 3DFSV on novel classes in both 1-shot and 5-shot cases, e.g., of 3DFSV vs of R-3DFSV in 1-shot. A similar trend can be observed on the UCF101 dataset. Those results demonstrate that our retrieval-based approach can alleviate the imbalance issues to some extent. At the same time, we find that generalized few-shot video classification (GFSV) setting, e.g. not restricting the test time search space only to novel classes but considering all of the classes even though base classes are distracting, is still a challenging task and hope that this setting will attract interest of a wider community for future research.
In this section, we perform an ablation study to understand the importance of each component of our approach. After the ablation study, we evaluate the importance of the number of retrieved clips to the few-shot video classification (FSV) performance.
Ablation study. We ablate our model in the 1-shot, 5-way video classification task on Kinetics dataset with respect to six critical parts including pretraining R(2+1)D on Sports1M (PR), self-supervised model of  as the backbone (SS), representation learning on base classes (RL), video retrieval with tags (VR), batch denoising (BD) and best clip selection (BC). Table 4 shows the results.
We start from a model with only a few-shot learning stage on novel classes. If a PR component is added to the model (first result row in Table (4), the newly-obtained model can achieve accuracy which is only slightly better than random guessing performance (). It demonstrates that a pretrained 3D CNN alone is not sufficient for a good performance. Besides, it also indicates that there exists a domain shift between the pretraining dataset, i.e. Sports1M, and our target Kinetics dataset.
Adding RL component to the model (the second result row) means to train representation on base classes from scratch, which results in a worse accuracy of compared to our full model. The primary reason for worse results is that optimizing the massive number of parameters of R(2+1)D is difficult on a train set consisting of only 6400 videos. Interestingly, if we adopt the self-supervised pretrained 3D CNN (MC3 pretrained on Kinetics without using any label) of , i.e., SS, we immediate get performance gains (the third result row) over training from random initialization. Adding both PR and RL components (the fourth row) obtains an accuracy of which significantly boosts adding PR and RL components alone.
Next, we study two critical components proposed in our retrieval approach. Comparing to our approach without retrieval (the fourth row), directly appending retrieved videos from YFCC100M (VR) to the few-shot training set of novel classes (the fifth result row) leads to performance drop, while performing the batch denoising (the sixth row) in addition to VR obtains gain. This implies that noisy labels from retrieved videos may hurt the performance but our batch denoising technique handles the noise well. Finally, adding the best clip selection (BC, the last row) after VR and BD gets a big boost of accuracy. In summary, those ablation studies demonstrate the effectiveness of the six different critical parts in our approach.
Influence of the number of retrieved clips. Intuitively, when the number of retrieved clips increases, the retrieved videos become more diverse, but at the same time, the risk of obtaining negative videos becomes higher. We show the effectiveness of our R-3DFSV with the increasing number of retrieved clips in Figure 4.
On the Kinetics dataset (left of Figure 4), without retrieving any videos, the performance is . As we increase the number of retrieved video clips for each novel class, the performance keeps improving and saturates at retrieving clips per class, reaching an accuracy of . On the UCF101 dataset (right of Figure 4), retrieving 1 clip gives us gain. Retrieving more clips does not further improve the results, indicating more negative videos are retrieved. On the other hand, our batch denoising strategy is able to tolerate the noise to some extent. We observe a slight performance drop at retrieving 10 clips because the noise level becomes too high, i.e. there are 10 times more noisy labels than clean labels.
In Figure 5, we visualize the top-5 video clips we retrieve from YFCC100M dataset with video tag retrieval followed by the best clips selection. Here we only show 8 novel classes of Kinetics dataset due to the space limitation and visualization of other classes are in supplementary.
We observe that the retrieved video clips of some classes are of high quality, meaning that those videos truly reveal the target novel classes. For instance, retrieved clips of class “Busking” are all correct because user tags of those videos consist of words like “buskers”, “busking” that are close to the class name, and the best clip selection can effectively filter out the irrelevant clips. It is intuitive those clips can potentially help to learn better novel class classifiers by supplementing the limited training videos.
Failure cases are also common. For example, videos from the class “Cutting watermelon” do not retrieve any positive videos. The reasons can be that there are no user tags of cutting watermelon or our tag embeddings are not good enough. Those negative videos might hurt the performance if we treat them equally, which is why the batch denoising is critical to reduce the effect of negative videos.
In this work, we point out that a spatiotemporal CNN trained on a large-scale video dataset saturates existing few-shot video classification benchmarks. Hence, we propose new more challenging experimental settings, namely generalized few-shot video classification (GFSV) and few-shot video classification with more ways than the classical 5-way setting. We further improve spatiotemporal CNNs by leveraging the weakly-labeled videos from YFCC100M using weak-labels such as tags for text-supported and video-based retrieval. Our results show that generalized more-way few-shot video classification is challenging and we encourage future research in this setting.
Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211. Cited by: §2.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: §2.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2.
Large-scale video classification with convolutional neural networks. In CVPR, Cited by: §2, §4.1.
Generalized zero- and few-shot learning via aligned variational autoencoders. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546. Cited by: §2.