In recent years, many researchers have introduced few-shot learning methods, mainly for the task of image classification. Most methods are based on the idea of meta-learning (i.e. learning to learn). One popular approach to few-shot meta-learning can be described as metric-based [20, 36, 30, 31]. In these methods, data points are embedded into a feature space (e.g. via a neural network) and then predictions are made using the distances between the embeddings of query images and those of a small support set of labeled training points. Classification loss can be back-propagated to the embedding network over many small tasks (“episodes”) in order to create an embedding function that is effective across a range of distinct classes.
While relatively little attention has been given to the problem of few-shot video classification, deep learning approaches to video classification have been studied extensively over the past decade [17, 29, 8, 34, 38, 11, 37, 12, 4]. Most often, video classification is accomplished by first extracting frame-level features of a video using convolutional neural networks (CNN), and then aggregating the features over time to yield a video-level representation. Many variants of both the frame-level feature extractor and the aggregation technique have been developed. The video encoders we present in this paper draw from the findings in this body of traditional video classification work. Rather than optimizing for direct classification, we instead aim to produce discriminative embeddings for few-shot classification.
Recently, researchers have begun to address the task of few-shot learning for video action recognition. Thanker and Krishnakumar  made use of dynamic images and memory augmented neural networks (MANN)  to perform few-shot action recognition on the Kinetics 400 dataset . Zhu and Yang  propose another MANN based architecture they call the “compound memory network” (CMN). Their method, obtains 78.9% accuracy on a 5-way 5-shot task on the Kinetics 400  dataset. Bishay et al. propose a video-specific version of the Relation Network  and leverage a large 3D CNN backbone network to generate embeddings. They manage to out-perform the CMN, but only when using a larger backbone. Hu et. al  propose a method of explicitly modeling the composition of semantic features for few-shot recognition. They achieve an 83.1% accuracy for a 5-shot 5-way task on the Kinetics 400. Cao et. al  propose a temporal alignment method in order to better model the temporal dynamics of video. They achieve a state-of-the-art 85.8% accuracy on the Kinetics 400 using a ResNet50  backbone. Daesik et al.  propose a variation of an existing few-shot method called Matching Networks , with the addition of a memory-based embedding. Zou et al.  propose a complex memory-based model. Both methods evaluate on cleaner, small datasets [23, 26, 22].
In this work, we introduce a set of approaches to few-shot video action recognition. We propose and evaluate several video encoder architectures across three few-shot methods. We also investigate the importance of including an optical flow feature stream. We train and evaluate our models using a “few-shot” split of the Kinetics 600 [4, 3] dataset. In addition to training and validation sets, the splits include one general test set (with randomly drawn classes) and one “challenge” test set, whose classes are highly similar (all aggressive/violent actions - e.g. kicking, boxing). Prior work in few-shot video action recognition has leveraged complex methods and large backbone CNNs. We find video encoder architectures that allow “simple” few-shot methods (e.g. Prototypical Networks ) to yield performance comparable to state-of-the-art models, while utilizing far fewer parameters.
Ii Few-shot Background
In the few-shot setting, rather than aiming to generalize to previously unseen examples, one aims to generalize to previously unseen classes, given only a small set (e.g. 1, 5) of labeled examples per class. Often few-shot learning is posed as an instance of meta-learning (i.e. learning to learn). Formally, let denote a set of classes. Assume inputs come from some set . Let be a “support” set, containing input-label pairs for each of the classes in . We seek a meta-model , parameterized by , that takes any support set
and produces a new classifier. That is, with only the small amount of data contained in support set , the meta model allows us to construct a classifier capable of classifying novel as one of the classes in .
The challenge of meta-learning lies in training the meta-model; namely, in estimating its parameters,. In order to match the evaluation environment, meta-models are often trained using episodic training . In each episode during episodic training, a set of classes is sampled from a larger set of classes called meta-train. Given , a support set with input-output pairs per class in is sampled. A disjoint query set is also sampled, with distinct input-output pairs drawn from the same set of classes . Together, and are called an episode. Meta-model takes and produces , which is used to make predictions on the query set . The loss on the query set is back-propagated to adjust the meta-model parameters, . Formally, training the meta-model involves solving the following problem, where we drop the subscripts to simplify notation:
Here denotes cross-entropy loss. Once trained, the meta-model can be evaluated by averaging classification accuracy over many episodes drawn from a set of unseen classes called meta-test.
In this paper, we focus on metric-based few-shot methods. This means the parameters of the meta-learner belong to the encoder network that embeds videos into a metric space. The model produced by may itself be non-parametric; for example, making its decision only by distances between the embedding of a novel query video and the embeddings of the support set videos. If successful, the meta-learner defines a general embedding space that is agnostic to the particular classes () provided; it embeds examples to capture notions of similarity spanning all classes within the video action recognition domain.
Our few-shot architecture structure is illustrated in Fig. 1. First, CNNs are applied frame-wise to the RGB and optical flow streams. The resulting (flat or spatial) streams are aggregated into a fixed length representation per video, which are then fed into a few-shot algorithm. Details are provided below.
Iii-a Flow Feature Extraction
While all of the information in the video is contained in the sequence of RGB frames, (where is the number of frames), researchers have shown that deep learning models perform significantly better at action recognition when explicit optical flow features are extracted and fed into the model [11, 29]. In the few-shot scenario, where data is by definition limited, we seek the richest video representation possible. We therefore extract an optical flow feature stream, denoted and feed this in to our models, as described below.
Iii-B Frame-level Encoders
. Both are distributed with the PyTorch
library, and are pre-trained on ImageNet. We find the pre-trained weights provide a good initialization for both RGB and optical flow. Both architectures were selected to maintain a light memory footprint, with the space utilization balance tilted in favor of the more informative feature stream, RGB. Both models are fine-tuned during meta-training.
The ResNet-18 model is used to encode each RGB frame, . To represent each frame we extract -dimensional features immediately before the ResNet-18’s final linear layer. For our third few-shot method (learned distance metric), the encoder’s output is fed into a CNN, and thus must have spatial dimensions. In this case, feature maps are extracted from the ResNet before the last residual block. This provides feature maps which we further process using a
convolution, with a stride of 2, followed byaverage pooling with a stride of . This generates
feature maps. The RGB features extracted from the ResNet for theth frame are denoted .
The AlexNet model is used to encode each optical flow frame, . Like the ResNet, we remove the final linear layers, and the AlexNet produces spatial features ( feature maps). When flat embeddings are needed, the feature maps are flattened to yield a
-dimensional vector and passed through a linear layer to produce a-dimensional embedding. The features generated by the AlexNet for the th frame are denoted .
Iii-C Aggregation Techniques
The embeddings generated by the ResNet-18 and AlexNet need to be aggregated over time to produce a fixed-length embedding, , for each video. We consider four aggregation strategies.
We simply average the vectors across time.111 In preliminary
experiments we found average pooling to outperform max pooling, so we use only
the former in this work.
In preliminary experiments we found average pooling to outperform max pooling, so we use only the former in this work.
Here the Pool operation refers to average pooling and “:” denotes concatenation.
To capture the temporal dynamics of each video, it is intuitive to utilize a recurrent model. Here we use the Long Short-Term Memory (LSTM) Network  to yield a temporally-aware video-level representation.
The LSTM operation consists of a single layer LSTM network with a hidden representation size of. We find that averaging the LSTM hidden representation over all timesteps yields faster training and better results.
When using the learned distance metric method, a vanilla LSTM cannot be used as it expects flat features. We solve this by using a Convolutional LSTM (ConvLSTM) . The linear sublayers of the LSTM are replaced by convolutional layers.
Iii-C4 3D Convolutions
Rather than combining flat embeddings, the spatio-temporal aspect of each video can be captured via 3D convolutions. We propose to combine spatial feature maps of each frame over time as follows:
Where the Conv3d operation consists of two 3D convolutional layers with a kernel size of and a stride of
, with a ReLU activation in-between. If flat features are required, then adaptive average pooling is applied to create a-dimensional vector. If spatial features are required, adaptive average pooling is applied to create feature maps.
Iii-D Few-shot Methods
Once video-level embeddings are generated, various few-shot methods can be used to classify each video using only the support set. We consider three popular metric-based few-shot methods.
Iii-D1 Matching Networks
Matching Networks  are a few-shot method consisting of an image embedding function parameterized by a CNN, along with a metric-based classification in the embedding space. This method works by embedding both the query set and support set. To classify an example from the query set, the distance to each instance in the support set is calculated. For a given query input, , classification scores are calculated by averaging the negative squared euclidean distance from a query embedding to the support set embeddings for each class:
The scores are then softmaxed to yield a probability distribution over possible classes. For simplicity, we did not incorporate fully-contextual embeddings in our implementation.
Iii-D2 Prototypical Networks
Prototypical Networks  are a few-shot method similar to Matching Nets (equivalent in the one-shot case). Instead of computing the distance from an embedded query to each example in the support set, a single embedding (“prototype”) is created to represent each class. The prototype, , is calculated by averaging support set embeddings within each class.
The negative distances to each prototype are then used as classification logits:
Iii-D3 Learned distance metric
We also propose a variant of the Prototypical Nets that is inspired by the Relation Network  few-shot method. This method works by computing prototypes for each class, but instead of using a fixed distance metric to compute classification scores, a small CNN, , with parameters is used to calculate the “distance” between a query and each of the class prototypes:
The CNN , consists of two convolutional layers, each with a kernel with a stride of , and each followed by ReLU. The feature maps are then average pooled, flattened, mapped to a scalar score via a linear layer.
Iv-a Experimental Setup
Iv-A1 Kinetics 600
Kinetics 600 [4, 3] is a collection of YouTube videos categorized by action. The traditional splits are not suitable for the few-shot scenario, so we propose custom splits for training, validation and testing. We design our splits to mirror the structure of popular image few-shot benchmark dataset, miniImageNet . The training and validation sets consist of 64 and 16 classes, respectively. We define two test sets to evaluate the effectiveness of models in different conditions. The first test set consists of 20 randomly sampled classes as is typically used to quantify the generalization of the model to unseen classes. The second, a “challenge” test set, consists of closely related classes (all aggressive actions, e.g. kicking, punching), and is used to determine the model’s ability to discriminate between very similar classes. Using Kinetics 600 instead of Kinetics 400 (e.g. as used in ) offers two key advantages: 1) like miniImageNet, each class has at least 600 instances (the splits in  are limited to 100 instances per class), and 2) the larger number of classes facilitates the creation of our “challenge” test set.
For each video, all RGB frames are extracted and downsampled to one frame per second. To be consistent with the expected inputs of the pre-trained ResNet-18, the frames are resized to
and the RGB features are standardized using the same mean and standard deviation that is used by the pre-trained ResNet-18. Pairs of frames are also sampled from each video in order to compute optical flow features. The frame pairs are resized toand optical flow is computed using the Färneback algorithm  implemented in the OpenCV library . The flow is thresholded into the range , rescaled into the range
, and then standardized. The pre-trained AlexNet expects a 3-channel input, but the optical flow only has two, (vertical and horizontal components), so we pad a zero third channel before feeding into the AlexNet. To ensure each video has a sufficient number of frames, very short videos (fewer than five frames) are discarded from the episode.
Iv-A3 Few-shot Setup
All experiments are way
shot; that is, meta-training episodes consist of five examples from five classes for the support set and five examples from the same five classes for the query set. We train for 25,000 episodes, checking an early stopping criterion on the validation set every 500 episodes. Very little hyperparameter tuning was performed. We use a fixed learning rate ofand optimize using the Adam optimizer . After training, test results are obtained by randomly selecting and evaluating on 1000 episodes from the test set, each with 10 queries per class.
Iv-B Results and Analysis
We first evaluate combinations of four aggregation strategies and three few-shot methods; these results are reported in Table I. The most notable trend is the gap between the general and challenge test sets, with the general test set obtaining % higher accuracy. This suggests that in many realistic use cases, where the distinction between classes is fine-grained, there is still room for significant improvement in video few-shot learning. Another clear trend is the consistently superior performance of Prototypical Networks; regardless of the aggregation method, Prototypical Networks give the highest accuracy. Comparing aggregation strategies, we observe that the LSTM gives the highest performance, although simple averaging is a very competitive alternative.
|General Test Set|
|Prototypical||83.5 0.46||84.2 0.44||77.9 0.53||78.8 0.51|
|Matching||79.1 0.55||81.1 0.50||75.7 0.56||75.7 0.56|
|Learned||77.9 0.51||-||78.1 0.51||74.1 0.55|
|Challenge Test Set|
|Prototypical||58.5 0.58||59.4 0.59||53.6 0.60||54.4 0.60|
|Matching||52.3 0.57||54.6 0.60||51.0 0.60||49.2 0.58|
|Learned||51.5 0.61||-||52.3 0.61||50.3 0.61|
Test set accuracy and 95% confidence interval, contrasting aggregation approaches (columns) and few-shot methods (rows), averaged over 1000 test episodes.
Table II lists the results of an ablation study, designed to assess the importance of input features when using our best performing model (LSTM aggregation paired with Prototypical Networks). First, we compare the contributions of the RGB and flow streams. Using only the RGB stream without flow, there is a small decrease in performance (%). In contrast, using only flow without RGB yields a significant drop in performance (). Next, we compare the effect of framerate. We report a “single frame” result, in which RGB and flow for a single randomly selected frame is used; the performance is significantly worse than the using 1 fps RGB and flow (%). This suggests that a naive approach of reducing the video few-shot task to an image few-shot task is ineffective. We then consider the effect of increasing either RGB and/or flow framerates to 2 fps, and find that it yields little-to-no improvement in performance. Further increasing the framerate provided no improvements (not shown in table for brevity).
|Input||General Test||Challenge Test|
|RGB & Flow (1 fps)||84.2 0.44||59.4 0.59|
|RGB only (1 fps)||82.7 0.47||58.0 0.59|
|Flow only (1 fps)||64.6 0.56||45.9 0.53|
|RGB & Flow (Single frame)||75.6 0.52||51.0 0.56|
|RGB (1 fps) & Flow (2 fps)||84.4 0.44||59.6 0.59|
|RGB (2 fps) & Flow (1 fps)||84.1 0.45||58.7 0.59|
|RGB (2 fps) & Flow (2 fps)||83.7 0.44||59.1 0.59|
The field of few-shot video action recognition is still quite young. In this work, we propose a set of approaches, using different video encoder architectures and metric-based few-shot methods. Among the proposed approaches, a two-stream pooled LSTM-CNN video encoder used with a Prototypical Network gives the best performance: 84.2% accuracy on our general test set for the 5-way 5-shot setup on a few-shot split of the Kinetics 600 dataset. Given the inherent computational challenges of processing video, we find it encouraging that high accuracy can be obtained from a computationally efficient few-shot algorithm and a low framerate.
This work was funded by the U.S. Government.
-  G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
-  Kaidi Cao, Jingwei Ji, Zhangjie Cao, Chien-Yi Chang, and Juan Carlos Niebles. Few-shot video classification via temporal alignment. CoRR, abs/1906.11415, 2019.
-  João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. CoRR, abs/1808.01340, 2018.
-  João Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, pages 4724–4733, 2017.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pages 4960–4964, 2016.
-  G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, Jan 2012.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. CVPR, 2009.
-  Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR, 2015.
-  Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Sandy Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, David Jones, David Silver, Koray Kavukcuoglu, Demis Hassabis, and Andrew Senior. De novo structure prediction with deep-learning based scoring. In Proc. CASP, 12 2018.
-  Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Proceedings of the 13th Scandinavian Conference on Image Analysis, Proc. SCIA, pages 363–370, Berlin, Heidelberg, 2003. Springer-Verlag.
-  Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proc. CVPR, 2016.
-  Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. ActionVLAD: Learning spatio-temporal aggregation for action classification. In Proc. CVPR, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, pages 770–778, 2016.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Compututation, 9(8):1735–1780, November 1997.
-  Ping Hu, Ximeng Sun, Kate Saenko, and Stan Sclaroff. Weakly-supervised compositional featureaggregation for few-shot recognition. CoRR, abs/1906.04833, 2019.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proc. CVPR, pages 1725–1732, June 2014.
-  Daesik Kim, Myunggi Lee, and Nojun Kwak. Matching video net: Memory-based embedding for video action recognition. In Proc. IJCNN, pages 432–438, 05 2017.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. ICLR, 2015.
-  Gregory R. Koch. Siamese neural networks for one-shot image recognition. In Proc. ICML, 2015.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, pages 1097–1105. 2012.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In Proc. ICCV, pages 2556–2563, November 2011.
-  J. Liu, Jiebo Luo, and M. Shah. Recognizing realistic actions from videos “in the wild”. In Proc. CVPR, pages 1996–2003, June 2009.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Proc. Autodiff Workshop at NIPS, 2017.
-  Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Proc. ICLR, 2017.
-  Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition. In Proc. CVPR, pages 1–8. IEEE, June 2008.
-  Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-Learning with Memory-Augmented Neural Networks. In Proc. ICML, pages 1842–1850, New York, New York, USA, June 2016. PMLR.
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In Proc. NIPS, pages 802–810, 2015.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Proc. NIPS, pages 568–576. 2014.
-  Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proc. NIPS, pages 4077–4087. 2017.
-  Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proc. CVPR, pages 1199–1208, 2018.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, Montreal, CA, 2014.
-  Darshan Thaker and Kapil Krishnakumar. k-shot learning for action recognition. unpublished.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV, pages 4489–4497, Dec 2015.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NIPS, pages 5998–6008. 2017.
-  Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Proc. NIPS, pages 3630–3638. 2016.
-  Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proc. ECCV, 10 2016.
-  Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Joseph Pal, Hugo Larochelle, and Aaron C. Courville. Describing videos by exploiting temporal structure. Proc. ICCV, pages 4507–4515, 2015.
-  Linchao Zhu and Yi Yang. Compound memory networks for few-shot video classification. In Proc. ECCV, September 2018.
-  Y. Zou, Y. Shi, Y. Wang, Y. Shu, Q. Yuan, and Y. Tian. Hierarchical temporal memory enhanced one-shot distance learning for action recognition. In Proc. ICME, pages 1–6, July 2018.