The universal adoption of smartphones has led to large quality of videos being produced and shared on social media platforms, which are in urgent need of automated analysis. As a result, video action recognition [Tran2015C3D, carreira2017quo, S3D_G_ECCV18, lin2019tsm] has been studied intensively with a recent focus on fine-grained action recognition [ssv1]
. Most recent action recognition models employ deep convolutional neural networks (CNNs) which are known to be data hungry. In particular, a large number (at least hundreds) of annotated training samples need to be collected for each action class. Collecting and annotating such a large amount of data however is expensive, tedious and sometimes even infeasible for some rare fine-grained action classes. Therefore, few-shot action recognition has started to receive increasing interest[zhu2018compound, zhang2020few, cao2020few], which aims to construct a video action classifier with few training samples (e.g., 1-5) per class.
Few-shot action recognition is a special case of few-shot learning (FSL). Most FSL methods follow a meta-learning (or learning-to-learn) paradigm [snell2017prototypical, vinyals2016matching]. It is characterized by episodic training that aims to learn a model or optimizer from a set of base/seen tasks, in order to generalize well to new tasks with few labeled training samples/shots. Specifically, a meta-training set containing abundant training samples per seen/base class is used to sample a large number of training episodes. In each episode, the training data is split into a support set with classes and samples per class to mimic the setting of target meta-test tasks, and a query set from the same classes. A classifier is built using the support set; it is then evaluated on each query set sample with a classification loss for model updating. Many state-of-the-art FSL methods [ye2020few, zhang2020deepemd] are based on the popular prototypical network (ProtoNet) [snell2017prototypical] for its simplicity and effectiveness. With ProtoNet, each support set class mean is computed as a class prototype for classifying each query sample (see Figure 1(a)).
There are however two major limitations in existing few-shot action recognition methods [zhu2018compound, zhang2020few, cao2020few]. The first limitation is the lack of data efficiency due to the query-centered learning objective adopted by existing methods. This is evident in Figure 1(a): with only samples for each of the classes in the support set, these samples are first reduced to prototypes; a loss term (e.g., cross-entropy) is then computed for each query sample individually based on its distances to the prototypes, without any consideration on how the whole query set samples should be distributed. This query-centered only learning objective thus does not make full use of the limited training data in each episode.
The second limitation is their inability to address two fundamental challenges in FSL, i.e., outlying samples and inter-class distribution overlapping in the support set. As illustrated in Figure 2, outliers are caused by unusual viewpoint, background, occlusion etc. They are problematic for any learning tasks, but particular so in few-shot action recognition – with few samples per class, a single outlier could have an immense effect on the class distribution. The problem of inter-class overlapping is also commonplace when the training samples of different classes have very similar background, or just being visually similar. This problem is especially acute for fine-grained action classes. Both problems are not addressed when a class is simply represented as class mean without considering both the intra-class relationships to identify outliers and inter-class relationships to avoid class overlapping.
To overcome the aforementioned limitations, we propose a novel Prototype-centered Attentive Learning (PAL) model with two key components. First, a prototype-centered contrastive learning loss is introduced. This loss is computed by comparing a given prototype against all the query-set samples for discriminative learning (see Figure 1(b)). This new learning objective is clearly complementary to conventional query-centered learning objective, and combining the two enables PAL to make full use of the limited training data and hence improve the data efficiency in training. Second, a hybrid attentive learning component is developed consisting of self-attention on support-set samples and cross-attention from query-set to support-set samples. This design aims to mitigate the negative effect of outlying samples and promote class separation. Importantly, it can be seamlessly integrated with the query- and prototype-centered learning objectives in a unified meta-learning pipeline.
We make the following contributions in this paper: (1) We propose a novel Prototype-centered Attentive Learning (PAL) for few-shot action recognition, specifically designed to address the data efficiency, outliers and class overlapping problems that existing methods suffer from. (2) A novel prototype-centered contrastive learning objective is introduced to complement existing query-centered objective, in order to make full use of limited few shot training data. (3) We further introduce a hybrid attentive learning strategy which is dedicated for solving the under-studied outlier sample and class overlapping problems. (4) Extensive experiments show that our model achieves new state-of-the-art results on four few-shot action datasets. Its performance is particularly compelling on the more challenging fine-grained benchmark, yielding around 10% improvement.
2 Related work
The previous efforts have been focused on training action recognition models on large-scale video datasets (e.g., the coarse-grained Kinetics [carreira2017quo] and fine-grained Something-Something [ssv1]). Computationally, utilizing 2D CNNs [simonyan2014two, wang2016temporal, lin2019tsm] for action recognition is more efficient than the 3D counterparts [Tran2015C3D, carreira2017quo, S3D_G_ECCV18, lin2019tsm]. As one of the most popular action classification models, Temporal Segment Networks (TSN) [wang2016temporal] extract features from temporally sparsely sampled frames based on a 2D CNN and then average pooling is used to obtain the video-level feature representation and prediction. We use TSN as the feature extractor/embedding network in our few-shot action recognition model as we found that other more complex feature extractors [Tran2015C3D, carreira2017quo, S3D_G_ECCV18, lin2019tsm] do not fare better on performance but are computationally more demanding.
Few-shot action recognition
When the research focus in action recognition has shifted towards fine-grained action classes lately, the problem of collecting sufficient training samples per class has become an obstacle. Few-shot action recognition emerged as a potential solution to this problem [zhu2018compound, zhang2020few, cao2020few]. CMN [zhu2018compound] employs a compound memory network to store the representation and classify videos by matching and ranking. OTAM [cao2020few] measures the query-centered distance with respect to support set samples, by explicitly leveraging the temporal ordering information in query video via ordered temporal alignment. ARN [zhang2020few] learns the query-centered similarity between query and support video clips with a pipeline following RelationNet [sung2018learning]. It exploits augmentation-guided spatial and temporal attention with auxiliary self-supervision training losses. Despite their differences in model design, all deploy a meta-learning framework with a query-centered learning objective, thus being unable to make full use of the limited training data in each episode. Further, none is designed to address the the outlying sample and class overlapping problems, as our model does.
Few-shot action recognition is a special case of few-shot learning (FSL). Most existing FSL models focus on recognizing static images. They can be roughly categorized into two groups including metric-based methods [vinyals2016matching, sung2018learning, snell2017prototypical, doersch2020crosstransformers, ye2020few, perez2020incremental, zhang2020deepemd] and optimization-based methods [finn2017model, ravi2016optimization]. All existing few-shot action recognition methods and our PAL belong to the first group, where what is meta-learned is a feature embedding network for video. Note that the idea of using self-attention for few-shot learning has previously been explored in [ye2020few]. The model in [ye2020few] only applies self-attention to class prototypes. In contrast, our model applies it to the support set as well as query samples, therefore being capable of dealing with sample outliers and improving data efficiency. Following the practice of most recent FSL methods [ye2020few, zhang2020deepemd], feature embedding network pretraining is incorporated into our model.
Our attentive learning is based on self-attention across data instances, which has been first introduced in transformers as a global self-attention module for machine translation tasks [vaswani2017attention]
. Non-local neural networks[wang2018non]
applied the core self-attention block from transformers for context modeling and feature learning in computer vision tasks. They learns an affinity map among all pixels in the image and allow the neural network to effectively increase the receptive field to the global context. State-of-the-art performances have been shown in classification[dosovitskiy2020image]
, self-supervised learning[chen2020generative], semantic segmentation [fu2019dual, zhang2020dynamic], object detection [yin2020disentangled, carion2020end, zhu2020deformable]
by using this transformer-based attention model. Different from these works, in this paper, we propose a hybrid attentive learning mechanism that aims to exploit the support setself-attention and query-to-support cross-attention in a unified meta-learning manner.
3.1 Problem definition
We consider the standard few-shot video action classification problem definition [cao2020few, zhu2018compound]. Given a meta-test dataset , we sample a -way -shot classification task to test a learned FSL action model . To train the model in a way that it can perform well on those sampled classification tasks, episodic training is adopted to meta-learn the model. Concretely, a large number of -way -shot tasks are randomly sampled from a meta-training set , and then used to train the model in an episodic manner. In each episode, we start with sampling classes from at random, from which labeled training samples are then randomly drawn to create a support set and a query set consisting of and samples per class, respectively. Formally, the support and query sets are defined as:
Note, are sample-wise non-overlapping.
Typically, episodic training is conducted in a two-loop manner [snell2017prototypical]: the support set is used in the inner loop to construct a classifier for the classes, and the query set is then used in the outer loop to evaluate this classifier and update the model parameters . It is noteworthy that as the objective is to obtain a learner able to recognize novel classes each with only a few labeled examples, and are set to be disjoint in the class space, i.e., . Unlike the sparsely annotated meta-test classes, each meta-training class comes with abundant labeled training data that allow to sample sufficient episodes for meta-training.
As in all existing few-shot action classification models [zhu2018compound, zhang2020few, cao2020few], one of the key objectives is to learn a feature embedding network through meta learning, so that it can generalize to any unseen action classes. The embedding networks adopted are based on those used by existing video action recognition model with TSN [wang2016temporal] as the most popular choice. For example, in the state-of-the-art OTAM [cao2020few]
model, a ResNet-50 based TSN model, pre-trained on ImageNet, was meta-trained together with a time-warping based video distance. However, in most recent FSL methods for static image classification[ye2020few, zhang2020deepemd], pretraining the embedding network on the whole meta-training set before episodic training starts has become a must-have step. In this work, we also adopt such a pretraining step and show in our experiments that this step is vital (see Sec. 4).
Specifically, we use as our feature embedding network a TSN action model [wang2016temporal]. It is then pretrained on the whole training set
with a cosine similarity based cross-entropy loss. Given a video samplewith a varying length and redundant temporal information, we first sample video frames. We adopt the same sparse sampling strategy as introduced in [wang2016temporal]: splitting the video into equal-sized segments and randomly selecting one frame from each segment. This way, each video can be represented using a fixed number of frames where is a random frame from the -th segment. As sampled frames cover the majority of original time span, long-term temporal information is kept whereas the spatio-temporal redundant information is reduced significantly.
Next, each sampled video frame is individually encoded by a feature embedding network and classified by a cosine-distance based classifier, giving us frame-level classification score vectors with the total number of classes in . In particular, for each frame, where denotes the cosine similarity function and () the classifier weights of each class. Then, these per-frame score vectors are averaged to get video-level scores :
Finally, a softmax is applied to the video-level scores to get video-level action probabilities :
In training, and its ground-truth class label are then utilized to optimize the model using a cross-entropy loss:
where is an indicator function which returns 1 when the argument is true, 0 otherwise.
After the pretraining stage, the feature embedding network together with a cosine distance can be used directly for meta-test without going through the episodic training stage. Our experiments show that this turns out to be a surprisingly strong baseline that even achieves better results than current state-of-the-art OTAM [cao2020few] (see Table 1). This verifies for the first time that a good feature embedding (or feature reuse) is also critical for few-shot action modeling – a finding that has been reported in recent static image FSL works [wei2019Close, tian2020rethinking, liu2020negative]. Nonetheless, this baseline is still limited for FSL since it lacks a “learning to learn” or task adaptation capability to better deal with unseen new tasks.
To this end, we next introduce our Prototype-centered Attentive Learning (PAL) method. The overview of PAL is depicted in Figure 3. PAL is built upon the ProtoNet [snell2017prototypical] but consists of two new components: (1) Hybrid attentive learning including self-attention on support-set samples and cross-attention from query-set samples to support-set samples (Sec. 3.3). (2) Prototype-centered contrastive learning (Sec. 3.4).
3.3 Hybrid attentive learning
Hybrid attentive learning (HAL) is designed to mitigate inter-class ambiguity as well as intra-class outlying sample problem by allowing each support set sample to examine all other samples in order to identify and fix both problems. It relies on a transformer self-attention module whose parameters are meta-learned together with those of the embedding network . Once learned, during meta-test, it is used to exploit task-specific contextual information for superior task adaptation. Concretely, given an episode we first extract a -dimensional feature representation using a TSN . The video-level feature vectors are obtained for both support-set and query-set by average pooling along frames.
Support set self-attention
Formally, let and be the support-set and query-set feature matrix respectively. The input to HAL is in the triplet form of (Query, Key, Value). To learn discriminative contextual information per episode, the input is designed based on the support-set samples as:
where // are the learnable parameters (each represented by a fully connected layer) that project the TSN feature to a -D latent space. As Query, Key and Value share the same input source (i.e., support-set data), self-attention can be formulated as:
where is a row-wise softmax function for attention normalization. Residual learning is adopted for more stable model convergence. As written in Eq (7), pairwise similarity defines the attentive scores between any two support-set samples, and further used for weighted aggregation in the Value space. The intuition is that, statistically sample pairs from the same class often enjoy more similarity than those with different classes except very little outlier instances; As a result, this attentive learning would reinforce the importance of class-sensitive information, subject to the context of current task’s classes. Consequently, the effect caused by class irrelevant information of outlier samples will be well controlled during feature transformation. Finally, intra-class variation shrinks and inter-class ambiguity can be reduced accordingly.
Query set cross-attention
For consistency, the query-set samples should be also contextualized in a task-specific manner, because they will be classified into a label space formed by support-set samples. To that end, we introduce a cross-attention process from query-set samples to support-set samples formulated as:
By doing this, in the same spirit as support-set self-attention, query-set samples are also enriched by contextual information. Note that each query set sample transformation is done independently from other query samples as the Keys and Values are provided by the support set only. Our model thus remains inductive as existing FSL action models.
To obtain the supervision for our HAL module, we adopt the popular prototype based objective loss [snell2017prototypical]. With the attentive support-set feature matrix , we first form the prototype for each class as the mean feature:
where denotes the class label of -th support-set sample. We then compute the cosine similarity between any query-set feature and all prototypes and obtain the classification probability vector by softmax function (Eq (4)). With the query-set class labels , a cross-entropy loss can be derived over classes of the current episode as the meta-training objective:
Which will be used to update the parameters of both the HAL module and the TSN embedding network.
3.4 Prototype-centered contrastive learning
The meta-loss in Eq. (10) is still the conventional query-centered loss (see Figure 1(a)). That is, only query-to-prototype discrimination is considered, whilst the other way around is ignored. We hypothesize that this design fails to make full use of already-limited training data in each episode and would lead to sub-optimal task adaptation.
To overcome this limitation, we further introduce a complementary prototype-centered contrastive learning. More specifically, for a prototype , we define the query-set samples from the same class as the positive matches and all the others as the negative. We then compute the cosine similarity between
and all the query-set samples, and devise the following prototype centered contrastive loss function:
where is the number of class prototypes. This design attempts to pull positive query-set samples closer to their corresponding prototype, whilst pushing away the negative ones. This helps further reduce intra-class variation and simultaneously better separate different classes. Eq (11) is designed similarly in spirit to previous supervised contrastive learning objective [khosla2020supervised] and neighborhood component analysis [goldberger2004neighbourhood]. However, our loss function fundamentally differs from them because (1) it acts uniquely as a meta-learning loss; and (2) it is specially designed for posing prototype centered discrimination in a FSL context with prototypes as anchors.
3.5 Objective loss, model training and inference
The overall objective loss function of our PAL model for meta-training is defined as:
where is a weight hyper-parameter which we set to 1.
In summary, the proposed method is trained in two sequential stages: First, the baseline TSN is trained on meta-training set in a standard supervised learning way (Sec. 3.2); Second, the feature embedding network (TSN) and our PAL model are further optimized end-to-end in the episodic training stage. During test, the model is fixed and can be used similarly as in training to form prototypes with labeled support-set samples and classify each unlabeled query-set samples on meta-test set with the cosine similarity based classifier model.
|Matching Net [vinyals2016matching]||53.3||74.6||-||-||-||-||-||-|
We used four few-shot action benchmarks in our evaluations. (1) Kinetics-100
is a 100-class subset of Kinetics-400[Kinetics], and was introduced firstly in [zhu2018compound] for few-shot action classification. We followed the same protocol: 64/12/24 classes and 13063/2210/4472 videos for meta-training, meta-validation and meta-testing respectively. Whilst Kinetics is one of the most commonly evaluated datasets, visual appearance and background encapsulate most class-related information rather than motion patterns [sevilla2019only]. With less need for temporal modeling and involving coarse-grained action classes, it presents a relatively easy action classification task. We hence further evaluated (2) Sth-Sth-100
created in a similar way based on the Something-Something-V2 dataset[ssv1]. For this dataset we adopted the same protocol as [cao2020few] where 64/12/24 classes and 66939/1925/2854 videos are included for train/val/test respectively. By considering fine-grained actions involving human object interactions with subtle differences between different classes, it presents a significantly more challenging action recognition task. To compare with very recent models [zhang2020few], we also tested two more YouTube video datasets under the same setting. (3) HMDB-51 [kuehne2011hmdb] contains 51 action classes with 6,849 videos. We used 31/10/10 action classes with 4280/1194/1292 videos for train/val/test, respectively. (4) UCF101 [soomro2012ucf101] consists of 101 action categories with 13,320 video clips. We took 70/10/21 classes with 9154/1421/2745 videos for train/val/test, respectively.
We adopted the standard 5-way 1/5-shot FSL evaluation setting [zhu2018compound, cao2020few]. We randomly sampled 150,000 episodes from the meta-test set and reported the mean classification accuracy.
We used the ImageNet pretrained ResNet-50[he2016deep] as the backbone network. We replaced the original fully-connected layer with a new layer that performs cosine similarity based classification. For the first training stage, we optimized the TSN feature embedding [wang2016temporal] (Sec. 3.2
) with SGD, with a starting learning rate at 0.001 and decaying every 30 epochs by 0.1 and a total of 70 epochs. For the second stage, we conducted meta-training of both TSN feature backbone and our PAL model end-to-end. On Sth-Sth-100, from the initial learning rate at 0.0001 we trained a total of 35 epochs with decaying epochs at 15 and 30 and each epoch consists of 200 episodes. For the other datasets, we found that training 10 epochs sufficed, with decaying points at 5, 7 and 9. For both stages, we used cosine based classifier model. During training, we resized each video frame to the size offrom which a random region was cropped to form the input. For three coarse-grained datasets, we applied random horizontal flip in training. However, many classes in the fine-grained Sth-Sth-100 are direction-sensitive (e.g., pulling something from left to right and pulling something from right to left), horizontal flip was therefore not applied. During test, we used the center crop only to get the prediction of test data and model performance.
Three groups of methods are compared in performance evaluation: (1) Classical FSL models originally proposed for image classification including Matching Net [vinyals2016matching], MAML [finn2017model] and ProtoNet [snell2017prototypical]. All these FSL methods used the same TSN feature embedding for fair comparison. (2) Stronger action recognition models including TRN [zhou2018temporal] and TARN [chang2019d3tw]. (3) State-of-the-art FSL action models including CMN [zhu2018compound], OTAM [cao2020few] and ARN [zhang2020few].
4.1 Main results
Comparison to state-of-the-art
The comparative results are reported in Table 1.
We make the following observations:
(1) In comparison to all classical FSL methods, the proposed PAL is clearly superior under both settings and on all datasets.
(2) Whilst stronger action recognition models (e.g., TRN++ and TARN) improve the results, they still lag behind our model by a large margin.
(3) Interestingly, the first FSL action model CMN is shown to be inferior to both TRN++ and TARN, suggesting that its memory network’s merit is less critical than better temporal structure modeling. Directly combining CMN with TRN helps little (CMN++).
(4) By taking a pairwise temporal alignment strategy, the state-of-the-art OTAM further improves the performance. Nonetheless, it is still outperformed by our PAL particularly on the more challenging Sth-Sth-100 dataset (10.3% improvement under 5-shot). This is not surprising because OTAM’s temporal warping is functionally susceptive to outlier (less consistent) support-set videos as typically encountered in fine-grained actions with human-object interactions. Besides, as compared with our model, OTAM is inferior in leveraging few shots per class during meta-learning due to lacking prototype-centered discrimination. On the easier dataset Kinetics-100 where class overlapping is less a problem, our PAL model remains superior to OTAM by a smaller margin, indicating that tackling outlying samples is consistently a more effective strategy.
(5) On the two smallest datasets HMDB51 and UCF, the proposed method also sets new state-of-the-art, due to its superior ability to deal with sparse training samples.
Why PAL works?
As described in Introduction and Figure 1, our PAL is specially designed to overcome the class overlap challenge caused by inter-class boundary ambiguity and outlier support-set samples intrinsic to new tasks. To understand the internal mechanism of our model, we visualized the change of support samples’ feature representations and per-class prototypes in a new 5-way 5-shot task. Concretely, this contrasted the feature distributions with and without our PAL model to reveal how the feature space is improved. It is evident in Figure 4 that PAL does improve the separation of different classes’ decision boundary by posing two effects: (1) reducing intra-class variation by mitigating the distracting effect of outlier samples in forming class decision boundary, and (2) minimizing inter-class overlap by conducting query-centered and prototype-centered discriminative learning concurrently.
4.2 Ablation study
Contributions of model components
To understand the benefits of the two components in our PAL, namely Hybrid Attentive Learning (HAL) and Prototype-centered Contrastive Learning (PCL), we conducted detailed component analysis by comparing the performances with and without them. From Table 2 the following observations can be made. (1) Each component is useful and the two components are complementary to each other in improving the classification accuracy on both datasets. (2) More performance gain is achieved on the more difficult Sth-Sth-100 than on Kinetics-100. This is not surprising considering that Kinetics’ classes are coarse-grained with less distribution overlap and more distinctive pattern between classes.
Importance of pretraining
In our model, the feature embedding TSN model is pretrained on the whole training set before the episodic training stage. Instead, the current state-of-the-art OTAM [cao2020few] skipped this pretraining step. Table 3 shows surprisingly that this pretraining step is vital. Our PAL model only with pretraining is already significantly more effective than the state-of-the-art OTAM. This suggests for the first time that strong feature embedding is significant for action classification in FSL context, confirming the similar conclusion drawn in the image counterparts [wei2019Close, tian2020rethinking, liu2020negative].
Similar as PAL, the recent FEAT model [ye2020few] also exploits transformer [vaswani2017attention] based self-attention for task adaptation in the context of few-shot image classification. However, unlike our model, it directly adapts the class prototypes (i.e., feature mean) by self-attention alone; it is thus unable to deal with outlier samples in support set. Moreover, there is no cross-attention and only conventional query-centered learning is performed during meta-training. Table 4 demonstrates that our PAL is consistently superior to FEAT.
In this work we have proposed a novel Prototype-centered Attentive Learning (PAL) method for few-shot action recognition. It is designed specifically to address the data efficiency and inability to deal with outliers and class-overlapping problems of existing methods. To that end, two complementary components are developed, namely prototype-centered contrastive learning that allows to make better use of few shots per class, and hybrid attention learning that aims to mitigate the negative effect of outlier support samples as well as class overlapping. They can be integrated in a single framework and trained end-to-end to maximize their complementarity. Extensive experiments validate that the proposed PAL yields new state-of-the-art performances on four action benchmarks, with the improvement on the more challenging fine-grained action recognition benchmark being the most compelling.