Localizing actions in videos is a challenging task that has received increasing attention in the last years (Oikonomopoulos et al., 2009; Shou et al., 2016; Gao et al., 2017b, 2018; Lin et al., 2019; Xu et al., 2020b; Paul et al., 2018; Shou et al., 2018; Liu et al., 2019; Shi et al., 2020; Yang et al., 2018; Feng et al., 2018). A central challenge in this field, is the difficulty in obtaining large scale, fully annotated data, where the temporal extend of the different actions are given as ground truth. To address this issue, several recent works have appeared on topics such as weakly supervised localization (Wang et al., 2017; Paul et al., 2018; Shou et al., 2018; Narayan et al., 2019; Liu et al., 2019; Shi et al., 2020), few-shot action detection (Yang et al., 2018; Xu et al., 2020a) and video re-localization (Feng et al., 2018; Huang et al., 2020; Yang et al., 2020).
Few-shot learning (Fei-Fei et al., 2006) has been used in several domains, including action recognition. Such methods (Bishay et al., 2019; Cao et al., 2020; Zhang et al., 2020), typically rely on learning a similarity function between pairs of videos on a training set and use it to compare videos in the test set with videos in a support set that contains one or few examples of novel classes (i.e., classes that have not been seen during training). In the domain of video action localization, the few recent few-shot learning approaches that have been published (e.g., (Yang et al., 2018; Xu et al., 2020a)), do so by assuming fully annotated training examples, i.e., known temporal borders of the classes on the query set during training.
In these works, this information is used to train a class-agnostic first stage proposal generator and/or as a supervision signal to the similarity function that is learned between pairs of snippets in the support and the query videos and/or to select the snippets in the untrimmed query videos on which the similarity is learned. However, manual annotation of the borders of actions is time-consuming and sometimes ambiguous.
To address the problem that temporal annotation of action borders is a time-consuming task, several weakly-supervised learning methods (Paul et al., 2018; Narayan et al., 2019; Shi et al., 2020) have been proposed. These methods split the video into snippets (e.g., 16 frames) and perform classification at snippet level to obtain temporal class activation maps (TCAMs) (Zhou et al., 2016). Those maps are used during training as attention mechanisms to refine the classifiers, and during testing to localize the actions. However, such methods (Paul et al., 2018; Narayan et al., 2019; Shi et al., 2020) rely on classifiers that are learned for the classes that are present in the training set each of which has typically several samples. This is very different from the one/few-shot learning framework, where only one/few samples are available for the classes in the test set; in such cases, training a classifier is impractical/prone to over-fitting.
To address these problems, we propose a weakly supervised method for one/few-shot action localization, that is a method that localizes actions in untrimmed videos when a) only one/few trimmed examples of the target action are available at test time, and b) when a large collection of videos with class label (some trimmed and some weakly annotated untrimmed) are available for training – clearly, without overlap between the classes used during training and testing. We do so by designing a network that during training learns a similarity function that estimates Temporal Similarity Matrices (TSMs), that is, fine-grained snippet-to-snippet level similarity patterns between pairs of videos (trimmed or untrimmed). These are subsequently used in order to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes. The TCAMs serve as temporal attention mechanisms to extract video-level representations of untrimmed videos at training time, and to temporally localize actions in them at test time.
Our TCAMs are similar in functionality with those in other weakly supervised works (Nguyen et al., 2018, 2019; Shi et al., 2020), however, a crucial difference is that, in our case, TCAMs are calculated based on similarities with reference videos as in (Kordopatis-Zilos et al., 2019), and not from class-based classifiers that are hardly trained from one/few examples. During training, we optimize a classification loss at video level, in order to ensure the inter-class separability of learned features. This is in contrast to other works on few-shot action localisation that at training, they have fine-grained action labels at snippet level and therefore can supervise their similarity function at the level of action-proposals (Yang et al., 2018; Xu et al., 2020a), whose overlap with the ground truth is known. We show that with the proposed method we obtain similar, or better performance than them, even though they are trained in a fully-supervised manner, i.e., with the annotation of action boundaries in the untrimmed videos in the training set.
Our main contributions are summarized as follows:
We address a novel and challenging task, namely weakly-supervised few-shot video action localization, which attempts to locate instances of unseen actions using one/few examples by learning from videos (trimmed and untrimmed), with only video-level labels. To the best of our knowledge, we are the first to address this problem.
By contrast to other weakly supervised methods, we propose an end-to-end single stage method to generate Temporal Class Activation Maps (TCAMs) from Temporal Similarity Matrices (TSMs) and not from class-based classifiers. This allows the generation of TCAMs using sample-query video pairs from both seen and unseen classes, and avoids additional proposal generation stage.
Our results are comparable or better to those of fully-supervised few-shot action localization methods.
2. Related Work
Traditional fully-supervised deep learning methods typically require large amounts of annotated data, introducing a significant prone-to-ambiguity annotation workload(Zhao et al., 2019; Xie et al., 2020; Shao et al., 2020a, b). For this reason, learning with scarce data (i.e., few-shot learning) has received increasing attention, in domains like object detection (Fei-Fei et al., 2006; Vinyals et al., 2016; Sung et al., 2018; Sun et al., 2019; Hou et al., 2019; Michaelis et al., 2020), action recognition (Zhu et al., 2018; Hahn et al., 2019; Bishay et al., 2019; Cao et al., 2020; Brattoli et al., 2020; Zhang et al., 2020), and action localization (Yang et al., 2018; Feng et al., 2018; Huang et al., 2020; Yang et al., 2020). Current works in this domain either learn using trimmed (Kordopatis-Zilos et al., 2017; Zhu and Yang, 2018; Bishay et al., 2019; Cao et al., 2020; Zou et al., 2020; Brattoli et al., 2020) or well-annotated untrimmed videos (Yang et al., 2018), or address class-agnostic localization tasks (Feng et al., 2018; Huang et al., 2020; Yang et al., 2020) – learning with both scarce data and limited annotation for both action recognition and localization is still an under-explored area.
2.1. Temporal action localization
Video action localization has been extensively studied under the fully-supervised paradigm (Shou et al., 2016; Zhao et al., 2017; Gao et al., 2017b; Lin et al., 2018; Long et al., 2019; Lin et al., 2019; Xu et al., 2020b). However, due to the challenging, time-consuming, and prone-to-ambiguity task of data collection and annotation, weakly-supervised approaches have received increasing attention by the research community (Wang et al., 2017; Shou et al., 2018; Paul et al., 2018; Nguyen et al., 2018; Narayan et al., 2019; Shi et al., 2020; Min and Corso, 2020). In this case, video annotation is given only with respect to the video-level action class, while the exact boundaries of the class instances are not available during training.
More specifically, driven by the effectiveness of fully-supervised two-stage temporal action localization methods (Zhao et al., 2017; Gao et al., 2017a; Lin et al., 2018, 2019), recent works (Wang et al., 2017; Shou et al., 2018) propose to classify a set of candidate proposals by training a video-level classifier. For instance, UntrimmedNet (Wang et al., 2017) generates proposals by uniform or shot-based sampling that are subsequently fed to a classification module trained on video-level labels. AutoLoc (Shou et al., 2018)
generates temporal class activation maps (TCAMs) by performing video-level classification and arrives at TCAM-based proposals using an appropriate loss function during training the localization model.
In contrast to the above, some works have directed efforts towards improving TCAM for improving weakly-supervised temporal action localization. For instance, (Paul et al., 2018; Narayan et al., 2019) propose to exploit the correlations between similar actions, and (Nguyen et al., 2018) imposes background suppression. (Min and Corso, 2020) proposes the optimization of a two-branch network in an adversarial manner so as one branch localizes the most salient activities of a video, while the other discovers supplementary ones, from non-localized parts of the video. (Shi et al., 2020) propose to discriminate the action and context frames by a conditional VAE (Kingma and Welling, 2013) by maximizing the likelihood of each frame with respect to the attention values.
2.2. Few-shot learning
Few-shot learning paradigm has been extensively studied for video related tasks, such as action recognition (Zhu and Yang, 2018; Bishay et al., 2019; Cao et al., 2020; Zou et al., 2020; Brattoli et al., 2020). CMN (Zhu and Yang, 2018) utilizes the key-value memory network paradigm to obtain an optimal video representation in a large space, then classifies videos by matching and ranking. TARN (Bishay et al., 2019) and OTAM (Cao et al., 2020) exploit the temporal information missed from previous few-shot learning methods (Zhu and Yang, 2018; Careaga et al., 2019) by imposing temporal alignment before measuring distances. Zou et al. (Zou et al., 2020) propose a soft composition mechanism to investigate compositional recognition that human can perform, which has been well studied in cognitive science, but not well explored under few-shot learning setting. Brattoli et al. (Brattoli et al., 2020) conduct an in-depth analysis of end-to-end training and pre-trained backbones for zero-shot learning.
Recently, few-shot learning has been adopted also for the problem of video action localization (Yang et al., 2018; Xu et al., 2020a) under the fully-supervised paradigm. (Yang et al., 2018; Xu et al., 2020a) uses a two-stage approach where, in the first stage, it applies a proposal generator to generate class-agnostic action proposals, and in the second stage it feeds them to a network that learns to compare them (using some similarity metric) to the categorical samples for classification. The difference between (Yang et al., 2018; Xu et al., 2020a) and us is, the supervision signal in (Yang et al., 2018; Xu et al., 2020a) is much stronger. Knowing the overlap of each proposal with ground truth segments during training, they are able to distinguish actions from background explicitly in loss function. By contrast, we adopt a weakly-supervised setting, extract class-specific video-level representations of the untrimmed videos using the TCAMs as attention masks, and learn using a video-level classification cost. Besides, (Yang et al., 2018; Xu et al., 2020a) exploit a proposal generation stage, learning or not, while we do not.
In a recently proposed line of research, video re-localization, Feng et al. (Feng et al., 2018) propose to localize in a query video segments that correspond semantically to a given reference video. Huang et al. (Huang et al., 2020) extends the original formulation so as to learn without using temporal boundaries information in the training set by utilizing a multi-scale attention module. Besides, Yang et al. (Yang et al., 2020) assume only one class in each query video and more than one support videos. Different with (Feng et al., 2018; Huang et al., 2020; Yang et al., 2020), we work on multi-class video localization, focusing on not only localization but also classification, which is more challenging than single-class example-based re-localization – assuming only one action, from the same class with reference video, to be located in a given query video.
3. Few shot, weakly supervised localization
In this paper, we address the problem of weakly-supervised few-shot action localization in videos. In this framework, the training set contains videos that are annotated with only class label(s), both trimmed and untrimmed ones, possibly more than one labels per video, and typically contains a large number of examples of each class. During testing, we are given a support set that contains one/few examples of novel classes, and a test set that contains untrimmed videos in which we seek to localize the actions of those novel classes. Adopting the protocol followed by (Vinyals et al., 2016; Sung et al., 2018; Yang et al., 2018), we consider -way -shot episode training/testing. More specifically, in each episode, we randomly select classes from the training set and trimmed action instances for each class to serve as sample set , and untrimmed videos with video-level annotations as query set , in each at least one action instance of the classes exists. During a test episode, given a query video, the task is to generate snippet-level attention masks, and categorize each snippet into one of the classes or as background.
3.1. Proposed method
Our method consists of two learnable modules, namely the video encoder and the attention generator . The video encoder is used to generate meaningful embeddings in order to calculate a small set of temporal similarity matrices (TSMs) between query (from query set) and reference videos (from sample set) using different similarity metrics, which subsequently are used in order to learn attention masks. An overview of the proposed method is given in Fig. 1.
Given a query video and a reference video from the sample set, where denote their -th snippet, respectively, represented by using either the RGB or the optical flow features (Carreira and Zisserman, 2017). Note that, is the feature dimension and is the number of snippets of the corresponding video. We first use the video encoder in order to transform them into embeddings and , respectively. Note that we train a separate video encoder for each feature representation scheme (RGB and optical flow).
Subsequently, we obtain the TSMs by calculating pair-wise embedding similarities for all snippet pairs between the query and reference video, using various similarity metrics (i.e., we compute one TSM for each similarity metric choice and each class). With a max-pooling operation along the time dimension of the reference video in TSM we obtain the similarity of each of thesnippets of the query video with the reference video. We arrive at the attention masks by learning the attention generator module that takes as input four similarity vectors, one for each combination of features (RGB and optical flow) and similarity metrics (dot product and cosine distance). By setting a threshold on these attention masks we assign action/background labels to each snippet; this way, we localize actions at snippet-level (localization block in Fig. 1).
For doing classification, we compare the transformed (using video encoder ) reference videos, after applying a pooling operation in order to fix their dimensions, to the product of the normalized attention masks and transformed (using the same video encoder ) query features, in order to decide on the class of the action that we have previously localized (classification block in Fig. 1). Below, we further discuss each part of the proposed method in detail.
3.2. Video Encoder
As described above, the video encoder is used in order to refine pre-trained features and arrive at representations more meaningful to the task at hand. More specifically, we use I3D (Carreira and Zisserman, 2017) as a pre-trained feature extractor, similarly to (Nguyen et al., 2018; Paul et al., 2018). I3D incorporates both spatial and temporal information by using two stream of RGB and TV-L1 optical flow (Zach et al., 2007) – this has been shown to benefit activity detection (Xie et al., 2019; Chao et al., 2018). We give non overlapping two-stream -frame snippets as input and pass its output through a 3D pooling layer of kernel size in order to obtain -dimensional features in each stream. The video encoder consists of two Fully-Connected (FC) layers, with output dimensions and
3.3. Temporal similarity and attention generation
Given the embeddings of the given query and of a reference video of class , we calculate a snippet-to-snippet similarity matrix , which we call Temporal Similarity Matrix (TSM). More specifically, the -th entry of , i.e., the similarity between the snippets of the query and of the reference video, is given as , where is a similarity metric. Given , we then assign a single similarity score to each snippet of the query video that expresses how well the snippet matches the reference video. We do so, by max-pooling along the rows of (see Fig. 2), that is,
In practice, we calculate four TSMs for each class: one for each combination of two distances (cosine and dot product) and two types of features (RGB and optical flow). By doing so, we arrive at four similarity vectors , , , and using (1) for each class
. We found that beneficial in comparison to using only a single type of similarity metric and/or feature. The similarities are then concatenated and given as input to an attention generator module consisting of a batch normalization and a FC layer (Fig.3). The output Temporal Class Attention Mask (TCAM) is then given by
Finally, by normalizing each to using the softmax operator, we arrive at normalized Temporal Class Attention Masks, as
3.4. Localization and Classification
Training and testing of our method is done in -way, -shot episodes where, at each episode, an untrimmed query video is compared to videos in the sample set – the latter set contains examples of randomly sampled classes. In this section, we will show how we obtain action localisation maps and scores, and how we obtain video level scores for the query video. To simplify the notation, we will first present the 1-shot scenario and then how it can be trivially extended to the -shot case.
Localization After obtaining the TCAM for a query video and class , we threshold the and group together consecutive snippets that are above a given threshold . Then, following the standard practice (Paul et al., 2018; Narayan et al., 2019; Shi et al., 2020), we arrive at a set of action predictions , where and are the start, end, and prediction score of a certain prediction. We set the prediction score as the average of of the individual snippets, that is, . In the case of -shot, for a specific class , we average TCAMs calculated from samples to be the final TCAM . We use here for the sake of its high discriminative ability among snippets compared with (normalised TCAM).
Classification We use the normalized TCAMs in order to obtain class specific vector representations for the query video. More specifically, for class , we multiply (element-wise) (by broadcasting along the -dimensional vector (2) as an matrix) with the transformed features , leading to a matrix. By summing up over video length (i.e., weighted temporal average pooling (Nguyen et al., 2018)) for each class, we arrive at a set of -dimensional vectors, each of which corresponds to the class-wise representation of the query video. This is , depicted by the blue toned vectors in the classification block of Fig. 1.
At the same time, we transform each of the videos in the sample set using the video encoder and apply a temporal average pooling operation in order to fix their dimensionality (note that videos from sample set are typically of different lengths) to . Thus, we arrive at a representation for the reference videos in the sample set (see Fig. 1). The final score of the query for class is then given by
where , denote the -th rows of and , respectively. In -shot, we calculate distances for each class, which we average and proceed as above.
Since we adopt a weakly-supervised setting, we only use video-level class labels. Given that denotes the ground truth label of query video with respect to class and its predicted label given as described above, we optimize a cross-entropy loss term given as follows
Datasets We evaluate the proposed method on two popular datasets for video action localization, namely THUMOS14 (Jiang et al., 2014) and ActivityNet1.2 (Caba Heilbron et al., 2015). THUMOS14 provides annotations for 101 classes and consists of 1010 validation (THUMOS14-Val) and 1574 testing videos (THUMOS14-Test). However, temporal annotations (for 20 classes) are provided only for 200 validation and 213 testing videos. This is typically referred to as the THUMOS14-Val-20/Test-20 split. Following the standard practice (Zhao et al., 2017; Yang et al., 2018; Shi et al., 2020), we train our models on the validation set and evaluate them on the testing set for the boundary localization task. ActivityNet1.2 provides annotations (both in terms of video-level class and temporal boundaries) for 100 classes on 4819 training (ANET-Train) and 2383 validation videos (ANET-Val). Following the standard protocol (e.g., (Wang et al., 2017; Shou et al., 2017; Xu et al., 2017; Shi et al., 2020)), we train our models on the training set and evaluate on validation set. Note that we do not use any temporal boundaries information during training, but only during testing for evaluating our models. Similarly to (Yang et al., 2018), we use trimmed videos in the support set, but in contrast to them we do not utilize the temporal boundary annotations in the query videos.
Few-shot training/evaluation protocol Few-shot learning paradigm requires that the classes used for testing must not be present during training. Following (Yang et al., 2018), for THUMOS14, we we use a part of THUMOS14 validation set (6 classes from THUMOS14-Val-20) to train both video encoder and attention generator (see Fig. 1). We use the remaining 14 classes from the test set (THUMOS14-Test-14) in order to evaluate our one-shot localization network. For ActivityNet1.2, we split the 100 classes into 80/20 splits. We train our localization network on 80 classes in the training set, denoted as ANET-Train-80, and evaluate on the other 20 classes in the validation set, denoted as ANET-Test-20, following (Yang et al., 2018).
Training In each training episode, we randomly choose 5 classes from the set of training classes, and train our network under the standard -way -shot setting (Sung et al., 2018) using 8 query videos for each class. In THUMOS14 (-way -shot), we use 5 query videos due to limited amount of data. In each training episode (-way -shot), our model will be trained in a mini-dataset of classes, in which we split into two non-overlap subsets. One of them consists of videos from each class, and we use one trimmed action instance from each to form the sample set; the rest untrimmed videos will be served as query videos.
Testing Our (meta-)testing setting is similar to that of (meta-)training, except for the support and testing sets. More specifically, we pair a randomly chosen video in testing set with 5 examples. Due to the large number of different combinations of 5 examples (random classes/samples from each class), and since the localization performance relies on them, similarly to (Yang et al., 2018), we randomly sample 1000 different examples from each of the test classes and calculate mAP across all these examples. In experiments, we report the median of 10 repetitions.
Evaluation metrics Following the literature in temporal action localization, we evaluate our models using mean Average Precision (mAP) at different temporal Intersection over Union (tIoU) thresholds (mAP@tIoU). In ActivityNet1.2, we also report the average mAP at 10 evenly distributed tIoU thresholds between 0.5 and 0.95 (Zhao et al., 2017; Shou et al., 2018; Shi et al., 2020). We also report the numerical video-level action recognition accuracy of top-1 and top-3 predicted classes.
Implementation details We train our network using the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of , which we decrease by a factor of 2 after 1000 episodes, a weight-decay factor of , and a dropout rate of . For both datasets, we train for 10000 episodes.
4.1. Main results
|Full||CDC@1 (Shou et al., 2017)||6.4|
|Full||CDC@5 (Shou et al., 2017)||6.5|
|Full||Sl. window@1 (Yang et al., 2018)||13.6|
|Full||Sl. window@5 (Yang et al., 2018)||14.0|
F-PAD@1(Xu et al., 2020a)
|Full||F-PAD@5 (Xu et al., 2020a)||28.1|
|Full||CDC@1 (Shou et al., 2017)||8.2||2.4|
|Full||CDC@5 (Shou et al., 2017)||8.6||2.5|
|Full||Sl. window@1 (Yang et al., 2018)||22.3||9.8|
|Full||Sl. window@5 (Yang et al., 2018)||23.1||10.0|
|Full||F-PAD@1 (Xu et al., 2020a)||41.5||28.5|
|Full||F-PAD@5 (Xu et al., 2020a)||50.8||34.2|
We evaluate our method on THUMOS14 and ActivityNet1.2 and compare with state-of-the-art fully-supervised few-shot methods for the lack of other weakly-supervised few-shot methods. We report results on THUMOS14 and ActivityNet1.2 datasets in Tables 2 and 3, respectively. More specifically, on THUMOS14, we surpass (Shou et al., 2017) by a large margin for both 1-shot and 5-shot 5-way settings, while we achieve very similar results with (Yang et al., 2018). Besides, table 2 also shows our model lags behind F-PAD (Xu et al., 2020a), which most likely due to the proposal generation subset they trained on the boundary information. On ActivityNet1.2, we outperform both fully-supervised works (Shou et al., 2017; Yang et al., 2018), by a large margin – particularly we outperform the state-of-the-art (Yang et al., 2018; Xu et al., 2020a) in both - and -shot 5-way settings (e.g., in the case of -shot, we achieve a mAP@0.5 of 52.6% compared to 23.1% of (Yang et al., 2018) and 50.8% of (Xu et al., 2020a)).
It is worth noting that the different performance in two datasets is due to their different relative difficulty in the context of temporal action localization (Shi et al., 2020). That is, THUMOS14 consists of more fine-grained action instances per video (15.5 on average), compared to ActivityNet1.2 (1.5 on average). Moreover, action instances in THUMOS14 typically range from a few seconds to minutes, making them, in practice, sparsely distributed in a clutter of backgrounds, compared to ActivityNet1.2 where the actions are long and typically of only one class in each video. This is also reflected in Table 1, where we report localization results in terms of mAP for different tIoU thresholds. We see that we achieve much higher localization performance for small tIoUs in ActivityNet1.2 compared to THUMOS14. This is also observed in top-1 classification accuracy.
The above have informed our choice of the threshold that we set on the TCAMs. For THUMOS14, we use the middle of the range . For the ActivityNet1.2, we use different thresholds for the different classes in the sample set. More specifically, we set the threshold for the class such that the average length of the predictions in the query video is similar to the (average) length of the action of that class in the support video. Finally, same to (Shi et al., 2020), we note THUMOS14 has fewer weakly annotated videos for training.
4.2. Ablation studies
We conduct a number of ablation studies in order to demonstrate the effectiveness of a) the two main learnable components of our method, namely the video encoder and the attention generator , and b) various secondary design choices, such as the similarity metric. We choose to evaluate on THUMOS14, since it is more challenging than ActivityNet1.2, under the 5-way, 1-shot setting.
Without learning We begin by evaluating our architecture without learning the video encoder or the attention generator (see Fig. 1). More specifically, we do this by using directly the pre-trained I3D (Carreira and Zisserman, 2017) in order to calculate the temporal similarity matrices (TSMs) and attention masks (TCAMs), as described in Sect. 3. In Table 4 we report the performance of our network, in terms of action recognition accuracy and localization mAP@0.5, when no learning is conducted, for various combinations of similarity metrics and temporal pooling operations. We note that using the dot product or the cosine distance for calculating TSMs, outperforms Euclidean distance by large margins with respect to both classification and localization. Moreover, in order to investigate the effectiveness of weighted temporal average pooling (in order to calculate the video-level representations as in Fig. 1), we compare it with average pooling. We see that using TCAMs improves by 6.46% and 8.48% the top-1 and top-3 action recognition accuracy.
Learning and Next, we proceed into investigating the effectiveness of training the video encoder and the attention generator modules. In Table 5, we report the localization performance, in terms of mAP@0.5, and the classification performance, in terms of the top-1 and top-3 accuracy, on THUMOS14 under the 5-way, 1-shot setting. We note that training the video encoder alone improves the action recognition ability of our network (e.g., top-1 accuracy is improved from 47.75% to 51.80%). Moreover, training the attention generator alone (Fig. 3) improves the localization performance by 1.21%. Finally, training both the video encoder and the attention generator arrives at better performance both in terms of localization (+3.69%) and recognition accuracy (top-1: +3.95%, top-3: +1.45%).
-way, -shot To investigate the generalization ability of our method, we test with different and parameters (-way, -shot) using the model we trained using the -way, -shot setting (Table 6). Compared to (Table 1), as expected, localization performance of is increased, since under this setting the problem boils down to class-agnostic action localization. In the cases of or , even though classification task is more challenging, which leads to an anticipated drop in classification performance, we note that our method achieves slightly worse or comparable localization performance.
Visualization We conclude our ablation studies by illustrating how training the learnable modules, i.e., video encoder and attention generator, affects the attention masks used for temporal action localization. In Fig. 4 we show an indicative example of the attention masks (multiplied by the video class scores) to which our method arrives when we learn the video encoder and attention generator (in blue), and the corresponding masks when we do not learn any of them (in brown). We note that when we optimize and , we arrive at more meaningful attention masks, which subsequently lead to better segmentation of the query video with respect to ground truth. It is also worth noting that background snippets are suppressed in the case of learnt attention masks.
4.3. 1-shot results on novel splits.
Usually few-shot learning refers to the study of generalizing to unseen categories in image/action classification. This paper, following (Larochelle et al., 2008; Radford et al., ), instead uses “
few-shot learning” in a broader sense and mainly studies generalization to unseen datasets/tasks, which is closer to “
”. More precisely, in the experiments reported in the previous section, we utilized the pre-trained features on Kinetics-400(Kay et al., 2017) from (Carreira and Zisserman, 2017), a dataset which has overlapped actions with THUMOS14 and ActivityNet1.2.
To further demonstrate the effectiveness of our method, this section reports results on a “novel split” that uses In-Domain classes (overlapped with kinetics-400) as training set and Out-of-Domain classes (non-overlapped with kinetics-400) as testing set. We show that with the help of the representations trained in a large dataset, our model yields comparable performance on both In-/Out-of-Domain classes in other datasets.
For Table 7, we train on 28 In-Domain classes and evaluate on 72 Out-of-Domain classes in ActivityNet1.2. The results show similar recognition ability as in Table 3 and an expected drop over localization mAP due to the decrease in the size of the training set. Still, the result outperforms (Yang et al., 2018) and (Xu et al., 2020a) by a large margin (39.48% vs 22.3%(Yang et al., 2018) vs 31.7%(Xu et al., 2020a)). Fig. 5 shows Kinetics-400 pretrained feature on ActivityNet1.2, indicating that the feature itself is useful to discriminate actions, no matter whether they belong to the in- or out-of-domain subsets of Kinetics-400. From the THUMOS14-20 classes, only “Billiards” is not present in Kinetics-400, so we remove ’Billiards’ from the training set and add it to evaluation set – we see that there is only a marginal drop in the performance of this class (Table 8). As seen in Fig. 6 where the T-SNE plots of classes is presented, even though “Billiards” is not present in Kinetics-400, it is still distinguishable, which is consistent with the visualization of the ActivityNet1.2 classes. Taking analysis above and results without learning (Table 4) into consideration, we believe that a good feature representation (such as one obtained by pretraining on Kinetics) is the key to localization performance with few examples.
|F-PAD (Xu et al., 2020a)*||80/20||31.7||19.4||-||-|
is pretrained on a larger dataset – sports-1M(Karpathy et al., 2014), * is the controlled split testing on Out-of-Domain classes, which is different from ‘origin’.)
In this paper, we proposed a weakly-supervised few-shot method for the problem of temporal action localization in videos. To the best of our knowledge, this is the first method to address this problem under the assumptions of few-shot learning using video-level annotation only. We do that by learning to estimate meaningful temporal similarity matrices that model fine grained similarity patterns between pairs of videos (trimmed or untrimmed), and use them to generate attention masks for seen or unseen classes. Experimental results on two datasets show that we our method achieves performance comparable to SoA fully-supervised, few-shot learning methods.
Acknowledgements.Funding for this research is provided by the joint QMUL-CSC Scholarship, and by the EPSRC Programme Grant EP/R025290/1.
- TARN: temporal attentive relation network for few-shot and zero-shot action recognition. Proceedings of the British Machine Vision Conference. Cited by: §1, §2.2, §2.
- Rethinking zero-shot video classification: end-to-end training for realistic applications. In , Cited by: §2.2, §2.
- Activitynet: a large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. Cited by: §4.
- Few-shot video classification via temporal alignment. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2, §2.
- Metric-based few-shot learning for video action recognition. arXiv preprint arXiv:1909.09602. Cited by: §2.2.
- Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. Cited by: §3.1, §3.2, Figure 4, §4.2, §4.3.
- Rethinking the faster r-cnn architecture for temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139. Cited by: §3.2.
- One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4), pp. 594–611. Cited by: §1, §2.
- Video re-localization. In European Conference on Computer Vision, pp. 51–66. Cited by: §1, §2.2, §2.
- Ctap: complementary temporal action proposal generation. In European Conference on Computer Vision, pp. 68–83. Cited by: §1.
- Cascaded boundary regression for temporal action detection. Proceedings of the British Machine Vision Conference. Cited by: §2.1.
- Turn tap: temporal unit regression network for temporal action proposals. IEEE International Conference on Computer Vision. Cited by: §1, §2.1.
- Action2vec: a crossmodal embedding approach to action learning. IEEE Conference on Computer Vision and Pattern Recognition, Workshop. Cited by: §2.
- Cross attention network for few-shot classification. In Advances in Neural Information Processing Systems, pp. 4003–4014. Cited by: §2.
Weakly-supervised video re-localization with multiscale attention model.. In
National Conference on Artificial Intelligence, pp. 11077–11084. Cited by: §1, §2.2, §2.
- THUMOS challenge: action recognition with a large number of classes. Note: http://crcv.ucf.edu/THUMOS14/ Cited by: §4.
Large-scale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: Table 7.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.3.
- Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §4.
- Auto-encoding variational bayes. International Conference on Learning Representations. Cited by: §2.1.
- Visil: fine-grained spatio-temporal video similarity learning. In IEEE International Conference on Computer Vision, pp. 6351–6360. Cited by: §1.
- Near-duplicate video retrieval by aggregating intermediate cnn layers. In International conference on multimedia modeling, pp. 251–263. Cited by: §2.
- Zero-data learning of new tasks.. In National Conference on Artificial Intelligence, Cited by: §4.3.
- BMN: boundary-matching network for temporal action proposal generation. IEEE International Conference on Computer Vision. Cited by: §1, §2.1, §2.1.
- BSN: boundary sensitive network for temporal action proposal generation. European Conference on Computer Vision. Cited by: §2.1, §2.1.
- Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298–1307. Cited by: §1.
- Gaussian temporal awareness networks for action localization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 344–353. Cited by: §2.1.
- Closing the generalization gap in one-shot object detection. arXiv preprint arXiv:2011.04267. Cited by: §2.
- Adversarial background-aware loss for weakly-supervised temporal activity localization. European Conference on Computer Vision. Cited by: §2.1, §2.1.
- 3c-net: category count and center loss for weakly-supervised action localization. In IEEE International Conference on Computer Vision, pp. 8679–8687. Cited by: §1, §1, §2.1, §2.1, §3.4.
- Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761. Cited by: §1, §2.1, §2.1, §3.2, §3.4.
- Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5502–5511. Cited by: §1.
- An implicit spatiotemporal shape model for human activity localization and recognition. In 2009 IEEE Computer Society conference on computer vision and pattern recognition workshops, pp. 27–33. Cited by: §1.
- W-talc: weakly-supervised temporal activity localization and classification. In European Conference on Computer Vision, pp. 563–579. Cited by: §1, §1, §2.1, §2.1, §3.2, §3.4.
-  Learning transferable visual models from natural language supervision. Image. Cited by: §4.3.
- Finegym: a hierarchical video dataset for fine-grained action understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §2.
- Intra-and inter-action understanding via temporal action parsing. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 730–739. Cited by: §2.
- Weakly-supervised action localization by generative attention modeling. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1009–1019. Cited by: §1, §1, §1, §2.1, §2.1, §3.4, §4.1, §4.1, §4, §4.
- CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1417–1426. Cited by: §4.1, Table 2, Table 3, §4.
- Autoloc: weakly-supervised temporal action localization in untrimmed videos. In European Conference on Computer Vision, pp. 154–171. Cited by: §1, §2.1, §2.1, §4.
- Temporal action localization in untrimmed videos via multi-stage cnns. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058. Cited by: §1, §2.1.
Meta-transfer learning for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 403–412. Cited by: §2.
- Learning to compare: relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §2, §3, §4.
- Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §2, §3.
- Untrimmednets for weakly supervised action recognition and detection. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. Cited by: §1, §2.1, §2.1, §4.
Temporal action localization with variance-aware networks. arXiv preprint arXiv:2008.11254. Cited by: §2.
- Exploring feature representation and training strategies in temporal action localization. In International Conference on Image Processing, pp. 1605–1609. Cited by: §3.2.
- R-c3d: region convolutional 3d network for temporal activity detection. In IEEE International Conference on Computer Vision, Vol. 6, pp. 8. Cited by: §4.
- Revisiting few-shot activity detection with class similarity control. arXiv preprint arXiv:2004.00137. Cited by: §1, §1, §1, §2.2, §4.1, §4.3, Table 2, Table 3, Table 7.
- G-tad: sub-graph localization for temporal action detection. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §1, §2.1.
- One-shot action localization by learning sequence matching network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1450–1459. Cited by: §1, §1, §1, §2.2, §2, §3, §4.1, §4.3, Table 2, Table 3, Table 7, §4, §4, §4.
- Localizing the common action among a few videos. European Conference on Computer Vision. Cited by: §1, §2.2, §2.
- A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: §3.2.
- Few-shot action recognition with permutation-invariant attention. In European Conference on Computer Vision, Cited by: §1, §2.
- Hacs: human action clips and segments dataset for recognition and temporal localization. In IEEE International Conference on Computer Vision, pp. 8668–8678. Cited by: §2.
- Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision, Vol. 8. Cited by: §2.1, §2.1, §4, §4.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §1.
- Compound memory networks for few-shot video classification. In European Conference on Computer Vision, pp. 751–766. Cited by: §2.2, §2.
- Towards universal representation for unseen action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9436–9445. Cited by: §2.
- Compositional few-shot recognition with primitive discovery and enhancing. In ACM on Multimedia Conference, pp. 156–164. Cited by: §2.2, §2.