With the rapid growth of video media, video understanding has become an important area of computer vision. As a fundamental task in video understanding, Temporal Action Localization (TAL) aims to localize temporal boundaries and classify the actions for each action instance in a long untrimmed video.
Because many actions span long temporal extent (e.g., 50-100s), most prior approaches in TAL Lin et al. (2018, 2019b); Xu et al. (2020); Chao et al. (2018); Bai et al. (2020); Zhao et al. (2020); Long et al. (2019); Lin et al. (2021); Wang et al. (2021), divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. As shown in Fig. 1, the first part involves operating on many consecutive short clips (e.g., each spanning 1-2 seconds) sampled from a long untrimmed video for short-term feature extraction. In the second part, the model then uses the extracted features of all short-term clips (i.e., spanning the entire duration of an untrimmed video) for predicting action boundaries and categories. Based on these observations, it is natural to conclude that an ideal TAL model should consist of (i) a powerful short-term feature extractor and (ii) a precise temporal boundary localization module.
However, due to the high GPU memory cost needed to process long untrimmed videos, the majority of existing methods sacrifice the representational power of the short-term feature extractor, by either freezing the backbone Lin et al. (2018, 2019b); Xu et al. (2020) or using a very small spatial video resolution (e.g., ) Lin et al. (2021); Wang et al. (2021). While both of these techniques are highly effective at reducing GPU memory consumption, they also degrade the quality of the extracted short-term features, which leads to a significantly lower TAL accuracy. For example, as shown in Fig. 1, in order to save memory, reducing spatial resolution from to leads to 2.3% mAP drop; using a smaller backbone also reduce the mAP by 2.1%; freezing the backbone leads to a severe drop in mAP (8-9%).
In parallel, we note that the recent introduction of powerful video transformer models Bertasius et al. (2021); Liu et al. (2021b) have achieved impressive results on various video understanding problems such as action recognition. However, these models have made the above-described GPU memory issues even worse. Due to the quadratic memory complexity of self-attention, video transformers require even more GPU memory than traditional CNNs. As a result, it is challenging to adapt these models to the TAL task, which generally requires a lot of GPU memory even when using CNN-based models. The commonly used GPU memory saving techniques such as Checkpointing Chen et al. (2016), Mixed Precision Micikevicius et al. (2017), can alleviate these computational issues. However, as shown in Table 1, even when using these techniques, the GPU memory cost of applying video transformers (e.g., VideoSwin Liu et al. (2021b)) to TAL is very large.
Thus, with these computational issues in mind, we propose TallFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with a Long-memory mechanism. Our key observation is that most videos are highly redundant, i.e., their content changes little in most neighboring frames. This raises the question of whether every single frame from a long untrimmed video needs to be processed during each training iteration. Motivated by this observation, we design Tall
Former to process only a fraction of randomly selected frames at each training iteration, which significantly reduces the training time and GPU memory requirements. For the remaining (i.e., not selected) video frames, the video features are sampled from long-term memory, which stores the features of all previously processed frames for that particular video. Note that the features from long-term memory do not have to be re-computed online, and they also do not require backpropagating the gradients, which makes long video processing much more efficient.
As the short-term feature extractor evolves throughout training, the video features in long-term memory are also evolving, i.e., the newly computed features for a given video are used to replace the old features in long-term memory. Compared to previous TAL approaches, TallFormer has several main advantages. First, our model can be trained end-to-end on long, high spatial resolution videos beyond the constraints of finite GPU memory. Second, our framework is flexible as we can incorporate any state-of-the-art short-term video transformer model into TallFormer, thus, benefiting from future improvements in the video transformer design. Lastly, unlike many previous TAL methods Lin et al. (2018, 2019b); Xu et al. (2020); Bai et al. (2020); Zhao et al. (2020); Long et al. (2019) that rely on external action recognition classifiers, TallFormer is a unified framework that predicts action boundaries and categories with a single model. Despite being simpler, and only operating on RGB inputs, TallFormer achieves an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3, thus, outperforming the current state-of-the-arts by 7.1% and 1.2% respectively.
2 Related Work
Action recognition is a fundamental short-term modeling task in video understanding. With the success of deep learning, a vast array of methodsTran et al. (2015); Wang et al. (2016); Tran et al. (2018); Wang et al. (2018); Carreira and Zisserman (2017); Feichtenhofer (2020); Feichtenhofer et al. (2019); Jiang et al. (2019); Lin et al. (2019a); Kwon et al. (2020) utilize 2D and 3D CNNs to achieve impressive performance on standard action recognition benchmarks Carreira and Zisserman (2017). Recently, Vision Transformer-based methods Bertasius et al. (2021); Liu et al. (2021b); Fan et al. (2021) have been shown to outperform previous CNN-based methods by a large margin. Due to the large scale pretraining on action recognition datasets, the pretrained models from this domain are widely used in temporal action localization as a short-term feature extractor. One limitation of modern video transformer models is that due to the quadratic memory complexity of self-attention Vaswani et al. (2017), these models are slow to train and they require a lot of GPU memory. As a result, it is difficult to apply them to long-term modeling tasks such as temporal action localization.
Temporal Action Localization (TAL)
Due to finite GPU memory constraints, most existing methods Lin et al. (2018, 2019b); Xu et al. (2020); Chao et al. (2018); Bai et al. (2020); Zhao et al. (2020); Long et al. (2019); Tan et al. (2021); Zhao et al. (2021) use pre-extracted action recognition features as inputs to the TAL model. However, since those features are extracted using models He et al. (2016); Carreira and Zisserman (2017) that are pretrained on different datasets, using these features for TAL often leads to suboptimal performance. To address these issues, recent methods AFSD Lin et al. (2021) and DaoTAD Wang et al. (2021) proposed end-to-end trainable frameworks. However, to fit into finite GPU memory, these models operate on very low spatial video resolutions (e.g., and respectively), which leads to a significant drop in TAL accuracy. To the best of our knowledge, none of the existing methods are capable of end-to-end training with both high spatial resolution and long temporal extent. We aim to address this issue by proposing a simple, end-to-end trainable, transformer-based TAL method that can operate on long high-resolution video inputs.
Besides end-to-end training ability, we also note that most TAL methods can be categorized into two groups: (i) single-stage detectors, and (ii) two-stage detectors that require external action recognition classifiers. One-stage detectors Lin et al. (2017a); Liu and Wang (2020); Zhang et al. (2018); Wang et al. (2020b) perform action localization and classification at the same time. In comparison, the two-stage methods Lin et al. (2018, 2019b); Xu et al. (2020); Lin et al. (2021); Gao et al. (2018); Zeng et al. (2019); Liu et al. (2019); Gao et al. (2020); Bai et al. (2020); Su et al. (2020); Qing et al. (2021); Shou et al. (2017); Zhao et al. (2017) only predict action boundaries and then use the predictions of an external action recognition classifier to assign an action class to a given video segment. Despite the elegance and simplicity of one-stage methods, the two-stage methods typically have a much higher detection accuracy. In this work, we will show that even without relying on the external action recognition classifier, our TallFormer still achieves state-of-the-art results on several major TAL benchmarks.
Applying transformer-based methods to TAL poses many GPU memory challenges due to the quadratic memory complexity of self-attention. There are several general memory-saving techniques, including Gradient Checkpointing Chen et al. (2016) and Mixed Precision Micikevicius et al. (2017), which reduce the GPU memory usage by about 50%. We note that our proposed approach is complementary to these techniques. In fact, we use Gradient Checkpointing Chen et al. (2016) in many of our experiments, thus, demonstrating that our proposed method works well in conjunction with these prior memory-saving techniques.
Furthermore, we note that several methods from Natural Language Processing (NLP) such as LinFormerWang et al. (2020a) and Performer Choromanski et al. (2020) propose to reduce the memory complexity of standard self-attention by approximating the attention using low-rank matrix decomposition. While being effective in NLP, those approximation methods work poorly when applied to video recognition Patrick et al. (2021). Furthermore, these approximate attention-based models are difficult to incorporate into pretrained action recognition models (i.e., due to the differing network architectures), which is essential for the TAL task.
Given an untrimmed video with RGB frames, our TallFormer model aims to predict a set of action instances where is the number of action instances in . Each action instance
is a four-element tuple that represents the start timestamp of action, end timestamp of action, action class and probability of this instance respectively.
As shown in Fig. 2, TallFormer consists of four components: (i) a short-term Transformer encoder, (ii) a long memory module, (iii) a temporal consistency module and (iv) a temporal boundary localization module. First, we randomly sample a subset of short video clips, and process them using the short-term Transformer encoder. The remaining features are directly sampled from long-term memory, which stores previously computed features of all frames for that particular video input. Afterward, all of these features (i.e., from the short-term Transformer encoder and long-term memory) are fed into a temporal consistency module that effectively fuses them in order to map them to a similar feature space, i.e., to alleviate potential issues caused by differing feature distributions from the feature extractor and long-term memory. Lastly, the temporal boundary localization module processes these features and produces temporal boundaries and action categories for each detected action instance. We also provide the pseudocode of this whole process in Alg. 1. We now describe each of these components in more detail.
3.1 Short-term Transformer Encoder
Our Shor-term Transformer Encoder considers many consecutive short clips (i.e., spanning 1-2 seconds) from a long untrimmed video. In order to avoid computing dense features for every single clip, we randomly sample a fixed number of such clips and feed them into our encoder.
Formally, for each input video we divide it into non-overlapping clips where . The input video is shifted at most
frames to ensure the clip-division changes at each epoch. A uniform sampler first samples the indicesof clips that will be processed by the encoder. The indices of the remaining (i.e., not sampled) clips are denoted as . The encoder then processes each sampled clip to extract low-dimensional features to produce features .
Note that during each training iteration, the Transformer encoder only processes a small fraction of clips from the whole input video. The remaining clips are sampled from the long memory module (described in Sec. 3.2). This enables TallFormer to be trained end-to-end on long high spatial resolution videos without (i) reducing the spatial video resolution, (ii) freezing the backbone, or (iii) resorting to a weak short-term feature extraction backbone. We use the recent VideoSwin Liu et al. (2021b) as our short-term Transformer encoder, which achieved impressive results on several popular action recognition benchmarks Carreira and Zisserman (2017); Goyal et al. (2017).
3.2 Long Memory Module
Our proposed Long Memory Module (LMM) enables TallFormer to be trained on long and high-resolution videos. Inspired by Zhang et al. (2021); He et al. (2020), we propose LMM to cache the features computed by our short-term Transformer encoder for all short-term video clips. For the remaining clips (denoted by the indices ) that are not processed by the short-term Transformer encoder, LMM samples the features for these clips from the long-term memory. Following this step, we then update long-term memory with the features extracted by the short-term Transformer encoder. Note that before training, we initialize the LMM with the features extracted by our short-term Transformer encoder.
Such a scheme works well in the TAL setting because the short-term Transformer encoder is already pretrained on a large-scale external action recognition dataset (e.g., Kinetics) and thus, it evolves more slowly than the other modules in the network (i.e., it uses a smaller learning rate than the other parts of the network). Thus, “approximating" short-term features with the features from LMM provides large efficiency gains (both in terms of training time and GPU memory), while still achieving excellent TAL accuracy, which we demonstrate in our experimental section.
Overall, compared to standard end-to-end training, TallFormer only needs to process a fraction of input clips, which saves the memory and computational cost by a rate of . This then allows us to (i) use a powerful transformer-based feature extractor without freezing its backbone or reducing the spatial video resolution and (ii) still maintain the ability to precisely localize long-range temporal boundaries of actions. Note that during inference, we extract all features using a short-term Transformer encoder (i.e., without using LMM).
3.3 Temporal Consistency Module
Due to different feature distributions between (i) the online extracted Transformer features and (ii) LMM-cached offline features , we need to reduce temporal inconsistency among clip-level features across the whole input video. To be more precise, the features that are processed online (i.e., using our short-term Transformer encoder) are extracted using the latest short-term encoder. In contrast, most clip-level features stored in the LMM are extracted using previous versions of that same short-term Transformer encoder (i.e., the model from the previous iterations). Thus, the short-term features associated with different clips might have different feature distributions, which can potentially degrade TAL performance. To address this issue, we propose a simple, yet effective Temporal Consistency Module (TCM).
The idea is to make the features from both sources more consistent by allowing them to interact with and learn from each other. Due to the effectiveness of standard self-attention to capture global long-range dependencies, we design TCM as an attention layer subnetwork. Formally, given the video features , the TCM refines the features using three Transformer layers:
where is the layer index, and is the refined features of TCM. The TransformerLayer uses relative positional encoding as in Swin Liu et al. (2021a), GELU Hendrycks and Gimpel (2016) activation, Droppath Huang et al. (2016).
Conceptually, our self-attention-based TCM subnetwork allows our model to refine potentially inconsistent features by incorporating temporal information from the entire untrimmed input video into feature vectors associated with individual video clips. In our experimental section, we demonstrate the effectiveness of such long-range TCM module.
3.4 Temporal Boundary Localization Module
The Temporal Boundary Localization Module (TBLM) utilizes the features of all clips produced by the TCM to predict the action boundaries and categories. The TBLMs in most existing methods Lin et al. (2018, 2019b); Xu et al. (2020); Bai et al. (2020); Zhao et al. (2020); Long et al. (2019); Lin et al. (2021); Tan et al. (2021); Zhao et al. (2021) especially for difficult datasets (i.e., ActivityNet Caba Heilbron et al. (2015)) are two-stage detectors that require external action classification classifiers, which is costly and cumbersome. Our analysis into this problem reveals that the reason that many prior methods rely on external classifiers is because of a weak short-term encoder. Specifically, as discussed above, many prior methods have to either freeze the backbone or use small spatial video resolution in order to save GPU memory. This then leads to poor action classification performance, which requires these methods to adapt an external action recognition classifier. In contrast, we note that TallFormer utilizes a strong short-term encoder while achieving strong performance on both action classification and localization using a one-stage TBLM module.
To implement our TBLM, we build upon the existing methods Lin et al. (2021); Wang et al. (2021) by simply adding a shared linear action recognition classifier to each action proposal. For datasets where each video only contains one action category such as ActivityNet Caba Heilbron et al. (2015), we add the action classifier on the averaged features of the TCM with 50% dropout to predict the video-level action classes. Due to the strong representational power of Tall
Former, this simple modification achieves a high action classification accuracy and thus, eliminates the necessity for an external action recognition classifier. We use the same loss functions as the previous methodsLin et al. (2021); Wang et al. (2021). Additionally, we also attach Focal Loss Lin et al. (2017b) to the added linear action recognition layer.
4.1 Datasets and Evaluation Metrics
Datasets. We conduct our evaluations on the two commonly-used benchmark datasets THUMOS14 Idrees et al. (2017) and AcitivityNet-1.3 Caba Heilbron et al. (2015). THUMOS14 contains 200 untrimmed validation videos and 213 untrimmed testing videos with temporal annotations from 20 categories. ActivityNet-1.3 contains videos for training and videos for validation. Following previous works Lin et al. (2019b, 2021); Wang et al. (2021), on THUMOS14, we train on the validation set and evaluate on the test set. On ActivityNet-1.3 we train on the training set and evaluate on the validation set.
Evaluation Metrics. As is standard, in all of our experiments, we use mean Average Precision (mAP) to report our results. The Intersection over Union (IoU) thresholds are set to for THUMOS14 and for ActivityNet-1.3.
4.2 Implementation Details
The flexibility of our framework allows us to consider any transformer-based model as our short-term feature extractor. Due to its superior accuracy, we adopt Video Swin Transformer Liu et al. (2021b)
pretrained on Kinetics-400Carreira and Zisserman (2017). The number of layers in Temporal Consistency Module (TCM) is set to 3 and Droppath rate is 0.1. Our temporal boundary localization module (TBLM) is designed using the techniques from DaoTAD Wang et al. (2021) and AFSD Lin et al. (2021) for THUMOS14 and ActivityNet-1.3 respectively. Unlike previous methods that operate on (i) RGB and (ii) optical flow frames inputs, our TallFormer only uses RGB frames.
During training, we apply common data augmentations on both datasets, including random crop, random horizontal flipping, random rotate and other photometric distortions such as random brightness and contrast. For the other training and inference details, we follow DaoTAD for THUMOS14 and AFSD for ActivityNet-1.3 with minor modifications as below. The inference and all the other settings are kept the same. Gradient Checkpointing Chen et al. (2016) is applied to all our models. Our models are trained on 4 RTX A6000 GPUs.
THUMOS14. We extract RGB frames with 15fps and spatial resolution. Because 99.5% of action instances in the validation set span less than 32 seconds, we consider 480 frame-inputs cropped at
spatial resolution. The batch size is set to 4 on each GPU. Since in DaoTAD, clip features are temporally downsampled by 8, but in Swin Transformer the temporal downsampling rate is only 2, we add two Convolutional layers with stride 2 before the Temporal Consistency Module to keep the same temporal downsampling rate as in DaoTAD.
ActivityNet-1.3. As is done in prior work Lin et al. (2021), we resize all videos to 768 RGB frames with a spatial resolution of . We then obtain crops from the resized frames and use them as our inputs. The batch size is set to 1 on each GPU. For simplicity, we remove the boundary consistency learning module in AFSD. Our model is trained for 10 epochs instead of 16 that is used in the original AFSD.
4.3 Comparison with Short-term and Long-term Baselines
We next conduct a thorough empirical study investigating the importance of short-term vs. long-term modeling for the TAL task. We focus our comparisons on three of our baselines, which are compared under the same finite GPU memory constraints, i.e., either GB (RTX 3080) or GB (Tesla V100). We describe each of these baselines in more details below:
LT-Frozen: For this Long-Term modeling baseline, we use a powerful yet Frozen Video Swin-B feature extractor that operates on video frames. A similar strategy of freezing the feature extractor backbone is commonly used in many prior methods Lin et al. (2018, 2019b); Xu et al. (2020) as the GPU memory savings from freezing the backbone enable very long-range temporal modeling needed for TAL. All models are trained under a finite GPU memory constraint (i.e., either 12GB or 32GB).
ST-E2E: Unlike LT-Frozen baseline, the Short-Term End-to-End trainable baseline uses a Swin-B feature extractor (not frozen) that operates on video frame inputs. While benefiting from end-to-end trainability, due to the GPU memory limitation, this baseline can only span either (i) short temporal extent with dense video frame sampling or (ii) long temporal extent with sparse video frame sampling. We study both of these ST-E2E variants.
TallFormer: Compared to the previous two baselines, we believe that our approach achieves the best trade-off between short-term and long-term modeling. In other words, the short-term feature extractor in our framework can be trained end-to-end on high spatial resolution videos. Furthermore, our long-term memory module enables the model to maintain strong long-term modeling capability for precise temporal boundary localization.
|Mem Cap||Model Type||Short-term||Long-term||mAP(%)|
Based on the results in Table 2, we can make the following observations. Firstly, we observe that long-term modeling is important, i.e., reducing the temporal support in ST-E2E leads to sub-optimal performance, i.e., with a 32GB GPU memory limit, ST-E2E with 12 second temporal support achieves 4.1% lower average mAP than TallFormer with 32 second temporal support. Note that in this case, due to the GPU memory constraints, the ST-E2E variant cannot span longer than seconds. We also point out that the ST-E2E variant that spans seconds using sparsely sampled frames (i.e., vs. fps) also produces worse performance than TallFormer under the 32GB GPU memory constraints. We observe similar trends for the models trained under the 12GB GPU memory constraint.
Additionally, our results indicate that the the end-to-end training of a short-term feature extractor is also important as LT-Frozen baseline achieves 6.5% lower accuracy than TallFormer . We observe this trend in both 12GB and 32GB GPU memory settings.
Based on our results, we can conclude that TallFormer achieves the best accuracy-memory trade-offs under both 12GB and 32GB GPU memory constraints. Specifically, TallFormer outperforms the LT-Frozen and ST-E2E baselines by a large margin of 6.5% and 2.9% respectively under the 32GB memory constraint.
4.4 Comparison to the State-of-the-Art
Next, we compare TallFormer to the state-of-the-art methods as shown in Tab. 3. The upper part of Tab. 3 includes methods that operate on pre-extracted action recognition features extracted from video frames with a spatial resolution of . The middle section of Tab. 3 includes recently proposed end-to-end trainable methods, AFSD and DaoTAD, that operate on small spatial video resolution (i.e., and respectively) to fit into GPU memory. Lastly, in the bottom part of the table, we include our TallFormer , which can be trained end-to-end on long videos.
We experiment with three variants of our method. First, we introduce a variant, named TallFormer-12, which is cheap enough to fit in a 12GB memory GPU. For this variant, we experiment with two backbones: I3D and Swin-B, which allows fair comparison with prior methods. Additionally, for the first variant (using I3D backbone), we use the clip sampling rate on THUMOS14 and on AcitivityNet-1.3, whereas for the latter variant (using Swin-B backbone) the clip sampling rate is set to for THUMOS14 and for ActivityNet-1.3. The sampling rates are set such that both of these variants would fit in a 12GB memory GPU. Lastly, our best performing variant is TallFormer-32 with Swin-B as its backbone. For this last variant, we use a clip sampling-rate of for THUMOS and for ActivityNet. We set these sampling rates so that our model would fit in a 32GB Tesla V100 GPU.
|BSNLin et al. (2018)||TS||✗||✓||✓||53.5||36.9||20.0||36.8||-||46.5||30.0||8.0||30.0||-|
|BMNLin et al. (2019b)||TS||✗||✓||✓||56.0||38.8||20.5||38.5||-||50.1||34.8||8.3||33.9||-|
|BC-GNNBai et al. (2020)||TS||✗||✓||✓||57.1||40.4||23.1||40.2||-||50.6||34.8||9.4||34.3||-|
|BU-TALZhao et al. (2020)||I3D||✗||✓||✓||53.9||45.4||28.5||43.3||-||43.5||33.9||9.2||30.1||-|
|GTANLong et al. (2019)||P3D||✗||✓||✓||57.8||38.8||-||-||-||52.6||34.1||8.9||34.3||-|
|G-TADXu et al. (2020)||TS||✗||✓||✓||54.5||40.2||23.4||39.3||-||50.4||34.6||9.0||34.1||-|
|TAL Chao et al. (2018)||I3D||✗||✓||✗||53.2||42.8||20.8||39.8||-||38.2||18.3||1.3||20.2||-|
|RTD-Action Tan et al. (2021)||TS||✗||✓||✓||68.3||51.9||23.7||49.0||-||47.2||30.7||8.6||30.8||-|
|VSGN Zhao et al. (2021)||TS||✗||✓||✓||66.7||52.4||30.4||50.2||-||52.4||36.0||8.4||35.1||-|
|AFSDLin et al. (2021)||I3D||✓||✓||-||67.3||55.5||31.1||52.0||8||52.4||35.3||6.5||34.4||12|
|DaoTADWang et al. (2021)||I3D||✓||✗||✗||62.8||53.8||30.1||50.0||11||-||-||-||-||-|
|DaoTADWang et al. (2021)||SW||✓||✗||✗||72.7||59.8||33.3||56.3||30||-||-||-||-||-|
THUMOS14. The results in Tab. 3 (the left part of the table), indicate several interesting trends. First, we notice that despite using a small spatial resolution, the end-to-end trainable methods such as AFSD and DaoTAD, outperform methods that operate on pre-extracted action recognition features by a large margin. Second, our results indicate that the memory-constrained TallFormer-12 with an I3D backbone outperforms a strong AFSD baseline by a substantial margin according to all evaluation metrics. Moreover, when increasing the GPU memory constraints, TallFormer-12 achieves 7.2% higher accuracy on average than AFSD. We note that the GPU consumption for TallFormer is 29GB, which is still within the capacity of the mainstream Tesla V100 GPUs. We also point out that even when using the same amount of GPU memory as prior methods, our method still largely outperforms previous SOTAs, i.e., TallFormer-12 with an I3D backbone outperforms AFSD by 1.9%, and TallFormer-12 with a Swin-B backbone outperforms AFSD by 7.0%.
ActivityNet. In the right part of Tab. 3, we also present our results on ActivityNet-1.3. First, we point out that all the previous methods achieve strong TAL results while relying on an external action recognition classifier Xiong et al. (2016), which ensembles the predictions from ResNet200 and Inception-V2 models operating on RGB and optical flow inputs. Instead, we simplify this pipeline by predicting action boundaries and categories using a single model. Not only is our proposed framework simpler and more efficient, but it also outperforms all previous approaches by 1.2% using RGB inputs alone. One interesting observation is that TallFormer with I3D backbone is 6.5% lower than TallFormer with a Swin-B backbone. Our analysis of this result reveals that TallFormer-12 variant with an I3D backbone achieves a low video-level action recognition accuracy (78.2%) while the accuracy of TallFormer-12 with a Swin-B backbone is 90.1%. We also note that the accuracy of an external action recognition classifier Xiong et al. (2016) used by AFSD is 88.9%. This empirical finding also explains why all previous methods require an external action recognition classifier.
HACS. HACS-Segment Zhao et al. (2019) is a large scale dataset for temporal action localization task. It contains 38K untrimmed videos for training and 6K for validation. We train Tall
Former with Swin-B as backbone using the same network structure and hyperparameters as in AcitivityNet-1.3. We report these results in Tab.4. These results suggest that similar to our previously considered datasets, TallFormer also achieves state-of-the-art results on the HACS-Segment dataset. Specifically, it outperforms the GTAD Xu et al. (2020) and BMN Lin et al. (2019b); Qing et al. (2021) baselines by and average mAP respectively without external classifier.
|GTAD Xu et al. (2020)||-||-||-||27.5|
|BMN Lin et al. (2019b); Qing et al. (2021)||52.5||36.4||10.4||35.8|
Inference Speed Discussion. Compared with previous methods, we use a larger backbone and higher spatial resolution. On the other hand, our proposed framework is much simpler than that of many prior TAL methods. First, we note that we only use RGB frames as our input while most previous methods adopt a two-stream approach, which requires averaging predictions from the RGB and optical flow prediction streams. This is costly, because (i) optical flow extraction is slow and (ii) processing optical flow inputs with an additional backbone can roughly double the cost relative to single-stream RGB-only approaches.
Second, we point out that unlike most methods on ActivityNet, our framework does not require a separate ensembled action recognition classifier. More specifically, the most widely used external classifier Xiong et al. (2016) ensembles the results of multiple models (ResNet200, Inception-v2) from several modalities (e.g., RGB, Flow) using multiple sampled clips to achieve a high video-level action recognition accuracy. Due to the ensembling, this operation is very costly. In contrast, our method eliminates the costly reliance on an external action recognition classifier, and solves the TAL task using a single model.
Due to the complexity of the existing systems, and the lack of publicly available implementations, it is difficult to quantitatively measure the inference speed of many prior methods. However, we note that in general, TallFormer provides a much simpler, elegant and more efficient framework to the TAL problem. For example, consider performing inference on AcitivyNet-1.3 where each video is resized to 768 frames. The recent state-of-the-art method AFSD Lin et al. (2021) requires to (i) extract optical flow, (ii) process video frames for both RGB and optical flow modalities, (iii) perform video-level action classification using Xiong et al. (2016). On a RTX A6000 GPU, extracting optical flow of 768 frames takes about 11s using TV-L1 algorithm222https://github.com/open-mmlab/denseflow, which is what the authors use in their paper. Furthermore, inferring action boundaries from both RGB and flow modalities costs 0.14s. Lastly, the video-level classification step costs 0.6s. Thus, the overall inference speed of AFSD is 11.74s/video. On the other hand, TallFormer only costs 1.58s/video while outperforming AFSD by 1.2% mAP.
4.5 Ablation Study
Lastly, we study various design choices of our TallFormer model. Specifically, we investigate (i) TAL performance as a function of our clip sampling rate hyperparameter , (ii) the importance of the temporal consistency module (TCM) and (iii) TAL performance as a function of temporal support. We present these results below.
Accuracy vs. Clip Sampling Rate. During training, our model samples a fraction of total short-term clips from an untrimmed video input. We study the performance as a function of clip sampling rate and present these results in Tab. 1(a). Based on these, results we can make the following observations. First, we note that the GPU memory usage is proportional to the sampling rate . Second, we point out that standard end-to-end training () causes out-of-memory (OOM) error. Lastly, we show that TallFormer performs quite well even with a very small sampling rate , i.e., the TAL accuracy in mAP drops by only 1.0% while reducing the GPU memory usage to only GB (compared to GB using )
Importance of Temporal Consistency Module. As shown in Tab. 5(b), our proposed Temporal Consistency Module (TCM) increases the average mAP by 1.5%, which indicates its importance to our overall framework. More conceptually, these results suggest that encouraging long-range interactions between the features from long-term memory, and the online-processed features can alleviate the feature distribution inconsistency issue.
Analysis of Temporal Support. Next, we evaluate TallFormer when using different temporal support (measured in seconds). Based on the results in Tab 5(c), we observe that longer temporal supports leads to consistently higher average mAP.
We present TallFormer, a long-memory Transformer for temporal action localization. Our method is simple, flexible, and it can be efficiently trained on long high-resolution videos for TAL. Furthermore, we demonstrate that TallFormer significantly outperforms previous TAL approaches on the THUMOS14 and ActivityNet-1.3 benchmarks.
Some readers might wonder whether optimizing the GPU memory usage for long-video processing is a valuable contribution since modern GPUs can accommodate larger and larger GPU memory requirements. Furthermore, there exist many prior memory saving techniques such as Gradient Checkpoint Chen et al. (2016) and Mixed Precision Micikevicius et al. (2017). Despite the advances in GPU hardware, and new developments in memory saving techniques, we believe that TallFormer is still a valuable contribution to the research community. With the new developments in GPU hardware, the demands for higher resolution video analysis and larger models also grow. Thus, such demands pose new GPU memory-related challenges, especially for long-term video understanding tasks such as temporal action localization. We also note that TallFormer can be easily combined with the existing memory-saving techniques, which we demonstrated in our experiments. Our future work involves extending our framework to various multimodal settings that involve processing both visual inputs and language.
Boundary content graph neural network for temporal action proposal generation. In European Conference on Computer Vision, pp. 121–137. Cited by: §1, §1, §2, §2, §3.4, Table 3.
- Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095 2 (3), pp. 4. Cited by: §1, §2.
Activitynet: a large-scale video benchmark for human activity understanding.
Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970. Cited by: §3.4, §3.4, §4.1.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2, §2, §3.1, §4.2, Table 3.
- Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139. Cited by: §1, §2, Table 3.
- Training deep nets with sublinear memory cost. arXiv:1604.06174. Cited by: Table 1, §1, §2, §4.2, §5.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794. Cited by: §2.
- Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835. Cited by: §2.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211. Cited by: §2.
- X3d: expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213. Cited by: §2.
Accurate temporal action proposal generation with relation-aware pyramid network.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10810–10817. Cited by: §2.
- Ctap: complementary temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV), pp. 68–83. Cited by: §2.
- The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp. 5842–5850. Cited by: §3.1.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738. Cited by: §3.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §3.3.
- Deep networks with stochastic depth. In European conference on computer vision, pp. 646–661. Cited by: §3.3.
- The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, pp. 1–23. Cited by: §4.1.
- Stm: spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009. Cited by: §2.
- Motionsqueeze: neural motion feature learning for video understanding. In European Conference on Computer Vision, pp. 345–362. Cited by: §2.
- Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320–3329. Cited by: §1, §1, §2, §2, §3.4, §3.4, §4.1, §4.2, §4.2, §4.4, Table 3.
- Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093. Cited by: §2.
- Bmn: boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898. Cited by: §1, §1, §1, §2, §2, §3.4, 1st item, §4.1, §4.4, Table 3, Table 4.
- Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pp. 988–996. Cited by: §2.
- Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV ), pp. 3–19. Cited by: §1, §1, §1, §2, §2, §3.4, 1st item, Table 3.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.4.
- Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11612–11619. Cited by: §2.
- Multi-granularity generator for temporal action proposal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3604–3613. Cited by: §2.
- Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §3.3.
- Video swin transformer. arXiv preprint arXiv:2106.13230. Cited by: Table 1, §1, §2, §3.1, §4.2, Table 3.
- Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353. Cited by: §1, §1, §2, §3.4, Table 3.
- Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: §1, §2, §5.
- Keeping your eye on the ball: trajectory attention in video transformers. Advances in Neural Information Processing Systems 34. Cited by: §2.
- Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 485–494. Cited by: §2, §4.4, Table 4.
- Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: Table 3.
- Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5734–5743. Cited by: §2.
- Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27. Cited by: Table 3.
- Bsn++: complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. arXiv preprint arXiv:2009.07641. Cited by: §2.
- Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13526–13535. Cited by: §2, §3.4, Table 3.
- Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.
- Attention is all you need. Advances in neural information processing systems 30. Cited by: §2.
- Rgb stream is enough for temporal action detection. arXiv preprint arXiv:2107.04362. Cited by: Table 1, §1, §1, §2, §3.4, §4.1, §4.2, Table 3.
- Appearance-and-relation networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1430–1439. Cited by: §2.
- Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §2.
- Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768. Cited by: §2.
- Multi-level temporal pyramid network for action detection. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 41–54. Cited by: §2.
- Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797. Cited by: §4.4, §4.4, §4.4.
- G-tad: sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165. Cited by: §1, §1, §1, §2, §2, §3.4, 1st item, §4.4, Table 3, Table 4.
- Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103. Cited by: §2.
- Temporal query networks for fine-grained video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4486–4496. Cited by: §3.2.
- S3D: single shot multi-span detector via fully 3d convolutional networks. arXiv preprint arXiv:1807.08069. Cited by: §2.
- Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13658–13667. Cited by: §2, §3.4, Table 3.
- Hacs: human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668–8678. Cited by: §4.4.
- Bottom-up temporal action localization with mutual regularization. In European Conference on Computer Vision, pp. 539–555. Cited by: §1, §1, §2, §3.4, Table 3.
- Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923. Cited by: §2.