TALLFormer: Temporal Action Localization with Long-memory Transformer

04/04/2022
by   Feng Cheng, et al.
7

Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a very small spatial video resolution. This issue becomes even worse with the recent video transformer models, many of which have quadratic memory complexity. To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization transformer with Long-term memory. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration, thus, significantly reducing the GPU memory consumption and training time. These efficiency savings allow us (i) to use a powerful video transformer-based feature extractor without freezing the backbone or reducing the spatial video resolution, while (ii) also maintaining long-range temporal boundary localization capability. With only RGB frames as input and no external action recognition classifier, TALLFormer outperforms previous state-of-the-art methods by a large margin, achieving an average mAP of 59.1 code will be available in https://github.com/klauscc/TALLFormer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2022

InstanceFormer: An Online Video Instance Segmentation Framework

Recent transformer-based offline video instance segmentation (VIS) appro...
research
04/02/2021

Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

The standard way of training video models entails sampling at each itera...
research
01/20/2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

While today's video recognition systems parse snapshots or short clips a...
research
06/13/2023

E2E-LOAD: End-to-End Long-form Online Action Detection

Recently, there has been a growing trend toward feature-based approaches...
research
03/21/2022

LocATe: End-to-end Localization of Actions in 3D with Transformers

Understanding a person's behavior from their 3D motion is a fundamental ...
research
11/25/2022

Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Temporal action localization (TAL) requires long-form reasoning to predi...
research
07/30/2022

Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding

This paper proposes a 4D backbone for long-term point cloud video unders...

Please sign up or login with your details

Forgot password? Click here to reset