An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

07/21/2022
by   Yuetian Weng, et al.
1

The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6 and performing favorably against state-of-the-art AFSD that uses additional flow features with 31 end-to-end Transformer-based framework for action detection.

READ FULL TEXT

page 2

page 16

page 18

research
08/02/2022

Two-Stream Transformer Architecture for Long Video Understanding

Pure vision transformer architectures are highly effective for short vid...
research
07/13/2023

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Recent video recognition models utilize Transformer models for long-rang...
research
12/09/2019

STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection

Spatio-temporal action localization is a challenging yet fascinating tas...
research
03/20/2023

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Autoregressive transformers have shown remarkable success in video gener...
research
03/02/2022

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars

Online action detection has attracted increasing research interests in r...
research
03/21/2022

LocATe: End-to-end Localization of Actions in 3D with Transformers

Understanding a person's behavior from their 3D motion is a fundamental ...
research
07/25/2019

Submission to ActivityNet Challenge 2019: Task B Spatio-temporal Action Localization

This technical report present an overview of our system proposed for the...

Please sign up or login with your details

Forgot password? Click here to reset