Adaptive Perception Transformer for Temporal Action Localization

08/25/2022
by   Yizheng Ouyang, et al.
0

Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a novel end-to-end model, called adaptive perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch multi-head self-attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the boundaries and categories of video actions without extra steps. Extensive experiments together with ablation studies are provided to reveal the effectiveness of our design. Our method achieves a state-of-the-art accuracy on the THUMOS14 dataset (65.8% in terms of mAP@0.5, 42.6% mAP@0.7, and 62.7% mAP@Avg), and obtains competitive performance on the ActivityNet-1.3 dataset with an average mAP of 36.1%. The code and models are available at https://github.com/SouperO/AdaPerFormer.

READ FULL TEXT

page 1

page 9

page 10

page 11

research
06/18/2021

End-to-end Temporal Action Detection with Transformer

Temporal action detection (TAD) aims to determine the semantic label and...
research
02/16/2022

ActionFormer: Localizing Moments of Actions with Transformers

Self-attention based Transformer models have demonstrated impressive res...
research
08/12/2022

Class-attention Video Transformer for Engagement Intensity Prediction

In order to deal with variant-length long videos, prior works extract mu...
research
03/13/2023

TriDet: Temporal Action Detection with Relative Boundary Modeling

In this paper, we present a one-stage framework TriDet for temporal acti...
research
01/02/2022

TVNet: Temporal Voting Network for Action Localization

We propose a Temporal Voting Network (TVNet) for action localization in ...
research
09/18/2021

Towards High-Quality Temporal Action Detection with Sparse Proposals

Temporal Action Detection (TAD) is an essential and challenging topic in...
research
07/13/2022

Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers

Humans have an innate ability to sense their surroundings, as they can e...

Please sign up or login with your details

Forgot password? Click here to reset