STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

10/14/2022
by   Dasom Ahn, et al.
0

In action recognition, although the combination of spatio-temporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder layer consists of a full self-attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.

READ FULL TEXT
research
01/08/2022

Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition

Capturing the dependencies between joints is critical in skeleton-based ...
research
12/12/2022

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

An important challenge in vision-based action recognition is the embeddi...
research
06/20/2023

How can objects help action recognition?

Current state-of-the-art video models process a video clip as a long seq...
research
02/26/2019

STAR-Net: Action Recognition using Spatio-Temporal Activation Reprojection

While depth cameras and inertial sensors have been frequently leveraged ...
research
09/04/2022

Hierarchical Transformer with Spatio-Temporal Context Aggregation for Next Point-of-Interest Recommendation

Next point-of-interest (POI) recommendation is a critical task in locati...
research
03/16/2023

PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

Existing methods of multi-person video 3D human Pose and Shape Estimatio...
research
05/13/2023

Lightweight Delivery Detection on Doorbell Cameras

Despite recent advances in video-based action recognition and robust spa...

Please sign up or login with your details

Forgot password? Click here to reset