Video Action Transformer Network

12/06/2018
by   Rohit Girdhar, et al.
20

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin - more than 7.5 relative) improvement, using only raw RGB frames as input.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
04/05/2018

Guess Where? Actor-Supervision for Spatiotemporal Action Localization

This paper addresses the problem of spatiotemporal localization of actio...
research
07/08/2023

VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Egocentric action anticipation is a challenging task that aims to make a...
research
05/23/2017

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

This paper introduces a video dataset of spatio-temporally localized Ato...
research
03/07/2020

TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation

Video action anticipation aims to predict future action categories from ...
research
12/14/2015

Watch-Bot: Unsupervised Learning for Reminding Humans of Forgotten Actions

We present a robotic system that watches a human using a Kinect v2 RGB-D...
research
07/21/2019

Attention Filtering for Multi-person Spatiotemporal Action Detection on Deep Two-Stream CNN Architectures

Action detection and recognition tasks have been the target of much focu...
research
03/21/2022

LocATe: End-to-end Localization of Actions in 3D with Transformers

Understanding a person's behavior from their 3D motion is a fundamental ...

Please sign up or login with your details

Forgot password? Click here to reset