Revisiting spatio-temporal layouts for compositional action recognition

11/02/2021
by   Gorjan Radevski, et al.
0

Recognizing human actions is fundamentally a spatio-temporal reasoning problem, and should be, at least to some extent, invariant to the appearance of the human and the objects involved. Motivated by this hypothesis, in this work, we take an object-centric approach to action recognition. Multiple works have studied this setting before, yet it remains unclear (i) how well a carefully crafted, spatio-temporal layout-based method can recognize human actions, and (ii) how, and when, to fuse the information from layout and appearance-based models. The main focus of this paper is compositional/few-shot action recognition, where we advocate the usage of multi-head attention (proven to be effective for spatial reasoning) over spatio-temporal layouts, i.e., configurations of object bounding boxes. We evaluate different schemes to inject video appearance information to the system, and benchmark our approach on background cluttered action recognition. On the Something-Else and Action Genome datasets, we demonstrate (i) how to extend multi-head attention for spatio-temporal layout-based action recognition, (ii) how to improve the performance of appearance-based models by fusion with layout-based models, (iii) that even on non-compositional background-cluttered video datasets, a fusion between layout- and appearance-based models improves the performance.

READ FULL TEXT

page 1

page 4

page 8

page 9

research
05/04/2023

Modelling Spatio-Temporal Interactions for Compositional Action Recognition

Humans have the natural ability to recognize actions even if the objects...
research
09/01/2020

View-invariant action recognition

Human action recognition is an important problem in computer vision. It ...
research
02/01/2015

Learning Latent Spatio-Temporal Compositional Model for Human Action Recognition

Action recognition is an important problem in multimedia understanding. ...
research
06/27/2020

Compositional Video Synthesis with Action Graphs

Videos of actions are complex spatio-temporal signals, containing rich c...
research
12/21/2015

Harnessing the Deep Net Object Models for Enhancing Human Action Recognition

In this study, the influence of objects is investigated in the scenario ...
research
08/02/2016

Spatio-temporal Co-Occurrence Characterizations for Human Action Classification

The human action classification task is a widely researched topic and is...
research
03/20/2022

Point3D: tracking actions as moving points with 3D CNNs

Spatio-temporal action recognition has been a challenging task that invo...

Please sign up or login with your details

Forgot password? Click here to reset