Modelling Spatio-Temporal Interactions for Compositional Action Recognition

05/04/2023
by   Ramanathan Rajendiran, et al.
0

Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects and their context which is referred to as compositionality of actions. Compositional action recognition deals with imparting human-like compositional generalization abilities to action-recognition models. In this regard, extracting the interactions between humans and objects forms the basis of compositional understanding. These interactions are not affected by the appearance biases of the objects or the context. But the context provides additional cues about the interactions between things and stuff. Hence we need to infuse context into the human-object interactions for compositional action recognition. To this end, we first design a spatial-temporal interaction encoder that captures the human-object (things) interactions. The encoder learns the spatio-temporal interaction tokens disentangled from the background context. The interaction tokens are then infused with contextual information from the video tokens to model the interactions between things and stuff. The final context-infused spatio-temporal interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result of 83.8 methods by a significant margin. Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets such as Something-Something-V2 and Epic-Kitchens-100 where we obtain comparable or better performance than state-of-the-art.

READ FULL TEXT

page 1

page 5

page 6

page 10

research
12/20/2019

Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks

Human action is naturally compositional: humans can easily recognize and...
research
11/02/2021

Revisiting spatio-temporal layouts for compositional action recognition

Recognizing human actions is fundamentally a spatio-temporal reasoning p...
research
10/13/2021

Object-Region Video Transformers

Evidence from cognitive psychology suggests that understanding spatio-te...
research
01/13/2022

Hand-Object Interaction Reasoning

This paper proposes an interaction reasoning network for modelling spati...
research
11/25/2022

Interaction Visual Transformer for Egocentric Action Anticipation

Human-object interaction is one of the most important visual cues that h...
research
12/21/2015

Harnessing the Deep Net Object Models for Enhancing Human Action Recognition

In this study, the influence of objects is investigated in the scenario ...
research
12/03/2020

SAFCAR: Structured Attention Fusion for Compositional Action Recognition

We present a general framework for compositional action recognition – i....

Please sign up or login with your details

Forgot password? Click here to reset