Interaction Visual Transformer for Egocentric Action Anticipation

11/25/2022
by   Debaditya Roy, et al.
0

Human-object interaction is one of the most important visual cues that has not been explored for egocentric action anticipation. We propose a novel Transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3 recall.

READ FULL TEXT

page 2

page 6

page 7

page 8

page 9

page 14

research
05/04/2023

Modelling Spatio-Temporal Interactions for Compositional Action Recognition

Humans have the natural ability to recognize actions even if the objects...
research
10/13/2021

Object-Region Video Transformers

Evidence from cognitive psychology suggests that understanding spatio-te...
research
07/20/2022

Is an Object-Centric Video Representation Beneficial for Transfer?

The objective of this work is to learn an object-centric video represent...
research
06/20/2023

How can objects help action recognition?

Current state-of-the-art video models process a video clip as a long seq...
research
08/10/2023

Interaction-aware Joint Attention Estimation Using People Attributes

This paper proposes joint attention estimation in a single image. Differ...
research
12/09/2020

Proactive Interaction Framework for Intelligent Social Receptionist Robots

Proactive human-robot interaction (HRI) allows the receptionist robots t...
research
10/17/2022

A Saccaded Visual Transformer for General Object Spotting

This paper presents the novel combination of a visual transformer style ...

Please sign up or login with your details

Forgot password? Click here to reset