Object-Region Video Transformers

10/13/2021
by   Roei Herzig, et al.
4

Evidence from cognitive psychology suggests that understanding spatio-temporal object interactions and dynamics can be essential for recognizing actions in complex videos. Therefore, action recognition models are expected to benefit from explicit modeling of objects, including their appearance, interaction, and dynamics. Recently, video transformers have shown great success in video understanding, exceeding CNN performance. Yet, existing video transformer models do not explicitly model objects. In this work, we present Object-Region Video Transformers (ORViT), an object-centric approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric spatio-temporal representations throughout multiple transformer layers. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an “Object-Region Attention” element applies self-attention over the patches and object regions. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate “Object-Dynamics Module”, which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on standard and compositional action recognition on Something-Something V2, standard action recognition on Epic-Kitchen100 and Diving48, and spatio-temporal action detection on AVA. We show strong improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at https://roeiherz.github.io/ORViT/.

READ FULL TEXT

page 2

page 4

page 23

research
05/04/2023

Modelling Spatio-Temporal Interactions for Compositional Action Recognition

Humans have the natural ability to recognize actions even if the objects...
research
02/23/2023

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

We propose a novel framework for the task of object-centric video predic...
research
07/20/2021

Generative Video Transformer: Can Objects be the Words?

Transformers have been successful for many natural language processing t...
research
06/26/2021

An Image Classifier Can Suffice For Video Understanding

We propose a new perspective on video understanding by casting the video...
research
11/25/2022

Interaction Visual Transformer for Egocentric Action Anticipation

Human-object interaction is one of the most important visual cues that h...
research
06/08/2022

Patch-based Object-centric Transformers for Efficient Video Generation

In this work, we present Patch-based Object-centric Video Transformer (P...
research
09/13/2022

Vision Transformers for Action Recognition: A Survey

Vision transformers are emerging as a powerful tool to solve computer vi...

Please sign up or login with your details

Forgot password? Click here to reset