How can objects help action recognition?

06/20/2023
by   Xingyi Zhou, et al.
0

Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30 Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

READ FULL TEXT

page 1

page 2

page 3

page 8

research
10/14/2022

STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

In action recognition, although the combination of spatio-temporal video...
research
08/08/2023

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Transformers have become the primary backbone of the computer vision com...
research
11/25/2022

Interaction Visual Transformer for Egocentric Action Anticipation

Human-object interaction is one of the most important visual cues that h...
research
10/05/2022

Phenaki: Variable Length Video Generation From Open Domain Textual Description

We present Phenaki, a model capable of realistic video synthesis, given ...
research
05/07/2023

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Transformer models are foundational to natural language processing (NLP)...
research
04/17/2023

Efficient Video Action Detection with Token Dropout and Context Refinement

Streaming video clips with large-scale video tokens impede vision transf...
research
06/13/2022

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Recent action recognition models have achieved impressive results by int...

Please sign up or login with your details

Forgot password? Click here to reset