Video Action Recognition with Attentive Semantic Units

03/17/2023
by   Yifei Chen, et al.
0

Visual-Language Models (VLMs) have significantly advanced action video recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1

READ FULL TEXT

page 2

page 3

page 4

page 5

page 12

page 13

page 14

research
07/25/2019

Learning Visual Actions Using Multiple Verb-Only Labels

This work introduces verb-only representations for both recognition and ...
research
09/03/2021

Video Pose Distillation for Few-Shot, Fine-Grained Sports Action Recognition

Human pose is a useful feature for fine-grained sports action understand...
research
02/08/2020

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Egocentric video recognition is a natural testbed for diverse interactio...
research
08/03/2022

Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

Fine-grained action recognition is a challenging task in computer vision...
research
04/19/2011

Hue Histograms to Spatiotemporal Local Features for Action Recognition

Despite the recent developments in spatiotemporal local features for act...
research
10/27/2016

Exploiting Structure Sparsity for Covariance-based Visual Representation

The past few years have witnessed increasing research interest on covari...
research
04/01/2016

Learning a Pose Lexicon for Semantic Action Recognition

This paper presents a novel method for learning a pose lexicon comprisin...

Please sign up or login with your details

Forgot password? Click here to reset