Cross-modal Representation Learning for Zero-shot Action Recognition

05/03/2022
by   Chung-Ching Lin, et al.
4

We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic gap, and the hubness problem. Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will be made available.

READ FULL TEXT

page 3

page 13

page 14

page 16

page 18

page 19

page 20

research
07/27/2021

A New Split for Evaluating True Zero-Shot Action Recognition

Zero-shot action recognition is the task of classifying action categorie...
research
07/26/2021

Towards the Unseen: Iterative Text Recognition by Distilling from Errors

Visual text recognition is undoubtedly one of the most extensively resea...
research
07/20/2022

Temporal and cross-modal attention for audio-visual zero-shot learning

Audio-visual generalised zero-shot learning for video classification req...
research
03/29/2022

Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

Most methods tackle zero-shot video classification by aligning visual-se...
research
10/16/2018

Cross-Modal and Hierarchical Modeling of Video and Text

Visual data and text data are composed of information at multiple granul...
research
04/10/2017

Semantically Consistent Regularization for Zero-Shot Recognition

The role of semantics in zero-shot learning is considered. The effective...
research
11/13/2020

Transductive Zero-Shot Learning using Cross-Modal CycleGAN

In Computer Vision, Zero-Shot Learning (ZSL) aims at classifying unseen ...

Please sign up or login with your details

Forgot password? Click here to reset