Temporal and cross-modal attention for audio-visual zero-shot learning

07/20/2022
by   Otniel-Bogdan Mercea, et al.
5

Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to unseen classes at test time. We propose a multi-modal and Temporal Cross-attention Framework () for audio-visual generalised zero-shot learning. Its inputs are temporally aligned audio and visual features that are obtained from pre-trained networks. Encouraging the framework to focus on cross-modal correspondence across time instead of self-attention within the modalities boosts the performance significantly. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the , , and benchmarks for (generalised) zero-shot learning. Code for reproducing all results is available at <https://github.com/ExplainableML/TCAF-GZSL>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2022

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Learning to classify video data from classes not included in the trainin...
research
11/22/2022

On the Transferability of Visual Features in Generalized Zero-Shot Learning

Generalized Zero-Shot Learning (GZSL) aims to train a classifier that ca...
research
03/27/2023

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning

Recent compositional zero-shot learning (CZSL) methods adapt pre-trained...
research
05/19/2023

Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Pre-trained vision-language models have inspired much research on few-sh...
research
03/03/2020

Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Trained on large datasets, deep learning (DL) can accurately classify vi...
research
05/03/2022

Cross-modal Representation Learning for Zero-shot Action Recognition

We present a cross-modal Transformer-based framework, which jointly enco...
research
08/24/2023

Hyperbolic Audio-visual Zero-shot Learning

Audio-visual zero-shot learning aims to classify samples consisting of a...

Please sign up or login with your details

Forgot password? Click here to reset