Text-to-feature diffusion for audio-visual few-shot learning

09/07/2023
by   Otniel-Bogdan Mercea, et al.
0

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2022

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Learning to classify video data from classes not included in the trainin...
research
10/18/2021

Who calls the shots? Rethinking Few-Shot Learning for Audio

Few-shot learning aims to train models that can recognize novel classes ...
research
06/17/2017

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...
research
11/16/2022

Few-shot Learning for Multi-modal Social Media Event Filtering

Social media has become an important data source for event analysis. Whe...
research
10/11/2022

Match Cutting: Finding Cuts with Smooth Visual Transitions

A match cut is a transition between a pair of shots that uses similar fr...
research
05/19/2023

Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Pre-trained vision-language models have inspired much research on few-sh...
research
08/24/2023

Hyperbolic Audio-visual Zero-shot Learning

Audio-visual zero-shot learning aims to classify samples consisting of a...

Please sign up or login with your details

Forgot password? Click here to reset