Language-Guided Audio-Visual Source Separation via Trimodal Consistency

03/28/2023
by   Reuben Tan, et al.
0

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training.

READ FULL TEXT

page 1

page 4

page 8

page 14

research
04/05/2018

Learning to Separate Object Sounds by Watching Unlabeled Video

Perceiving a scene most fully requires all the senses. Yet modeling how ...
research
08/10/2020

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visua...
research
07/21/2020

SLNSpeech: solving extended speech separation problem by the help of sign language

A speech separation task can be roughly divided into audio-only separati...
research
03/28/2022

Separate What You Describe: Language-Queried Audio Source Separation

In this paper, we introduce the task of language-queried audio source se...
research
05/15/2021

Move2Hear: Active Audio-Visual Source Separation

We introduce the active audio-visual source separation problem, where an...
research
07/14/2022

Audio-guided Album Cover Art Generation with Genetic Algorithms

Over 60,000 songs are released on Spotify every day, and the competition...
research
05/12/2023

Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation

In this paper, we focus on a recently proposed novel task called Audio-V...

Please sign up or login with your details

Forgot password? Click here to reset