CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

12/14/2022
by   Hao-Wen Dong, et al.
6

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

READ FULL TEXT

page 8

page 18

page 19

page 20

page 21

page 22

research
06/16/2023

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Recent work has studied text-to-audio synthesis using large amounts of p...
research
04/12/2022

Text-Driven Separation of Arbitrary Sounds

We propose a method of separating a desired sound source from a single-c...
research
11/02/2020

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Recent progress in deep learning has enabled many advances in sound sepa...
research
12/15/2021

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Deep learning techniques for separating audio into different sound sourc...
research
06/17/2021

Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention

We introduce a state-of-the-art audio-visual on-screen sound separation ...
research
07/20/2022

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-...
research
03/20/2021

Overprotective Training Environments Fall Short at Testing Time: Let Models Contribute to Their Own Training

Despite important progress, conversational systems often generate dialog...

Please sign up or login with your details

Forgot password? Click here to reset