CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

06/16/2023
by   Hao-Wen Dong, et al.
0

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2022

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Recent years have seen progress beyond domain-specific sound separation ...
research
01/30/2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in tex...
research
05/06/2023

AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion

Recent advances in diffusion models have showcased promising results in ...
research
04/25/2022

Retrieval-Augmented Diffusion Models

Generative image synthesis with diffusion models has recently achieved e...
research
06/29/2023

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

The Video-to-Audio (V2A) model has recently gained attention for its pra...
research
06/05/2022

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand info...
research
09/15/2023

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

Learning high-quality video representation has shown significant applica...

Please sign up or login with your details

Forgot password? Click here to reset