AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

05/22/2023
by   Guy Yariv, et al.
0

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

READ FULL TEXT

page 1

page 3

page 4

research
05/18/2023

Discriminative Diffusion Models as Few-shot Vision and Language Learners

Diffusion models, such as Stable Diffusion, have shown incredible perfor...
research
06/20/2023

Align, Adapt and Inject: Sound-guided Unified Image Generation

Text-guided image generation has witnessed unprecedented progress due to...
research
01/30/2023

ArchiSound: Audio Generation with Diffusion

The recent surge in popularity of diffusion models for image generation ...
research
02/08/2023

Noise2Music: Text-conditioned Music Generation with Diffusion Models

We introduce Noise2Music, where a series of diffusion models is trained ...
research
08/31/2023

Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps

Despite recent advancements in image generation, diffusion models still ...
research
09/14/2023

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Generating realistic talking faces is a complex and widely discussed tas...
research
07/24/2023

Interpolating between Images with Diffusion Models

One little-explored frontier of image generation and editing is the task...

Please sign up or login with your details

Forgot password? Click here to reset