Align, Adapt and Inject: Sound-guided Unified Image Generation

06/20/2023
by   Yue Yang, et al.
0

Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first train a multi-modal encoder to align audio representation with the pre-trained textual manifold and visual manifold, respectively. Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly. In this way, we are able to extract the dynamic information of varied sounds, while utilizing the formidable capability of existing T2I models to facilitate sound-guided image generation, editing, and stylization in a convenient and cost-effective manner. The experiment results confirm that our proposed AAI outperforms other text and sound-guided state-of-the-art methods. And our aligned multi-modal encoder is also competitive with other approaches in the audio-visual retrieval and audio-text retrieval tasks.

READ FULL TEXT

page 2

page 7

page 9

page 15

page 16

page 17

research
03/17/2023

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Text-to-image (T2I) models based on diffusion processes have achieved re...
research
11/30/2021

Sound-Guided Semantic Image Manipulation

The recent success of the generative model shows that leveraging the mul...
research
05/22/2023

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

In recent years, image generation has shown a great leap in performance,...
research
09/21/2023

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Text-guided image generation aimed to generate desired images conditione...
research
03/08/2023

New Audio Representations Image Gan Generation from BriVL

Recently, researchers have gradually realized that in some cases, the se...
research
09/08/2023

Sequential Semantic Generative Communication for Progressive Text-to-Image Generation

This paper proposes new framework of communication system leveraging pro...
research
08/20/2019

From Text to Sound: A Preliminary Study on Retrieving Sound Effects to Radio Stories

Sound effects play an essential role in producing high-quality radio sto...

Please sign up or login with your details

Forgot password? Click here to reset