Generating Realistic Images from In-the-wild Sounds

09/05/2023
by   Taegyeong Lee, et al.
0

Representing wild sounds as images is an important but challenging task due to the lack of paired datasets between sound and images and the significant differences in the characteristics of these two modalities. Previous studies have focused on generating images from sound in limited categories or music. In this paper, we propose a novel approach to generate images from in-the-wild sounds. First, we convert sound into text using audio captioning. Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound. Lastly, we propose a direct sound optimization with CLIPscore and AudioCLIP and generate images with a diffusion-based model. In experiments, it shows that our model is able to generate high quality images from wild sounds and outperforms baselines in both quantitative and qualitative evaluations on wild audio datasets.

READ FULL TEXT
research
05/11/2023

V2Meow: Meowing to the Visual Beat via Music Generation

Generating high quality music that complements the visual content of a v...
research
02/13/2022

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

The task of audio-visual sound source localization has been well studied...
research
05/10/2022

Learning Visual Styles from Audio-Visual Associations

From the patter of rain to the crunch of snow, the sounds we hear often ...
research
09/24/2021

From images in the wild to video-informed image classification

Image classifiers work effectively when applied on structured images, ye...
research
08/17/2023

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

Finding the right sound effects (SFX) to match moments in a video is a d...
research
08/18/2023

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

The task of lip synchronization (lip-sync) seeks to match the lips of hu...
research
05/03/2023

Diverse and Vivid Sound Generation from Text Descriptions

Previous audio generation mainly focuses on specified sound classes such...

Please sign up or login with your details

Forgot password? Click here to reset