Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

03/30/2023
by   Kim Sung-Bin, et al.
0

How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space.

READ FULL TEXT

page 5

page 7

page 8

page 13

page 14

page 15

page 16

page 17

research
09/19/2023

Sound Source Localization is All about Cross-Modal Alignment

Humans can easily perceive the direction of sound sources in a visual sc...
research
05/03/2023

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Segment Anything Model (SAM) has recently shown its powerful effectivene...
research
05/10/2022

Learning Visual Styles from Audio-Visual Associations

From the patter of rain to the crunch of snow, the sounds we hear often ...
research
01/26/2020

Curriculum Audiovisual Learning

Associating sound and its producer in complex audiovisual scene is a cha...
research
08/18/2023

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundat...
research
12/22/2020

AudioViewer: Learning to Visualize Sound

Sensory substitution can help persons with perceptual deficits. In this ...
research
11/21/2022

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

We present a novel framework, Localized Image Stylization with Audio (LI...

Please sign up or login with your details

Forgot password? Click here to reset