AudioViewer: Learning to Visualize Sound

by   Yuchi Zhang, et al.

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.


page 1

page 3

page 4

page 5

page 9

page 10

page 11

page 12


Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during postpr...

A subjective study of the perceptual acceptability of audio-video desynchronization in sports videos

This paper presents the results of a study conducted on the perceptual a...

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

How does audio describe the world around us? In this paper, we propose a...

Vocoder-Based Speech Synthesis from Silent Videos

Both acoustic and visual information influence human perception of speec...

BWSNet: Automatic Perceptual Assessment of Audio Signals

This paper introduces BWSNet, a model that can be trained from raw human...

Earballs: Neural Transmodal Translation

As is expressed in the adage "a picture is worth a thousand words", when...

Direct Speech-to-image Translation

Direct speech-to-image translation without text is an interesting and us...

Please sign up or login with your details

Forgot password? Click here to reset