Log In Sign Up

AudioViewer: Learning to Visualize Sound

by   Yuchi Zhang, et al.

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.


page 1

page 3

page 4

page 5

page 9

page 10

page 11

page 12


Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during postpr...

Vocoder-Based Speech Synthesis from Silent Videos

Both acoustic and visual information influence human perception of speec...

Video-to-Video Translation for Visual Speech Synthesis

Despite remarkable success in image-to-image translation that celebrates...

Earballs: Neural Transmodal Translation

As is expressed in the adage "a picture is worth a thousand words", when...

Large-scale multilingual audio visual dubbing

We describe a system for large-scale audiovisual translation and dubbing...

Direct Speech-to-image Translation

Direct speech-to-image translation without text is an interesting and us...