AudioViewer: Learning to Visualize Sound

12/22/2020
by   Yuchi Zhang, et al.
8

Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 9

page 10

page 11

page 12

research
08/19/2018

Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during postpr...
research
12/03/2022

A subjective study of the perceptual acceptability of audio-video desynchronization in sports videos

This paper presents the results of a study conducted on the perceptual a...
research
03/30/2023

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

How does audio describe the world around us? In this paper, we propose a...
research
04/06/2020

Vocoder-Based Speech Synthesis from Silent Videos

Both acoustic and visual information influence human perception of speec...
research
09/05/2023

BWSNet: Automatic Perceptual Assessment of Audio Signals

This paper introduces BWSNet, a model that can be trained from raw human...
research
05/27/2020

Earballs: Neural Transmodal Translation

As is expressed in the adage "a picture is worth a thousand words", when...
research
04/07/2020

Direct Speech-to-image Translation

Direct speech-to-image translation without text is an interesting and us...

Please sign up or login with your details

Forgot password? Click here to reset