Estimating Visual Information From Audio Through Manifold Learning

08/03/2022
by   Fabrizio Pedersoli, et al.
1

We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to occlusions and changes in illumination, and can function as a backup in case vision/lidar sensors fail. Therefore, audio-based methods can be useful even for applications in which only visual information is of interest Our framework is based on Manifold Learning and consists of two steps. First, we train a Vector-Quantized Variational Auto-Encoder to learn the data manifold of the particular visual modality we are interested in. Second, we train an Audio Transformation network to map multi-channel audio signals to the latent representation of the corresponding visual sample. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset. In particular, we consider the prediction of the following visual modalities from audio: depth and semantic segmentation. We hope the findings of our work can facilitate further research in visual information extraction from audio. Code is available at: https://github.com/ubc-vision/audio_manifold.

READ FULL TEXT

page 5

page 14

page 18

page 19

page 20

page 21

page 22

research
08/21/2023

Audio-Visual Class-Incremental Learning

In this paper, we introduce audio-visual class-incremental learning, a c...
research
12/15/2022

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Vision transformers (ViTs) have achieved impressive results on various c...
research
06/12/2021

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore,...
research
11/29/2021

AVA-AVD: Audio-visual Speaker Diarization in the Wild

Audio-visual speaker diarization aims at detecting “who spoken when“ usi...
research
10/24/2019

Vision-Infused Deep Audio Inpainting

Multi-modality perception is essential to develop interactive intelligen...
research
04/06/2023

A Closer Look at Audio-Visual Semantic Segmentation

Audio-visual segmentation (AVS) is a complex task that involves accurate...
research
11/15/2021

Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

Binaural audio gives the listener an immersive experience and can enhanc...

Please sign up or login with your details

Forgot password? Click here to reset