Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

11/15/2021
by   Kranti Kumar Parida, et al.
0

Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality. However, recording binaural audio requires specialized setup with a dummy human head having microphones in left and right ears. Such a recording setup is difficult to build and setup, therefore mono audio has become the preferred choice in common devices. To obtain the same impact as binaural audio, recent efforts have been directed towards lifting mono audio to binaural audio conditioned on the visual input from the scene. Such approaches have not used an important cue for the task: the distance of different sound producing objects from the microphones. In this work, we argue that depth map of the scene can act as a proxy for inducing distance information of different objects in the scene, for the task of audio binauralization. We propose a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio feature jointly. We design the network on top of state-of-the-art transformer networks for image and depth representation. We show empirically that the proposed method outperforms state-of-the-art methods comfortably for two challenging public datasets FAIR-Play and MUSIC-Stereo. We also demonstrate with qualitative results that the method is able to focus on the right information required for the task. The project details are available at <https://krantiparida.github.io/projects/bmonobinaural.html>

READ FULL TEXT

page 3

page 8

research
08/10/2021

Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention

Binaural audio gives the listener the feeling of being in the recording ...
research
03/15/2021

Beyond Image to Depth: Improving Depth Prediction using Echoes

We address the problem of estimating depth with multi modal audio visual...
research
12/07/2022

iQuery: Instruments as Queries for Audio-Visual Sound Separation

Current audio-visual separation methods share a standard architecture de...
research
10/05/2021

Echo-Reconstruction: Audio-Augmented 3D Scene Reconstruction

Reflective and textureless surfaces such as windows, mirrors, and walls ...
research
10/09/2021

An evaluation of data augmentation methods for sound scene geotagging

Sound scene geotagging is a new topic of research which has evolved from...
research
08/03/2022

Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a sce...
research
04/13/2021

Visually Informed Binaural Audio Generation without Binaural Audios

Stereophonic audio, especially binaural audio, plays an essential role i...

Please sign up or login with your details

Forgot password? Click here to reset