Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

05/03/2021
by   Yan-Bo Lin, et al.
0

Human perceives rich auditory experience with distinct sound heard by ears. Videos recorded with binaural audio particular simulate how human receives ambient sound. However, a large number of videos are with monaural audio only, which would degrade the user experience due to the lack of ambient information. To address this issue, we propose an audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components. By preserving the left-right consistency in both audio and visual modalities, our learning strategy can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth binaural audio data during training. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios, with ablation studies and visualization further support the use of our model for audio spatialization.

READ FULL TEXT

page 3

page 6

research
05/14/2019

Self-supervised Audio Spatialization with Correspondence Classifier

Spatial audio is an essential medium to audiences for 3D visual and audi...
research
09/07/2018

Self-Supervised Generation of Spatial Audio for 360 Video

We introduce an approach to convert mono audio recorded by a 360 video c...
research
05/21/2021

Semi-Supervised Audio Representation Learning for Modeling Beehive Strengths

Honey bees are critical to our ecosystem and food security as a pollinat...
research
06/11/2020

Telling Left from Right: Learning Spatial Correspondence between Sight and Sound

Self-supervised audio-visual learning aims to capture useful representat...
research
01/13/2020

Two Channel Audio Zooming System For Smartphone

In this paper, two microphone based systems for audio zooming is propose...
research
07/10/2023

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

We propose a self-supervised method for learning representations based o...
research
03/11/2023

CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective

Incorporating the audio stream enables Video Saliency Prediction (VSP) t...

Please sign up or login with your details

Forgot password? Click here to reset