Telling Left from Right: Learning Spatial Correspondence between Sight and Sound

06/11/2020
by   Karren Yang, et al.
0

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.

READ FULL TEXT

page 1

page 4

page 6

research
06/11/2020

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Self-supervised audio-visual learning aims to capture useful representat...
research
07/10/2023

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

We propose a self-supervised method for learning representations based o...
research
10/03/2022

That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation

Learning to produce contact-rich, dynamic behaviors from raw sensory dat...
research
01/26/2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Large-scale datasets are the cornerstone of self-supervised representati...
research
09/02/2021

Binaural Audio Generation via Multi-task Learning

We present a learning-based approach for generating binaural audio from ...
research
09/07/2018

Self-Supervised Generation of Spatial Audio for 360 Video

We introduce an approach to convert mono audio recorded by a 360 video c...
research
05/03/2021

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Human perceives rich auditory experience with distinct sound heard by ea...

Please sign up or login with your details

Forgot password? Click here to reset