Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

07/10/2023
by   Sagnik Majumder, et al.
0

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. In particular, our method leverages a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. We show through extensive experiments that our features are generic enough to improve over multiple state-of-the-art baselines on two public challenging egocentric video datasets, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.

READ FULL TEXT

page 2

page 4

page 9

page 17

research
06/11/2020

Telling Left from Right: Learning Spatial Correspondence between Sight and Sound

Self-supervised audio-visual learning aims to capture useful representat...
research
06/11/2020

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Self-supervised audio-visual learning aims to capture useful representat...
research
06/02/2022

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning from audio-visual data offers many possibilities to express cor...
research
09/14/2020

Themes Informed Audio-visual Correspondence Learning

The applications of short-term user-generated video (UGV), such as Snapc...
research
01/04/2023

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

Can conversational videos captured from multiple egocentric viewpoints r...
research
10/11/2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

360^∘ videos convey holistic views for the surroundings of a scene. It p...
research
05/03/2021

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Human perceives rich auditory experience with distinct sound heard by ea...

Please sign up or login with your details

Forgot password? Click here to reset