Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

06/02/2022
by   Shanshan Wang, et al.
0

Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360^o video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10 % improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.

READ FULL TEXT

page 1

page 4

page 5

page 6

page 13

research
11/03/2020

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning represent...
research
08/10/2020

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visua...
research
01/04/2022

Sound and Visual Representation Learning with Multiple Pretraining Tasks

Different self-supervised tasks (SSL) reveal different features from the...
research
07/10/2023

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

We propose a self-supervised method for learning representations based o...
research
03/14/2020

Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

Immersive audio-visual perception relies on the spatial integration of b...
research
01/26/2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Large-scale datasets are the cornerstone of self-supervised representati...
research
04/08/2021

CoCoNets: Continuous Contrastive 3D Scene Representations

This paper explores self-supervised learning of amodal 3D feature repres...

Please sign up or login with your details

Forgot password? Click here to reset