Self-Supervised Learning of Audio-Visual Objects from Video

08/10/2020
by   Triantafyllos Afouras, et al.
7

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets.Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

READ FULL TEXT

page 6

page 8

page 10

page 11

04/10/2018

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The thud of a bouncing ball, the onset of speech as lips open -- when vi...
02/13/2020

Self-supervised learning for audio-visual speaker diarization

Speaker diarization, which is to find the speech segments of specific sp...
07/28/2020

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Detecting sound source objects within visual observation is important fo...
06/02/2022

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning from audio-visual data offers many possibilities to express cor...
11/24/2017

Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

This paper presents a self-supervised method for detecting the active sp...
09/07/2018

Self-Supervised Generation of Spatial Audio for 360 Video

We introduce an approach to convert mono audio recorded by a 360 video c...

Code Repositories

avobjects

Implementation for ECCV20 paper "Self-Supervised Learning of audio-visual objects from video"


view repo