Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

04/10/2018
by   Andrew Owens, et al.
0

The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory

READ FULL TEXT

page 2

page 5

page 6

page 7

page 10

page 13

research
12/11/2018

2.5D Visual Sound

Binaural audio provides a listener with 3D sound sensation, allowing a r...
research
01/03/2017

AENet: Learning Deep Audio Features for Video Analysis

We propose a new deep network for audio event recognition, called AENet....
research
05/05/2021

Self-Supervised Learning from Automatically Separated Sound Scenes

Real-world sound scenes consist of time-varying collections of sound sou...
research
02/14/2023

A dataset for Audio-Visual Sound Event Detection in Movies

Audio event detection is a widely studied audio processing task, with ap...
research
07/14/2020

Generating Visually Aligned Sound from Videos

We focus on the task of generating sound from natural videos, and the so...
research
06/19/2023

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

The framework of visually-guided sound source separation generally consi...
research
06/26/2022

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

We present a simple yet effective self-supervised framework for audio-vi...

Please sign up or login with your details

Forgot password? Click here to reset