Audio-Visual Synchronisation in the wild

12/08/2021
by   Honglie Chen, et al.
2

In this paper, we consider the problem of audio-visual synchronisation applied to videos `in-the-wild' (ie of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.

READ FULL TEXT

page 2

page 5

page 10

page 16

page 17

page 18

page 20

page 21

research
11/02/2020

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Recent progress in deep learning has enabled many advances in sound sepa...
research
08/11/2023

Large-Scale Learning on Overlapped Speech Detection: New Benchmark and New General System

Overlapped Speech Detection (OSD) is an important part of speech applica...
research
06/02/2023

Improved DeepFake Detection Using Whisper Features

With a recent influx of voice generation methods, the threat introduced ...
research
07/20/2022

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-...
research
05/03/2021

Naturalistic audio-visual volumetric sequences dataset of sounding actions for six degree-of-freedom interaction

As audio-visual systems increasingly bring immersive and interactive cap...
research
10/13/2022

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

The objective of this paper is audio-visual synchronisation of general v...
research
10/01/2020

FSD50K: an Open Dataset of Human-Labeled Sound Events

Most existing datasets for sound event recognition (SER) are relatively ...

Please sign up or login with your details

Forgot password? Click here to reset