Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

10/13/2022
by   Vladimir Iashin, et al.
21

The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync

READ FULL TEXT

page 2

page 8

page 9

page 11

page 14

research
05/04/2021

Where and When: Space-Time Attention for Audio-Visual Explanations

Explaining the decision of a multi-modal decision-maker requires to dete...
research
11/10/2021

Space-Time Memory Network for Sounding Object Localization in Videos

Leveraging temporal synchronization and association within sight and sou...
research
01/04/2023

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

Manipulated videos often contain subtle inconsistencies between their vi...
research
12/08/2021

Audio-Visual Synchronisation in the wild

In this paper, we consider the problem of audio-visual synchronisation a...
research
03/30/2022

Forensic Analysis and Localization of Multiply Compressed MP3 Audio Using Transformers

Audio signals are often stored and transmitted in compressed formats. Am...
research
05/25/2023

SoundSieve: Seconds-Long Audio Event Recognition on Intermittently-Powered Systems

A fundamental problem of every intermittently-powered sensing system is ...
research
09/25/2017

Dense scale selection over space, time and space-time

Scale selection methods based on local extrema over scale of scale-norma...

Please sign up or login with your details

Forgot password? Click here to reset