Rethinking Audio-visual Synchronization for Active Speaker Detection

06/21/2022
by   Abudukelimu Wuerkaixi, et al.
0

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

READ FULL TEXT
research
12/01/2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, ...
research
05/22/2023

Target Active Speaker Detection with Audio-visual Cues

In active speaker detection (ASD), we would like to detect whether an on...
research
06/07/2021

How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Successful active speaker detection requires a three-stage pipeline: (i)...
research
03/08/2023

A Light Weight Model for Active Speaker Detection

Active speaker detection is a challenging task in audio-visual scenario ...
research
01/11/2021

MAAS: Multi-modal Assignation for Active Speaker Detection

Active speaker detection requires a solid integration of multi-modal cue...
research
04/26/2022

Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining

The recent success of audio-visual representation learning can be largel...
research
12/14/2018

On Attention Modules for Audio-Visual Synchronization

With the development of media and networking technologies, multimedia ap...

Please sign up or login with your details

Forgot password? Click here to reset