Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

12/01/2022
by   Rahul Sharma, et al.
0

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets, the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows), we show that a simple late fusion of the two approaches enhances the active speaker detection performance.

READ FULL TEXT

page 4

page 6

research
09/24/2022

Unsupervised active speaker detection in media content using cross-modal information

We present a cross-modal unsupervised framework for active speaker detec...
research
03/09/2020

Crossmodal learning for audio-visual speech event localization

An objective understanding of media depictions, such as about inclusive ...
research
06/21/2022

Rethinking Audio-visual Synchronization for Active Speaker Detection

Active speaker detection (ASD) systems are important modules for analyzi...
research
03/29/2016

Cross-modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of act...
research
02/28/2020

Bio-Inspired Modality Fusion for Active Speaker Detection

Human beings have developed fantastic abilities to integrate information...
research
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
research
07/10/2021

Speech2Video: Cross-Modal Distillation for Speech to Video Generation

This paper investigates a novel task of talking face video generation so...

Please sign up or login with your details

Forgot password? Click here to reset