Cross-modal Supervision for Learning Active Speaker Detection in Video

03/29/2016
by   Punarjay Chakravarty, et al.
0

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

READ FULL TEXT

page 2

page 9

page 14

research
12/01/2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, ...
research
03/27/2022

End-to-End Active Speaker Detection

Recent advances in the Active Speaker Detection (ASD) problem build upon...
research
09/21/2023

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

The goal of this work is Active Speaker Detection (ASD), a task to deter...
research
12/02/2021

Learning Spatial-Temporal Graphs for Active Speaker Detection

We address the problem of active speaker detection through a new framewo...
research
02/10/2020

Multimodal active speaker detection and virtual cinematography for video conferencing

Active speaker detection (ASD) and virtual cinematography (VC) can signi...
research
01/11/2021

MAAS: Multi-modal Assignation for Active Speaker Detection

Active speaker detection requires a solid integration of multi-modal cue...
research
11/24/2017

Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

This paper presents a self-supervised method for detecting the active sp...

Please sign up or login with your details

Forgot password? Click here to reset