Unsupervised active speaker detection in media content using cross-modal information

09/24/2022
by   Rahul Sharma, et al.
6

We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies. Machine learning advances have enabled impressive performance in identifying individuals from speech and facial images. We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). We express the speech segments in terms of their associated speaker identity distances, from all other speech segments, to capture a relative identity structure for the video. Then we assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure. Furthermore, we propose a simple and effective approach to address speech segments where speakers are present off-screen. We evaluate the proposed system on three benchmark datasets – Visual Person Clustering dataset, AVA-active speaker dataset, and Columbia dataset – consisting of videos from entertainment and broadcast media, and show competitive performance to state-of-the-art fully supervised methods.

READ FULL TEXT

page 1

page 2

page 3

page 6

page 7

page 9

page 10

page 11

research
12/01/2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, ...
research
03/30/2022

Using Active Speaker Faces for Diarization in TV shows

Speaker diarization is one of the critical components of computational m...
research
09/21/2023

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

The goal of this work is Active Speaker Detection (ASD), a task to deter...
research
02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...
research
07/05/2018

Detection and Analysis of Content Creator Collaborations in YouTube Videos using Face- and Speaker-Recognition

This work discusses and implements the application of speaker recognitio...
research
08/04/2023

Speaker Diarization of Scripted Audiovisual Content

The media localization industry usually requires a verbatim script of th...
research
09/01/2021

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

The strong relation between face and voice can aid active speaker detect...

Please sign up or login with your details

Forgot password? Click here to reset