UniCon: Unified Context Network for Robust Active Speaker Detection

08/05/2021
by   Yuanhang Zhang, et al.
1

We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15 Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0 the first time on this challenging dataset at the time of submission. Project website: https://unicon-asd.github.io/.

READ FULL TEXT

page 2

page 4

page 6

research
05/20/2020

Active Speakers in Context

Current methods for active speak er detection focus on modeling short-te...
research
01/19/2023

LoCoNet: Long-Short Context Network for Active Speaker Detection

Active Speaker Detection (ASD) aims to identify who is speaking in each ...
research
03/08/2023

A Light Weight Model for Active Speaker Detection

Active speaker detection is a challenging task in audio-visual scenario ...
research
06/07/2021

How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Successful active speaker detection requires a three-stage pipeline: (i)...
research
03/07/2022

Visually Supervised Speaker Detection and Localization via Microphone Array

Active speaker detection (ASD) is a multi-modal task that aims to identi...
research
04/20/2016

A Deep Neural Network for Chinese Zero Pronoun Resolution

Existing approaches for Chinese zero pronoun resolution overlook semanti...
research
03/09/2023

WASD: A Wilder Active Speaker Detection Dataset

Current Active Speaker Detection (ASD) models achieve great results on A...

Please sign up or login with your details

Forgot password? Click here to reset