LoCoNet: Long-Short Context Network for Active Speaker Detection

01/19/2023
by   Xizi Wang, et al.
0

Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. ASD reasons from audio and visual information from two contexts: long-term intra-speaker context and short-term inter-speaker context. Long-term intra-speaker context models the temporal dependencies of the same speaker, while short-term inter-speaker context models the interactions of speakers in the same scene. These two contexts are complementary to each other and can help infer the active speaker. Motivated by these observations, we propose LoCoNet, a simple yet effective Long-Short Context Network that models the long-term intra-speaker context and short-term inter-speaker context. We use self-attention to model long-term intra-speaker context due to its effectiveness in modeling long-range dependencies, and convolutional blocks that capture local patterns to model short-term inter-speaker context. Extensive experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets, achieving an mAP of 95.2 68.1 59.7 speakers are present, or face of active speaker is much smaller than other faces in the same scene, LoCoNet outperforms previous state-of-the-art methods by 3.4 https://github.com/SJTUwxz/LoCoNet_ASD.

READ FULL TEXT

page 2

page 3

page 8

research
05/20/2020

Active Speakers in Context

Current methods for active speak er detection focus on modeling short-te...
research
07/14/2021

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Active speaker detection (ASD) seeks to detect who is speaking in a visu...
research
01/20/2018

The effects of anger on automated long-term-spectra based speaker-identification

Forensic speaker identification has traditionally considered approaches ...
research
07/15/2022

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Active speaker detection (ASD) in videos with multiple speakers is a cha...
research
08/05/2021

UniCon: Unified Context Network for Robust Active Speaker Detection

We introduce a new efficient framework, the Unified Context Network (Uni...
research
12/02/2021

Learning Spatial-Temporal Graphs for Active Speaker Detection

We address the problem of active speaker detection through a new framewo...
research
03/24/2018

Multi-range Reasoning for Machine Comprehension

We propose MRU (Multi-Range Reasoning Units), a new fast compositional e...

Please sign up or login with your details

Forgot password? Click here to reset