AVA-AVD: Audio-visual Speaker Diarization in the Wild

11/29/2021
by   Eric Zhongcong Xu, et al.
0

Audio-visual speaker diarization aims at detecting “who spoken when“ using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD. This benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. Yet, how to deal with off-screen and on-screen speakers together still remains a critical challenge. To overcome it, we propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility. Experiments have shown that our method not only can outperform state-of-the-art methods but also is more robust as varying the ratio of off-screen speakers. Ablation studies demonstrate the advantages of the proposed AVR-Net and especially the modality mask on diarization. Our data and code will be made publicly available at https://github.com/zcxu-eric/AVA-AVD.

READ FULL TEXT

page 1

page 4

page 13

research
07/02/2020

Spot the conversation: speaker diarisation in the wild

The goal of this paper is speaker diarisation of videos collected 'in th...
research
06/07/2021

How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Successful active speaker detection requires a three-stage pipeline: (i)...
research
11/02/2022

Towards End-to-end Speaker Diarization in the Wild

Speaker diarization algorithms address the "who spoke when" problem in a...
research
11/29/2020

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Speech separation aims to separate individual voice from an audio mixtur...
research
08/03/2022

Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a sce...
research
08/06/2021

The Right to Talk: An Audio-Visual Transformer Approach

Turn-taking has played an essential role in structuring the regulation o...
research
03/09/2023

WASD: A Wilder Active Speaker Detection Dataset

Current Active Speaker Detection (ASD) models achieve great results on A...

Please sign up or login with your details

Forgot password? Click here to reset