Who said that?: Audio-visual speaker diarisation of real-world meetings

06/24/2019
by   Joon Son Chung, et al.
1

The goal of this work is to determine 'who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs. To achieve this, we propose a novel iterative approach that first enrolls speaker models using audio-visual correspondence, then uses the enrolled models together with the visual information to determine the active speaker. We show strong quantitative and qualitative performance on a dataset of real-world meetings. The method is also evaluated on the public AMI meeting corpus, on which we demonstrate results that exceed all comparable methods. We also show that beamforming can be used together with the video to further improve the performance when multi-channel audio is available.

READ FULL TEXT
research
04/07/2023

Margin-Mixup: A Method for Robust Speaker Verification in Multi-Speaker Audio

This paper is concerned with the task of speaker verification on audio w...
research
05/22/2023

Target Active Speaker Detection with Audio-visual Cues

In active speaker detection (ASD), we would like to detect whether an on...
research
05/31/2017

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

In this paper, we present a system that associates faces with voices in ...
research
11/16/2022

Exploring Detection-based Method For Speaker Diarization @ Ego4D Audio-only Diarization Challenge 2022

We provide the technical report for Ego4D audio-only diarization challen...
research
10/26/2022

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

In this paper, we propose a deep learning based multi-speaker direction ...
research
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
research
09/13/2023

PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

It is common in everyday spoken communication that we look at the turnin...

Please sign up or login with your details

Forgot password? Click here to reset