Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

03/31/2016
by   Israel D. Gebru, et al.
0

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.

READ FULL TEXT

page 8

page 9

page 10

page 12

research
04/10/2018

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

We present a joint audio-visual model for isolating a single speech sign...
research
08/06/2021

The Right to Talk: An Audio-Visual Transformer Approach

Turn-taking has played an essential role in structuring the regulation o...
research
09/28/2018

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

In this paper we address the problem of tracking multiple speakers via t...
research
08/17/2020

Deep Variational Generative Models for Audio-visual Speech Separation

In this paper, we are interested in audio-visual speech separation given...
research
05/11/2022

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Audio-visual automatic speech recognition is a promising approach to rob...
research
02/23/2021

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Estimating the positions of multiple speakers can be helpful for tasks l...
research
07/27/2022

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Active speaker detection plays a vital role in human-machine interaction...

Please sign up or login with your details

Forgot password? Click here to reset