Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

02/23/2021
by   Julio Wissing, et al.
0

Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the acoustic and the visual modality may be corrupted in specific spatial regions, for instance due to poor lighting conditions or to the presence of background noise. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space. This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.

READ FULL TEXT
research
03/14/2019

Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights

Data fusion plays an important role in many technical applications that ...
research
06/13/2019

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Speech recognition in cocktail-party environments remains a significant ...
research
02/28/2020

Bio-Inspired Modality Fusion for Active Speaker Detection

Human beings have developed fantastic abilities to integrate information...
research
05/13/2021

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial DoA domain focus on a single mo...
research
05/10/2022

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Under noisy conditions, automatic speech recognition (ASR) can greatly b...
research
01/29/2020

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Under noisy conditions, speech recognition systems suffer from high Word...
research
03/31/2016

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engag...

Please sign up or login with your details

Forgot password? Click here to reset