Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

10/26/2022
by   Qing Wang, et al.
0

In this paper, we propose a deep learning based multi-speaker direction of arrival (DOA) estimation with audio and visual signals by using permutation-free loss function. We first collect a data set for multi-modal sound source localization (SSL) where both audio and visual signals are recorded in real-life home TV scenarios. Then we propose a novel spatial annotation method to produce the ground truth of DOA for each speaker with the video data by transformation between camera coordinate and pixel coordinate according to the pin-hole camera model. With spatial location information served as another input along with acoustic feature, multi-speaker DOA estimation could be solved as a classification task of active speaker detection. Label permutation problem in multi-speaker related tasks will be addressed since the locations of each speaker are used as input. Experiments conducted on both simulated data and real data show that the proposed audio-visual DOA estimation model outperforms audio-only DOA estimation model by a large margin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/04/2023

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

Human perception of the complex world relies on a comprehensive analysis...
research
06/24/2019

Who said that?: Audio-visual speaker diarisation of real-world meetings

The goal of this work is to determine 'who spoke when' in real-world mee...
research
12/14/2021

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

Multi-modal fusion is proven to be an effective method to improve the ac...
research
05/13/2021

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial DoA domain focus on a single mo...
research
01/25/2021

Using Angle of Arrival for Improving Indoor Localization

In this paper, we primarily explore the improvement of single stream aud...
research
09/15/2023

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

We introduce a distinctive real-time, causal, neural network-based activ...
research
12/18/2018

Audiovisual speaker diarization of TV series

Speaker diarization may be difficult to achieve when applied to narrativ...

Please sign up or login with your details

Forgot password? Click here to reset