Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-based Multimodal Fusion

06/07/2021
by   Baptiste Pouthier, et al.
0

It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.

READ FULL TEXT
research
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
research
07/14/2020

DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection

For recognizing speakers in video streams, significant research studies ...
research
06/30/2021

Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dim...
research
08/01/2023

Multi-Modality Multi-Loss Fusion Network

In this work we investigate the optimal selection and fusion of features...
research
06/03/2020

M2P2: Multimodal Persuasion Prediction using Adaptive Fusion

Identifying persuasive speakers in an adversarial environment is a criti...
research
02/28/2020

Bio-Inspired Modality Fusion for Active Speaker Detection

Human beings have developed fantastic abilities to integrate information...
research
11/17/2019

Detecting F-formations Roles in Crowded Social Scenes with Wearables: Combining Proxemics Dynamics using LSTMs

In this paper, we investigate the use of proxemics and dynamics for auto...

Please sign up or login with your details

Forgot password? Click here to reset