Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

12/14/2021
by   Yidi Li, et al.
0

Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of multi-modal signals remains a challenging issue. In this paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. Specifically, a novel acoustic map based on spatial-temporal Global Coherence Field (stGCF) is first constructed for heterogeneous signal fusion, which employs a camera model to map audio cues to the localization space consistent with the visual cues. Then a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise. Moreover, a unique cross-modal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities. Experimental results show that the proposed MPT achieves 98.6 occluded datasets, respectively, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.

READ FULL TEXT

page 1

page 3

page 7

research
08/16/2023

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

The integration of different modalities, such as audio and visual inform...
research
05/04/2021

Where and When: Space-Time Attention for Audio-Visual Explanations

Explaining the decision of a multi-modal decision-maker requires to dete...
research
08/06/2021

The Right to Talk: An Audio-Visual Transformer Approach

Turn-taking has played an essential role in structuring the regulation o...
research
10/26/2022

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

In this paper, we propose a deep learning based multi-speaker direction ...
research
09/18/2023

Concurrent Haptic, Audio, and Visual Data Set During Bare Finger Interaction with Textured Surfaces

Perceptual processes are frequently multi-modal. This is the case of hap...
research
05/29/2020

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Recognizing sounds is a key aspect of computational audio scene analysis...
research
05/13/2021

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial DoA domain focus on a single mo...

Please sign up or login with your details

Forgot password? Click here to reset