Multi-modal Multi-channel Target Speech Separation

03/16/2020
by   Rongzhi Gu, et al.
0

Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

READ FULL TEXT

page 1

page 5

page 10

research
04/07/2019

Time Domain Audio Visual Speech Separation

Audio-visual multi-modal modeling has been demonstrated to be effective ...
research
02/02/2021

Multimodal Attention Fusion for Target Speaker Extraction

Target speaker extraction, which aims at extracting a target speaker's v...
research
11/27/2018

Noise-tolerant Audio-visual Online Person Verification using an Attention-based Neural Network Fusion

In this paper, we present a multi-modal online person verification syste...
research
06/24/2019

Audio-Visual Kinship Verification

Visual kinship verification entails confirming whether or not two indivi...
research
07/04/2022

Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation

In this paper we propose a multi-modal multi-correlation learning framew...
research
05/17/2020

Multi-modal Automated Speech Scoring using Attention Fusion

In this study, we propose a novel multi-modal end-to-end neural approach...
research
10/26/2021

ViDA-MAN: Visual Dialog with Digital Humans

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interacti...

Please sign up or login with your details

Forgot password? Click here to reset