Audio-visual speech separation based on joint feature representation with cross-modal attention

03/05/2022
by   Junwen Xiong, et al.
0

Multi-modal based speech separation has exhibited a specific advantage on isolating the target character in multi-talker noisy environments. Unfortunately, most of current separation strategies prefer a straightforward fusion based on feature learning of each single modality, which is far from sufficient consideration of inter-relationships between modalites. Inspired by learning joint feature representations from audio and visual streams with attention mechanism, in this study, a novel cross-modal fusion strategy is proposed to benefit the whole framework with semantic correlations between different modalities. To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated to strengthen the robustness of visual representation. The evaluation of the proposed work is performed on two public audio-visual speech separation benchmark datasets. The overall improvement of the performance has demonstrated that the additional motion network effectively enhances the visual representation of the combined lip images and audio signal, as well as outperforming the baseline in terms of all metrics with the proposed cross-modal fusion.

READ FULL TEXT
research
08/16/2023

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

The integration of different modalities, such as audio and visual inform...
research
10/26/2021

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

The recent success of transformer models in language, such as BERT, has ...
research
06/21/2021

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Many previous audio-visual voice-related works focus on speech, ignoring...
research
05/08/2023

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

The speech-to-singing (STS) voice conversion task aims to generate singi...
research
07/04/2022

Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation

In this paper we propose a multi-modal multi-correlation learning framew...
research
03/25/2021

Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation

In this paper, we address the problem of separating individual speech si...
research
10/27/2020

Rule-embedded network for audio-visual voice activity detection in live musical video streams

Detecting anchor's voice in live musical streams is an important preproc...

Please sign up or login with your details

Forgot password? Click here to reset