SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

08/16/2023
by   Kai Li, et al.
0

The integration of different modalities, such as audio and visual information, plays a crucial role in human perception of the surrounding environment. Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion architectures situated either at the top or bottom positions, rather than comprehensively considering multi-modal fusion at various hierarchical positions within the network. In this paper, we propose a novel model called self- and cross-attention network (SCANet), which leverages the attention mechanism for efficient audio-visual feature fusion. SCANet consists of two types of attention blocks: self-attention (SA) and cross-attention (CA) blocks, where the CA blocks are distributed at the top (TCA), middle (MCA) and bottom (BCA) of SCANet. These blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of SCANet, outperforming existing state-of-the-art (SOTA) methods while maintaining comparable inference time.

READ FULL TEXT

page 4

page 14

research
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
research
06/25/2023

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extr...
research
12/14/2021

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

Multi-modal fusion is proven to be an effective method to improve the ac...
research
09/12/2023

DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

With the rise in manipulated media, deepfake detection has become an imp...
research
07/21/2020

SLNSpeech: solving extended speech separation problem by the help of sign language

A speech separation task can be roughly divided into audio-only separati...
research
08/14/2020

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

The major challenge in audio-visual event localization task lies in how ...
research
10/04/2022

Pay Self-Attention to Audio-Visual Navigation

Audio-visual embodied navigation, as a hot research topic, aims training...

Please sign up or login with your details

Forgot password? Click here to reset