Dual-path Attention is All You Need for Audio-Visual Speech Extraction

07/09/2022
by   Zhongweiyang Xu, et al.
2

Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN <cit.>, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like <cit.>, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk self-attention, but in the proposed algorithm, inter-chunk attention incorporates the visual features as an additional feature stream. This prevents the upsampling of visual cues, resulting in more efficient audio-visual fusion. The result shows we achieve superior results compared with other time-domain based audio-visual fusion models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2023

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extr...
research
01/15/2021

MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

The purpose of speech enhancement is to extract target speech signal fro...
research
12/05/2019

Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

In this work, we propose a new method to address audio-visual target spe...
research
01/03/2017

AENet: Learning Deep Audio Features for Video Analysis

We propose a new deep network for audio event recognition, called AENet....
research
11/29/2020

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Speech separation aims to separate individual voice from an audio mixtur...
research
08/10/2022

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

Both visual and auditory information are valuable to determine the salie...
research
03/01/2022

Automatic Depression Detection via Learning and Fusing Features from Visual Cues

Depression is one of the most prevalent mental disorders, which seriousl...

Please sign up or login with your details

Forgot password? Click here to reset