Time Domain Audio Visual Speech Separation

04/07/2019
by   Jian Wu, et al.
0

Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder which can extract lip embedding from video steams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on 2 and 3 speakers cases respectively, compared to audio-only TasNet and frequency domain audio-visual networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2020

Multi-modal Multi-channel Target Speech Separation

Target speech separation refers to extracting a target speaker's voice f...
research
03/04/2022

Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Active speaker detection and speech enhancement have become two increasi...
research
07/04/2022

Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation

In this paper we propose a multi-modal multi-correlation learning framew...
research
05/31/2023

Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model

We propose Audio-Visual Lightweight ITerative model (AVLIT), an effectiv...
research
04/05/2022

Audio-visual multi-channel speech separation, dereverberation and recognition

Despite the rapid advance of automatic speech recognition (ASR) technolo...
research
03/13/2021

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

Multi-modal datasets in artificial intelligence (AI) often capture a thi...
research
06/22/2021

Multi-accent Speech Separation with One Shot Learning

Speech separation is a problem in the field of speech processing that ha...

Please sign up or login with your details

Forgot password? Click here to reset