Deep Variational Generative Models for Audio-visual Speech Separation

08/17/2020
by   Viet-Nhat Nguyen, et al.
0

In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-maximization algorithm. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches as well as a supervised deep learning-based technique.

READ FULL TEXT

page 1

page 2

research
08/07/2019

Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder

Variational auto-encoders (VAEs) are deep generative latent variable mod...
research
09/15/2023

RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function

In indoor scenes, reverberation is a crucial factor in degrading the per...
research
05/04/2020

A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing

We propose a deep and interpretable probabilistic generative model to an...
research
11/04/2020

Can We Trust Deep Speech Prior?

Recently, speech enhancement (SE) based on deep speech prior has attract...
research
02/05/2019

A variance modeling framework based on variational autoencoders for speech enhancement

In this paper we address the problem of enhancing speech signals in nois...
research
03/31/2016

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engag...
research
10/16/2018

Hierarchical Generative Modeling for Controllable Speech Synthesis

This paper proposes a neural end-to-end text-to-speech (TTS) model which...

Please sign up or login with your details

Forgot password? Click here to reset