Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement
In this paper, we are interested in unsupervised speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e., lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual mixture VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.
READ FULL TEXT