Learning and controlling the source-filter representation of speech with a variational autoencoder

04/14/2022
by   Samir Sadok, et al.
0

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency f_0 and the formants are of primary importance. In this work, we show that the source-filter model of speech production naturally arises in the latent space of a variational autoencoder (VAE) trained in an unsupervised manner on a dataset of natural speech signals. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we experimentally illustrate that f_0 and the formant frequencies are encoded in orthogonal subspaces of the VAE latent space and we develop a weakly-supervised method to accurately and independently control these speech factors of variation within the learned latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on f_0 and the formant frequencies, and which is applied to the transformation of speech signals.

READ FULL TEXT

page 8

page 9

page 14

research
06/11/2021

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

The Variational Autoencoder (VAE) is a powerful deep generative model th...
research
04/07/2021

Learning robust speech representation with an articulatory-regularized variational autoencoder

It is increasingly considered that human speech perception and productio...
research
05/11/2020

Exploring TTS without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020)

In this study, we reported our exploration of Text-To-Speech without Tex...
research
09/25/2019

Disentangling Speech and Non-Speech Components for Building Robust Acoustic Models from Found Data

In order to build language technologies for majority of the languages, i...
research
01/13/2019

Modeling neural dynamics during speech production using a state space variational autoencoder

Characterizing the neural encoding of behavior remains a challenging tas...
research
04/09/2018

Scalable Factorized Hierarchical Variational Autoencoder Training

Deep generative models have achieved great success in unsupervised learn...
research
08/09/2023

Deep Generative Networks for Heterogeneous Augmentation of Cranial Defects

The design of personalized cranial implants is a challenging and tremend...

Please sign up or login with your details

Forgot password? Click here to reset