Learning Joint Articulatory-Acoustic Representations with Normalizing Flows

05/16/2020
by   Pramit Saha, et al.
6

The articulatory geometric configurations of the vocal tract and the acoustic properties of the resultant speech sound are considered to have a strong causal relationship. This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features. Our model utilizes a convolutional autoencoder architecture and normalizing flow-based models to allow both forward and inverse mappings in a semi-supervised manner, between the mid-sagittal vocal tract geometry of a two degrees-of-freedom articulatory synthesizer with 1D acoustic wave model and the Mel-spectrogram representation of the synthesized speech sounds. Our approach achieves satisfactory performance in achieving both articulatory-to-acoustic as well as acoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encoding of both the domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2020

Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

For articulatory-to-acoustic mapping using deep neural networks, typical...
research
02/16/2020

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of se...
research
09/25/2019

Speech Recognition with Augmented Synthesized Speech

Recent success of the Tacotron speech synthesis architecture and its var...
research
09/08/2015

Unsupervised Domain Discovery using Latent Dirichlet Allocation for Acoustic Modelling in Speech Recognition

Speech recognition systems are often highly domain dependent, a fact wid...
research
08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...
research
09/25/2017

Predicting interviewee attitude and body language from speech descriptors

This present research investigated the relationship between personal imp...
research
04/05/2022

Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation

We propose a computational model of speech production combining a pre-tr...

Please sign up or login with your details

Forgot password? Click here to reset