Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

08/06/2020
by   Tamás Gábor Csapó, et al.
0

For articulatory-to-acoustic mapping using deep neural networks, typically spectral and excitation parameters of vocoders have been used as the training targets. However, vocoding often results in buzzy and muffled final speech quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a large amount of English and Hungarian speech data. The inputs of the convolutional neural network are ultrasound tongue images. The training target is the 80-dimensional mel-spectrogram, which results in a finer detailed spectral representation than the previously used 25-dimensional Mel-Generalized Cepstrum. From the output of the ultrasound-to-mel-spectrogram prediction, WaveGlow inference results in synthesized speech. We compare the proposed WaveGlow-based system with a continuous vocoder which does not use strict voiced/unvoiced decision when predicting F0. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the WaveGlow neural vocoder produces significantly more natural synthesized speech than the baseline system. Besides, the advantage of WaveGlow is that F0 is included in the mel-spectrogram representation, and it is not necessary to predict the excitation separately.

READ FULL TEXT
research
06/24/2019

Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

Recently it was shown that within the Silent Speech Interface (SSI) fiel...
research
07/26/2021

Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

For articulatory-to-acoustic mapping, typically only limited parallel tr...
research
04/29/2021

Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend

Articulatory-to-acoustic (forward) mapping is a technique to predict spe...
research
04/10/2019

Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

When using ultrasound video as input, Deep Neural Network-based Silent S...
research
05/16/2020

Learning Joint Articulatory-Acoustic Representations with Normalizing Flows

The articulatory geometric configurations of the vocal tract and the aco...
research
06/29/2020

Ultra2Speech – A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Thousands of individuals need surgical removal of their larynx due to cr...
research
03/04/2019

Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis

We conduct an investigation on various hyper-parameters regarding neural...

Please sign up or login with your details

Forgot password? Click here to reset