Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

04/10/2019
by   Gábor Gosztolya, et al.
0

When using ultrasound video as input, Deep Neural Network-based Silent Speech Interfaces usually rely on the whole image to estimate the spectral parameters required for the speech synthesis step. Although this approach is quite straightforward, and it permits the synthesis of understandable speech, it has several disadvantages as well. Besides the inability to capture the relations between close regions (i.e. pixels) of the image, this pixel-by-pixel representation of the image is also quite uneconomical. It is easy to see that a significant part of the image is irrelevant for the spectral parameter estimation task as the information stored by the neighbouring pixels is redundant, and the neural network is quite large due to the large number of input features. To resolve these issues, in this study we train an autoencoder neural network on the ultrasound image; the estimation of the spectral speech parameters is done by a second DNN, using the activations of the bottleneck layer of the autoencoder network as features. In our experiments, the proposed method proved to be more efficient than the standard approach: the measured normalized mean squared error scores were lower, while the correlation values were higher in each case. Based on the result of a listening test, the synthesized utterances also sounded more natural to native speakers. A further advantage of our proposed approach is that, due to the (relatively) small size of the bottleneck layer, we can utilize several consecutive ultrasound images during estimation without a significant increase in the network size, while significantly increasing the accuracy of parameter estimation.

READ FULL TEXT

page 1

page 2

page 3

page 5

research
08/06/2020

Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis

For articulatory-to-acoustic mapping using deep neural networks, typical...
research
07/05/2021

Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Articulatory information has been shown to be effective in improving the...
research
06/24/2019

Ultrasound-based Silent Speech Interface Built on a Continuous Vocoder

Recently it was shown that within the Silent Speech Interface (SSI) fiel...
research
07/12/2021

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

In this paper, we present our first experiments in text-to-articulation ...
research
04/12/2019

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

Speech sounds are produced as the coordinated movement of the speaking o...
research
11/20/2018

Weakly Supervised Estimation of Shadow Confidence Maps in Ultrasound Imaging

Detecting acoustic shadows in ultrasound images is important in many cli...
research
06/26/2022

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Silent Speech Interfaces aim to reconstruct the acoustic signal from a s...

Please sign up or login with your details

Forgot password? Click here to reset