Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

06/08/2021
by   Amin Honarmandi Shandiz, et al.
0

Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3 generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2019

Speaker-independent classification of phonetic segments from raw ultrasound in child speech

Ultrasound tongue imaging (UTI) provides a convenient way to visualize t...
research
10/26/2022

Speaker Diarization Based on Multi-channel Microphone Array in Small-scale Meeting

In the task of speaker diarization, the number of small-scale meetings a...
research
05/30/2023

Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks

Thanks to the latest deep learning algorithms, silent speech interfaces ...
research
08/03/2020

Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

Articulatory-to-acoustic (forward) mapping is a technique to predict spe...
research
06/22/2016

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Zero-resource speech technology is a growing research area that aims to ...
research
02/06/2023

Residual Information in Deep Speaker Embedding Architectures

Speaker embeddings represent a means to extract representative vectorial...
research
07/01/2019

Synchronising audio and ultrasound by learning cross-modal embeddings

Audiovisual synchronisation is the task of determining the time offset b...

Please sign up or login with your details

Forgot password? Click here to reset