Silent versus modal multi-speaker speech recognition from ultrasound and video

02/27/2021
by   Manuel Sam Ribeiro, et al.
0

We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2019

Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

We investigate the automatic processing of child speech therapy sessions...
research
10/28/2022

Improving short-video speech recognition using random utterance concatenation

One of the limitations in end-to-end automatic speech recognition framew...
research
11/19/2020

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of a...
research
05/19/2016

Contour-based 3d tongue motion visualization using ultrasound image sequences

This article describes a contour-based 3D tongue deformation visualizati...
research
03/03/2023

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

The availability of digital devices operated by voice is expanding rapid...
research
08/28/2023

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

This technical report details our submission system to the CHiME-7 DASR ...
research
11/10/2017

Reinforcement Learning of Speech Recognition System Based on Policy Gradient and Hypothesis Selection

Speech recognition systems have achieved high recognition performance fo...

Please sign up or login with your details

Forgot password? Click here to reset