Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling

05/30/2023
by   Theodoros Kouzelis, et al.
0

The study of speech disorders can benefit greatly from time-aligned data. However, audio-text mismatches in disfluent speech cause rapid performance degradation for modern speech aligners, hindering the use of automatic approaches. In this work, we propose a simple and effective modification of alignment graph construction of CTC-based models using Weighted Finite State Transducers. The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment. During the graph construction, we allow the modeling of common speech disfluencies, i.e. repetitions and omissions. Further, we show that by assessing the degree of audio-text mismatch through the use of Oracle Error Rate, our method can be effectively used in the wild. Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements, particularly for recall, achieving a 23-25 baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/20/2023

SpeechAlign: a Framework for Speech Translation Alignment Evaluation

Speech-to-Speech and Speech-to-Text translation are currently dynamic ar...
research
12/31/2022

Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems A case study for Modern Greek

Modern speech recognition systems exhibits rapid performance degradation...
research
03/01/2023

On the Audio-visual Synchronization for Lip-to-Speech Synthesis

Most lip-to-speech (LTS) synthesis models are trained and evaluated unde...
research
11/05/2018

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

End-to-end Speech Translation (ST) models have many potential advantages...
research
04/11/2021

NeMo Toolbox for Speech Dataset Construction

In this paper, we introduce a new toolbox for constructing speech datase...
research
08/11/2023

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks...

Please sign up or login with your details

Forgot password? Click here to reset