Mispronunciation Detection and Correction via Discrete Acoustic Units

08/12/2021
by   Zhan Zhang, et al.
0

Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. However, conventional CAPT methods cannot effectively use non-native utterances for supervised training because the ground truth pronunciation needs expensive annotation. Meanwhile, certain undefined nonnative phonemes cannot be correctly classified into standard phonemes. To solve these problems, we use the vector-quantized variational autoencoder (VQ-VAE) to encode the speech into discrete acoustic units in a self-supervised manner. Based on these units, we propose a novel method that integrates both discriminative and generative models. The proposed method can detect mispronunciation and generate the correct pronunciation at the same time. Experiments on the L2-Arctic dataset show that the detection F1 score is improved by 9.58 proposed method also achieves a comparable word error rate (WER) and the best style preservation for mispronunciation correction compared with text-to-speech (TTS) methods.

READ FULL TEXT

page 3

page 4

research
06/17/2022

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE

The human perception system is often assumed to recruit motor knowledge ...
research
03/01/2023

DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction

Personal Digital Assistants (PDAs) - such as Siri, Alexa and Google Assi...
research
06/29/2023

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

The goal of Automatic Voice Over (AVO) is to generate speech in sync wit...
research
07/13/2022

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

Some recent studies have demonstrated the feasibility of single-stage ne...
research
06/04/2023

An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech

Self-supervised representation learning for speech often involves a quan...
research
04/02/2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, ...
research
10/31/2021

Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units

Language models (LMs) for text data have been studied extensively for th...

Please sign up or login with your details

Forgot password? Click here to reset