Text-Aware End-to-end Mispronunciation Detection and Diagnosis

06/15/2022
by   Linkai Peng, et al.
0

Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from 57.51% to 61.75% compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.

READ FULL TEXT
research
04/17/2021

A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques

Recently, end-to-end mispronunciation detection and diagnosis (MD D) s...
research
06/13/2023

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Lyrics alignment gained considerable attention in recent years. State-of...
research
02/19/2019

A spelling correction model for end-to-end speech recognition

Attention-based sequence-to-sequence models for speech recognition joint...
research
07/02/2019

Attention model for articulatory features detection

Articulatory distinctive features, as well as phonetic transcription, pl...
research
04/11/2021

NeMo Toolbox for Speech Dataset Construction

In this paper, we introduce a new toolbox for constructing speech datase...
research
03/10/2023

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

In text-audio retrieval (TAR) tasks, due to the heterogeneity of content...
research
08/26/2021

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

With the acceleration of globalization, more and more people are willing...

Please sign up or login with your details

Forgot password? Click here to reset