Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

04/08/2022
by   Eesung Kim, et al.
0

Self-supervised learning (SSL) approaches such as wav2vec 2.0 and HuBERT models have shown promising results in various downstream tasks in the speech community. In particular, speech representations learned by SSL models have been shown to be effective for encoding various speech-related characteristics. In this context, we propose a novel automatic pronunciation assessment method based on SSL models. First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language (ESL) learners in a data environment. Then, the layer-wise contextual representations are extracted from all across the transformer layers of the SSL models. Finally, the automatic pronunciation score is estimated using bidirectional long short-term memory with the layer-wise contextual representations and the corresponding text. We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762. Furthermore, we analyze how different representations of transformer layers in the SSL model affect the performance of the pronunciation assessment task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2023

Mispronunciation detection using self-supervised speech representations

In recent years, self-supervised learning (SSL) models have produced pro...
research
06/25/2023

Addressing Cold Start Problem for End-to-end Automatic Speech Scoring

Integrating automatic speech scoring/assessment systems has become a cri...
research
05/31/2023

Zero-Shot Automatic Pronunciation Assessment

Automatic Pronunciation Assessment (APA) is vital for computer-assisted ...
research
07/12/2020

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

We introduce a self-supervised speech pre-training method called TERA, w...
research
12/05/2019

Self-Supervised Contextual Language Representation of Radiology Reports to Improve the Identification of Communication Urgency

Machine learning methods have recently achieved high-performance in biom...
research
08/31/2023

RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting

Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the...
research
05/29/2023

A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment

Automatic Pronunciation Assessment (APA) plays a vital role in Computer-...

Please sign up or login with your details

Forgot password? Click here to reset