Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring

05/19/2023
by   Kaiqi Fu, et al.
0

Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features. Deep neural networks are commonly trained to map fluency-related features into the human scores. However, the effectiveness of deep learning-based models is constrained by the limited amount of labeled training samples. To address this, we introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring. Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts. We then fine-tune the pre-trained model using human-annotated scoring data. Our experimental results, conducted on datasets such as Speechocean762 and our non-native datasets, show that our proposed method outperforms the baseline systems in terms of Pearson correlation coefficients (PCC). Moreover, we also conduct an ablation study to better understand the contribution of phonetic and prosody factors during the pre-training stage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

Speech separation with large-scale self-supervised learning

Self-supervised learning (SSL) methods such as WavLM have shown promisin...
research
03/01/2022

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

Deep learning-based pronunciation scoring models highly rely on the avai...
research
12/15/2022

Curriculum Learning Meets Weakly Supervised Modality Correlation Learning

In the field of multimodal sentiment analysis (MSA), a few studies have ...
research
02/23/2023

ProsAudit, a prosodic benchmark for self-supervised speech models

We present ProsAudit, a benchmark in English to assess structural prosod...
research
07/05/2023

Self-supervised learning with diffusion-based multichannel speech enhancement for speaker verification under noisy conditions

The paper introduces Diff-Filter, a multichannel speech enhancement appr...
research
12/02/2018

Image Score: How to Select Useful Samples

There has long been debates on how we could interpret neural networks an...

Please sign up or login with your details

Forgot password? Click here to reset