Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

by   Kaiqi Fu, et al.
ByteDance Inc.

Deep learning-based pronunciation scoring models highly rely on the availability of the annotated non-native data, which is costly and has scalability issues. To deal with the data scarcity problem, data augmentation is commonly used for model pretraining. In this paper, we propose a phone-level mixup, a simple yet effective data augmentation method, to improve the performance of word-level pronunciation scoring. Specifically, given a phoneme sequence from lexicon, the artificial augmented word sample can be generated by randomly sampling from the corresponding phone-level features in training data, while the word score is the average of their GOP scores. Benefit from the arbitrary phone-level combination, the mixup is able to generate any word with various pronunciation scores. Moreover, we utilize multi-source information (e.g., MFCC and deep features) to further improve the scoring system performance. The experiments conducted on the Speechocean762 show that the proposed system outperforms the baseline by adding the mixup data for pretraining, with Pearson correlation coefficients (PCC) increasing from 0.567 to 0.61. The results also indicate that proposed method achieves similar performance by using 1/10 unlabeled data of baseline. In addition, the experimental results also demonstrate the efficiency of our proposed multi-source approach.



page 1

page 2

page 3

page 4


A transfer learning based approach for pronunciation scoring

Phone-level pronunciation scoring is a challenging task, with performanc...

AdMix: A Mixed Sample Data Augmentation Method for Neural Machine Translation

In Neural Machine Translation (NMT), data augmentation methods such as b...

Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning

We introduce EfficientCL, a memory-efficient continual pretraining metho...

Context-gloss Augmentation for Improving Word Sense Disambiguation

The goal of Word Sense Disambiguation (WSD) is to identify the sense of ...

Removing Undesirable Feature Contributions Using Out-of-Distribution Data

Several data augmentation methods deploy unlabeled-in-distribution (UID)...

DeepDarts: Modeling Keypoints as Objects for Automatic Scorekeeping in Darts using a Single Camera

Existing multi-camera solutions for automatic scorekeeping in steel-tip ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is well-known that the L2 learning process is heavily affected by a well-established habitual perception of phonemes and articulatory motions in the learners’ primary language (L1) [ellis1994study], which often cause mistakes and imprecise articulation in speech productions by the L2 learners, e.g., a negative language transfer [ellis1994study, meng2009developing]. As a feasible tool, computer assisted pronunciation training (CAPT) is often employed to automatically assess L2 learners’ pronunciation quality at different levels, e.g., phone-level [franco1997automatic, witt2000phone, zheng2007generalized, hu2015improved, shi2020context, li2017improving, leung2019cnn, yan2020end, feng2020sed, wei2009new], word-level [qian2012use, kyriakopoulos2020automatic, lee2013pronunciation, lee2016language, korzekwa2021weakly] and sentence-level [cincarek2009automatic, yu2015using, chen2018end, lin2020automatic, lin2021deep].

Over the last few decades, many methods have been proposed to evaluate non-native leaners’ pronunciation quality at the aforementioned levels. Most of these methods can be divided into two major groups, namely one-step or two-step approaches. Methods in the former group directly utilize confidence scores [franco1997automatic, witt2000phone, zheng2007generalized, hu2015improved, shi2020context]

derived from automatic speech recognition (ASR) systems to assess phone-level and sentence-level pronunciation quality. For example, the HMM-based phone log-likelihood

[franco1997automatic] or log-posterior [witt2000phone, zheng2007generalized, hu2015improved, shi2020context]. Among these methods, Goodness of Pronunciation (GOP) [witt2000phone]

is widely used in pronunciation scoring task. Given an acoustic segment of phoneme with multiple frames, the corresponding GOP is defined as the normalized frame-level posterior probability over the segment.

In addition, the output phone sequence from ASR, e.g. end-to-end models [yan2020end, leung2019cnn, feng2020sed] or extended recognition network [qian2012use, kyriakopoulos2020automatic]

, can be aligned with the given prompts, and a phone or word is marked as mispronounced if there is some mismatch between these two sequences. Alternatively, the two-step approaches treat pronunciation scoring or mispronunciation detection as regression or classification task. Specifically, phone, word and sentence boundaries are first generated by forced-alignment, and then either frame-level or segmental-level pronunciation features within each boundary are fed into task-dependent classifiers or regressors (e.g.,

[hu2015improved, wei2009new, li2017improving, lee2013pronunciation, lee2016language, korzekwa2021weakly, cincarek2009automatic, lin2021deep, lin2020automatic, yu2015using, chen2018end]

). Finally, the posterior probabilities or predicted values obtained from those models are often used as pronunciation scores. Taking advantage of using extra supervised training data, the two-step approach armed with deep learning techniques was reported to achieve better scoring performance than its counterparts


Although the aforementioned two-step approaches have achieved satisfactory pronunciation scoring or mispronunciation detection results, the performances heavily rely on the size of labeled scoring samples, that is not always practical in many real-world applications. Meanwhile, the non-native data collection and labeling process is costly and has scalability issues [lee2016language]. Take the recently released public free dataset Speechocean762 [zhang2021speechocean762] for example, only 5,000 sentences, 31,816 words and 95,078 phones have been assigned with human scores. Another public dataset L2-ARCTIC [zhao2018l2] has collected 11,026 utterances including 97,000 words and 349,000 phone segments. The comparison in [nancy] show that the largest non-native corpus contains 90,841 utterances, but it is not publicly available. To tackle the limited non-native data problem, data augmentation techniques have been investigated in some previous work [lee2016language, korzekwa2020detection, lin2021deep]. In [korzekwa2020detection], Text-To-Speech (TTS) was used to synthesize “incorrect stress” samples on top of modified lexical stress. In [lee2016language], the author created negative training samples by modifying the canonical text as the modified text forms a mismatched pair with its original word-level speech. More recently, authors in [lin2021deep] collected a large number of unlabeled non-native data for pretraining sentence-level scorer with GOP scores. It avoids time-consuming labeling, but non-native data collection is still a challenge task.

Figure 1: The mixup data generation module.

Inspired by the recent success of mixup on image classifications [mixup], this paper proposes a phone-level mixup to deal with limited word-level training data. Different to image, word pronunciation can be explicitly defined as a sequence of phones. Hence, we are able to build a phone-level dataset from the limited word-level training samples. For each phone, a pool of phone-level features and corresponding GOP scores is first constructed from the training data. Then the artificial word-level augmented features are concatenated by random sampling from the corresponding phone-level data, where the word score is the average of the phone GOP scores. Leveraging the arbitrary combination of phone-level information, we can generate any word with various pronunciation scores to improve model’s generalization ability. As reported in [li2017improving]

, deep features extracted from non-native acoustic model are not discriminate enough due to the inconsistency of non-native phone label. Therefore, we used both MFCC and deep features to further improve the scoring system.

2 Pronunciation Scoring Framework with Mixup

Our proposed scoring framework contains two modules: the mixup data generation module, which is able to create arbitrary words with various pronunciation scores, and the pronunciation scorer module, which is used to map a sequence of frame-level acoustic features into word-level pronunciation score.

2.1 Mixup Data Generation

Mixup was first proposed in [mixup]

, where interpolation between original feature-target pairs were used to generate augmented data for image classification, where the linear interpolation is applied on the pixel-based feature space. Different to the image data used in classification task, which has the same size and inherent ordering, for the word-level pronunciation scoring, the speech signal generally consists of different phoneme sequences and various duration, which is hard to perform interpolation on word-level features directly. Alternatively, we apply mixup on a lower-level feature space, phone-level features. Given a phone sequence from lexicon, it is easy to generated the augmented word-level features by concatenating the corresponding phone-level data, while the word score is the average of the phone scores.

To perform mixup, we need to collect a pool of phone-level features and their corresponding GOP scores. Specifically, an acoustic model, trained with paired speech signal and transcription, is utilized for deep features and GOPs score extraction. As shown in Figure 1 (a), given a collection of non-native audio with canonical transcription, we first extract MFCC features from the speech signal. An acoustic model is utilized to force-align all non-native audio (pairs of MFCC features and canonical transcription) to get the start and end time boundary for each phoneme. The MFCCs of each phoneme are then fed into the acoustic model to extract corresponding deep features and frame-level posterior probability from the bottleneck layer [lin2021deep] and output layer, respectively. Substantially, the GOP score of a specific phoneme can be interpreted as the normalized posterior probability over the phoneme duration obtained by the force-alignment. Finally, a pool of phone-level features and corresponding GOP scores is collected for each phoneme.

The constructed pools are then used for mixup as shown in Figure 1 (b). Firstly, a word and its phoneme sequence is sampled from lexicon based on the word frequency in training set. For each phoneme in the sequence, we randomly sample one quadruplet from the pool of the corresponding phoneme. As the example shown in Figure 1, given a word sampled from the lexicon, we first obtain its phoneme sequence of /D EY/. Then for each phoneme /D/ and /EY/, the GOP scores and corresponding features are randomly sampled from the pool of /D/ and /EY/, respectively. Finally, the features of generated word is formed by concatenating the features of /D/ and /EY/. We can obtain the the word score of 0.6 by averaging the corresponding GOP scores, which are 0.3 and 0.9 respectively.

Compared to [lin2021deep], the proposed mixup is able to generate as many as artificial word samples with a small amount of task-related unlabeled data. Different from [lee2016language, korzekwa2020detection], where the label of generated training samples is binary(e.g., correct or mispronounced), proposed mixup can generate continuous scores for augmented data.

2.2 Pronunciation Scorer

Word pronunciation scorer is designed to mapping acoustic feature to the word score. The Word pronunciation scorer consists of two training steps: (1) pretraining with unlabeled and mixup augmented data and (2) fine-tuning with human labeled data.

During pretraining, two sets of data were used for model training, including task related unlabeled data and augmented data generated by mixup (as described in Section 2.1). Given a word with sequence of frame-level deep features {, , …,}, we first apply nonlinear transformation on each of them, and then add them with corresponding canonical phone embeddings to obtain a sequence of frame-level phonetic features, which is subsequently fed into the 1D-CNN  [cnn] module. Then a mean-overtime pooling operation is applied over the feature map obtained by 1D-CNN module, to get the word-level representation . Similarly, can be obtained after doing the same operation on the MFCC features. Finally, their concatenation is fed into Eq.(1) to predict word pronunciation score .


where the [;] denotes the concatenation of two vectors. The mean square error (MSE) is used as the train objective for model updating, which is defined as:


where, denotes the total number of word samples in the pretraining data. is the word GOP score of the -th word.

At fine-tune stage, we adjust the pretrained pronunciation scorer by a small amount of human labeled data. Similar to the process in pretraining, the scorer input consists of MFCC, deep feature and corresponding phoneme sequence. And the word pronunciation score is obtained by Eq. (1).

Different to pretaining, at fine-tune, the -th word score assigned by human rater are used to supervise the scorer updating. The Eq. (2) can be rewrite as:


where denotes the total number of word samples in the fine-tuning data.

3 Experimental setup

3.1 Speech Corpus

The ASR training corpus consists of 5,000-hour speech data, including 960-hour native speech from LibriSpeech corpus [librispeech], and 4,000-hour non-native private recordings from Bytedance. While the pronunciation scoring data consists of 5,000 English utterances spoken by 250 non-native learners from Speechocean762 111 The word pronunciation score in each of them has been labeled by five experts, the median scores were adopted following the score files coming with the database, ranging from 0 to 10. Following the calculation method of previous study [nancypcc], the averaged inter-rater agreement is 0.726. In addition, linguistic experts in Bytedance collected a small amount of task-related unlabeled data (e.g., a group of Chinese adults are required to read aloud given English prompts). It is used in proposed mixup for word-level augmented data generation. The statistics of pronunciation scoring data is detailed in Table 1.

width=0.48center Task-related unlabeled data Speechocean762 Train Test # of words 50,000 15,849 15,967 # of phones 168,472 47,390 47,688

Table 1: Data statistics for pronunciation scorer.

3.2 Experimental configuration

DFSMN-HMM [dfsmn] based ASR model was adopted, which consists of 2 convolutional layers, 24 FSMN layers, a bottleneck layer and a feedforward layer. 39-dimensional MFCC feature was extracted using a 25 ms hamming window with 10 ms shift as the model input. While, the softmax output layer had 5,432 units, representing the senone labels derived from forced-alignment with a GMM-HMM system. The deep features were extracted from output of the bottleneck layer with the dimension of 512. While, the phoneme GOP scores were calculated with frame-level senones and their corresponding posteriors. The frame accuracy of the ASR system is 73%.

width=0.48center Layer Description Output Size 1 Input a sequence of phonetic features 32 2 1D-Convolution

kernel_size=3, stride=1,

32 filters 32 3 Batch normalization 32 4 ReLU 32 5 Dropout probability: 0.1 32 6 1D-Convolution kernel_size=3, stride=1, 32 filters 32 7 Batch normalization 32 8 ReLU 32 9 Dropout probability: 0.1 32 10 Max pooling kernel_size=2,stride=2 32 11 1D-Convolution kernel_size=1, stride=1, 32 filters 32 12 Batch normalization 32 13 ReLU 32 14 Dropout probability: 0.1 32 15 Mean pooling average time 321 16 Feature concatenation mfcc + deep feature 641 17 Fully connected output size: 32 321 18 Fully connected output size: 1 11 19 Sigmoid predicted word score 11

Table 2: The architecture of the proposed 1D-CNN-MLP, where , and is the frame length of one word.

The multi-source scoring model takes sequence of frame-level 39-dim MFCC and 512-dim deep feature as its input. Nonlinear transformations are applied to these features, and output frame-level 32-dim feature vectors, which are added to 32-dim reference phone embeddings to get phonetic features. Subsequently, we use 1D-CNN and MLP module to map phonetic features into word score. The detailed configuration is listed in Table 2. The Adam optimizing algorithm is chosen to minimize the MSE, and learning rate is set to 0.002. The word scores of both augmented and human-labeled data are scaled to the interval from 0 to 1 for model training and evaluation.

3.3 System Setup

To valid the proposed ideas, we considered the state-of-the-art method as our baseline.

  • No-pretrain: the scoring model is trained on the human-labeled data directly. No pretraining process is applied;

  • Real-pretrain: we calculate GOP score of each word in task-related unlabeled data, and use it to pretrain word-level scorer, which is then fine-tuned with the labeled data in training set shown in Table 1. This pretrain and fine-tune strategy achieved the highest PCC for sentence scoring in the recent study [lin2021deep], hence we apply it for word-level scoring and treat it as a strong baseline;

  • Mixup-pretrain: The pretraining is first preformed on the task-related unlabeled data and the artificial data generated by mixup. The model is then fine-tuned with the labeled data in training set.

For fair comparison, both baselines and proposed system shared the same ASR and scoring model structures.

4 Results and analyses

To assess the system performance, the Pearson correlation coefficient (PCC) between machine predicted scores and human scores was used in our experiments. A higher PCC indicates a better system performance.

4.1 Validate the Configurations of Proposed Method

In this section, we examined two configurations of proposed method, including the effects of mixup data size and multi-source information for predicting word-level pronunciation quality.

Figure 2: Performance of different amount of augmented data. The augmented data consists of both real unlabeled data (50k words) and artificial data generated with mixup. Take augmented data size of 500k as an example, it is formed with 50k real unlabeled data and 450k mixup data. Augmented data size of 50k indicates no mixup data.
Methods Real unlabeled data PCC
No-pretrain - 0.507
Real-pretrain 50k 0.567
100k 0.570
200k 0.586
500k 0.612
Mixup-pretrain 50k 0.610
Table 3: Comparison of different methods. Mixup-pretrain is pretrained with 50k real unlabeled data and 450k artificial augmented word created by mixup.

4.1.1 Effect of Mixup Data Size

We first compare the performance of the word-level scoring over different size of augmented data, which consists of real unlabeled data (50k words) and artificial data generated with mixup, e.g. 50k, 100k, 200k, 500k and 1,000k words. Figure 2 presents the PCC results with different amount of augmented data. It is observed that the system performance improves as the augmented data size increases for all three features. When the augmented data size reaches 500k, the best performance is achieved, with the PCC of 0.610. After that, we observe the PCC plateaus out for multi-source and deep features. Therefore, we set the augmented data size as 500k words (50k unlabeled data + 450k mixup data) in the rest of the experiments.

4.1.2 Effect of Multi-Source Information for Scoring

Then, we examined word-level pronunciation scoring performance over different input, e.g. MFCC, deep and multi-source feature. As shown in Figure 2, it is observed that deep feature consistently outperforms the MFCC for pronunciation scoring in all the tests. This observation confirms that deep feature is more discriminative than MFCC due to utilizing extra large amount of non-native data for its feature extractor training. The performance is further improved by multi-source feature, which utilizing both MFCC and deep features as system input. The results imply that MFCC and deep features are complementary to each other, and their combination in hidden space is helpful for pronunciation scoring.

4.2 Comparison with Baselines

In this section, we validate the proposed mixup by comparing against baselines. To focus on comparison of different pretraining methods, all the systems take multi-source feature as their input. Table 3 presents the PCC results from different methods. Benefit from the real unlabeled non-native data, all the real- and mixup-pretrain methods outperform the no-pretrain counterpart. This confirms the effectiveness of the data augmentation and pretrainining method for pronunciation scoring.

Then we compare the proposed mixup-pretrain against real-pretrain baseline. Since the task is defined in a low-resource scenario, we fix the size of real unlabelled data as 50k for the proposed method. Experimental results indicate that the proposed mixup-pretrain method performs better than the real-pretrain baseline trained with the same amount of unlabelled data when using 50k real unlabelled data. To achieve comparable performance of real-pretrain with the proposed mixup-pretrain, we increase the amount of real unlabeled data from 50k to 500k. From Table 3, we clearly see that the system performance improves as the unlabeled data size increases. The results show that, the real-pretrain baseline achieves similar performance to the proposed method when using 500k real unlabelled data, which is 10 times of proposed method. It suggests that the proposed mixup method offers a clear advantage over baseline under limited data condition.

5 Conclusion

In this paper, phone-level mixup is proposed to make full use of smaller unit (e.g., phone) to creates arbitrary bigger unit (e.g., word) with various pronunciation scores. Experiments conducted on Speechocean762 show that proposed augmented data is complementary to the real non-native data for training word-level pronunciation scorer. Moreover, feature combination under the proposed training framework brings extra improvement. In the future, we will extend current phone-level mixup to word-level mixup for improving sentence-level scoring. Meanwhile, we will also investigate whether similar improvement could be achieved, when the proposed data augmentation method is applied to attention-based pronunciation scorers [korzekwa2021weakly, lin2020automatic].