Mispronunciation Detection and Correction via Discrete Acoustic Units

08/12/2021 ∙ by Zhan Zhang, et al. ∙ Zhejiang University 0

Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. However, conventional CAPT methods cannot effectively use non-native utterances for supervised training because the ground truth pronunciation needs expensive annotation. Meanwhile, certain undefined nonnative phonemes cannot be correctly classified into standard phonemes. To solve these problems, we use the vector-quantized variational autoencoder (VQ-VAE) to encode the speech into discrete acoustic units in a self-supervised manner. Based on these units, we propose a novel method that integrates both discriminative and generative models. The proposed method can detect mispronunciation and generate the correct pronunciation at the same time. Experiments on the L2-Arctic dataset show that the detection F1 score is improved by 9.58 proposed method also achieves a comparable word error rate (WER) and the best style preservation for mispronunciation correction compared with text-to-speech (TTS) methods.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Computer-Assisted Pronunciation Training (CAPT) is an important technology to offer a flexible education service for the second language (L2) learners. In the pronunciation training process, the learner is asked to read a pre-defined text, and the CAPT system should detect mispronunciations of the speech and give the proper feedback. The common approach for CAPT is to compare the pronounced speech with the standard pronunciation distribution. If the deviation is too large, this pronunciation is judged as an error.

On the one hand, directly comparing with the standard speech can be an intuitive idea for the mispronunciation detection task. For example, [8, 9, 7]

align the student utterance with the teacher utterance via Dynamic Time Warping (DTW). Misalignment extracted from the DTW path is used to detect the mispronunciation. However, such methods have two main shortcomings. First, teacher utterances are expensive to prepare. However, with the quick development of deep learning, text-to-speech (TTS) technologies

[11, 16, 3, 15]

can be applied to generate the teacher utterances. Nevertheless, how to make the generated speech more suitable for alignment and mispronunciation detection remains to be explored. Second, the feature for alignment needs to be invariant of the speaking style. Otherwise, the alignment may fail. Thus, on the other hand, to extract content-related and style-invariant features, other methods choose to use the feature extracted by an automatic speech recognition (ASR) model. Note that the ASR model is trained on utterances collected from native (L1) speakers. Thus, these methods in fact compare the input speech to the learned knowledge of standard pronunciation. For example, the goodness-of-pronunciation (GOP)


uses the posterior probability of pre-defined phonemes to judge the correctness. In recent years, the performance of ASR has improved greatly. Thus, recognizing the pronounced phonemes and comparing them to the pre-defined phonemes is also a valid approach. This method is used by

[27, 25, 28] to simplify the workflow. However, it is hard for these ASR-based methods to train on L2 utterances. First, to utilize L2 utterances, the text annotations must be the phonemes pronounced by the speaker rather than the phonemes of the pre-defined text. Such annotations are expensive to obtain. Second, generally, ASR-based methods only use the standard phonemes (for example, the 44 phonemes in English) for recognition and comparison. However, as analyzed in [2, 19], English learners from different language backgrounds may show acoustic characteristics similar to their mother tongues. Thus, L2 utterances may include undefined phonemes (we call them L2 phonemes in this paper) and cannot be properly classified into the standard phonemes.

To solve the aforementioned limitations, we propose to encode both L1 and L2 utterances into discrete acoustic units (we refer to this unit as “code” in the following discussion for simplicity) using the vector-quantized variational autoencoder (VQ-VAE)[20, 21]. This training process is self-supervised so that the L2 acoustic feature can be properly modeled without expensive annotations. Next, we utilize the encoded L1 code sequence and other distracting code sequences to simulate the mispronunciation and train a correction model that can discriminate the error codes and revise them to the correct ones considering the pre-defined text. By decoding the corrected code sequence, we can also obtain the corrected speech while keeping the style of the speaker. As analyzed by [26, 14, 4], if the teacher’s speech is more similar to the speaking style of the learner, the impact on pronunciation training will be more positive. Thus, the corrected speech can be powerful for education.

Our contributions can be summarized as follows.

  • We propose to encode the speech into discrete acoustic units via VQ-VAE for mispronunciation detection and correction. The proposed self-supervised encoding method can better model L2 features without expensive annotations.

  • By discriminating error codes and generating the correct ones, the proposed method not only detects the mispronunciation but also generates the correct pronunciation while keeping the speaking style. To the best of our knowledge, it is the first approach to perform mispronunciation detection and correction simultaneously. This kind of feedback is powerful for learners to improve their speaking skills.

  • We conduct experiments on the L2-Arctic dataset[29]. Experiments show that the detection F1 score is improved by 9.58% relatively compared with ASR-based methods. The proposed method also achieves a comparable word error rate (WER) and the best style preservation for mispronunciation correction.111Audio samples in https://zju-zhan-zhang.github.io/mispronunciation-d-c/ compared with TTS-based methods.

Ii Proposed Method

Fig. 1: Structure of the proposed VQ-VAE. The spectrum is encoded into codes by the VQ encoder. Then, the spectrum is reconstructed using the encoded codes and the speaker information (X-vector).
Fig. 2: Workflow of the proposed method. Spectrums are encoded into codes for further processing. The ground truth L1 codes are corrupted by the distracting codes. The code corrector is trained to predict the original L1 codes and the error mask based on the corrupted codes and the phonemes. For inference, the corrected codes are converted back to spectrums using the code decoder.

Ii-a Acoustic units encoding via VQ-VAE

We adopt VQ-VAE[21] for spectrum encoding as illustrated in Fig.1. The VQ encoder is constructed by Conformer[5] encoder layers, a time-domain down-sampling convolutional (Conv) layer, and a Gumbel-Softmax VQ layer[6, 12]. The raw waveform is first converted to the log-Mel spectrum and goes through the Conformer encoder layers and the down-sampling layer for feature extraction,


Then, the VQ layer transforms the encoded feature (

is the attention dim of the Conformer layer) to logit


is the number of codebooks) using a linear layer. The probability for choosing the

-th codebook embedding is


where is the Gumbel-Softmax temperature, is the dim of the codebook embedding, and is uniformly sampled from . During the forward pass, the index of the chosen codebook is,


In the backward pass, the true gradient of the Gumbel-Softmax output is used.

The code decoder in Fig.1 uses a transposed convolutional (TConv) layer for the time-domain up-sampling, and uses Conformer decoder layers to recover the original spectrum from the encoded code embedding sequence. The code sequence is merged with the speaker information. We use X-vector[17] extracted from the input spectrum as the speaker information,


On the one hand, the whole VQ-VAE must preserve information from the encoder to reconstruct the original spectrum. On the other hand, the VQ layer imposes an information bottleneck to force the encoder to discard non-essential details. As the style-related speaker information is directly offered to the decoder, the encoder will focus on extracting acoustic features that correlate more with the content.

We apply two loss functions to train the proposed VQ-VAE. First, to reconstruct the original spectrum

, we apply the mean square error (MSE) loss between and . Second, to encourage the codebook usage, as inspired by [1], the diversity loss is also applied to increase the information entropy of . The final loss is defined as


where is the weight of the diversity loss. We set in our experiments.

Ii-B Non-autoregressive Code Correction

After the training of the proposed VQ-VAE converges, we freeze the model parameters and encode both L1 and L2 utterances into discrete codes. As illustrated in Fig.2, to simulate mispronunciations, the ground truth L1 codes are mixed with other distracting codes . Formally, for the L1 code sequence , we randomly replace segments of its original codes with other distracting codes . In practice, we use a mask sequence (its initial values are 0) and set the segments to 1. Thus, the corrupted code sequence can be denoted as


where is the element-wise production. We sample from L2 utterances and other utterances read by the same speaker of . We set the number of distracting replacements to for each code sequence . The replacement length and the start position are uniformly sampled, and .

We adopt the Transformer[22] structure to predict both the original code sequence and the error mask based on the phonemes and the corrupted code sequence. The text encoder (TextEnc) uses a front Conv layer and the Transformer encoder to encode the phonemes . The code corrector is constructed by a front Conv layer, the Transformer decoder, and two output layers for the code and error mask prediction,


The loss function is defined as the classification loss between and , and , using the cross-entropy (CE) loss and the binary cross-entropy (BCE) loss, respectively,

Fig. 3: Alignment between the error mask prediction and the phoneme-level error prediction . This sample reads “A maddening joy pounded in his brain”. The mispronounced phonemes are marked in red.

Ii-C Mispronunciation Correction and Detection

During inference, the input speech is first encoded into the discrete code sequence for further processing. Then, the code corrector decides whether each code matches the standard L1 distribution conditioned on the input phonemes. Since we use the L1 code sequence as the training target in Eq.(8), codes that only appear in L2 utterances or deviated far from the corresponding phoneme will be found out and replaced with the correct ones. The predicted code sequence is further passed to the code decoder to get the spectrum. Finally, the spectrum is converted to the corrected speech waveform by a vocoder.

Note that is the error prediction for each code. We use the attention map of the last decoder layer to align the code sequence to the phoneme sequence. Formally, if we define the attention weight between each phoneme and each predict error mask as , the error prediction for each phoneme is


A sample result is shown in Fig.3. When is larger than the threshold , we mark this phoneme as a mispronunciation.

Iii Experiments

Iii-a Model Details

We use Librispeech[13] as the L1 dataset and L2-Arctic[29] as the L2 dataset. Annotated L2-Arctic utterances are kept as the testset, and the others are combined with Librispeech utterances to train the proposed VQ-VAE. We convert the 24kHz raw waveform into the 80-dim log-Mel spectrum for experiments. The detailed parameters are listed in Table I, where is the attention dim, is the feed-forward dim, is the number of attention heads, and is the kernel size.

We train both VQ-VAE and the code corrector until the training loss converges. We use the Adam optimizer with the warm-up learning scheduler.

Model Description
Spectrum , ,
VQ Encoder
  Conformer Enc*3 , , ,
  Conv Layer ,
  VQ Layer , ,
Code Decoder
  TConv Layer ,
  Conformer Dec*3 , , ,
Text Encoder
  Conv Layer*2 ,
  Transformer Enc*6 , ,
Code Corrector
  Conv Layer*2 ,
  Transformer Dec*6 , ,
TABLE I: Model Details
 GMM-HMM[29] - - - 0.290 0.290 0.290
 CTC-Attention[10] 0.475 0.204 0.757 0.305 0.525 0.386
0.468 0.184 0.774 0.330 0.532 0.407
0.428 0.193 0.773 0.336 0.572 0.423
0.419 0.224 0.748 0.307 0.581 0.401
TABLE II: Mispronunciation Detection Results
Model WER(%) Similarity
Raw 17.63 1.000.00
Fastspeech2(w/o Style) 9.08 0.530.02
Fastspeech2(w/ Style) 12.11 0.690.01
Proposed 12.98 0.860.02
TABLE III: Mispronunciation Correction Results

Iii-B Mispronunciation Detection Results

For mispronunciation detection, the model should make balance of detecting mispronunciations and accepting the correct ones. Thus, we use F1 score as the final metric for this task. In addition to accuracy (ACC), precision (PRE), and recall (REC), we also use the false rejection rate (FRR) and the false acceptance rate (FAR) to show the performance.

As suggested by [10], a phoneme-level ASR system trained on Librispeech is set as the baseline. This baseline recognizes the spoken phonemes and then aligns them to the target phonemes using the Needleman-Wunsch algorithm[18] to determine the mispronunciation. We show the results in Table II

. Results of the GOP-based Gaussian Mixture Model Hidden Markov Model (GMM-HMM) from

[29] are also displayed.

As we can see from Table II, with the help of deep-learning, the baseline ASR-based model has a great improvement on F1 score compared with the GOP-based GMM-HMM model. However, as the ASR-based model can only utilize L1 utterances, when faced with unseen L2 utterances, such a model may fail to classify L2 phonemes and will have a relatively high FRR. In contrast, the proposed model encodes both L1 and L2 utterances in a self-supervised manner and further uses both the code sequence and the conditioned phonemes for judgment. The F1 score is increased to 0.423 using , which is a 9.58% relative improvement compared to the baseline.

Fig. 4: Spectrum comparison of the sample from Fig.3. Our method can modify the mispronounced areas while keeping the other correct areas compared with the TTS-based method. Two obviously modified areas of our method are marked in blue boxes.

Iii-C Mispronunciation Correction

To test whether the generated speech is corrected to the standard one, we use a word-level ASR system trained on Librispeech to test the WER performance. A higher WER suggests that the speech contains more mispronunciations that cannot be recognized by this L1 ASR system. To evaluate the speaking style, we ask 20 volunteers to compare the similarity of the style between the raw speech and the corrected speech.

We use the speech generated by the L1 Fastspeech2[15] TTS system (conditioned on text phonemes and the speaker X-vector) as the WER topline. This system is denoted as Fastspeech2 (w/o Style). For comparison, to preserve other styles of the original speaker, the energy, pitch, and phoneme duration attributes are used as extra conditions for another TTS generation (denoted as Fastspeech2 (w/ Style)). For a fair WER comparison, the original pronunciation is also reconstructed using the same Parallel-WaveGAN vocoder[24].

As we can see from Table III, adding styles from L2 utterances leads to a WER degradation, but the style similarity increases. For our method, as we only modify the wrong codes and keep the correct codes (we show a sample spectrum in Fig.4), the proposed method can achieve a comparable correction performance and the best style preservation.

Iv Conclusion

In this paper, we propose to use VQ-VAE for discrete acoustic units encoding. Further, we integrate discriminative and generative models to detect the error codes and generate the correct ones. Thus, the proposed method can perform mispronunciation detection and correction at the same time. Experiments on the L2-Arctic dataset show that the proposed method is a promising approach for CAPT.


  • [1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020)

    Wav2vec 2.0: a framework for self-supervised learning of speech representations

    Advances in Neural Information Processing Systems 33. Cited by: §II-A.
  • [2] C. B. Chang (2010) First language phonetic drift during second language acquisition. ProQuest LLC. External Links: Link, ISSN 1245-5128 Cited by: §I.
  • [3] M. Chen, X. Tan, Y. Ren, J. Xu, H. Sun, S. Zhao, and T. Qin (2020) MultiSpeech: multi-speaker text to speech with transformer. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, External Links: Document Cited by: §I.
  • [4] D. Felps, H. Bortfeld, and R. Gutierrez-Osuna (2009) Foreign accent conversion in computer assisted pronunciation training. Speech Communication 51 (10), pp. 920–932. External Links: ISSN 01676393, Document Cited by: §I.
  • [5] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang (2020) Conformer: convolution-augmented transformer for speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, External Links: Document Cited by: §II-A.
  • [6] E. Jang, S. Gu, and B. Poole (2016/11/4) Categorical reparameterization with gumbel-softmax. External Links: Link Cited by: §II-A.
  • [7] A. Lee and J. Glass (2012) A comparison-based approach to mispronunciation detection. In IEEE Spoken Language Technology Workshop, SLT, pp. 382–387. External Links: ISBN 978-1-4673-5126-3, Document Cited by: §I.
  • [8] A. Lee and J. Glass (2013) Pronunciation assessment via a comparison-based system. In Speech and Language Technology in Education, Cited by: §I.
  • [9] A. Lee, Y. Zhang, and J. Glass (2013)

    Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams

    In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 8227–8231. External Links: ISSN 15206149, Document Cited by: §I.
  • [10] W. Leung, X. Liu, and H. Meng (2019) CNN-rnn-ctc based end-to-end mispronunciation detection and diagnosis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, External Links: ISBN 9781479981311, Document Cited by: §III-B, TABLE II.
  • [11] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu (2019)

    Neural speech synthesis with transformer network


    34th AAAI Conference on Artificial Intelligence

    pp. 6706–6713. External Links: ISSN 2159-5399, Document Cited by: §I.
  • [12] C. J. Maddison, D. Tarlow, and T. Minka (2014/11/1) A* sampling. External Links: Link Cited by: §II-A.
  • [13] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5206–5210. External Links: ISBN 978-1-4673-6997-8, Document Cited by: §III-A.
  • [14] K. Probst, Y. Ke, and M. Eskenazi (2002) Enhancing foreign language tutors – in search of the golden speaker. Speech Communication 37 (3-4), pp. 161–173. External Links: ISSN 01676393, Document Cited by: §I.
  • [15] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2020/6/8) FastSpeech 2: fast and high-quality end-to-end text to speech. External Links: Link Cited by: §I, §III-C.
  • [16] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2019) FastSpeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3171–3180. Cited by: §I.
  • [17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5329–5333. External Links: ISBN 978-1-5386-4658-8, Document Cited by: §II-A.
  • [18] (2008) The needleman-wunsch algorithm for sequence alignment. External Links: Link Cited by: §III-B.
  • [19] M. Tu, A. Grabek, J. Liss, and V. Berisha (2018) Investigating the role of l1 in automatic pronunciation evaluation of l2 speech. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1636–1640. External Links: ISSN 19909772, Document Cited by: §I.
  • [20] A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017/11/3) Neural discrete representation learning. External Links: Link Cited by: §I.
  • [21] B. van Niekerk, L. Nortje, and H. Kamper (2020)

    Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge

    In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 4836–4840. External Links: ISSN 19909772, Document Cited by: §I, §II-A.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, pp. 5999–6009. External Links: ISSN 10495258 Cited by: §II-B.
  • [23] S. M. Witt and S. J. Young (2000) Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication 30 (2), pp. 95–108. External Links: ISSN 01676393, Document Cited by: §I.
  • [24] R. Yamamoto, E. Song, and J. Kim (2019/10/25)

    Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

    External Links: Link Cited by: §III-C.
  • [25] B. C. Yan, M. C. Wu, H. T. Hung, and B. Chen (2020) An end-to-end mispronunciation detection system for l2 english speech leveraging novel anti-phone modeling. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 3032–3036. External Links: ISSN 19909772, Document Cited by: §I.
  • [26] S. H. Yang and M. Chung (2019) Self-imitating feedback generation using gan for computer-assisted pronunciation training. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1881–1885. External Links: Document Cited by: §I.
  • [27] L. Zhang, Z. Zhao, C. Ma, L. Shan, H. Sun, L. Jiang, S. Deng, and C. Gao (2020) End-to-end automatic pronunciation error detection based on improved hybrid ctc/attention architecture. Sensors (Switzerland) 20 (7), pp. 1–24. External Links: ISSN 14248220, Document Cited by: §I.
  • [28] Z. Zhang, Y. Wang, and J. Yang (2021) Text-conditioned transformer for automatic pronunciation error detection. Speech Communication 130, pp. 55–63. External Links: ISSN 01676393, Document Cited by: §I.
  • [29] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna (2018) L2-arctic: a non-native english speech corpus. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, External Links: Document Cited by: 3rd item, §III-A, §III-B, TABLE II.