Computer-Assisted Pronunciation Training (CAPT) is an important technology to offer a flexible education service for the second language (L2) learners. In the pronunciation training process, the learner is asked to read a pre-defined text, and the CAPT system should detect mispronunciations of the speech and give the proper feedback. The common approach for CAPT is to compare the pronounced speech with the standard pronunciation distribution. If the deviation is too large, this pronunciation is judged as an error.
align the student utterance with the teacher utterance via Dynamic Time Warping (DTW). Misalignment extracted from the DTW path is used to detect the mispronunciation. However, such methods have two main shortcomings. First, teacher utterances are expensive to prepare. However, with the quick development of deep learning, text-to-speech (TTS) technologies[11, 16, 3, 15]
can be applied to generate the teacher utterances. Nevertheless, how to make the generated speech more suitable for alignment and mispronunciation detection remains to be explored. Second, the feature for alignment needs to be invariant of the speaking style. Otherwise, the alignment may fail. Thus, on the other hand, to extract content-related and style-invariant features, other methods choose to use the feature extracted by an automatic speech recognition (ASR) model. Note that the ASR model is trained on utterances collected from native (L1) speakers. Thus, these methods in fact compare the input speech to the learned knowledge of standard pronunciation. For example, the goodness-of-pronunciation (GOP)
uses the posterior probability of pre-defined phonemes to judge the correctness. In recent years, the performance of ASR has improved greatly. Thus, recognizing the pronounced phonemes and comparing them to the pre-defined phonemes is also a valid approach. This method is used by[27, 25, 28] to simplify the workflow. However, it is hard for these ASR-based methods to train on L2 utterances. First, to utilize L2 utterances, the text annotations must be the phonemes pronounced by the speaker rather than the phonemes of the pre-defined text. Such annotations are expensive to obtain. Second, generally, ASR-based methods only use the standard phonemes (for example, the 44 phonemes in English) for recognition and comparison. However, as analyzed in [2, 19], English learners from different language backgrounds may show acoustic characteristics similar to their mother tongues. Thus, L2 utterances may include undefined phonemes (we call them L2 phonemes in this paper) and cannot be properly classified into the standard phonemes.
To solve the aforementioned limitations, we propose to encode both L1 and L2 utterances into discrete acoustic units (we refer to this unit as “code” in the following discussion for simplicity) using the vector-quantized variational autoencoder (VQ-VAE)[20, 21]. This training process is self-supervised so that the L2 acoustic feature can be properly modeled without expensive annotations. Next, we utilize the encoded L1 code sequence and other distracting code sequences to simulate the mispronunciation and train a correction model that can discriminate the error codes and revise them to the correct ones considering the pre-defined text. By decoding the corrected code sequence, we can also obtain the corrected speech while keeping the style of the speaker. As analyzed by [26, 14, 4], if the teacher’s speech is more similar to the speaking style of the learner, the impact on pronunciation training will be more positive. Thus, the corrected speech can be powerful for education.
Our contributions can be summarized as follows.
We propose to encode the speech into discrete acoustic units via VQ-VAE for mispronunciation detection and correction. The proposed self-supervised encoding method can better model L2 features without expensive annotations.
By discriminating error codes and generating the correct ones, the proposed method not only detects the mispronunciation but also generates the correct pronunciation while keeping the speaking style. To the best of our knowledge, it is the first approach to perform mispronunciation detection and correction simultaneously. This kind of feedback is powerful for learners to improve their speaking skills.
We conduct experiments on the L2-Arctic dataset. Experiments show that the detection F1 score is improved by 9.58% relatively compared with ASR-based methods. The proposed method also achieves a comparable word error rate (WER) and the best style preservation for mispronunciation correction.111Audio samples in https://zju-zhan-zhang.github.io/mispronunciation-d-c/ compared with TTS-based methods.
Ii Proposed Method
Ii-a Acoustic units encoding via VQ-VAE
We adopt VQ-VAE for spectrum encoding as illustrated in Fig.1. The VQ encoder is constructed by Conformer encoder layers, a time-domain down-sampling convolutional (Conv) layer, and a Gumbel-Softmax VQ layer[6, 12]. The raw waveform is first converted to the log-Mel spectrum and goes through the Conformer encoder layers and the down-sampling layer for feature extraction,
Then, the VQ layer transforms the encoded feature (
is the attention dim of the Conformer layer) to logit(
is the number of codebooks) using a linear layer. The probability for choosing the-th codebook embedding is
where is the Gumbel-Softmax temperature, is the dim of the codebook embedding, and is uniformly sampled from . During the forward pass, the index of the chosen codebook is,
In the backward pass, the true gradient of the Gumbel-Softmax output is used.
The code decoder in Fig.1 uses a transposed convolutional (TConv) layer for the time-domain up-sampling, and uses Conformer decoder layers to recover the original spectrum from the encoded code embedding sequence. The code sequence is merged with the speaker information. We use X-vector extracted from the input spectrum as the speaker information,
On the one hand, the whole VQ-VAE must preserve information from the encoder to reconstruct the original spectrum. On the other hand, the VQ layer imposes an information bottleneck to force the encoder to discard non-essential details. As the style-related speaker information is directly offered to the decoder, the encoder will focus on extracting acoustic features that correlate more with the content.
We apply two loss functions to train the proposed VQ-VAE. First, to reconstruct the original spectrum, we apply the mean square error (MSE) loss between and . Second, to encourage the codebook usage, as inspired by , the diversity loss is also applied to increase the information entropy of . The final loss is defined as
where is the weight of the diversity loss. We set in our experiments.
Ii-B Non-autoregressive Code Correction
After the training of the proposed VQ-VAE converges, we freeze the model parameters and encode both L1 and L2 utterances into discrete codes. As illustrated in Fig.2, to simulate mispronunciations, the ground truth L1 codes are mixed with other distracting codes . Formally, for the L1 code sequence , we randomly replace segments of its original codes with other distracting codes . In practice, we use a mask sequence (its initial values are 0) and set the segments to 1. Thus, the corrupted code sequence can be denoted as
where is the element-wise production. We sample from L2 utterances and other utterances read by the same speaker of . We set the number of distracting replacements to for each code sequence . The replacement length and the start position are uniformly sampled, and .
We adopt the Transformer structure to predict both the original code sequence and the error mask based on the phonemes and the corrupted code sequence. The text encoder (TextEnc) uses a front Conv layer and the Transformer encoder to encode the phonemes . The code corrector is constructed by a front Conv layer, the Transformer decoder, and two output layers for the code and error mask prediction,
The loss function is defined as the classification loss between and , and , using the cross-entropy (CE) loss and the binary cross-entropy (BCE) loss, respectively,
Ii-C Mispronunciation Correction and Detection
During inference, the input speech is first encoded into the discrete code sequence for further processing. Then, the code corrector decides whether each code matches the standard L1 distribution conditioned on the input phonemes. Since we use the L1 code sequence as the training target in Eq.(8), codes that only appear in L2 utterances or deviated far from the corresponding phoneme will be found out and replaced with the correct ones. The predicted code sequence is further passed to the code decoder to get the spectrum. Finally, the spectrum is converted to the corrected speech waveform by a vocoder.
Note that is the error prediction for each code. We use the attention map of the last decoder layer to align the code sequence to the phoneme sequence. Formally, if we define the attention weight between each phoneme and each predict error mask as , the error prediction for each phoneme is
A sample result is shown in Fig.3. When is larger than the threshold , we mark this phoneme as a mispronunciation.
Iii-a Model Details
We use Librispeech as the L1 dataset and L2-Arctic as the L2 dataset. Annotated L2-Arctic utterances are kept as the testset, and the others are combined with Librispeech utterances to train the proposed VQ-VAE. We convert the 24kHz raw waveform into the 80-dim log-Mel spectrum for experiments. The detailed parameters are listed in Table I, where is the attention dim, is the feed-forward dim, is the number of attention heads, and is the kernel size.
We train both VQ-VAE and the code corrector until the training loss converges. We use the Adam optimizer with the warm-up learning scheduler.
|Conformer Enc*3||, , ,|
|VQ Layer||, ,|
|Conformer Dec*3||, , ,|
|Transformer Enc*6||, ,|
|Transformer Dec*6||, ,|
Iii-B Mispronunciation Detection Results
For mispronunciation detection, the model should make balance of detecting mispronunciations and accepting the correct ones. Thus, we use F1 score as the final metric for this task. In addition to accuracy (ACC), precision (PRE), and recall (REC), we also use the false rejection rate (FRR) and the false acceptance rate (FAR) to show the performance.
As suggested by , a phoneme-level ASR system trained on Librispeech is set as the baseline. This baseline recognizes the spoken phonemes and then aligns them to the target phonemes using the Needleman-Wunsch algorithm to determine the mispronunciation. We show the results in Table II29] are also displayed.
As we can see from Table II, with the help of deep-learning, the baseline ASR-based model has a great improvement on F1 score compared with the GOP-based GMM-HMM model. However, as the ASR-based model can only utilize L1 utterances, when faced with unseen L2 utterances, such a model may fail to classify L2 phonemes and will have a relatively high FRR. In contrast, the proposed model encodes both L1 and L2 utterances in a self-supervised manner and further uses both the code sequence and the conditioned phonemes for judgment. The F1 score is increased to 0.423 using , which is a 9.58% relative improvement compared to the baseline.
Iii-C Mispronunciation Correction
To test whether the generated speech is corrected to the standard one, we use a word-level ASR system trained on Librispeech to test the WER performance. A higher WER suggests that the speech contains more mispronunciations that cannot be recognized by this L1 ASR system. To evaluate the speaking style, we ask 20 volunteers to compare the similarity of the style between the raw speech and the corrected speech.
We use the speech generated by the L1 Fastspeech2 TTS system (conditioned on text phonemes and the speaker X-vector) as the WER topline. This system is denoted as Fastspeech2 (w/o Style). For comparison, to preserve other styles of the original speaker, the energy, pitch, and phoneme duration attributes are used as extra conditions for another TTS generation (denoted as Fastspeech2 (w/ Style)). For a fair WER comparison, the original pronunciation is also reconstructed using the same Parallel-WaveGAN vocoder.
As we can see from Table III, adding styles from L2 utterances leads to a WER degradation, but the style similarity increases. For our method, as we only modify the wrong codes and keep the correct codes (we show a sample spectrum in Fig.4), the proposed method can achieve a comparable correction performance and the best style preservation.
In this paper, we propose to use VQ-VAE for discrete acoustic units encoding. Further, we integrate discriminative and generative models to detect the error codes and generate the correct ones. Thus, the proposed method can perform mispronunciation detection and correction at the same time. Experiments on the L2-Arctic dataset show that the proposed method is a promising approach for CAPT.
Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §II-A.
-  (2010) First language phonetic drift during second language acquisition. ProQuest LLC. External Links: Cited by: §I.
-  (2020) MultiSpeech: multi-speaker text to speech with transformer. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, External Links: Cited by: §I.
-  (2009) Foreign accent conversion in computer assisted pronunciation training. Speech Communication 51 (10), pp. 920–932. External Links: Cited by: §I.
-  (2020) Conformer: convolution-augmented transformer for speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, External Links: Cited by: §II-A.
-  (2016/11/4) Categorical reparameterization with gumbel-softmax. External Links: Cited by: §II-A.
-  (2012) A comparison-based approach to mispronunciation detection. In IEEE Spoken Language Technology Workshop, SLT, pp. 382–387. External Links: Cited by: §I.
-  (2013) Pronunciation assessment via a comparison-based system. In Speech and Language Technology in Education, Cited by: §I.
Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 8227–8231. External Links: Cited by: §I.
-  (2019) CNN-rnn-ctc based end-to-end mispronunciation detection and diagnosis. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, External Links: Cited by: §III-B, TABLE II.
Neural speech synthesis with transformer network. In
34th AAAI Conference on Artificial Intelligence, pp. 6706–6713. External Links: Cited by: §I.
-  (2014/11/1) A* sampling. External Links: Cited by: §II-A.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5206–5210. External Links: Cited by: §III-A.
-  (2002) Enhancing foreign language tutors – in search of the golden speaker. Speech Communication 37 (3-4), pp. 161–173. External Links: Cited by: §I.
-  (2020/6/8) FastSpeech 2: fast and high-quality end-to-end text to speech. External Links: Cited by: §I, §III-C.
-  (2019) FastSpeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3171–3180. Cited by: §I.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5329–5333. External Links: Cited by: §II-A.
-  (2008) The needleman-wunsch algorithm for sequence alignment. External Links: Cited by: §III-B.
-  (2018) Investigating the role of l1 in automatic pronunciation evaluation of l2 speech. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1636–1640. External Links: Cited by: §I.
-  (2017/11/3) Neural discrete representation learning. External Links: Cited by: §I.
Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 4836–4840. External Links: Cited by: §I, §II-A.
-  (2017) Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, pp. 5999–6009. External Links: Cited by: §II-B.
-  (2000) Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication 30 (2), pp. 95–108. External Links: Cited by: §I.
Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. External Links: Cited by: §III-C.
-  (2020) An end-to-end mispronunciation detection system for l2 english speech leveraging novel anti-phone modeling. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 3032–3036. External Links: Cited by: §I.
-  (2019) Self-imitating feedback generation using gan for computer-assisted pronunciation training. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1881–1885. External Links: Cited by: §I.
-  (2020) End-to-end automatic pronunciation error detection based on improved hybrid ctc/attention architecture. Sensors (Switzerland) 20 (7), pp. 1–24. External Links: Cited by: §I.
-  (2021) Text-conditioned transformer for automatic pronunciation error detection. Speech Communication 130, pp. 55–63. External Links: Cited by: §I.
-  (2018) L2-arctic: a non-native english speech corpus. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, External Links: Cited by: 3rd item, §III-A, §III-B, TABLE II.