Dysarthria arises from various neurological disorders including Parkinson’s disease or amyotrophic lateral sclerosis, leading to weak regulation of articulators such as jaw, tongue, and lips . Therefore, the resulting dysarthric speech may be perceived as harsh or breathy with abnormal prosody and inaccurate pronunciation, which degrades the efficiency of vocal communication for dysarthric patients. Attempts have been made to improve the quality of dysarthric speech by using various reconstruction approaches, where voice conversion (VC) serves as a promising candidate .
The goal of VC is to convert non-linguistic or para-linguistic factors such as speaker identity , prosody , emotion  and accent . VC has also been widely applied in reconstructing different kinds of impaired speech including esophageal speech [10, 29], electrolaryngeal speech [25, 17], hearing-impaired speech  and dysarthric speech , where rule-based and statistical VC approaches have been investigated for dysarthric speech reconstruction (DSR). Rule-based VC tends to apply manually designed, speaker-dependent rules to correct phoneme errors or modify temporal and frequency features to improve intelligibility [28, 18]
. Statistical VC automatically maps the features of dysarthric speech to those of normal speech, where typical approaches contain Gaussian mixture model, non-negative matrix factorization [1, 4], partial least squares 
, and deep learning methods including sequence-to-sequence (seq2seq) models[33, 11, 7] and gated convolutional networks . Though significant progress has been made, previous work generally ignores speaker identity preservation, which loses the ability for patients to demonstrate their personality via acoustic characteristics. Preserving the identities for dysarthric speakers is very challenging since their normal speech utterances are difficult to collect. A few studies [12, 32] use a speaker representation to control the speaker identity of reconstructed speech, where the speaker encoder (SE) proposed in our previous work  is trained on a speaker verification (SV) task by using large-scale normal speech. However, the SE may fail to effectively extract speaker representations from previously unseen dysarthric speech, which lowers the speaker similarity of reconstructed speech.
This paper proposes an improved DSR system based on  by using adversarial speaker adaptation (ASA). The DSR system in  contains four modules: (1) A speech encoder extracting accurate phoneme embeddings from dysarthric speech to restore the linguistic content; (2) A prosody corrector inferring normal prosody features that are treated as canonical values for correction; (3) A speaker encoder
producing a single vector as speaker representation used to preserve the speaker identity; and (4) Aspeech generator
mapping phoneme embeddings, prosody features and speaker representation to reconstructed mel-spectrograms. The speaker encoder and speech generator are independently trained by using large-scale normal speech data. We term the resulting integrated DSR system using SV-based speaker encoder as the SV-DSR, which can generate the reconstructed speech with high intelligibility and naturalness. To better preserve the identity of the target dysarthric speaker during speech generation, speaker adaptation can be used to fine-tune the speaker encoder by using the dysarthric speech data. However, this approach inevitably incorporates dysarthric speaking patterns into the reconstructed speech. Hence, we propose to use ASA to alleviate this issue, and the resulting DSR system is termed as the ASA-DSR. For each dysarthric speaker, ASA-DSR is first cloned from SV-DSR and then adapted in a multi-task learning manner: (1) The primary task performs speaker adaptation to fine-tune the speaker encoder by using the dysarthric speech data to enhance the speaker similarity; (2) The secondary task performs adversarial training to alternatively optimize the speaker encoder and a system discriminator, by min-maximizing a discrimination loss to classify whether the mel-spectrograms are reconstructed by ASA-DSR or SV-DSR, which forces the reconstructed speech from ASA-DSR to have a distribution close to that of SV-DSR without dysarthric speaking patterns, rendering the reconstructed speech from ASA-DSR to maintain stable prosody and improved intelligibility.
The main contribution of this paper is the use of proposed ASA approach to effectively preserve speaker identities of dysarthric patients after the reconstruction, without using patients’ normal speech that is nearly impossible to collect. It is noted that our work is different from  that aims to achieve robust speech recognition, as the proposed ASA here is used to obtain regularized mel-spectrograms for generating high-quality speech with enhanced speaker similarity.
2 Baseline approach: SV-DSR
As shown in Fig. 1, our previously proposed SV-DSR system  contains four modules: speech encoder, prosody corrector, speaker encoder and speech generator. The first three modules respectively produce phoneme embeddings, prosody values and speaker representation; and the fourth module, the speech generator, maps these features to reconstructed mel-spectrograms.
Speech encoder: To recover the content, a seq2seq-based speech encoder is optimized by two-stage training to infer the phoneme sequence: (1) Pre-training on large-scale normal speech data; (2) Fine-tuning on the speech of a certain dysarthric speaker to achieve accurate phoneme prediction. The outputs of pre-trained speech encoder or fine-tuned speech encoder
are used as phoneme embeddings that denote phoneme probability distributions.
Prosody corrector: As abnormal duration and pitch are two essential prosody factors that contribute to dysarthric speech , a prosody corrector is used to amend the abnormal prosody to a normal form, it contains two predictors to respectively infer normal phoneme duration and pitch (i.e., fundamental frequency ()). The prosody corrector is trained by a healthy speaker’s speech with normal prosodic patterns: (1) Given the phoneme embeddings extracted by the speech encoder as inputs, the phoneme duration predictor is trained to infer the normal phoneme durations that are obtained from force-alignment via Montreal Forced Aligner toolkit ; (2) The ground-truth phoneme durations are used to align phoneme embeddings and as shown in Fig. 1, the expanded phoneme embeddings are denoted as p and fed into the pitch predictor to infer normal that is denoted by v. The prosody corrector is expected to take in phoneme embeddings extracted from dysarthric speech to infer normal values of phoneme duration and , which can be used as canonical values to replace their abnormal counterparts for generating the speech with normal prosodic patterns.
Speaker encoder: The speaker encoder, , is trained on a SV task to capture speaker characteristics. takes in mel-spectrograms m of one utterance with arbitrary length to produce a single vector as speaker representation: . Following the training scheme in , is optimized to minimize a generalized end-to-end loss  by using normal speech data that is easily acquired from thousands of healthy speakers.
Speech generator: The speech generator with parameters predicts mel-spectrograms as: . To generate normal speech, the speech generator is trained by using normal speech data from a set of healthy speakers . Each speaker has the training data set , where each sample corresponds to one utterance and contains mel-spectrograms , expanded phoneme embeddings and pitch features . Then speech generator is optimized by minimizing the generation loss , i.e., the L2-norm between the predicted mel-spectrograms and :
During the reconstruction phase, the SV-DSR system takes in the dysarthric speech of speaker to generate reconstructed mel-spectrograms as , where are phoneme embeddings extracted by fine-tuned speech encoder and expanded with predicted normal duration, is predicted normal pitch, and is the speaker representation. Then Parallel WaveGAN (PWG)  is adopted as the neural vocoder to transform to speech waveform. SV-DSR is a strong baseline as it can generate the speech with high intelligibility and naturalness. However, the speaker encoder is trained on normal speech, which limits its generalization ability to previously unseen dysarthric speech. Therefore, cannot effectively capture identity-related information of dysarthric speakers. Our experiments found that SV-DSR may even change the gender of speech after the reconstruction.
3 Proposed approach: ASA-DSR
The proposed approach of adversarial speaker adaptation (ASA), as illustrated in Fig. 2, aims to enhance speaker similarity, resulting in the proposed ASA-DSR system that shares the same modules as SV-DSR except for the speaker encoder. First, ASA-DSR is cloned from SV-DSR, then a system discriminator is introduced to determine whether its input mel-spectrograms are reconstructed by SV-DSR or ASA-DSR systems. Given a dysarthric speaker with the adaptation data set , where each element corresponds to one dysarthric utterance, are phoneme embeddings extracted by and expanded with dysarthric duration, is dysarthric pitch, their normal counterparts can be obtained via the prosody corrector as and , respectively. SV-DSR and ASA-DSR generate reconstructed mel-spectrograms as and respectively:
where and are respectively produced from the speaker encoders (from SV-DSR) and (from ASA-DSR) to control the speaker identity. Besides, ASA-DSR predicts dysarthric mel-spectrograms as used for adaptation:
Then speaker encoder of ASA-DSR and discriminator are alternatively optimized with remaining networks frozen. On one hand, is optimized to minimize the discrimination loss :
is the posterior probability of mel-spectrograms reconstructed by SV-DSR. On the other hand,is optimized to minimize the multi-task learning (MTL) loss :
where is set to 1 empirically. The primary task minimizes the adaptation loss to force speaker encoder to effectively capture speaker characteristics from the dysarthric speech, so that enhanced speaker similarity can be achieved in reconstructed mel-spectrograms . The secondary task maximizes the discrimination loss to force to have a similar distribution to that has high intelligibility and naturalness, which facilitates to maintain normal pronunciation patterns as . As a result, the proposed ASA-DSR preserves the capacity of SV-DSR to reconstruct high-quality speech, while achieving improved capacity for preserving the speaker identity of the target dysarthric speaker .
4.1 Experimental Settings
The datasets used in our experiments contain LibriSpeech , VCTK , VoxCeleb1 , VoxCeleb2 , LJSpeech  and UASPEECH . Speech encoder is pre-trained by 960h training data of LibriSpeech, prosody corrector is trained by the data of a healthy female speaker from LJSpeech, speaker encoder is trained by Librispeech, VoxCeleb1 and VoxCeleb2 with around 8.5K healthy speakers, speech generator
and PWG vocoder are trained by VCTK. For dysarthric speech, two male speakers (M05, M07) and two female speakers (F04, F02) are selected from UASPEECH, where M05/F04 and M07/F02 have moderate and moderate-severe dysarthria respectively. We use the speech data of blocks 1 and 3 of each dysarthric speaker for fine-tuning speech encoder and ASA, and block 2 for testing. The inputs of speech encoder are 40-dim mel-spectrograms appended with deltas and delta-deltas which results in 120-dim vectors, the targets of speech generator are 80-dim mel-spectrograms, all mel-spectrogtams are computed with 400-point Fourier transform, 25ms Hanning window and 10ms hop length.is extracted by the Pyworld toolkit222https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder with the 10ms hop length. To stabilize the training and inference of predictor, we adopt the logarithmic scale of
. All acoustic features are normalized to have zero mean and unit variance.
The speech encoder, prosody corrector, speaker encoder and speech generator adopt the same architectures as in , where the speaker encoder contains 3-layer 256-dim LSTM followed by one fully-connected layer to obtain the 256-dim vector that is L2-normalized as the speaker representation . The pre-training and fine-tuning of speech encoder are performed by Adadelta optimizer  with 1M and 2K steps respectively by using learning rate of 1 and batch size of 8. Both duration and predictors are trained by Adam optimizer  with 30K steps by using learning rate of 1e-3 and batch size of 16, speech generator is optimized in a similar way except that the training steps are set to 50K. The training of speaker encoder by using normal speech follows the scheme in . Convolution-based discriminator of StarGAN  is used as the system discriminator and alternatively trained with the speaker encoder during ASA for 5K steps. Four DSR systems are compared: (1) SV-DSR; (2) ASA-DSR; (3) SA-DSR, which is an ablation system that performs speaker adaptation similar with ASA-DSR but without adversarial training; and (4) E2E-VC , which is an end-to-end DSR model via cross-modal knowledge distillation, where the speaker encoder used in SV-DSR is added to control the speaker identity.
Comparison Results of MOS with 95% Confidence Intervals for Speaker Similarity.
4.2 Experimental Results and Analysis
4.2.1 Comparison Based on Speaker Similarity
Subjective tests are conducted to evaluate the speaker similarity of reconstructed speech, in terms of 5-point mean opinion score (MOS, 1-bad, 2-poor, 3-fair, 4-good, 5-excellent) rated by 20 subjects for 20 utterances randomly selected from each of four dysarthric speakers, and the scores are averaged and shown in Table 1. For E2E-VC and SV-DSR that use the SV-based speaker encoder to control the speaker identity, lower speaker similarity is achieved. Through our listening tests, the gender of reconstructed speech by E2E-VC and SV-DSR may be changed especially for female speakers, this shows the limited generalization ability of the SV-based speaker encoder to extract effective speaker representations from the dysarthric speech. However, with the speaker adaptation to fine-tune the speaker encoder, both SA-DSR and ASA-DSR can accurately preserve the gender with improved speaker similarity, showing the necessity of using dysarthric speech data to fine-tune the speaker encoder to effectively capture identity-related information of dysarthric speech.
4.2.2 Comparison Based on Speech Naturalness
Table 2 gives the MOS results of naturalness of original or reconstructed speech from different systems. We can see that all DSR systems improve the naturalness of original dysarthric speech, and SV-DSR achieves highest speech naturalness scores for all speakers, which shows the effectiveness of explicit prosody correction to generate the speech with stable and accurate prosody. By using the speaker adaptation without adversarial training, SA-DSR achieves lower naturalness improvements, due to partial dysarthric pronunciation patterns incorporated into the reconstructed speech. This issue can be effectively alleviated by using the proposed ASA to align the statistical distributions of reconstructed speech from ASA-DSR and SV-DSR, which facilitates ASA-DSR to generate high-quality speech that achieves comparable naturalness with SV-DSR.
4.2.3 Comparison Based on Speech Intelligibility
Objective evaluation of speech intelligibility is conducted by using a publicly released speech recognition model, i.e., Jasper , to test the word error rate (WER) with greedy decoding, and the results are shown in Table 3. Compared with original dysarthric speech, SV-DSR achieves largest WER reduction for all dysarthric speakers, showing the effectiveness of prosody correction to improve the speech intelligibility. Compared with SV-DSR, the adaptation version of SV-DSR without adversarial training, i.e., SA-DSR, has smaller WER reduction, which is caused by the incorporation of dysarthric speaking characteristics into reconstructed speech. However, with the proposed ASA to alleviate this issue, ASA-DSR outperforms E2E-VC and SA-DSR and matches the performance of SV-DSR, leading to 22.3% and 31.5% absolute WER reduction on average for speakers M05/F04 and M07/F02 that have moderate and moderate-severe dysarthria respectively.
4.2.4 Influence of Phoneme Duration and
We also conduct an ablation study to investigate how the phoneme duration and influence the quality of reconstructed speech by the proposed ASA-DSR system. Three combinations of phoneme duration and are used to generate the speech. We perform AB preference tests, where listeners are required to select the utterance that sounds more normal, i.e., more stable prosody and precise articulation, from two utterances generated by two different combinations. The results are illustrated in Fig. 3. For the comparison ‘GG vs. GP’ (i.e., Ground-truth duration and versus Ground-truth duration and Predicted ) of different speakers, more reconstructed speech samples are favored by using predicted normal (p-values 0.05). For the comparison ‘GP vs PP’ (i.e., Ground-truth duration and Predicted versus Predicted duration and ), using the predicted normal duration can significantly improve speech quality especially for speakers M05, M07 and F02 who have abnormally slow speaking speed. This shows that both phoneme duration and affect speech normality, and the prosody corrector in ASA-DSR derives normal values of phoneme duration and , which facilitate the reconstruction of speech to have normal prosodic patterns.
This paper presents a DSR system based on a novel multi-task learning strategy, i.e., ASA, to simultaneously preserve the speaker identity and maintain high speech quality. This is achieved by a primary task (i.e., speaker adaptation) to facilitate the speaker encoder to capture speaker characteristics from the dysarthric speech, and a secondary task (i.e., adversarial training) to avoid the incorporation of dysarthric speaking patterns into reconstructed speech. Experiments show that the proposed ASA-DSR can effectively achieve dysarthria reductions with improved naturalness and intelligibility, while speaker identity can be effectively maintained with 0.73 and 0.85 absolute MOS improvements of speaker similarity over the strong baseline SV-DSR, for speakers with moderate and moderate-severe dysarthria respectively.
This research is supported partially by the HKSAR Research Grants Council’s General Research Fund (Ref Number 14208817) and also partially by the Centre for Perceptual and Interactive Intelligence, a CUHK InnoCentre.
-  (2012) Consonant enhancement for articulation disorders based on non-negative matrix factorization. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–4. Cited by: §1.
-  (2012) GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing 2 (5), pp. 134–138. Cited by: §1.
-  (2017) Phoneme-discriminative features for dysarthric speech conversion.. In INTERSPEECH, pp. 3374–3378. Cited by: §1.
-  (2013) Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In ICASSP, pp. 8037–8040. Cited by: §1.
-  (2019) Parrotron: an end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. arXiv preprint arXiv:1904.04169. Cited by: §1.
-  (2020) Enhancing intelligibility of dysarthric speech using gated convolutional-based voice conversion system. INTERSPEECH. Cited by: §1.
-  (2021) Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. In INTERSPEECH, pp. 4828–4832. External Links: Cited by: §1.
-  (2018) . In , pp. 8789–8797. Cited by: §4.1.
-  (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §4.1.
-  (2010) Esophageal speech enhancement based on statistical voice conversion with gaussian mixture models. IEICE TRANSACTIONS on Information and Systems 93 (9), pp. 2472–2482. Cited by: §1.
-  (2021) Extending parrotron: an end-to-end, speech conversion and speech recognition model for atypical speech. In ICASSP, pp. 6988–6992. Cited by: §1.
-  (2021) A preliminary study of a two-stage paradigm for preserving speaker identity in dysarthric voice conversion. arXiv preprint arXiv:2106.01415. Cited by: §1.
-  (2017) The lj speech dataset. Cited by: §4.1.
-  (2007) Improving the intelligibility of dysarthric speech. Speech communication 49 (9), pp. 743–759. Cited by: §1, §2.
-  (2008) Dysarthric speech database for universal access research. In Ninth Annual Conference of the International Speech Communication Association, Cited by: §4.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2018) Electrolaryngeal speech enhancement with statistical voice conversion based on cldnn. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2115–2119. Cited by: §1.
-  (2016) Improving the intelligibility of dysarthric speech towards enhancing the effectiveness of speech therapy. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1000–1005. Cited by: §1.
-  (2019) Jasper: an end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288. Cited by: §4.2.3.
-  (2020) End-to-end accent conversion without using native utterances. In ICASSP, pp. 6289–6293. Cited by: §1, §2, §4.1.
-  (2017) Montreal forced aligner: trainable text-speech alignment using kaldi.. In INTERSPEECH, Vol. 2017, pp. 498–502. Cited by: §2.
-  (2019) Adversarial speaker adaptation. In ICASSP, pp. 5721–5725. Cited by: §1.
-  (2017) An overview of voice conversion systems. Speech Communication 88, pp. 65–82. Cited by: §1.
-  (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §4.1.
-  (2012) Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech. Speech Communication 54 (1), pp. 134–146. Cited by: §1.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §4.1.
Transformation of speaker characteristics for voice conversion.
2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), pp. 706–711. Cited by: §1.
-  (2011) Acoustic transformations to improve the intelligibility of dysarthric speech. In Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, pp. 11–21. Cited by: §1.
-  (2019) Parallel vs. non-parallel voice conversion for esophageal speech.. In INTERSPEECH, pp. 4549–4553. Cited by: §1.
-  (2016) Superseded-cstr vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. Cited by: §4.1.
-  (2018) Generalized end-to-end loss for speaker verification. In ICASSP, pp. 4879–4883. Cited by: §2.
-  (2021) Learning explicit prosody models and deep speaker embeddings for atypical voice conversion. arXiv preprint arXiv:2011.01678. Cited by: §1, §1, §2, §4.1.
-  (2020) End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction. In ICASSP, pp. 7744–7748. Cited by: §1, §4.1.
-  (2012) Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoustical Science and Technology 33 (1), pp. 1–5. Cited by: §1, §1.
-  (2020) Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP, pp. 6199–6203. Cited by: §2.
-  (2008) Articulatory movements during vowels in speakers with dysarthria and healthy controls.. Journal of Speech, Language, and Hearing Research: JSLHR 51 (3), pp. 596–611. Cited by: §1.
-  (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.1.