Speaker Identity Preservation in Dysarthric Speech Reconstruction by Adversarial Speaker Adaptation

by   Disong Wang, et al.
The Chinese University of Hong Kong

Dysarthric speech reconstruction (DSR), which aims to improve the quality of dysarthric speech, remains a challenge, not only because we need to restore the speech to be normal, but also must preserve the speaker's identity. The speaker representation extracted by the speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity. However, the SE may not be able to fully capture the characteristics of dysarthric speakers that are previously unseen. To address this research problem, we propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA). The primary task of ASA fine-tunes the SE with the speech of the target dysarthric speaker to effectively capture identity-related information, and the secondary task applies adversarial training to avoid the incorporation of abnormal speaking patterns into the reconstructed speech, by regularizing the distribution of reconstructed speech to be close to that of reference speech with high quality. Experiments show that the proposed approach can achieve enhanced speaker similarity and comparable speech naturalness with a strong baseline approach. Compared with dysarthric speech, the reconstructed speech achieves 22.3 moderate and moderate-severe dysarthria respectively. Our demo page is released here: https://wendison.github.io/ASA-DSR-demo/



page 4


MIRNet: Learning multiple identities representations in overlapped speech

Many approaches can derive information about a single speaker's identity...

Speaker disentanglement in video-to-speech conversion

The task of video-to-speech aims to translate silent video of lip moveme...

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

This paper presents Daft-Exprt, a multi-speaker acoustic model advancing...

Lip2AudSpec: Speech reconstruction from silent lip movements video

In this study, we propose a deep neural network for reconstructing intel...

Neural voice cloning with a few low-quality samples

In this paper, we explore the possibility of speech synthesis from low q...

Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion?

Automatic speech recognition (ASR) is a key technology in many services ...

A Study of F0 Modification for X-Vector Based Speech Pseudonymization Across Gender

Speech pseudonymization aims at altering a speech signal to map the iden...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dysarthria arises from various neurological disorders including Parkinson’s disease or amyotrophic lateral sclerosis, leading to weak regulation of articulators such as jaw, tongue, and lips [36]. Therefore, the resulting dysarthric speech may be perceived as harsh or breathy with abnormal prosody and inaccurate pronunciation, which degrades the efficiency of vocal communication for dysarthric patients. Attempts have been made to improve the quality of dysarthric speech by using various reconstruction approaches, where voice conversion (VC) serves as a promising candidate [34].

The goal of VC is to convert non-linguistic or para-linguistic factors such as speaker identity [23], prosody [27], emotion [2] and accent [20]. VC has also been widely applied in reconstructing different kinds of impaired speech including esophageal speech [10, 29], electrolaryngeal speech [25, 17], hearing-impaired speech [5] and dysarthric speech [34], where rule-based and statistical VC approaches have been investigated for dysarthric speech reconstruction (DSR). Rule-based VC tends to apply manually designed, speaker-dependent rules to correct phoneme errors or modify temporal and frequency features to improve intelligibility [28, 18]

. Statistical VC automatically maps the features of dysarthric speech to those of normal speech, where typical approaches contain Gaussian mixture model

[14], non-negative matrix factorization [1, 4], partial least squares [3]

, and deep learning methods including sequence-to-sequence (seq2seq) models

[33, 11, 7] and gated convolutional networks [6]. Though significant progress has been made, previous work generally ignores speaker identity preservation, which loses the ability for patients to demonstrate their personality via acoustic characteristics. Preserving the identities for dysarthric speakers is very challenging since their normal speech utterances are difficult to collect. A few studies [12, 32] use a speaker representation to control the speaker identity of reconstructed speech, where the speaker encoder (SE) proposed in our previous work [32] is trained on a speaker verification (SV) task by using large-scale normal speech. However, the SE may fail to effectively extract speaker representations from previously unseen dysarthric speech, which lowers the speaker similarity of reconstructed speech.

This paper proposes an improved DSR system based on [32] by using adversarial speaker adaptation (ASA). The DSR system in [32] contains four modules: (1) A speech encoder extracting accurate phoneme embeddings from dysarthric speech to restore the linguistic content; (2) A prosody corrector inferring normal prosody features that are treated as canonical values for correction; (3) A speaker encoder

producing a single vector as speaker representation used to preserve the speaker identity; and (4) A

speech generator

mapping phoneme embeddings, prosody features and speaker representation to reconstructed mel-spectrograms. The speaker encoder and speech generator are independently trained by using large-scale normal speech data. We term the resulting integrated DSR system using SV-based speaker encoder as the SV-DSR, which can generate the reconstructed speech with high intelligibility and naturalness. To better preserve the identity of the target dysarthric speaker during speech generation, speaker adaptation can be used to fine-tune the speaker encoder by using the dysarthric speech data. However, this approach inevitably incorporates dysarthric speaking patterns into the reconstructed speech. Hence, we propose to use ASA to alleviate this issue, and the resulting DSR system is termed as the ASA-DSR. For each dysarthric speaker, ASA-DSR is first cloned from SV-DSR and then adapted in a multi-task learning manner: (1) The primary task performs speaker adaptation to fine-tune the speaker encoder by using the dysarthric speech data to enhance the speaker similarity; (2) The secondary task performs adversarial training to alternatively optimize the speaker encoder and a system discriminator, by min-maximizing a discrimination loss to classify whether the mel-spectrograms are reconstructed by ASA-DSR or SV-DSR, which forces the reconstructed speech from ASA-DSR to have a distribution close to that of SV-DSR without dysarthric speaking patterns, rendering the reconstructed speech from ASA-DSR to maintain stable prosody and improved intelligibility.

The main contribution of this paper is the use of proposed ASA approach to effectively preserve speaker identities of dysarthric patients after the reconstruction, without using patients’ normal speech that is nearly impossible to collect. It is noted that our work is different from [22] that aims to achieve robust speech recognition, as the proposed ASA here is used to obtain regularized mel-spectrograms for generating high-quality speech with enhanced speaker similarity.

2 Baseline approach: SV-DSR

As shown in Fig. 1, our previously proposed SV-DSR system [32] contains four modules: speech encoder, prosody corrector, speaker encoder and speech generator. The first three modules respectively produce phoneme embeddings, prosody values and speaker representation; and the fourth module, the speech generator, maps these features to reconstructed mel-spectrograms.

Speech encoder: To recover the content, a seq2seq-based speech encoder is optimized by two-stage training to infer the phoneme sequence: (1) Pre-training on large-scale normal speech data; (2) Fine-tuning on the speech of a certain dysarthric speaker to achieve accurate phoneme prediction. The outputs of pre-trained speech encoder or fine-tuned speech encoder

are used as phoneme embeddings that denote phoneme probability distributions.

Figure 1: The architecture for the SV-DSR system. The ASA-DSR has the same architecture, except that the speaker encoder of SV-DSR is trained for a SV task on the normal speech, while of ASA-DSR is first initialized from and then fine-tuned by the dysarthric speech via proposed ASA.

Prosody corrector: As abnormal duration and pitch are two essential prosody factors that contribute to dysarthric speech [14], a prosody corrector is used to amend the abnormal prosody to a normal form, it contains two predictors to respectively infer normal phoneme duration and pitch (i.e., fundamental frequency ()). The prosody corrector is trained by a healthy speaker’s speech with normal prosodic patterns: (1) Given the phoneme embeddings extracted by the speech encoder as inputs, the phoneme duration predictor is trained to infer the normal phoneme durations that are obtained from force-alignment via Montreal Forced Aligner toolkit [21]; (2) The ground-truth phoneme durations are used to align phoneme embeddings and as shown in Fig. 1, the expanded phoneme embeddings are denoted as p and fed into the pitch predictor to infer normal that is denoted by v. The prosody corrector is expected to take in phoneme embeddings extracted from dysarthric speech to infer normal values of phoneme duration and , which can be used as canonical values to replace their abnormal counterparts for generating the speech with normal prosodic patterns.

Speaker encoder: The speaker encoder, , is trained on a SV task to capture speaker characteristics. takes in mel-spectrograms m of one utterance with arbitrary length to produce a single vector as speaker representation: . Following the training scheme in [20], is optimized to minimize a generalized end-to-end loss [31] by using normal speech data that is easily acquired from thousands of healthy speakers.

Speech generator: The speech generator with parameters predicts mel-spectrograms as: . To generate normal speech, the speech generator is trained by using normal speech data from a set of healthy speakers . Each speaker has the training data set , where each sample corresponds to one utterance and contains mel-spectrograms , expanded phoneme embeddings and pitch features . Then speech generator is optimized by minimizing the generation loss , i.e., the L2-norm between the predicted mel-spectrograms and :


During the reconstruction phase, the SV-DSR system takes in the dysarthric speech of speaker to generate reconstructed mel-spectrograms as , where are phoneme embeddings extracted by fine-tuned speech encoder and expanded with predicted normal duration, is predicted normal pitch, and is the speaker representation. Then Parallel WaveGAN (PWG) [35] is adopted as the neural vocoder to transform to speech waveform. SV-DSR is a strong baseline as it can generate the speech with high intelligibility and naturalness. However, the speaker encoder is trained on normal speech, which limits its generalization ability to previously unseen dysarthric speech. Therefore, cannot effectively capture identity-related information of dysarthric speakers. Our experiments found that SV-DSR may even change the gender of speech after the reconstruction.

3 Proposed approach: ASA-DSR

The proposed approach of adversarial speaker adaptation (ASA), as illustrated in Fig. 2, aims to enhance speaker similarity, resulting in the proposed ASA-DSR system that shares the same modules as SV-DSR except for the speaker encoder. First, ASA-DSR is cloned from SV-DSR, then a system discriminator is introduced to determine whether its input mel-spectrograms are reconstructed by SV-DSR or ASA-DSR systems. Given a dysarthric speaker with the adaptation data set , where each element corresponds to one dysarthric utterance, are phoneme embeddings extracted by and expanded with dysarthric duration, is dysarthric pitch, their normal counterparts can be obtained via the prosody corrector as and , respectively. SV-DSR and ASA-DSR generate reconstructed mel-spectrograms as and respectively:


where and are respectively produced from the speaker encoders (from SV-DSR) and (from ASA-DSR) to control the speaker identity. Besides, ASA-DSR predicts dysarthric mel-spectrograms as used for adaptation:


Then speaker encoder of ASA-DSR and discriminator are alternatively optimized with remaining networks frozen. On one hand, is optimized to minimize the discrimination loss :



is the posterior probability of mel-spectrograms reconstructed by SV-DSR. On the other hand,

is optimized to minimize the multi-task learning (MTL) loss :


where is set to 1 empirically. The primary task minimizes the adaptation loss to force speaker encoder to effectively capture speaker characteristics from the dysarthric speech, so that enhanced speaker similarity can be achieved in reconstructed mel-spectrograms . The secondary task maximizes the discrimination loss to force to have a similar distribution to that has high intelligibility and naturalness, which facilitates to maintain normal pronunciation patterns as . As a result, the proposed ASA-DSR preserves the capacity of SV-DSR to reconstruct high-quality speech, while achieving improved capacity for preserving the speaker identity of the target dysarthric speaker .

Figure 2: Diagram of ASA. is the mel-spectrogram of dysarthric speech. is phoneme embedding expanded with dysarthric duration, is the pitch of dysathric speech, their normal counterparts are and obtained via prosody corrector. GRL is gradient reversal layer that passes the data during forward propagation and inverts the sign of gradient during backward propagation. Only parameters of and are updated during the ASA process.

4 Experiments

4.1 Experimental Settings

The datasets used in our experiments contain LibriSpeech [26], VCTK [30], VoxCeleb1 [24], VoxCeleb2 [9], LJSpeech [13] and UASPEECH [15]. Speech encoder is pre-trained by 960h training data of LibriSpeech, prosody corrector is trained by the data of a healthy female speaker from LJSpeech, speaker encoder is trained by Librispeech, VoxCeleb1 and VoxCeleb2 with around 8.5K healthy speakers, speech generator

and PWG vocoder are trained by VCTK. For dysarthric speech, two male speakers (M05, M07) and two female speakers (F04, F02) are selected from UASPEECH, where M05/F04 and M07/F02 have moderate and moderate-severe dysarthria respectively. We use the speech data of blocks 1 and 3 of each dysarthric speaker for fine-tuning speech encoder and ASA, and block 2 for testing. The inputs of speech encoder are 40-dim mel-spectrograms appended with deltas and delta-deltas which results in 120-dim vectors, the targets of speech generator are 80-dim mel-spectrograms, all mel-spectrogtams are computed with 400-point Fourier transform, 25ms Hanning window and 10ms hop length.

is extracted by the Pyworld toolkit222https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder with the 10ms hop length. To stabilize the training and inference of predictor, we adopt the logarithmic scale of

. All acoustic features are normalized to have zero mean and unit variance.

The speech encoder, prosody corrector, speaker encoder and speech generator adopt the same architectures as in [32], where the speaker encoder contains 3-layer 256-dim LSTM followed by one fully-connected layer to obtain the 256-dim vector that is L2-normalized as the speaker representation [20]. The pre-training and fine-tuning of speech encoder are performed by Adadelta optimizer [37] with 1M and 2K steps respectively by using learning rate of 1 and batch size of 8. Both duration and predictors are trained by Adam optimizer [16] with 30K steps by using learning rate of 1e-3 and batch size of 16, speech generator is optimized in a similar way except that the training steps are set to 50K. The training of speaker encoder by using normal speech follows the scheme in [20]. Convolution-based discriminator of StarGAN [8] is used as the system discriminator and alternatively trained with the speaker encoder during ASA for 5K steps. Four DSR systems are compared: (1) SV-DSR; (2) ASA-DSR; (3) SA-DSR, which is an ablation system that performs speaker adaptation similar with ASA-DSR but without adversarial training; and (4) E2E-VC [33], which is an end-to-end DSR model via cross-modal knowledge distillation, where the speaker encoder used in SV-DSR is added to control the speaker identity.

Approaches M05 F04 M07 F02
Original 4.930.01 4.890.02 4.950.01 4.960.01
E2E-VC 2.660.12 2.500.13 2.470.16 2.270.14
SV-DSR 2.700.14 2.270.10 2.550.14 1.880.13
SA-DSR 3.260.09 3.040.12 3.250.15 2.990.15
ASA-DSR 3.270.10 3.160.15 3.200.13 2.930.15
Table 1:

Comparison Results of MOS with 95% Confidence Intervals for Speaker Similarity.

4.2 Experimental Results and Analysis

4.2.1 Comparison Based on Speaker Similarity

Subjective tests are conducted to evaluate the speaker similarity of reconstructed speech, in terms of 5-point mean opinion score (MOS, 1-bad, 2-poor, 3-fair, 4-good, 5-excellent) rated by 20 subjects for 20 utterances randomly selected from each of four dysarthric speakers, and the scores are averaged and shown in Table 1. For E2E-VC and SV-DSR that use the SV-based speaker encoder to control the speaker identity, lower speaker similarity is achieved. Through our listening tests, the gender of reconstructed speech by E2E-VC and SV-DSR may be changed especially for female speakers, this shows the limited generalization ability of the SV-based speaker encoder to extract effective speaker representations from the dysarthric speech. However, with the speaker adaptation to fine-tune the speaker encoder, both SA-DSR and ASA-DSR can accurately preserve the gender with improved speaker similarity, showing the necessity of using dysarthric speech data to fine-tune the speaker encoder to effectively capture identity-related information of dysarthric speech.

Approaches M05 F04 M07 F02
Original 2.370.08 2.490.09 1.950.10 1.790.09
E2E-VC 3.640.11 3.400.13 3.580.12 3.350.12
SV-DSR 3.880.11 3.920.10 3.800.10 3.790.09
SA-DSR 3.560.09 3.220.14 3.670.11 3.380.12
ASA-DSR 3.840.09 3.860.12 3.790.09 3.750.11
Table 2: Comparison Results of MOS with 95% Confidence Intervals for Speech Naturalness.

4.2.2 Comparison Based on Speech Naturalness

Table 2 gives the MOS results of naturalness of original or reconstructed speech from different systems. We can see that all DSR systems improve the naturalness of original dysarthric speech, and SV-DSR achieves highest speech naturalness scores for all speakers, which shows the effectiveness of explicit prosody correction to generate the speech with stable and accurate prosody. By using the speaker adaptation without adversarial training, SA-DSR achieves lower naturalness improvements, due to partial dysarthric pronunciation patterns incorporated into the reconstructed speech. This issue can be effectively alleviated by using the proposed ASA to align the statistical distributions of reconstructed speech from ASA-DSR and SV-DSR, which facilitates ASA-DSR to generate high-quality speech that achieves comparable naturalness with SV-DSR.

Approaches M05 F04 M07 F02
Original 91.0 81.7 95.6 95.9
E2E-VC 69.8(21.2) 69.3(12.4) 73.1(22.5) 72.0(23.9)
SV-DSR 61.7(29.3) 64.6(17.1) 62.7(32.9) 65.3(30.6)
SA-DSR 69.6(21.4) 70.0(11.7) 67.8(27.8) 67.2(28.7)
ASA-DSR 62.5(28.5) 65.6(16.1) 62.7(32.9) 65.8(30.1)
Table 3: WER() (%) Results Comparison, Where ’’ Denotes the WER Reduction of Different Approaches Compared with Original Dysarthric Speech.

4.2.3 Comparison Based on Speech Intelligibility

Objective evaluation of speech intelligibility is conducted by using a publicly released speech recognition model, i.e., Jasper [19], to test the word error rate (WER) with greedy decoding, and the results are shown in Table 3. Compared with original dysarthric speech, SV-DSR achieves largest WER reduction for all dysarthric speakers, showing the effectiveness of prosody correction to improve the speech intelligibility. Compared with SV-DSR, the adaptation version of SV-DSR without adversarial training, i.e., SA-DSR, has smaller WER reduction, which is caused by the incorporation of dysarthric speaking characteristics into reconstructed speech. However, with the proposed ASA to alleviate this issue, ASA-DSR outperforms E2E-VC and SA-DSR and matches the performance of SV-DSR, leading to 22.3% and 31.5% absolute WER reduction on average for speakers M05/F04 and M07/F02 that have moderate and moderate-severe dysarthria respectively.

Figure 3: AB preference test results with 95% confidence intervals for different combinations of phoneme duration and , where ‘GG’ denotes Ground-truth duration and Ground-truth , ‘GP’ denotes Ground-truth duration and Predicted , and ‘PP’ denotes Predicted duration and Predicted .

4.2.4 Influence of Phoneme Duration and

We also conduct an ablation study to investigate how the phoneme duration and influence the quality of reconstructed speech by the proposed ASA-DSR system. Three combinations of phoneme duration and are used to generate the speech. We perform AB preference tests, where listeners are required to select the utterance that sounds more normal, i.e., more stable prosody and precise articulation, from two utterances generated by two different combinations. The results are illustrated in Fig. 3. For the comparison ‘GG vs. GP’ (i.e., Ground-truth duration and versus Ground-truth duration and Predicted ) of different speakers, more reconstructed speech samples are favored by using predicted normal (p-values 0.05). For the comparison ‘GP vs PP’ (i.e., Ground-truth duration and Predicted versus Predicted duration and ), using the predicted normal duration can significantly improve speech quality especially for speakers M05, M07 and F02 who have abnormally slow speaking speed. This shows that both phoneme duration and affect speech normality, and the prosody corrector in ASA-DSR derives normal values of phoneme duration and , which facilitate the reconstruction of speech to have normal prosodic patterns.

5 Conclusions

This paper presents a DSR system based on a novel multi-task learning strategy, i.e., ASA, to simultaneously preserve the speaker identity and maintain high speech quality. This is achieved by a primary task (i.e., speaker adaptation) to facilitate the speaker encoder to capture speaker characteristics from the dysarthric speech, and a secondary task (i.e., adversarial training) to avoid the incorporation of dysarthric speaking patterns into reconstructed speech. Experiments show that the proposed ASA-DSR can effectively achieve dysarthria reductions with improved naturalness and intelligibility, while speaker identity can be effectively maintained with 0.73 and 0.85 absolute MOS improvements of speaker similarity over the strong baseline SV-DSR, for speakers with moderate and moderate-severe dysarthria respectively.

6 Acknowledgements

This research is supported partially by the HKSAR Research Grants Council’s General Research Fund (Ref Number 14208817) and also partially by the Centre for Perceptual and Interactive Intelligence, a CUHK InnoCentre.


  • [1] R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki (2012) Consonant enhancement for articulation disorders based on non-negative matrix factorization. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–4. Cited by: §1.
  • [2] R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki (2012) GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing 2 (5), pp. 134–138. Cited by: §1.
  • [3] R. Aihara, T. Takiguchi, and Y. Ariki (2017) Phoneme-discriminative features for dysarthric speech conversion.. In INTERSPEECH, pp. 3374–3378. Cited by: §1.
  • [4] Aihara, Ryo and Takashima, Ryoichi and Takiguchi, Tetsuya and Ariki, Yasuo (2013) Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In ICASSP, pp. 8037–8040. Cited by: §1.
  • [5] F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanevsky, and Y. Jia (2019) Parrotron: an end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. arXiv preprint arXiv:1904.04169. Cited by: §1.
  • [6] C. Chen, W. Zheng, S. Wang, Y. Tsao, P. Li, and Y. Li (2020) Enhancing intelligibility of dysarthric speech using gated convolutional-based voice conversion system. INTERSPEECH. Cited by: §1.
  • [7] Z. Chen, B. Ramabhadran, F. Biadsy, X. Zhang, Y. Chen, L. Jiang, F. Chu, R. Doshi, and P. J. Moreno (2021) Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. In INTERSPEECH, pp. 4828–4832. External Links: Document Cited by: §1.
  • [8] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8789–8797. Cited by: §4.1.
  • [9] J. S. Chung, A. Nagrani, and A. Zisserman (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §4.1.
  • [10] H. Doi, K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano (2010) Esophageal speech enhancement based on statistical voice conversion with gaussian mixture models. IEICE TRANSACTIONS on Information and Systems 93 (9), pp. 2472–2482. Cited by: §1.
  • [11] R. Doshi, Y. Chen, L. Jiang, X. Zhang, F. Biadsy, B. Ramabhadran, F. Chu, A. Rosenberg, and P. J. Moreno (2021) Extending parrotron: an end-to-end, speech conversion and speech recognition model for atypical speech. In ICASSP, pp. 6988–6992. Cited by: §1.
  • [12] W. Huang, K. Kobayashi, Y. Peng, C. Liu, Y. Tsao, H. Wang, and T. Toda (2021) A preliminary study of a two-stage paradigm for preserving speaker identity in dysarthric voice conversion. arXiv preprint arXiv:2106.01415. Cited by: §1.
  • [13] K. Ito et al. (2017) The lj speech dataset. Cited by: §4.1.
  • [14] A. B. Kain, J. Hosom, X. Niu, J. P. Van Santen, M. Fried-Oken, and J. Staehely (2007) Improving the intelligibility of dysarthric speech. Speech communication 49 (9), pp. 743–759. Cited by: §1, §2.
  • [15] H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. S. Huang, K. Watkin, and S. Frame (2008) Dysarthric speech database for universal access research. In Ninth Annual Conference of the International Speech Communication Association, Cited by: §4.1.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [17] K. Kobayashi and T. Toda (2018) Electrolaryngeal speech enhancement with statistical voice conversion based on cldnn. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2115–2119. Cited by: §1.
  • [18] S. A. Kumar and C. S. Kumar (2016) Improving the intelligibility of dysarthric speech towards enhancing the effectiveness of speech therapy. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1000–1005. Cited by: §1.
  • [19] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde (2019) Jasper: an end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288. Cited by: §4.2.3.
  • [20] S. Liu, D. Wang, Y. Cao, L. Sun, X. Wu, S. Kang, Z. Wu, X. Liu, D. Su, D. Yu, et al. (2020) End-to-end accent conversion without using native utterances. In ICASSP, pp. 6289–6293. Cited by: §1, §2, §4.1.
  • [21] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger (2017) Montreal forced aligner: trainable text-speech alignment using kaldi.. In INTERSPEECH, Vol. 2017, pp. 498–502. Cited by: §2.
  • [22] Z. Meng, J. Li, and Y. Gong (2019) Adversarial speaker adaptation. In ICASSP, pp. 5721–5725. Cited by: §1.
  • [23] S. H. Mohammadi and A. Kain (2017) An overview of voice conversion systems. Speech Communication 88, pp. 65–82. Cited by: §1.
  • [24] A. Nagrani, J. S. Chung, and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §4.1.
  • [25] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano (2012) Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech. Speech Communication 54 (1), pp. 134–146. Cited by: §1.
  • [26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In ICASSP, pp. 5206–5210. Cited by: §4.1.
  • [27] D. Rentzos, S. Vaseghi, E. Turajlic, Q. Yan, and C. Ho (2003) Transformation of speaker characteristics for voice conversion. In

    2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721)

    pp. 706–711. Cited by: §1.
  • [28] F. Rudzicz (2011) Acoustic transformations to improve the intelligibility of dysarthric speech. In Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies, pp. 11–21. Cited by: §1.
  • [29] L. Serrano, S. Raman, D. Tavarez, E. Navas, and I. Hernaez (2019) Parallel vs. non-parallel voice conversion for esophageal speech.. In INTERSPEECH, pp. 4549–4553. Cited by: §1.
  • [30] C. Veaux, J. Yamagishi, K. MacDonald, et al. (2016) Superseded-cstr vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. Cited by: §4.1.
  • [31] L. Wan, Q. Wang, A. Papir, and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In ICASSP, pp. 4879–4883. Cited by: §2.
  • [32] D. Wang, S. Liu, L. Sun, X. Wu, X. Liu, and H. Meng (2021) Learning explicit prosody models and deep speaker embeddings for atypical voice conversion. arXiv preprint arXiv:2011.01678. Cited by: §1, §1, §2, §4.1.
  • [33] D. Wang, J. Yu, X. Wu, S. Liu, L. Sun, X. Liu, and H. Meng (2020) End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction. In ICASSP, pp. 7744–7748. Cited by: §1, §4.1.
  • [34] J. Yamagishi, C. Veaux, S. King, and S. Renals (2012) Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoustical Science and Technology 33 (1), pp. 1–5. Cited by: §1, §1.
  • [35] R. Yamamoto, E. Song, and J. Kim (2020) Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP, pp. 6199–6203. Cited by: §2.
  • [36] Y. Yunusova, G. Weismer, J. Westbury, and M. Lindstrom (2008) Articulatory movements during vowels in speakers with dysarthria and healthy controls.. Journal of Speech, Language, and Hearing Research: JSLHR 51 (3), pp. 596–611. Cited by: §1.
  • [37] M. D. Zeiler (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.1.