Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

10/02/2021 ∙ by Protima Nomo Sudro, et al. ∙ IIT Dharwad IEEE 0

The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation techniques to simulate training data for improving the children speech recognition considering the case of cleft lip and palate (CLP) speech. The augmentation techniques explored in this study, include vocal tract length perturbation (VTLP), reverberation, speaking rate, pitch modification, and speech feature modification using cycle consistent adversarial networks (CycleGAN). Our study finds that the data augmentation methods significantly improve the CLP speech recognition performance, which is more evident when we used feature modification using CycleGAN, VTLP and reverberation based methods. More specifically, the results from this study show that our systems produce an improved phone error rate compared to the systems without data augmentation.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Disordered speech is collectively used to refer to the speech that deviates both in intelligibility and quality [19, 28]. The deviations may be caused by structural and functional deformation of the articulatory system [11]. Such speech deviations affect the speech understandability by unfamiliar listeners due to a communication gap between impaired and normal speakers [38]. This kind of mismatch in the acoustic characteristics between the normal and impaired speech makes the automatic recognition of the disordered speech a challenging task. Besides the mismatch in acoustic characteristics, the type of speech distortions such as hypernasality, articulation error and levels of severity of the speech distortions among the pathological speakers impose additional challenges in case of disordered speech recognition [22, 37, 41, 33].

Various studies [43, 6, 16] reported in automatic speech recognition (ASR) literature highlight that a model built using a large amount of speech data often achieves high recognition accuracy and robust performance. However, when it comes to pathological speech, collecting a large amount of data is very challenging as the speakers face difficulty in speaking for a long time, apart from having sufficient numbers of speakers for a particular case of pathological speech. In addition, the time alignment of the collected data is painstaking. On account of data scarcity and acoustic variation, the disordered speech recognition systems are noted to yield relatively inferior recognition performances when compared with the normal speech recognition systems.

In the context of normal ASR systems, the data augmentation approaches are found to be effective for dealing with data scarcity as well as increasing robustness under adverse noisy conditions [27]. The ASR systems play an important role for various speech based applications in the recent years, particularly with smart-home devices that are operated by voice commands and dialogues [1, 23]. However, as most of those speech based applications are trained with speech data from normal speakers, the practical usability of such systems for pathological speakers is limited. To deal with the data scarcity and domain mismatch issues, data augmentation approaches have been exploited for improving the performance of disordered speech recognition [10, 14]. Among them, the dysarthric speech has gained relatively more attention in the recent years. We find the explorations on cleft lip and palate (CLP) speech in the context of improving ASR is limited, which motivated us to focus in this study.

The CLP is a congenital disorder affecting the craniofacial region including the articulatory system [19, 11]. Due to the impaired articulatory system, different speech disorders are noted, which are broadly categorized into hypernasality, hyponasality, articulation error (misarticulation), and voice disorder [19, 28]. The speech disorders are caused by velopharyngeal dysfunction, oro-nasal fistula and mislearning [24]. They impact the intelligibility and quality of CLP speech based on the degree of the speech disorder severity [19, 31, 22].

The impairment in the articulatory system affects the acoustic characteristics of the speech. The individuals with CLP exhibit deviant burst evidence, formant transitions, and spectral attributes compared to the normal speech. The acoustic characteristics of the speech sounds are also distorted because of the unintentional production of nasalized vowels and nasal cognates, as certain voiced stops share the same place of articulation with nasal consonants [12]. When the distorted speech is input to the speech system, the performance accuracy decreases significantly. The speech systems trained using normal speech typically ignore the unnatural variations of pathological speech [42]. However, such features are paramount in pathological speech and are not encoded in the statistical frameworks of the speech systems [30].

The distortions in CLP speech can be corrected using clinical interventions, namely, surgery, prosthetics, and speech therapy. Despite the advantage of the interdisciplinary team in the clinical settings and adverse circumstances of the speakers with CLP, many speakers still produce deviant speech compared to the normal speech. For the speakers whose speech distortions cannot be improved, may face difficulty in using various speech based applications effectively as desired. Hence, besides clinical intervention, different approaches from the perspective of signal processing research can be explored for improving the usability of speech based applications. Accordingly, the study devotes to generate variations of CLP speech samples using various data augmentation techniques and then to identify the methods having a higher impact on improving performance of ASR systems.

The rest of the paper is organized as follows. Section 2 discusses the literature related to data augmentation studies carried out for improving the performances of normal and disordered speech recognition. In Section 3, we explain the details of various data augmentation methods explored this work. Section 4 presents the experimental setup of the studies. The observations and results of conducted studies are reported in Section 5. Finally, Section 6 concludes this work.

2 Related Work

Data augmentation refers to the process of generating variations of input data to increase the number of examples for a given dataset. It has been proved effective for both in case of speech and image processing domains [4, 18, 20]. Several studies reported the use of data augmentation approaches to deal with data scarcity in low resource speech based tasks. It is also exploited for useful applications of speech processing systems such as ASR, transcription of multi genre media archives, children speech recognition, anti-spoofing and modeling large scale complex models [2, 32, 13, 7, 8]. Literature shows a wide range of data augmentation techniques that are explored in the context of speech processing, namely, vocal tract length perturbation, spectral distortion, voice conversion based on adversarial training, speaking rate modification, pitch modification, stochastic feature mapping, and spectrogram deformations, text-to-speech data augmentation, pseudo-label augmentation [13, 32, 14, 16, 6, 27, 35].

Motivated by the capability of data augmentation methods in dealing with data scarcity and attaining improved accuracy, researchers have explored the same for disordered speech recognition system [10, 14]

. Most of the studies attempted dysarthric speech recognition system, where the variations of the distorted speech are generated using some of the augmentation techniques stated above. In those studies, speech characteristics of normal speakers are transformed to that of disordered speakers. In addition, the augmentation techniques were exploited based on the disordered speech phenomena such as moderate speaking rate of the normal speakers are matched to the slow rate of the disordered speakers. In some cases, normal speakers’ pitch is linearly transformed to the reduced pitch variation of the disordered speakers. Similarly, other characteristics of disordered speakers are also considered while transforming normal speech into disordered speech 

[39]. All these studies suggest that the data augmentation approaches result in improved recognition performance for disordered speech.

Along similar direction, this study intends to exploit data augmentation approaches and transform the characteristics of normal children’s speech based on the deviated acoustic characteristics of CLP speech. We note that prominent distortions like nasalization and articulation errors are found in CLP speech. Therefore, the spectral deviations caused by nasalization or articulation error can be mapped into the normal speakers’ spectral characteristics to generate variations of CLP speech using normal speech data. A few known data augmentation approaches reported in the literature are also investigated in this study to quantify their relative impacts on the recognition performances.

3 Data Augmentation for CLP Speech

The study aims to improve CLP speech recognition for serving the need for using various speech-based applications. The following subsections describe the data augmentation techniques explored in this study followed by their observations.

3.1 Vocal tract length perturbation

The vocal tract length perturbation (VTLP) method is one among the commonly used data augmentation approaches. The speakers with CLP exhibit disordered speech by either changing the place of articulation (PoA) or manner of articulation (MoA) or both. The speakers change the PoA in response to impaired structure in their articulatory system. As a result the vocal tract parameters are distorted. Hence, in this work, the VTLP approach is used as one of the augmentation approaches for investigating its affect in the case of CLP speech recognition.

In this method, a random warp factor () is used to add variations in the input data. For each utterance in the training set, a random is generated and then the original frequency is warped into a new frequency using the technique reported in [21] where,


The values are chosen from a discrete set of values ranging between . The VTLP approach modifies the spectral characteristics of the input speech signal while preserving the fundamental frequency and duration of the signal.

3.2 CycleGAN based speech feature modification

Another way of data augmentation is accomplished by transforming the acoustic characteristics of normal speech towards the CLP speech characteristics and vice versa. The articulatory impairment distorts the acoustic characteristics of CLP speech. Therefore, mapping the CLP speech speech into that of normal speech will generate a different variation of CLP speech which will resemble the normal speech characteristics. Hence, we explore the cycle-consistent adversarial network (CycleGAN) adversarial training for transforming the acoustic characteristics as described in111 [17]. The CycleGAN is widely used for non-parallel voice conversion (VC) and it has shown its effectiveness for various other applications [17, 40, 9, 34]. A CycleGAN method is based on a combination of two generators and and two discriminators and , respectively. The generator is a function that transforms the distribution into distribution , whereas the generator transforms the distribution into distribution . On the other hand, the discriminator distinguishes between the distribution of and distribution of . In contrast, the discriminator distinguishes from . The CycleGAN model learns the mapping function from the training samples, which comprises of source and target samples. In CycleGAN approach, both the models are trained simultaneously using the objective function that consists of two losses: adversarial loss and cycle-consistency loss. With adversarial training of the generator and discriminator models, it is expected that the generated normal speech samples become indistinguishable from CLP speech and vice versa.

3.3 Reverberation

As the present work is motivated towards improving CLP speech recognition for its effective use of speech based applications, there are possible scenarios such as room reverberations and acoustics that may impact the intelligibility and quality of CLP speech. Therefore, we consider room reverberation as one of the strategies for data augmentation. We generate the reverberant signal by convolving the input source signal with the room’s acoustic impulse response by using Roomsimove222 [36].

3.4 Pitch modification

The pitch or fundamental frequency, is one of the important acoustic cues of speech signal. Along with various speech disorders, voice disorder is also observed in some of the individuals with CLP [19]. In response to velopharyngeal dysfunction, hyperadduction can occur resulting in high pitch. Therefore, the pitch variation is also considered as another augmentation technique for this study. The pitch is modified by using a factor , which is given by,



, it is defined as the ratio of standard deviations between the average pitch of CLP speakers and the normal speakers. The pitch modification is performed by using open-source Praat Vocal Toolkit

333 [5], which uses pitch-synchronous overlap and add (PSOLA) method to synthesize the transformed signal [25]. It is noted that the duration of the original signal is preserved, while performing pitch modification.

3.5 Speaking rate transformation

In CLP speech due to articulatory impairment, speech rate variation is also observed [15]. Hence, we explore the impact of speaking rate transformation in CLP speech recognition. Here, each of the normal speech samples is matched to that of the CLP speech in terms of the speaking rate. The normal speech rate is mapped to the CLP speech rate using dynamic time warping [26]. The time-aligned signals are then synthesized using overlap and add (OLA) method. Speaking rate transformation leads to the change in duration of the signal, while preserving the spectral characteristics and pitch of the normal speaker.

3.6 Observations from data augmentation on CLP speech

We are now interested to study the impact of various data augmentation methods explored in this study. For this analysis, we consider an example of CLP speech and then apply the different data augmentation methods to observe the corresponding speech waveforms and spectrograms.

Figure 1: Waveforms and spectrograms of various transformed speech for a CVCV word /sasa/. (a)-(b) original normal speech signal (c)-(d) reverberated version, (e)-(f) vocal tract length warped version, (g)-(h) CycleGAN generated version, (i)-(j) speaking rate modified version, and (k)-(l) pitch modified version.

Fig. 1 shows the comparison of speech waveforms and spectrograms for original speech and that generated using various data augmentation approaches. From Fig. 1 (c) and (d), we observe that additional interfering acoustic components are imposed on the original speech signal due to reverberation. With frequency warping factor, applied in Fig. 1 (e) and (f), it is observed that the high energy spectral components are sampled more sparsely as compared to original normal speech in Fig. 1 (a) and (b). Further, with CycleGAN approach shown in Fig. 1 (g) and (h), it is observed that the speech characteristics vary significantly because of mapping the normal speech features to that of the CLP speech. The perturbations are observed in the form of indistinguishable formants in the lower frequency region for voiced sounds and low energy concentration of the high frequency signals such as fricative /s/ in the figure.

Considering the speaking rate modification shown in Fig. 1 (i) and (j), the duration of the speech signal is reduced to match the corresponding speaking rate of a gender and age-matched CLP speaker. It is noted that other speech parameters are preserved while performing speaking rate modification. Further, from pitch modification of the signal with a modification factor as shown in Fig. 1 (k) and (l), it is observed that the shift in with  Hz preserves the spectral characteristics and the duration of the original signal. Overall, from Fig. 1, it is observed that various data augmentation methods alter the speech parameters such as pitch, duration, intelligibility, quality, and location of high energy spectral contents of the input signal. On the other hand, certain other aspects of speech like speaking style is preserved when the pitch and speaking rate is modified. Similarly, the is kept unaltered while transforming the speech features using CycleGAN based approach.

3.7 Perceptual evaluation

The database for this work is acquired from Kannada speaking children. The speech data was collected in a sound-treated room using a Bruel and kjær speech level meter [3]. All the speech data were collected in the All India Institute of Speech and Hearing (AIISH), Mysuru, India. The database consists of speakers, where speakers ( male and female) are the individuals with CLP and rest speakers ( male and female) are individuals with normal speech.

The age of CLP and normal participants are years (mean SD) and years (mean SD), respectively. The CLP speech characteristics exhibits a combination of speech disorders, namely, hypernasality, articulation errors, and nasal air emission. The transcription and rating of the severity levels of the CLP speech samples are obtained from expert speech language pathologists (SLP) from AIISH. The SLPs have an experience of more than five years who routinely work with the speakers with CLP. The SLPs were asked to provide rating and transcribe the speech samples based on the instructions given in [12]. According to the rating strategy reported in [12], the SLPs provide deviation scores on a scale of to , where close to normal, mild deviation, moderate deviation, and corresponds to severe deviation.

The augmented speech signals are analyzed using subjective evaluation to determine the impact of different data augmentation methods. For this purpose, the perturbed normal and CLP speech samples are mixed and then randomly presented to the listeners. While providing the speech samples for perceptual similarity, the reference speech samples for both the normal and CLP speech are also provided. A total of participants have performed the perceptual judgement task. For each type of augmentation, normal reference, augmented normal, CLP reference, and augmented CLP speech samples are presented for the evaluation. Therefore, a total of words are provided for each augmentation approach. We note that for CycleGAN method, normal transformed samples, i.e., from normal to CLP () and CLP transformed samples, i.e., from CLP to normal speech ( ) along with their reference signals are presented for evaluation. Overall, 100 speech samples (original reference and augmented) are presented for perceptual evaluation. However, it can be noted that in Table 1, the evaluations are shown for augmented speech samples, i.e., 50 words only.

Method # Samples Similarity (%) MOS
Normal CLP NP
VTLP 5 N + 5 C 40 32 28 1.85
CycleGAN 5 N + 5 C 43 37 20 2.03
Reverberation 5 N + 5 C 42 38 20 2.00
Speaking rate 5 N + 5 C 46 48 6 2.68
Pitch 5 N + 5 C 47 44 9 2.91
Total 25 N + 25 C 43.6 39.8 16.6 2.3
No augmentation 25 N 98 - 2 4.73
No augmentation 25 C 8 88 4 2.65
Table 1: Perceptual similarity (%) and MOS values for different data augmentation methods calculated over the normal (N) and CLP (C) speech signals. NP stands for no preference.

In the evaluation process, two different tasks were assigned to the listeners. The first task is designed to provide preferences for each augmented speech sample whether they are perceptually similar to the speech of normal speaker or CLP speaker. If the listener found difficult to make any judgement, they can select no preference option. In the perceptual evaluation, the second task of the listener aims to examine whether the augmentation approaches have altered the overall quality of the speech significantly. For this, the mean opinion score (MOS) values are computed using the rating scale ranging from to ( bad, fair, good, very good, excellent. The perceptual evaluation results are shown in Table 1.

4 Experimental Setup

Although the database consists of non-meaningful CVCV words and meaningful short phrases, most of the augmentation experiments were carried out using isolated CVCV words only. The exceptional case is explored in CycleGAN based approach, where the transformation model is collectively trained using the mel-cepstral coefficients extracted from each frame of the short phrases from normal and CLP speakers. Using the trained model, the features of the CVCV words from normal speech are transformed into that of CLP speech features. Similarly, another model is trained for transforming the CLP speech characteristics into normal speech characteristics. For the sake of creating more input variations, the fundamental frequency of the normal speech and CLP speech are preserved while modifying the spectral characteristics of the signal using CycleGAN method.

Table 1 shows that approximately of the normal augmented speech signals are perceived as normal and approximately of the CLP augmented signals are perceived as CLP speech. In addition, we notice that a significant number of speech samples are different (i.e., no judgement were made for these samples) from normal and CLP reference. Considering the MOS, an average score of is observed. The lower MOS value is because the perturbations have altered the perceptual quality of the speech signals to a certain extent.

Another factor for overall lower MOS value is that equal amount of normal and CLP speech samples are presented for the perceptual evaluation. Despite the higher MOS value of normal speech, perturbations and lower MOS of CLP speech have significantly reduced the overall MOS value.

Amount of
augmented data
# Training
PER (%)
Not applied - 5296 50.68
VTLP 2 10,592 42.95
CycleGAN 2 10,592 41.09
Reverberation 2 10,592 42.10
Speaking rate 2 10,592 48.67
Pitch 2 10,592 43.91
Table 2: Impact of different data augmentation methods in PER (%) on the test set over CLP speech signals
Amount of
augmented data
# Training
PER (%)
Not applied - 5297 21.59
VTLP 2 10,594 16.81
CycleGAN 2 10,594 14.84
Reverberation 2 10,594 15.68
Speaking rate 2 10,594 17.46
Pitch 2 10,594 16.95
Table 3: Impact of different data augmentation methods in PER (%) on the test set over normal speech signals
Amount of
augmented data
# Training
PER (%)
Not applied - 10,593 58.49
VTLP 2 21,186 49.21
CycleGAN 2 21,186 44.52
Reverberation 2 21,186 48.98
Speaking rate 2 21,186 56.02
Pitch 2 21,186 50.34
Table 4: Impact of different data augmentation methods in PER (%) on the test set over normal and CLP speech signals

5 Results and Discussion

This study focuses on investigating the ASR performance for CLP speech using various augmentation approaches. Hence, the deformed speech samples are augmented with the original speech samples and then analyze the ASR results. In the current work, double of data size is augmented to maintain the acoustic characteristics of both the original and augmented data equally. As the speech data for this study are in the Kannada language, a Kannada ASR system is developed using Kaldi speech recognition toolkit444 [29]

. In the experiment, the hybrid deep neural network (DNN) acoustic model was implemented. All the speech samples are downsampled to

 kHz before processing. The ASR system performance for various data augmentation methods is measured using phone error rate (PER) metric. The PER is calculated based on the general form of error rate (ER) computation given by, ER =(substitutions+deletions+insertions)/(total phones).

Tables 2-4 reports the PER for different studies on the test set along with number of training utterances, which is more when various data augmentations are applied. In Table 2, the PER values obtained from CLP speech are reported. For training, CLP speakers data are used and CLP speakers data are used for testing. Three-fold cross validation is performed and the final measure was obtained by averaging across all folds. During three-fold cross validation, the test set comprises of female-female, male-male and female-male combinations, respectively. The test set consists of an average of utterances in each fold. Table 3 reports the PER values obtained from normal speech. In this case, normal speakers data are used for training and normal speakers data are used for testing. Here also three-fold cross validation is performed and the combinations are same as stated before for the CLP speech recognition. In Table 4, the PER values are obtained from the combination of normal and CLP speech. A total of speakers (normal and CLP) data are used for training the system and CLP speakers data are used for testing. Three-fold cross validation and aforementioned combinations are used to obtain the final PER value.

For comparing the performances of the augmentation approaches, the PER values obtained for the original speech signals forms the baseline measures. From Tables 2-4, it is observed that all explored data augmentation approaches have yielded a significant improvement in the PER performances. In Table 4, a significant PER improvement of around % over the baseline is observed when the speech samples are augmented with CycleGAN based transformed data. This is attributed to the increase in the speech samples that exhibit very close characteristics to CLP speech. From tables 2-4, it is also observed that CycleGAN based augmentation yielded respectively lower PER value compared to other augmentation methods. In addition, we find that VTLP and reverberation based data augmentation method also contributes to improve PER significantly. On the other hand, the improvements in PER by pitch and speaking rate modification are relatively less compared to other techniques investigated in this work.

As discussed, this paper reports CLP speech recognition study using CVCV words. In contrast to the non-sensical CVCV words, for a realistic scenario, the recognition performance for meaningful sentences, words, and phrases are very relevant, which we intend to explore in the future. Additionally, in this study, a combination of CLP speech disorders are considered, which may hinder the recognition accuracy. Hence, future studies in this direction focusing on generalizing the variability of the speech disorders in this population or consideration of specific type of the speech disorder and then study its impact in the recognition performances deserves attention. In this work, we study CLP children speech that exhibit highly varying speech characteristics. In this regard, we can explore text-to-speech based data augmentation method to generate various speech styles and observe its impact in the recognition performance. Additionally, we can also investigate pseudo-label based data augmentation method to study the impact of label errors and biases in distorted CLP speech.

6 Conclusion

This work investigates a few data augmentation approaches for improving the CLP speech recognition. The data augmentation methods include VTLP, reverberation, speaking rate, pitch modification, and speech feature modification using CycleGAN. We study them using CLP and normal speech data collected in Kannada language. The studies show that CycleGAN based speech feature modification method improves the recognition performance significantly compared to other data augmentation techniques considered in this study. In addition, VTLP and reverberation based approaches show relatively better results than pitch and speaking rate modification methods.

7 Acknowledgement

The authors would like to thank Dr. M. Pushpavathi and Dr. Ajish Abraham, AIISH Mysore, for providing insights about CLP speech disorder. The authors would also like to acknowledge the research scholars of IIT Guwahati for their participation in the perceptual test. This work is in part supported by a project entitled “NASOSPEECH: Development of Diagnostic system for Severity Assessment of the Disordered Speech” funded by the Department of Biotechnology (DBT), Govt. of India.


  • [1] S. Bajpai and D. Radha (2019) Smart phone as a controlling device for smart home using speech recognition. In Proceedings of International Conference on Communication and Signal Processing (ICCSP), pp. 0701–0705. Cited by: §1.
  • [2] P. Bell, M. Gales, P. Lanchantin, X. Liu, Y. Long, S. Renals, P. Swietojanski, and P. C. Woodland (2012) Transcription of multi-genre media archives using out-of-domain data. In Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 324–329. Cited by: §2.
  • [3] (1942) Bruel and kjær. Note: Cited by: §3.7.
  • [4] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber (2011) High-performance neural networks for visual object classification. arXiv preprint arXiv:1102.0183. Cited by: §2.
  • [5] R. Corretge (2012) Praat vocal toolkit: a praat plugin with automated scripts for voice processing. Computer Software]. Retrieved from: http://www. praatvocaltoolkit. com. Cited by: §3.4.
  • [6] X. Cui, V. Goel, and B. Kingsbury (2015) Data augmentation for deep neural network acoustic modeling. IEEE Transactions on Audio, Speech, and Language Processing 23 (9), pp. 1469–1477. Cited by: §1, §2.
  • [7] R. K. Das, J. Yang, and H. Li (2021) Data augmentation with signal companding for detection of logical access attacks. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6349–6353. Cited by: §2.
  • [8] R. K. Das (2021) Known-unknown data augmentation strategies for detection of logical access, physical access and speech deepfake attacks: ASVspoof 2021. In Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 29–36. Cited by: §2.
  • [9] F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5279–5283. Cited by: §3.2.
  • [10] M. Geng, X. Xie, S. Liu, J. Yu, S. Hu, X. Liu, and H. Meng (2020) Investigation of data augmentation techniques for disordered speech recognition. In Proceedings of Interspeech, pp. 696–700. Cited by: §1, §2.
  • [11] P. Grunwell and D. Sell (2001) Speech and cleft palate/velopharyngeal anomalies. Management of Cleft Lip and Palate. London: Whurr. Cited by: §1, §1.
  • [12] G. Henningsson, D. P. Kuehn, D. Sell, T. Sweeney, J. E. Trost-Cardamone, and T. L. Whitehill (2008) Universal parameters for reporting speech outcomes in individuals with cleft palate. The Cleft Palate-Craniofacial Journal 45 (1), pp. 1–17. Cited by: §1, §3.7.
  • [13] N. Jaitly and G. E. Hinton (2013) Vocal tract length perturbation (VTLP) improves speech recognition. In

    Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language

    Vol. 117. Cited by: §2.
  • [14] Y. Jiao, M. Tu, V. Berisha, and J. Liss (2018) Simulating dysarthric speech for training data augmentation in clinical speech applications. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6009–6013. Cited by: §1, §2, §2.
  • [15] D. L. Jones and J. W. Folkins (1985) Effect of speaking rate on judgments of disordered speech in children with cleft palate.. The Cleft Palate Journal 22 (4), pp. 246–252. Cited by: §3.5.
  • [16] N. Kanda, R. Takeda, and Y. Obuchi (2013) Elastic spectral distortion for low resource speech recognition with deep neural networks. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 309–314. Cited by: §1, §2.
  • [17] T. Kaneko and H. Kameoka (2018) CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. Cited by: §3.2.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, pp. 1097–1105. Cited by: §2.
  • [19] A. W. Kummer (2013) Cleft palate & craniofacial anomalies: effects on speech and resonance. Nelson Education. Cited by: §1, §1, §3.4.
  • [20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Vol. 86, pp. 2278–2324. Cited by: §2.
  • [21] L. Lee and R. Rose (1998) A frequency warping approach to speaker normalization. IEEE Transactions on Speech and Audio Processing 6 (1), pp. 49–60. Cited by: §3.1.
  • [22] A. Maier, C. Hacker, E. Noth, E. Nkenke, T. Haderlein, F. Rosanowski, and M. Schuster (2006) Intelligibility of children with cleft lip and palate: evaluation by speech recognition techniques. In

    Proceedings of 18th International Conference on Pattern Recognition (ICPR)

    Vol. 4, pp. 274–277. Cited by: §1, §1.
  • [23] S. Möller, J. Krebber, and P. Smeele (2006) Evaluating the speech output component of a smart-home system. Speech Communication 48 (1), pp. 1–27. Cited by: §1.
  • [24] W. Moore and R. K. Sommers (1975) Phonetic contexts: their effects on perceived intelligibility in cleft-palate speakers. Folia Phoniatrica et Logopaedica 27 (6), pp. 410–422. Cited by: §1.
  • [25] E. Moulines and F. Charpentier (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9 (5-6), pp. 453–467. Cited by: §3.4.
  • [26] C. Myers, L. Rabiner, and A. Rosenberg (1980) Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (6), pp. 623–635. Cited by: §3.5.
  • [27] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. In Proceedings of Interspeech, pp. 2613–2617. Cited by: §1, §2.
  • [28] S. J. Peterson-Falzone, M. A. Hardin-Jones, and M. P. Karnell (2001) Cleft palate speech. Mosby St. Louis. Cited by: §1, §1.
  • [29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. (2011) The KALDI speech recognition toolkit. In Proceedings of Workshop on Automatic Speech Recognition and Understanding, Cited by: §5.
  • [30] F. Rudzicz (2013) Adjusting dysarthric speech signals to be more intelligible. Computer Speech & Language 27 (6), pp. 1163–1177. Cited by: §1.
  • [31] M. Scipioni, M. Gerosa, D. Giuliani, E. Nöth, and A. Maier (2009) Intelligibility assessment in children with cleft lip and palate in Italian and German. In Proceedings of Interspeech, pp. 967–970. Cited by: §1.
  • [32] S. Shahnawazuddin, N. Adiga, K. Kumar, A. Poddar, and W. Ahmad (2020) Voice conversion based data augmentation to improve children’s speech recognition in limited data scenario. In Proceedings of Interspeech, pp. 4382–4386. Cited by: §2.
  • [33] J. Shor, D. Emanuel, O. Lang, O. Tuval, M. Brenner, J. Cattiau, F. Vieira, M. McNally, T. Charbonneau, M. Nollstadt, et al. (2019) Personalizing asr for dysarthric and accented speech with limited data. In Proceedings of Interspeech, pp. 784–788. Cited by: §1.
  • [34] P. N. Sudro, R. K. Das, R. Sinha, and S. R. M. Prasanna (2021) Enhancing the intelligibility of cleft lip and palate speech using cycle-consistent adversarial networks. In Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 720–727. Cited by: §3.2.
  • [35] E. Tsunoo, K. Shibata, C. Narisetty, Y. Kashiwagi, and S. Watanabe (2021) Data augmentation methods for end-to-end speech recognition on distant-talk scenarios. In Proceedings of Interspeech, pp. 301–305. External Links: Document Cited by: §2.
  • [36] E. Vincent and D. R. Campbell (2008) Roomsimove. GNU Public License, http://homepages. loria. fr/evincent/software/Roomsimove_1 4. Cited by: §3.3.
  • [37] M. Vucovich, R. R. Hallac, A. A. Kane, J. Cook, C. V. Slot, and J. R. Seaward (2017) Automated cleft speech evaluation using speech recognition. Journal of Cranio-Maxillofacial Surgery 45 (8), pp. 1268–1271. Cited by: §1.
  • [38] T. L. Whitehill (2002) Assessing intelligibility in speakers with cleft palate: a critical review of the literature. The Cleft Palate-Craniofacial Journal 39 (1), pp. 50–58. Cited by: §1.
  • [39] F. Xiong, J. Barker, and H. Christensen (2019) Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5836–5840. Cited by: §2.
  • [40] C. Yeh, P. Hsu, J. Chou, H. Lee, and L. Lee (2018) Rhythm-flexible voice conversion without parallel data using Cycle-GAN over phoneme posteriorgram sequences. In Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 274–281. Cited by: §3.2.
  • [41] E. Yılmaz, V. Mitra, C. Bartels, and H. Franco (2018) Articulatory features for asr of pathological speech. In Proceedings of Interspeech, pp. 2958–2962. Cited by: §1.
  • [42] V. Young and A. Mihailidis (2010) Difficulties in automatic speech recognition of dysarthric speakers and implications for speech-based applications used by the elderly: a literature review. Assistive Technology 22 (2), pp. 99–112. Cited by: §1.
  • [43] D. Yu and J. Li (2017) Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica 4 (3), pp. 396–409. Cited by: §1.