Speech intelligibility is important for communication either it is in human-human interaction mode or human-machine interaction mode [10, 20, 15]. Due to articulatory impairment, the intelligibility of pathological speakers are degraded and it hinders them from communicating effectively as speakers without speech pathology do . Hence, researchers study to improve the intelligibility of the pathological speech from signal processing point of view [13, 3, 6, 1, 32, 2]. In this work, cleft lip and palate (CLP) speech enhancement is addressed.
The CLP is a birth disorder which affects the speech production system. The speech disorders occur even after clinical intervention due to velopharyngeal dysfunction, oronasal fistula, and mislearning [12, 17, 8]. The CLP speech distortions are categorized into hypernasality, hyponasality, articulation error and voice disorder [12, 21, 14]. Among the many speech disorders associated with CLP speech, nasalization and articulation error are the primary factors that affect the speech intelligibility. Hypernasality corresponds to a resonance disorder, and the presence of nasal resonances during speech production has an excessively perceptible nasal quality . Mostly, the voiced sounds are nasalized, and the nasal consonants tend to replace the obstruents due to severe hypernasality. Misarticulations are produced either due to the structural or functional disorder or both.
In the pathological speech enhancement literature, some of the reported works, focus on the segmentation of the disordered phoneme followed by modification of the same [30, 28, 29, 20]. In several other studies, specific enhancement strategies are developed based on the disordered phenomena. In the aforementioned studies, phoneme specific modifications are attempted. These studies were effective for isolated phoneme analysis and modification. In contrast, considering a more realistic situation, the speech intelligibility and quality must be analyzed for words, and phrases. In clinical settings, when CLP speech is analyzed for enhancement tasks, the speech language pathologists (SLP), first work on isolated phonemes. Once a CLP speaker has mastered the correct phoneme production, the SLPs embed the phoneme in a word and analyze the speech intelligibility of an entire word. Further, this process is observed in a phrase and conversational speech. In a study, it is reported that a speaker may produce a phoneme correctly in isolation, but the same may be produced in error in a word-level and short phrases due to the influence of phonetic contexts . Hence, the SLPs evaluate and attempt to enhance the CLP speech by minimizing the influence of phonetic contexts. In the similar direction, the present work also focuses on the word-level intelligibility via combination of phoneme specific modification techniques reported in our earlier works [28, 29, 27]. In those works, we have demonstrated the ability to modify the phoneme specific distortions for fricative /s/, stop consonant /k/, /t/, /T/ and vowels /a/, /i/, /u/. In the present work, the relevant phoneme specific enhancement techniques are combined to improve the entire word-level intelligibility.
2 Transforming word-level speech intelligibility
In this work, nonsensical word modification is attempted by combining the specific phoneme modification techniques discussed in the our earlier works. Several words are formed by the possible combinations of the phonemes studied in the reported works. When a single transformation method is used to modify the entire word, it is observed that the speech sounds muffled. Thus, further processing is required to achieve a good quality enhanced speech. various issues exist in performing the entire word-level intelligibility because phonemes are either substituted by other phonemes or nasal consonants or gets nasalized. Different types of other misarticulation also affects the CLP speech intelligibility and quality. It is challenging to detect such misarticulations in an unsupervised method. Therefore, certain assumptions are made prior to the enhancement task.
In CLP speech, the production of obstruents are mainly affected due to the loss of adequate intra-oral pressure or mislearned compensatory articulation. Due to the production error, important acoustic-phonetic cues related to obstruents, such as transient burst, frication noise, and formant dynamics in the adjacent sonorant’s transition region are degraded. Therefore, it is expected that deviations in their productions, that are reflected on the acoustic signal may be correlated with the perceived CLP speech intelligibility. On the other hand, due to nasalization, the voiced sounds are mostly affected. In a word, when the obstruents are modified, distorted voiced sounds could still affect the overall intelligibility. This renders the modification of the voiced sounds as well. Hence, characterizing all the deviations using relevant acoustic features and then modification of the same are expected to provide intelligible speech. The knowledge of deviated acoustic characteristics explored for the analysis of CLP speech in the previous works [27, 29, 28] can be used to enhance the distorted word intelligibility.
2.1 Database description
The database was created in collaboration with the speech language pathologists (SLPs) of All India Institute of Speech and Hearing (AIISH), Mysuru, India. The database consists of age and gender matched CLP and non-CLP speakers’ speech. The non-CLP speakers served as controls for the study. Both the CLP and non-CLP children are native Kannada speakers. Prior to the recording, ethical consents were obtained from the parents/caregivers of each of the participants. An overview of the study is provided to the parents/caregivers. The study was conducted with clearance from the AIISH Bio-behavioral ethical committee. In this work, nonsensical words are used for analysis and modification. All the speech samples are recorded in a sound-proof room using a speech level meter (Bruel and Kjær) at a sampling frequency of 48 kHz and 16-bit resolution . The recorded speech samples are saved in a .WAV format. The microphone was placed at a distance of 15 cm from each speaker while recording. During recording, the instructor first uttered the target word, then the response of the children was recorded. Each of the expert SLP had an experience of around five years in the field of CLP speech evaluation. The speech samples are recorded for 2-3 sessions for each of the speaker. The total number of speakers, number of tokens, and assessment of the distorted samples for fricatives, stops, and vowels can be found in , , and  respectively.
2.2 Modification of fricative-vowel-fricative-vowel words
Considering the combination of misarticulated fricative /s/ and vowels /a/, /i/, and /u/ as /sasa/, /sisi/, and /susu/, the enhancement task must address all the distorted phonemes to improve the overall speech intelligibility and quality [12, 17]. The enhancement techniques exploited in the previous works, such as spectral energy compression , insertion , spectral conversion  and temporal processing  are briefly discussed for word-level intelligibility. The experimental details of the aforementioned techniques can be found in [27, 29, 28].
If spectral compression technique is to be applied to modify the entire /sasa/ or /sisi/ or /susu/ word, then lower frequency region compression will yield enhanced fricatives but it will deteriorate the vowels at the same time because low-frequency energy is important for vowel perception [25, 16, 23, 22]. In certain cases, if vowels are nasalized, insertion method may be applicable but its perceptual quality will be far from the natural speech as it may not preserve the individuality information. Considering the temporal processing method described in 
, a fine weight function is applied around the glottal closure instants (GCI) locations to emphasize the significant GCI events and suppress any interfering signal components around it. If the fricative-vowel-fricative-vowel (FVFV) words are processed using temporal processing method, it is speculated that it might not result in effective transformation because the excitation source for fricative is a noise source signal and with temporal processing important signal components might get suppressed. As each of the phonemes have different spectral and temporal phenomena, the distortions caused by the articulatory impairment affects each phoneme differently. Further, considering the spectral conversion method, mapping the distorted CLP spectra to a desired speech spectra is a good strategy towards attaining CLP speech enhancement. Hence, an attempt is made to enhance /FVFV/ words using Gaussian mixture modeling (GMM) based spectral conversion method.
The enhanced /sasa/ word using GMM based spectral conversion is depicted in Fig. 1 (e)-(f). After performing enhancement, it is observed that GMM based spectral conversion of hypernasal speech results in muffled speech. Further analysis of the modified speech signal showed that the spectrally modified speech is observed to have low nasalization compared to original /FVFV/ word. In Fig. 1 (c)-(d) the fricatives are ambiguous and the overall speech quality is still degraded.
The speech degradation may be attributed to the oversmoothing effect. Because of the oversmoothing effect the formants of the vowels are observed to have larger bandwidth, smaller peak-to-valley ratio, and the fricative spectrum are observed to have deviant acoustic characteristics and spectral tilt. Several techniques are proposed in the literature to overcome the smoothening affect, where one of the approach corresponds to the parametric voice conversion (VC) method which is reported to alleviate the oversmoothing effect. Therefore, nonnegative matrix factorization (NMF) based speech enhancement is used to modify the /FVFV/ word and shown in Fig. 1 (g)-(h) . The modified /FVFV/ word shows some improvement compared to original /FVFV/ word but its acoustic characteristics are still far from that of the non-CLP /FVFV/ word shown in Fig. 1 (a)-(b). Although some improvements are observed as reduction in nasalization and spectral prominence in the high-frequency region for fricatives. However, further observation showed that low frequency energy in the fricatives persists and the formants of the vowels are not distinct. Thus, it projects the importance of processing different class of sound units separately.
2.3 Modification of consonant-vowel-consonant-vowel words
The consonant-vowel-consonant-vowel (CVCV) words consisting of the stops /k/, /t/ and /T/ in vowel contexts /a/, /i/, and /u/ are analyzed similar to that of the /FVFV/ described in previous section. The possible combination of words are /kaka/, /kiki/, /kuku/, /tata/, /titi/, /tutu/, /TaTa/, /TiTi/, and /TuTu/. The spectral compression technique may not effectively transform the /CVCV/ word because the spectral prominence of consonants are not confined to a specific frequency band. For example, the velar stop /k/ is characterized by prominent spectral energy in the low-frequency region, whereas, the alveolar stop /t/ shows spectral prominence around mid-frequency region and retroflex /T/ shows spectral prominence above 2 kHz. Additionally, formants of the vowels are observed in the low-frequency region. Therefore, spectral energy compression in the low-frequency region will result in further deterioration of the /CVCV/ words. Further, temporal processing will result in vowel modification only and using insertion method artificially synthesized phoneme can be used for speech modification. However, insertion method does not preserve the individuality information. Hence, spectral compression, temporal processing and insertion method might not be effective in transforming the entire /CVCV/ word.
The consonants are very short duration phonemes with dynamic characteristics that represents different spectral and temporal importance [19, 5]. The spectral prominence for consonants ranges from low-frequency region to high-frequency region and in vowels the lower frequency region are mostly considered to carry important perceptual information . Hence, the modification technique designed for a specific category of phoneme may or may not be effective in transforming the disordered nature of other phoneme. As stated in the previous section, mapping the disordered speech spectra into that of the non-CLP speech spectra may improve the speech intelligibility and quality. Hence, the modification of /CVCV/ word is also attempted using GMM and NMF based spectral conversion respectively.
For illustration, the transformed signals for the word /kaka/ is shown in Fig. 2. From the figure, it is observed that the vowel formants and the consonants in Fig. 2 (e)-(f) and Fig. 2 (g)-(h) are not similar to non-CLP speech characteristics shown in Fig. 2 (a)-(b). Based on the above figures, it is observed that the enhanced speech signal does not show significant improvement relative to the original unprocessed signal. The reason may be attributed to the complex relationship of the misarticulations and nasality in CLP speech. Hypernasality and articulation error both show an impact in the same word reducing the speech intelligibility and quality. In the above figures, for an illustration only /kaka/ is depicted. However, similar analysis are observed for other misarticulations in the stop /t/ and /T/, respectively. Both the NMF and GMM based spectral conversion had shown some improvement in the speech characteristics, however, they are not able to effectively enhanced the speech signal as desired. Therefore, this necessitates further analysis of the signal characteristics and then perform modification. Hence, phoneme specific enhancement can be attempted to observe the impact on the overall speech intelligibility and quality of a word.
3 Experimental evaluation
In this section, the impact of the combined phoneme specific modification techniques are analyzed. From Fig. 1 and Fig. 2, it is observed that when entire word is modified, all the phoneme distortions (either fricative or consonant misarticulation or nasalization) were not reduced as desired.
|Obstruent||Vowel||Obstruent + vowel|
When each of the phoneme in a word are modified independently as shown in Fig. 3 and Fig. 4, the impact of the speech distortions are reduced significantly. Due to severity, both articulation error and nasalization are observed in CLP speech. To have an enhancement system capable of performing the desired task, it is essential to improve the entire word intelligibility.
To analyze the effectiveness of the transformation method exploited in our earlier works, the evaluation is carried out for entire word modification and compared with isolated phoneme modifications. At first the impact on word-level intelligibility is evaluated for the enhanced speech obtained by transforming only fricatives/consonants.
In the second step, the enhanced speech obtained using only vowel modification is evaluated. Finally, in the third step, the enhanced speech obtained by exploiting the consonant and vowel modification both are evaluated.
3.1 Objective evaluation
The modified CLP speech is assessed using pathological short-time objective intelligibility (P-STOI) and pathological extended short-time objective intelligibility (P-ESTOI) measures. The mathematical descriptions of P-STOI and P-ESTOI can be found in .
The objective measures corresponding to P-STOI and P-ESTOI are noted in Table. 1 and Table. 2, respectively. The P-STOI and P-ESTOI values are computed for time-aligned CLP speech signals and non-CLP (reference) signals. Before comparing the measures, word specific reference templates are created from non-CLP speaker’s speech. The reported values are averaged across all the listeners corresponding to the specific errors in all the vowel contexts. P-STOI values in Table 1 indicate that compared to original CLP speech misarticulation, modification of either of the error (obstruents misarticulation or vowel nasalization) results in improved intelligibility. However, from the P-STOI values, it is also observed that modification of both the distorted obstruents and vowels provide higher P-STOI values as compared to standalone modification. Similar observations are also noted for the P-ESTOI values tabulated in Table 2.
|Obstruent||Vowel||Obstruent + vowel|
For an illustration, the original and modified CLP words are also analyzed using automatic speech recognition (ASR) performance. As the speech data for this study are in the Kannada language, a Kannada ASR system is developed using KALDI speech recognition toolkit. The ASR system performance for various phoneme modification categories is measured using phone error rate (PER) metric. The PER for the original and modified CLP speech is shown in Table 3.
|Obstruent||Vowel||Obstruent + vowel|
Compared to the original unprocessed CLP speech, the recognition performance of the modified speech is relatively higher. The modification of a specific phoneme also yield reduced PER value compared to that of the original distorted CLP speech. However, better performances are observed when both the phonemes in the word are modified. In certain instances, the PER of the combined modification is comparable to the standalone modification of the phoneme. This implies that, in those utterances, the distortion caused by that specific phoneme is dominant.
|Obstruent||Vowel||Obstruent + vowel|
Considering the MCD values shown in Table 4, it is observed that the combined modification of the obstruent and vowels, results in lower MCD values relative to that of the original distorted CLP words. The MCD values reported in Table 4 are averaged across all the listeners corresponding to the specific errors in all the vowel contexts. The lower MCD values of the modified words indicate that the spectral difference between non-CLP (reference) and misarticulated words are reduced significantly for all the words evaluated in the study.
In certain cases, the objective intelligibility values of combined modifications are observed to be comparable with standalone modification. The probable reason may be attributed to the fact that either obstruent or vowel in the word is less distorted, hence resulting in comparable values.
3.2 Subjective evaluation
Listening experiment is also carried out to assess the word-level intelligibility by exploiting the phoneme specific modification techniques reported in the earlier works. The speech quality of the distorted CLP speech and the modified CLP speech is measured using a 5-point rating scale mean opinion score (MOS) where, 1 = bad, 2 = fair, 3 = good, 4 = very good, and 5 = excellent.
|Obstruent||Vowel||Obstruent + vowel|
A total of 10 naive listeners have participated in the study and the speech samples were randomly numbered to avoid any bias towards the original or modified speech. The listeners bear the knowledge of speech science. Each of the listener have evaluated 120 speech samples (3 vowel contexts 3 fricative /s/ errors 4 variations = 36, 3 vowel contexts 1 stop /k/ error 4 variations = 12, 3 vowel contexts 3 stop /t/ errors 4 variations = 36, 3 vowel contexts 3 stop /T/ errors 4 variations = 36). The MOS values are derived by averaging all the values across each vowel context of the specific word from all the listeners. The averaged MOS values shown in Table 5 indicate that the modification of all phonemes provide significant improvement compared to the original and standalone modified CLP speech.
This study has been performed on speech data collected in clinical settings from children having mild to moderate CLP speech disorder. The scope of the work is limited to studying specific CLP speech distortions in isolated phonemes in the context of /CVCV/ words with identical /CV/ pairs. The study primarily focuses on enhancing a few select obstruents and vowels. Later, it is extended to some nonsensical /CVCV/ words that can be formed by combining the studied obstruents and vowels. The proposed system is developed based on certain assumptions in a controlled environment. Therefore, realization of the proposed techniques in real-time deserves future explorations.
The word-level intelligibility is attempted by combining the specific phoneme modifications discussed in our previous works. A comparison study is also done to observe whether the transformation method exploited in our earlier works can improve the entire word intelligibility. When the transformation method is used to modify the word as a whole, it is observed that the speech sounds muffled. Hence, phoneme specific modifications are exploited to observe its impact in word intelligibility. From the evaluation results, it is observed that, a significant improvement in the word-level intelligibility can be achieved when all phonemes in the words are modified independently.
The authors would like to thank Dr. M. Pushpavathi and Dr. Ajish Abraham, AIISH Mysore, for providing insights about CLP speech disorder. The authors would also like to acknowledge the research scholars of IIT Guwahati for their participation in the perceptual test. This work is in part supported by a project entitled “NASOSPEECH: Development of Diagnostic system for Severity Assessment of the Disordered Speech” funded by the Department of Biotechnology (DBT), Govt. of India.
-  (2013) Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8037–8040. Cited by: §1.
-  (1997) Application of speech conversion to alaryngeal speech enhancement. IEEE Transactions on Speech and Audio Processing 5 (2), pp. 97–105. Cited by: §1.
-  (2019) Parrotron: an end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation. In Proceedings of Interspeech, pp. 4115–4119. Cited by: §1.
-  (1942) Bruel and kjaer. Note: https://www.bksv.com/en Cited by: §2.1.
-  (1955) Acoustic loci and transitional cues for consonants. The Journal of the Acoustical Society of America 27 (4), pp. 769–773. Cited by: §2.3.
-  (2017) Joint dictionary learning-based non-negative matrix factorization for voice conversion to improve speech intelligibility after oral surgery. IEEE Transactions on Biomedical Engineering 64 (11), pp. 2584–2594. Cited by: §1, §2.2.
-  (2001) Speech and cleft palate/velopharyngeal anomalies. Management of Cleft Lip and Palate. London: Whurr. Cited by: §1.
-  (2008) Universal parameters for reporting speech outcomes in individuals with cleft palate. The Cleft Palate-Craniofacial Journal 45 (1), pp. 1–17. Cited by: §1, §1.
-  (2019) Pathological speech intelligibility assessment based on the short-time objective intelligibility measure. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6405–6409. Cited by: §3.1.
-  (2007) Improving the intelligibility of dysarthric speech. Speech Communication 49 (9), pp. 743–759. Cited by: §1.
-  (2011) Enhancement of noisy speech by temporal and spectral processing. Speech Communication 53 (2), pp. 154–174. Cited by: §2.2, §2.2.
-  (2013) Cleft palate & craniofacial anomalies: effects on speech and resonance. Nelson Education. Cited by: §1, §2.2.
-  (2019) Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users. Biomedical Signal Processing and Control 48, pp. 35–45. Cited by: §1.
Intelligibility of children with cleft lip and palate: Evaluation by speech recognition techniques.
18th International Conference on Pattern Recognition (ICPR), Vol. 4, pp. 274–277. Cited by: §1.
-  (2012) Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Communication 54 (1), pp. 134–146. Cited by: §1.
-  (2000) Noise source models for fricative consonants. IEEE Transactions on Speech and Audio processing 8 (3), pp. 328–344. Cited by: §2.2.
-  (2001) Cleft palate speech. Mosby St. Louis. Cited by: §1, §2.2.
-  (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, Cited by: §3.1.
-  (2014) Estimation of voice-onset time in continuous speech using temporal measures. The Journal of the Acoustical Society of America 136 (2), pp. EL122–EL128. Cited by: §2.3.
-  (2013) Adjusting dysarthric speech signals to be more intelligible. Computer Speech & Language 27 (6), pp. 1163–1177. Cited by: §1, §1.
-  (2009) Intelligibility assessment in children with cleft lip and palate in Italian and German. In Tenth Annual Conference of the International Speech Communication Association, Cited by: §1.
-  (1996) Quantifying spectral characteristics of fricatives. In Proceedings of 4th International Conference on Spoken Language Processing (ICSLP), Vol. 3, pp. 1521–1524. Cited by: §2.2.
-  (1985) The acoustics of fricative consonants. Cited by: §2.2.
-  (2016) The structure of Hindi stop consonants. The Journal of the Acoustical Society of America 140 (5), pp. 3633–3642. Cited by: §2.2, §2.3.
-  (2000) Acoustic phonetics. Vol. 30, MIT press. Cited by: §2.2.
-  (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6 (2), pp. 131–142. Cited by: §2.2, §2.2.
-  (2020) Enhancement of cleft palate speech using temporal and spectral processing. Speech Communication 123, pp. 70–82. Cited by: §1, §2.1, §2.2, §2.
-  (2021) Modification of misarticulated fricative /s/ in cleft lip and palate speech. Biomedical Signal Processing and Control 67, pp. 102088. Cited by: §1, §2.1, §2.2, §2.
-  (2021) Event-based transformation of misarticulated stops in cleft lip and palate speech. Circuits, Systems, and Signal Processing 40 (8), pp. 4064–4088. Cited by: §1, §2.1, §2.2, §2.
-  (2016) Spectral enhancement of cleft lip and palate speech.. In Proceedings of Interspeech, pp. 117–121. Cited by: §1.
-  (2002) Assessing intelligibility in speakers with cleft palate: a critical review of the literature. The Cleft Palate-Craniofacial Journal 39 (1), pp. 50–58. Cited by: §1.
-  (2019) Reconstruction of mandarin electrolaryngeal fricatives with hybrid noise source. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 27 (2), pp. 383–391. Cited by: §1.