Hypernasality refers to the perception of excessive nasal resonance in speech, caused by velopharyngeal dysfunction (VPD), an inability to achieve proper closure of the velum, the soft palate regulating airflow between the oral and nasal cavities. It is a common symptom in motor-speech disorders such as Parkinson’s Disease (PD), Huntington’s Disease (HD), amyotrophic lateral sclerosis (ALS), and cerebellar ataxia, as rapid movement of the velum requires very precise motor control , , , . It is also the defining perceptual trait of cleft palate speech . Reliable detection of hypernasality is useful in both rehabilitative (e.g. tracking the progress of speech therapy) and diagnostic (e.g. early detection of neurological diseases) settings , . Because of the promise hypernasality tracking shows for assessing neurological disease, there is interest in developing strategies that are robust to the limitations of existing work—in this work we focus on developing automated metrics for hypernasality scoring that are robust to disease- and speaker-specific confounders.
Clinician perceptual assessment is the gold-standard technique for assessing hypernasality . However, this method has been shown to be susceptible to a wide variety of error sources, including stimulus type, phonetic context, vocal quality, articulation patterns, and previous listener experience and expectations . Additionally, these perceptual metrics have been shown to erroneously overestimate severity on high vowels when compared with low vowels , and vary based on broader phonetic context . Although these difficulties may be mitigated by averaging multiple clinician ratings, this further drives up costs associated with hypernasality assessment and makes its use as a trackable metric over time less feasible.
Automated hypernasality assessment systems have been proposed as an objective alternative to perceptual assessment. Instrumentation-based direct assessment techniques visualize the velopharyngeal closing mechanism using videofluoroscopy  or magnetic resonance imaging (MRI)  and provide information about velopharyngeal port size and shape . These methods are invasive and may cause pain and discomfort to the patients. As an alternative, nasometry seeks to measure nasalence, the modulation of the velopharyngeal opening area, by estimating the acoustic energy from the nasal cavity relative to the oral cavity. This is done by measuring the acoustic energy from two microphones separated by a plate that isolates the mouth from the nose . In some cases, nasalance scores yield a modest correlation with perceptual judgment of hypernasality [3, 69], however there is considerable evidence that this relationship depends on the person and the reading passages used during assessment , . Because of this, the clinician’s perception of hypernasality is often the de-facto gold-standard in clinical practice . Furthermore, properly administering the evaluation requires significant training and it cannot be used to evaluate hypernasality from existing speech recordings.
An appealing alternative to instrumentation-based techniques is the direct estimation of hypernasality from recorded audio. This family of methods aims to measure the atypical acoustic resonance resulting from VPD as an objective proxy for hypernasal speech. Previous work in this area can be categorized broadly in two groups: engineered features based on statistical signal processing  and supervised methods based on machine learning . The simple acoustic features fail to capture the complex manifestation of hypernasality in speech, as there is a great deal of person-to-person variability . The more complex machine learning-based metrics are prone to overfitting to the necessarily small disease-specific speech datasets on which they are trained, making it difficult to evaluate how effectively they generalize.
In this paper, we propose an approach that falls between these two extremes. We know that for voiced sounds hypernasal speech results in additional resonances at the lower frequencies . The acoustic manifestation of the additional resonance is difficult to characterize with simple features as it is dependent on several factors including the physio-anatomy of the speaker, the context, etc. For unvoiced sounds, hypernasal speech results in imprecise consonant production—the characteristic insufficient closure of the velopharyngeal port renders the speaker unable to build sufficient pressure in the oral cavity to properly form plosives, causing the air to instead leak out through the nose .
I-a Related work
Spectral analysis of speech is a potentially effective method to analyze hypernasality. Acoustic cues based on formant F1 and F2 amplitudes, bandwiths, and pole/zero pairs , , , , , , ,  and changes in the voice low tone/high tone ratio   have been proposed to detect or evaluate hypernasal speech. These spectral modifications in hypernasal speech will have an impact on articulatory dynamics, thereby affecting speech intelligibility. Statistical signal processing methods that seek to reverse these cues, such as suppressing the nasal formant peaks and then performing peak-valley enhancement, have demonstrated improvement in the perceptual qualities of cleft palate and lip-caused hypernasal speech , further demonstrating the connection between these cues and intelligibility. The large variability of speech degradation patterns across neurological disease or injury challenges simple features that are based on domain expertise . Overall, these simple features are not robust to the complicated acoustic patterns that emerge in hypernasality, and are prone to high false positive and negative error rates in out-of-domain test cases.
In response, data-derived representations of hypernasality that combine more elemental speech features and supervised learning have been proposed. Mel-frequency cepstral coefficients (MFCCs) and other spectral transformations, , , , , , , , , glottal source related features (jitter and shimmer) , , difference between the low-pass and bandpass profile of the Teager Energy Operator (TEO) , , and non-linear features , 
have all been proposed as model input features. Gaussian mixture models (GMM), support vector machines, and deep neural networks have been used in conjunction with these features for hypernasality evaluation from word and sentence level data, , , . Recently, end-to-end neural networks taking MFCC frames as input and producing hypernasality assessments as output have also been proposed .
These methods rely on supervised learning and are trained on small data sets. For our application they run the risk of overfitting to the data by focusing on associated disease-specific symptoms rather than the perceptual acoustic cues of hypernasality itself.
Features based on automatic speech recognition (ASR) acoustic models targeting articulatory precision have been used in nasality assessment systems. The nasal cognate distinctiveness measure uses a similar approach but assesses the degree to which specific stops sound like their co-located nasal sonorants .
We propose the Nasalization-Articulation Precision (NAP) features, addressing the limitations of the current methods with a hybrid approach that brings together domain expertise and machine learning. Our approach relies on a combination of two minimally-supervised acoustic models trained using only healthy speech data, with dysarthric speech data only used for training simple linear classifiers on top of the features. The first acoustic model learns a distribution of acoustic patterns associated with nasalization of voiced phonemes and the second acoustic model learns a distribution of acoustic patterns of precise articulation for unvoiced phonemes. For a dysarthric sample, the model produces phoneme-specific measures of nasalization and precise articulation. As a result, the features are intuitive and interpretable, and focus on hypernasality rather than other co-modulating factors.
In contrast to other approaches that rely on machine learning, the features do not rely on any clinical data for training. We show that these features can be combined using simple linear regression to develop a robust estimator of hyperanasality that generalizes across different dysarthria corpora, outperforming both neural network-based and engineered feature-based approaches. Furthermore, we demonstrate that this estimator indeed robustly captures the perceptual attributes of hypernasality by training on our full dysarthria corpus and evaluting on a cleft lip and palate (CLP) dataset, with speakers who are, apart from the cleft palate-induced hypernasality, otherwise healthy speakers.
To evaluate the efficacy of this new feature set, we train and validate a linear model that estimates clinician-rated hypernasality scores across different neurological diseases to ensure that it is focusing on hypernasality and not other disease-specific dysarthria symptoms. We show that this representation correlates strongly with clinical perception of hypernasality across several different neurological disorders, even when trained on one disorder and evaluated on a different disorder. Such assessment has a potential advantage as an inexpensive, non-invasive and simple-to-administer method, scalable to large and diverse populations  .
The proposed hypernasality evaluation algorithm is based on the intuition that as the severity of hypernasality increases, two broad perceptible changes take place: the unvoiced phonemes become less precise and the voiced phonemes become nasalized. To that end, we model these perceptual changes at the phone level. In Fig. 1, we provide a high-level overview of the proposed hypernasality score estimation scheme.
After forced alignment and preprocessing, the input speech is routed phoneme-by-phoneme to one of two acoustic models. The voiced phonemes are analyzed using the first acoustic model, yielding an objective estimate of acoustic nasalization based on a likelihood ratio. The unvoiced phonemes are analyzed with the second model, which captures the production quality of unvoiced phonemes, objectively estimating articulatory precision as another likelihood ratio. Both the acoustic nasalization model and the articulatory precision model are trained using healthy speech from the LibriSpeech Dataset . The features from these models are averaged by phoneme, and then used as input to a simple linear model to predict clinician-assessed hypernasality ratings across several different neurological disorders. We describe the data and each of these processing steps in detail below.
Ii-A1 Healthy speech corpus
LibriSpeech is a public domain corpus of transcript-labelled healthy English utterances. It contains roughly 1000 hours of speech sampled at 16kHz. The speech consists of 1,128 female and 1,210 male speakers reading book passages aloud. It contains “clean” samples, which have been carefully segmented and aligned, as well as “other” samples which are more challenging to use . It is freely available for download at openslr.org. We use this corpus to train both acoustic models shown in Fig. 1.
Ii-A2 Dysarthric speech corpus
The database consists of recordings from 75 speakers (40 male and 35 female) of varying levels of hypernasality. The corpus contains data from speakers diagnosed with several different neurological disorders: 38 patients have Parkinson’s disease (PD), 6 patients have Huntington’s disease (HD), 16 patients have cerebellar Ataxia (A), and 15 patients have amyotrophic lateral sclerosis (ALS).
All individuals read the same set of five sentences, capturing a range of phonemes. Reading is an ideal stimulus for this task because it controls for phonetic distributional variations that would be present in more spontaneous speech and enables for consistency between speakers and between assessments in-time, ideal qualities for a clinical measure.
The perceptual evaluation of hypernasality from recorded samples was carried out by 14 different speech language pathologists on a scale of 1 to 7. The average hypernasality score for each speaker was used as the ground truth. The inter-rater reliability of the SLPs was moderate, with a Pearson Correlation Coefficient of 0.66 and an average inter-clinician mean absolute error of 1.44 on the 7-point scale. The sentences spoken were:
The supermarket chain shut down because of poor management.
Much more money must be donated to make this department succeed.
In this famous coffee shop they serve the best doughnuts in town.
The chairman decided to pave over the shopping center garden.
The standards committee met this afternoon in an open meeting.
The speech recordings were carried out in sound treated room using a microphone. Table I
shows the breakdown of clinical characteristics of the subjects and the statistics of the nasality score (NS) subsets. S.D. denotes standard deviation. Figure2 contains the clinician hypernasality score histograms for each disorder population.
|Disease||Male||Female||Mean Age||S.D. Age||Mean NS||S.D. NS|
Ii-A3 Cleft Palate speech corpus
While we are chiefly concerned with evaluating hypernasality in dysarthric speakers exhibiting neuromuscular diseases, cleft lip and palate (CLP) speech is useful for evaluation purposes, as CLP speech often exhibits hypernasality without the other kinds of perceptual changes (slurring, generalized articulatory imprecision) that also arise in dysarthria. We use a corpus of 6 child and 12 adult CLP speakers with different levels of hypernasality severity, that span the hypernasality range (from normal to extreme) in equal intervals  to demonstrate that our model chiefly captures hypernasality rather than any associated neurologically disorded speech symptoms. These CLP speakers are otherwise healthy and exhibit no other co-modulating symptoms such as imprecise articulation resulting from other motor impairments. Because the hypernasality assessments for these speakers were performed by different clinicians than our dysarthric data, we focus on correlation alone to evaluate the performance of our hypernasality evaluation system on this speech.
Ii-B Data pre-processing
In this section, we formalize the notation to be used in the ensuing analysis. Consider an utterance with sampling rate and a corresponding transcript of phonemes , . We analyze with a frame length and overlap. For a frame indexed by , , we extract a set of features, . The utterance is force-aligned using the Montreal Forced Aligner111See Section III-D for discussion on forced alignment performance for these dysarthric speech samples.  at the phoneme level. We denote the data feature matrix for all frames that are aligned to phoneme by .
After alignment, we use different features for the acoustic nasalization model than for the articulatory precision model. For the nasalization model we use perceptual linear prediction (PLP) features because they better preserve acoustic cues that have been previously used to model hypernasality, including formant frequencies, bandwidths, and spectral tilt . For the articulatory precision model for unvoiced phonemes, we use mel-frequency cepstral coefficients (MFCCs), a common representation for automatic speech recognition applications . We also use a lower sampling rate for the voiced nasalization model (8kHz) compared to the unvoiced articulatory precision model (16kHz). This is motivated by the difference in spectral energy distribution between voiced and unvoiced sounds.
Ii-C Nasalization model
Our acoustic nasalization model is trained using healthy speech data from the LibriSpeech dataset. We model the distributions of two classes of voiced phonemes. The “oral” non-nasal () class consists of all voiced oral consonants and all vowels from syllables where nasal consonants are not present. Similarly, we define the “nasal” class () to contain the nasal consonants as well as half of adjacent vowels surrounding them. These rules were implemented after alignment; an illustrative example of the two classes is shown in the third tier of the aligned example in Fig. 1.
For this task, we use 100 hours of clean-labeled speech from the LibriSpeech dataset . We first perform forced phone-alignment to the transcript as shown in Figure 1. We partition all phonemes into the and classes. For each frame in each phoneme, we extract 13 PLP coefficients, giving two feature matrices, and
, containing all frames of nasal PLPs in one, and non-nasal PLPs in the other. To model the probability density functions, we use a 16-mixture Gaussian Mixture Model (GMM). The weight, mean, and covariance matrix for each of the GMM components is learned using the expectation maximization (EM) algorithm. The GMM for the nasal class is represented by, . Here, , and represent the mean, covariance matrix and weight of the Gaussian, respectively. Similarly, for the non-nasal class the GMM components are given by , .
After training on healthy speech, we provide a segmented dysarthric utterance to evaluate the likelihood from each of the two learned probability density functions. For an out-of-sample input, we estimate the likelihood, voiced phoneme by voiced phoneme. That is, for data feature matrix , the likelihood that this phoneme is nasalized is
where the notation is shorthand notation for all frames aligned to phoneme . Similarly for the class, we have
We use the log-likelihood ratio test statistic as a continuous measure of nasalization. In particular, we define
where represents the number of acoustic frames aligned to phoneme . This statistic is calculated for every voiced, non-nasal phoneme in the input utterance. Thus, for a given speaker, a nasalization ratio is computed containing the log-likelihood ratios of nasalization of the voiced phonemes, (AA, AE, AH, AO, AW, AY, B, D, DH, EH, ER, EY, G, IY, JH, V, Z).
For non-nasalized speech, we expect the value of to be low, whereas for nasalized speech, we expect it to be high. Figure 3 shows a comparison of the value of the nasalization likelihood feature between a group of high hypernasality ( perceptual rating) and a group of low hypernasality ( Perceptual rating). We average the hypernasality scores for the 4 most relevant phonemes for predicting hypernasality (see Section III-B for details). As expected, there is an increase in the nasality feature value for an increase in severity of hypernasality.
Ii-D Articulation model
. For our implementation we used a triphone model trained with a Gaussian Mixture Model-Hidden Markov Model on 960 hours of healthy native English speech data from the LibriSpeech corpus. The input features to the ASR model are a
-dimensional second order Mel-Frequency Cepstral Coefficient (MFCC) with utterance-level cepstral mean variance normalization and Linear Discriminant Analysis transformation. We use the Kaldi toolkit training scripts for training the model. In contrast to the nasalization model, here we use a sampling rate of 16 kHz to capture the wideband nature of unvoiced phonemes. The nasalization model required a much smaller training set (100 hours) since there were only two classes modeled by the GMM.
After training, the acoustic model can be queried using the Viterbi decoding algorithm for the posterior probabilityof a given set of acoustic feature frames representing a realization of some phoneme . For a “well-articulated” phoneme, no phoneme apart from the one intended by the speaker should maximize this posterior.
We use the acoustic model to assess articulatory precision as follows. Considering the set of phonemes in the language, we assess the log-likelihood ratio of the frames from a given phoneme , to the maximum log-likelihood across all phonemes,
where represents the number of acoustic frames aligned to phoneme .
This processing is performed after forced alignment to the transcript labels, and assessed for each unvoiced phoneme to permit by-phoneme analysis of precise articulation.
For speakers who exhibit little hypernasality, we expect the value of for unvoiced phonemes to be high, whereas for hypernasal speakers, we expect it to be lower. Figure 4 shows a comparison of the average value of the articulation precision feature between a group of high hypernasality ( perceptual rating) subjects and a group of low hypernasality ( Perceptual rating) subjects. We average the articulation scores for the most relevant phonemes for predicting hypernasality (see Section III-B). As expected, there is a decrease in the articulation precision feature value for an increase in severity of hypernasality. Furthermore, we expect hypernasality to exhibit unique patterns in terms of affected and unaffected unvoiced phonemes, that are not general to dysarthria , making phoneme-level AP classification a valuable signal in quantifying hypernasality.
|Train on||Ataxia, HD, PD, ALS||HD, PD, ALS||Ataxia, PD, ALS||Ataxia, HD, ALS||Ataxia, PD, HD|
|Test on||Left-out speaker||Ataxia||HD||PD||ALS|
Ii-E Linear regression model
In the interest of generalization and clinical interpretability, simple linear ridge regression models are used to estimate the nasality score using the phoneme-averaged nasalization and articulatory precision features as input.
Two different cross-validation strategies are used to evaluate model performance and the quality of the input features. First, we use leave-one-speaker-out (LOSO) cross-validation, as is typically done. To evaluate the generalization across out-of-domain diseases, we also perform leave-one-disease-out (LODO) cross-validation. In LODO, data for three of the neurological conditions is used for training and the fourth is used for testing.
We evaluate the efficacy of the features in two ways: by analyzing the correlation between individual per-feature averages and speaker hypernasality rating, and by observing the performance of a simple linear regression model directly calculating the hypernasality score for a speaker from their features. For the hypernasality score computation task, we evaluate the performance of a model against two baselines. Models are compared using mean average error (MAE) and Pearson correlation coefficient (PCC) between the clinical perceptual hypernasality score and the predicted hypernasality scores.
Iii-a Hypernasality evaluation
In Table II, we show the results of the evaluations (LOSO - leave one speaker out and LODO - leave one disease out for the four diseases) for five different models. We evaluated our NAP features against two comparison models representing the state of the art in engineered features and in supervised learning. The most predictive acoustic features for hypernasality presented in  were extracted using Praat source code provided by the authors . These formant features (FF) included formant amplitude, nasality peak amplitude, and normalized and raw difference . All features were extracted for each vowel and used in a linear and non-linear model to estimate the clinician-assessed hypernasality labels. The linear model is based on simple multiple regression whereas the non-linear models are based on additive regression and -nearest neighbor regression. The results of this model are labeled FF-Linear, FF-Additive, and FF-KNN in Table II. In addition, we implemented the neural network proposed in , consisting of three feed-forward layers with sigmoid activations. The inputs to the networks are 39-dimensional Mel Frequency Cepstral Coefficients (MFCC) computed with a 20 ms window length and no overlap. The hidden layer is of size 100, and the output layer of size 1. The output value is averaged across all frames to provide a single nasality score estimation per speaker. The model is trained using L1 loss and the Adam optimizer 
for 50 epochs with a learning rate of 0.001. This model’s results are reported in Table II as MFCC-NN.
The results show that the linear model based on NAP features consistently outperforms the other two models, especially under the LODO conditions. The differences are also apparent when we analyze the individual LOSO correlation plots in Fig. 4. These scatter plots relate the estimated hypernasality score for each speaker against the actual hypernasality score. As is clear from the figures, the correlation of the baseline methods is largely driven by the samples with very high nasality scores. The NAP model exhibits a linear trend between the predicted and actual values throughout the hypernasality range.
Iii-B Individual feature contributions
We use a simple forward selection algorithm for the LOSO model to identify the most predictive NAP features. The algorithm identifies the subset of features that minimizes the cross-validation mean square error between the predicted hypernasality rating and the clinical hypernasality rating. Features are iteratively added until the cross-validation loss is no longer decreased. This procedure results in 6 non-redundant features selected for prediction. This includes the articulatory precision for T and F and the nasalization for D, B, IY, and AA. We plot the top features against the clinical perceptual nasality ratings in Fig.6; Fig. 7 depicts the marginal improvement in LOSO correlation as features are added in by decreasing feature prominence.
Iii-C Relationship between articulatory precision and hypernasality
Articulatory precision and hypernasality are tightly linked. Hypernasal speech results in impaired articulatory precision. However, articulatory impairments can occur in motor-speech disorders for a variety of reasons. The neurological conditions we study herein impact several aspects of speech production including, respiration, voicing, resonance, and articulation. This brings up two important questions:
Do our features capture changes related to hypernasality that go beyond changes in articulatory precision?
Are our features sensitive to changes in articulatory precision that result from only hypernasality (and not other articulatory impairments resulting from dysarthria)?
In an attempt to decouple articulatory precision from hypernasality, we collect clinical articulatory precision ratings (in addition to the hypernasality ratings) from the same clinicians. The inter-rater reliability of the ratings was robust, with a Pearson correlation coefficient of 0.75 and a mean absolute error of 1.01 on a 7-point scale.
To answer the first question above, and demonstrate that our features capture information beyond changes in articulatory precision, we use a multiple linear regression model with clinician-rated articulatory precision alongside our six most predictive features (N(AA), N(IY), N(B), N(D), AP(T), AP(F)) as independent variables. The dependent variable is the clinical hypernasality rating. We once again use the forward selection algorithm on Pearson correlation coefficient to cumulatively select the most predictive features. The results are depicted in Figure 8. As expected, the subjective AP rating is most predictive as there is significant overlap with hypernasality, and it is selected first. In the presence of this generalized measure of articulatory precision, it makes sense that AP(T, F), features that are themselves estimating AP, would not be selected. This reinforces the rationale for their inclusion in the model. Three nasalization features, N(IY, AA, D), are able to further improve the correlation of the linear model predictions.
To answer the second question, and demonstrate that our features are sensitive to hypernasality alone, we evaluate a linear model trained on our full dataset of dysarthric speech using the six most predictive features predicting hypernasality scores for the 18 speech samples from individuals with cleft lip and palate in our CLP dataset. The linear hypernasality model trained on our dysarthric speech corpus achieves a PCC of 0.89 for predicting the adult hypernasality severity, and 0.82 for predicting the hypernasality level of the children. This provides additional evidence that our features capture the perceptual quality of hypernasality and not other co-modulating symptoms.
Iii-D Effectiveness of forced alignment
The features we have proposed herein rely on force aligning known transcripts to dysarthric speech . This can be problematic as coarticulation, blending, missed targets, distorted vowels, and poor articulation present in severely disordered speech  may interfere with the appropriate matching of dictionary phoneme-word pairs to the realized sounds .
We directly evaluate the prevalence of alignment errors generated by our forced alignment methodology using manually aligned transcripts. Two annotators produced word- and syllable-level aligned transcripts using the same spelling and phoneme-word conventions employed in the acoustic model dictionary for all utterances in the dataset. For each speaker in the dataset, we count word- and phone-level alignment errors based on the position of the center point of a word or phoneme as assessed by the forced aligner and the beginning and end of the corresponding word or syllable, , as assessed by the human transcriber. For each word or phoneme, the error is counted as . This error measure returns 0 if the center of the phoneme falls within the syllable; otherwise it returns the maximum error between the center of the automatically aligned phoneme and the start and end of the manually-aligned syllable. In Figure 9 we show the alignment error (in sec.) against the hypernasality rating to show how alignment error rates progress as hypernasality increases. The results clearly show that for all but the most severely hypernasal speakers forced alignment works effectively.
These results also indicate that our objective hypernasality ratings for the most imprecise speakers are not reliable. While this is a limitation of the approach, it is not severely limiting. In most cases, clinicians are more concerned with evaluating speakers in the mild-moderate end of the scale where they can monitor disease progress early or evaluate the effects of an intervention. This is less common for later stages of disease.
It is interesting to note that, while the alignment is poor, the model still yields high hypernasality scores for imprecise speakers. Precise alignment for speakers in this range is simply not possible, manually or otherwise. It’s likely that the poor hypernasality ratings predicted by the model is driven by the poor alignment itself .
The model based on NAP features outperforms both baselines, across all settings. In the LOSO condition the MFCC-NN approach outperforms the simpler formant features in PCC, while the formant feature model does achieve a lower MAE it seems to be a result of largely predicting the mean, with only a very modest upsloping trend in Figure 5(a) as opposed to Figures 5(b) and 5(c) which clearly show upward-sloping trends.
In the LODO conditions, the formant features and MFCC-NN models perform unpredictably. On some disease classes, MFCC-NN outperforms FF, while the opposite is true for others. By comparison, the NAP achieves consistent performance across all LODO classes. This suggests that these features are a robust measure of hypernasality, relatively invariant to the disease-specific co-modulating variables that hinder the performance of the baselines on the same task. The nasalization features in the NAP, by virtue of being trained on a large corpus of healthy speech, and targeting a specific perceptual quality are simultaneously more robust to both the disease-specific overfitting expected from NN methods such as  and speaker-to-speaker variances discussed in the design of the formant-based A1P0 and related features in , . Articulatory precision features are robust in a similar way.
One of the added benefits of the proposed approach over the baseline methods is the direct interpretability of the individual NAP features. While it is not immediately clear how MFCC features or formant-based features are expected to change with different hypernasality levels, the proposed features are easy to interpret.
The feature-level analyses of the nasalization and articulatory precision features behave as expected, with the nasalization log likelihood of the phonemes increasing as hypernasality increases, while the articulatory precision decreases as hypernasality increases (Figure 6). Analysis of Eqns (3) and (4) shows that this makes sense. As hypernasality increases, the voiced phonemes become more and more like the class in the acoustic model in Section II. Similarly, as hypernasality increases, the acoustics of the unvoiced phonemes become less and less like the intended target, therefore the ratio in Eqn (4) decreases.
During the feature selection analysis in Section III-B, certain consonants appeared prominently. In particular, the nasalization feature for phonemes D, and B, as well as the articulatory precision of T and F were prominent. T, B, and D are referred to as a “nasal cognates” in , as the bilabial consonant B shares a place of articulation with the bilabial nasal M, the lingua-alveolar consonants T and D share a place of articulation with the lingua-alveolar nasal N. Leakage through the nasal cavity will interfere with the production of all of these phonemes, and in the voiced case, they will sound like their corresponding nasal phonemes. It is not surprising that the nasalization model is most sensitive to these phonemes since that model is trained on healthy speech, where the class consists mostly of instances in M and N and surrounding vowels.
Through the same analysis, the most prominent vowels selected were AA and IY. AA is the most open and back vowel in English, whereas IY is the most closed and fronted. It may be the case that these extreme ends of the vowel chart exhibit more noticeable patterns of nasalization, either on a perceptual level or just in their PLP-nasalization feature realization.
In spite of its robustness the NAP model has limitations. Most limiting is its reliance on aligned transcripts to perform the estimation. The results shown in this paper were based on forced alignment. This is always possible when the ground truth transcript is known but is not feasible for spontaneous speech. The robustness of the model comes from the fact that it is trained on a large corpus of healthy speech; however, this training also induces a bias in the model. As the feature selection results show, the model is adept at detecting hypernasal speech from phonemes that look similar to nasals in healthy speech; however it is impossible to capture nasalization acoustic patterns for unvoiced speech since these sounds never occur in healthy speech (and hence cannot be captured in our model). As a result, we use articulatory precision as a proxy for nasalization for these sounds. Increased hypernasality typically implies reduced articulatory precision, but the converse is not necessarily true. As such, it is possible for speakers to exhibit reduced precision for other reasons than hypernasality. As we showed with the CLP speech experiments, when the reduction in articulatory precision is due to hypernasality, the model generalizes out-of-disease quite well.
We have presented and demonstrated the Nasalization-Articulation Precision features for objective estimation of hypernasality. This method leverages a data-driven approach to learning expert-designed features on healthy speech that capture perceptible elements in hypernasal speech. We demonstrated that these features, when evaluated on disordered speech, track the expected trends in perceptual hypernasality ratings, and can be used with ridge regression to estimate a clinician-rated hypernasality score more accurately than several representative baseline methods. Additionally, we demonstrated that the NAP algorithm predictions for hypernasality rating generalize across diseases with significantly less loss in accuracy than existing approaches. This implies that the NAP features are a robust method for estimating hypernasality in dysarthria.
The chief limitation of this approach, and articulatory precision estimation techniques more generally, is a reliance on known transcripts with which alignment may be performed. Neural models for directly assessing articulatory precision from raw speech audio is a promising future research direction—such models could provide the simultaneous identification of and precision assessment of phonemes on the fly, and provide downstream representations that could drive characterization of hypernasality without relying on reading as a stimulus, or known transcripts for assessment.
We plan to expand on this work by collecting a larger dataset of nasality-scored dysarthric speech, representing more diseases, and designing stimuli better tailored for this task. Furthermore, we will work to apply insights from this work to improve the robustness of neural models for the estimation of articulatory precision, nasality, and other objective speech biomarkers.
-  (2009) A comparative study of two acoustic measures of hypernasality.. Journal of Speech, Language, and Hearing Research. Cited by: §I-A.
-  (2014) Instrumental assessment of velopharyngeal function and resonance: A review. Journal of Communication Disorders. Cited by: §I.
-  (2010) The relationship between nasalance scores and nasality ratings obtained with equal appearing interval and direct magnitude estimation scaling methods. The Cleft Palate-Craniofacial Journal 47 (6), pp. 631–637. Cited by: §I.
-  (1996) A noninvasive technique for detecting hypernasal speech using a nonlinear operator.. IEEE Transactions on Biomedical Engineering. Cited by: §I-A.
-  (1996) A noninvasive technique for detecting hypernasal speech using a nonlinear operator. IEEE transactions on biomedical engineering 43 (1), pp. 35. Cited by: §I-A.
Deviant Speech Characteristics in Motor Neuron Disease. JAMA Otolaryngology–Head & Neck Surgery 100 (3), pp. 212–218. External Links: Cited by: §I.
-  (2006) Acoustic speech analysis for hypernasality detection in children. In 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 5507–5510. Cited by: §I-A.
-  (2016) The Americleft Speech Project: a training and reliability study. The Cleft Palate-Craniofacial Journal 53 (1), pp. 93–108. Cited by: §I.
-  (1997) Acoustic correlates of English and French nasalized vowels. The Journal of the Acoustical Society of America 102 (4), pp. 2360–2370. Cited by: §III-A, §IV.
-  (2016) Zero time windowing based severity analysis of hypernasal speech. In 2016 IEEE Region 10 Conference (TENCON), pp. 970–974. Cited by: §I-A.
-  (2018) Pitch-adaptive front-end feature for hypernasality detection. Proc. Interspeech 2018, pp. 372–376. Cited by: §I-A.
-  (2018) Detection of hypernasality based on vowel space area. Journal of Acoustical Society of America. Cited by: §I-A.
-  (1995) Motor speech disorders: substrates, differential diagnosis, and management. Mosby. External Links: Cited by: §I.
-  (2017) The role of the speech pathologist in the care of the patient with cleft palate. In Maxillofacial Surgery (Third Edition), P. A. Brennan, H. Schliephake, G.E. Ghali, and L. Cascarini (Eds.), pp. 1014 – 1023. External Links: Cited by: §I.
-  (1985) Detection of nasalized vowels in American English.. Proceedings of ICASSP, volume 4, 1569–1572.. Cited by: §I-A.
-  (2017) Automatic identification of hypernasality in normal and cleft lip and palate patients with acoustic analysis of speech. Journal of Acoustical Society of America. Cited by: §I-A.
-  (2003) Automatic speech recognition with sparse training data for dysarthric speakers. In Eighth European Conference on Speech Communication and Technology, Cited by: §III-D.
-  (2004) Revisiting dysarthria assessment intelligibility metrics. In Eighth International Conference on Spoken Language Processing, Cited by: §III-D.
-  (1985) Acoustic and perceptual correlates of the non-nasal–nasal distinction for vowels.. Journal of Acoustical Society of America. Cited by: §I-A.
-  (2014) Automatic evaluation of hypernasality and consonant misarticulation in cleft palate speech. IEEE Signal Processing Letters. Cited by: §I-A.
-  (2018) A survey on machine learning approaches for automatic detection of voice disorders. Journal of Voice. External Links: Cited by: §I.
-  (1991) Comparison between multiview videofluoroscopy and nasendoscopy of velopharyngeal movements. The Cleft Palate-Craniofacial Journal 28 (4), pp. 413–418. Note: PMID: 1742312 External Links: Cited by: §I.
-  (1990) Perceptual linear predictive (PLP) analysis of speech. the Journal of the Acoustical Society of America 87 (4), pp. 1738–1752. Cited by: §II-B.
-  (2008) Magnetic resonance imaging as an aid in the dynamic assessment of the velopharyngeal mechanism in children. Plastic and reconstructive surgery 122 (2), pp. 572. Cited by: §I.
-  (1996) Spectral properties and quantitative evaluation of hypernasality in vowels. The Cleft palate-craniofacial journal 33 (1), pp. 43–50. Cited by: §I-A.
-  (2001) The relationship between spectral characteristics and perceived hypernasality in children. The Journal of the Acoustical Society of America 109 (5), pp. 2181–2189. Cited by: §I-A.
-  (1996) Some limits to the auditory-perceptual assessment of speech and voice disorders. American Journal of Speech Language Pathology. Cited by: §I.
-  (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-A.
-  (2018) Examining factors influencing the viability of automatic acoustic analysis of child speech. Journal of Speech, Language, and Hearing Research 61 (10), pp. 2487–2501. External Links: Cited by: §III-D.
-  (2005) Perception of hypernasality and its physical correlates. Oral Science International 2 (1), pp. 21 – 35. External Links: Cited by: §I-A.
-  (2002) Efficacy of continuous positive airway pressure for treatment of hypernasality. The Cleft palate-craniofacial journal 39 (3), pp. 267–276. Cited by: §II-A3.
-  (2000) Speech and language issues in the cleft palate population: the state of the art. The Cleft palate-craniofacial journal 37 (4), pp. 1–35. Cited by: §I.
Velopharyngeal closure force and levator veli palatini activation levels in varying phonetic contexts. Journal of Speech Language and Hearing Research. Cited by: §I.
-  Evaluation of hypernasality in vowels using voice low tone to high tone ratio. The Cleft Palate-Craniofacial Journal 46 (1), pp. 47–52. Cited by: §I-A.
-  (2006) Voice low tone to high tone ratio: a potential quantitative index for vowel [a:] and its nasalization. IEEE transactions on biomedical engineering 53 (7), pp. 1437–1439. Cited by: §I-A.
-  (1996-07) Evaluation and Treatment of Resonance Disorders. Language, Speech, and Hearing in Schools 27, pp. 271–281. Cited by: §I.
-  (1961) Phonetic elements and perception of nasality. Journal of Speech, Language, and Hearing Research. Cited by: §I.
-  (2004) Methodology for perceptual assessment of speech in patients with cleft palate: A critical review of the literature. The Cleft Palate-Craniofacial Journal 41 (1), pp. 64–70. Note: PMID: 14697067 External Links: Cited by: §I.
-  (2008) Analysis of hypernasal speech in children with cleft lip and palate. In International Conference on Text, Speech and Dialogue, pp. 389–396. Cited by: §I-A, §I-A.
-  (2019-04) MontrealCorpusTools/Montreal-Forced-Aligner: Version 1.0.1. External Links: Cited by: §II-B.
-  (2010) Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083. Cited by: §II-B.
-  (2014) Pattern recognition of hypernasality in voice of patients with cleft and lip palate. In 2014 XIX Symposium on Image, Signal Processing and Artificial Vision, pp. 1–5. Cited by: §I-A.
-  (2017) Hypernasality severity analysis in cleft lip and palate speech using vowel space area.. In INTERSPEECH, pp. 1829–1833. Cited by: §I-A.
-  (2012) Automatic detection of hypernasal speech signals using nonlinear and entropy measurements. In Thirteenth Annual Conference of the International Speech Communication Association, Cited by: §I-A.
-  (2015) Characterization methods for the detection of multiple voice disorders: neurological, functional, and laryngeal diseases. IEEE journal of biomedical and health informatics 19 (6), pp. 1820–1828. Cited by: §I-A, §I-A.
-  (2013) Nonlinear dynamics for hypernasality detection in Spanish vowels and words. Cognitive Computation 5 (4), pp. 448–457. Cited by: §I-A.
-  (2015-04) Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5206–5210. External Links: Cited by: §II-A1, §II-C, §II-D, §II.
-  (2016) Nasometer II: Model 6450. External Links: Cited by: §I.
-  (2019-05) Objective assessment of vocal tremor. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6386–6390. External Links: Cited by: §I-B.
-  (2012) Acoustic analysis and non linear dynamics applied to voice pathology detection: A review. Recent Patents on Signal Processing 2 (2), pp. 96–107. External Links: Cited by: §I.
-  (2001) A noninvasive estimation of hypernasality using a linear predictive model. Annals of biomedical Engineering 29 (7), pp. 587–594. Cited by: §I-A.
-  (2011) Automatic detection of hypernasality in children. In International Work-Conference on the Interplay Between Natural and Artificial Computation, pp. 167–174. Cited by: §I-A.
ALS longitudinal studies with frequent data collection at home: study design and baseline data. Amyotroph Lateral Scler Frontotemporal Degener 20 (1-2), pp. 61–67. Cited by: §I-B.
-  (2016) Hypernasality associated with basal ganglia dysfunction: evidence from Parkinson’s disease and Huntington’s disease. PeerJ 4, pp. e2530. Cited by: §I.
-  (2019-05) Objective measures of plosive nasalization in hypernasal speech. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6520–6524. External Links: Cited by: §I-A, §II-D, §IV.
-  (2017) Assessment of nasalance and nasality in patients with a repaired cleft palate. European Archives of Oto-Rhino-Laryngology 274 (7), pp. 2845–2854. Cited by: §I.
-  (2013) Using Praat for linguistic research. University of Colorado at Boulder Phonetics Lab. Cited by: §III-A.
-  (2015) On the acoustical and perceptual features of vowel nasality. Cited by: §III-A, §IV.
-  (2007) Simulation and analysis of nasalized vowels based on magnetic resonance imaging data. Journal of Acoustical Society of America. Cited by: §I-A.
-  (1993) Hypernasality in dysarthric speakers following severe closed head injury: a perceptual and instrumental analysis. Brain Injury 7 (1), pp. 59–69. External Links: Cited by: §I.
-  (1995) Hypernasality in Parkinson’s disease: A perceptual and physiological analysis. J Med Speech-Lang Pathol 3 (2), pp. 73–84. Cited by: §I.
-  (2018-01-01) Investigating the role of L1 in automatic pronunciation evaluation of L2 speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018-September, pp. 1636–1640 (English (US)). External Links: Cited by: §II-D.
-  (2007) Acoustic analysis and detection of hypernasality using a group delay function.. IEEE Transactions on Biomedical Engineering. Cited by: §I-A.
-  (2009) Selective pole modification-based technique for the analysis and detection of hypernasality. In TENCON 2009-2009 IEEE Region 10 Conference, pp. 1–5. Cited by: §I-A.
-  (2018) Estimation of hypernasality scores from cleft lip and palate speech. In Proc. Interspeech 2018, pp. 1701–1705. External Links: Cited by: §I-A, §III-A, §IV.
-  (2018) Estimation of hypernasality scores from cleft lip and palate speech.. Proc. Interspeech 2018, 1701-1705, DOI: 10.21437/Interspeech.2018-1631. Cited by: §I-A.
-  (2016) Spectral enhancement of cleft lip and palate speech. In INTERSPEECH, Cited by: §I-A.
-  (2015-01) Nasality in Friedreich ataxia. Clin Linguist Phon 29 (1), pp. 46–58. Cited by: §I.
-  (1993) The relationship between nasalance and nasality in children with cleft palate. Journal of Communication Disorders 26 (1), pp. 13–28. Cited by: §I.
-  (2000-02) Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun. 30 (2-3), pp. 95–108. External Links: Cited by: §II-D.
-  (2012-11) Velopharyngeal dysfunction. Semin Plast Surg 26 (4), pp. 170–177. Cited by: §I.
-  (2015) Improving automatic forced alignment for dysarthric speech transcription. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §III-D.
-  (2009) Classifying hypernasality using the pitch and formants. Proceedings of the 6th International Con ference on Information Technolog New Generations. Cited by: §I-A.