Log In Sign Up

Robustness against the channel effect in pathological voice detection

Many people are suffering from voice disorders, which can adversely affect the quality of their lives. In response, some researchers have proposed algorithms for automatic assessment of these disorders, based on voice signals. However, these signals can be sensitive to the recording devices. Indeed, the channel effect is a pervasive problem in machine learning for healthcare. In this study, we propose a detection system for pathological voice, which is robust against the channel effect. This system is based on a bidirectional LSTM network. To increase the performance robustness against channel mismatch, we integrate domain adversarial training (DAT) to eliminate the differences between the devices. When we train on data recorded on a high-quality microphone and evaluate on smartphone data without labels, our robust detection system increases the PR-AUC from 0.8448 to 0.9455 (and 0.9522 with target sample labels). To the best of our knowledge, this is the first study applying unsupervised domain adaptation to pathological voice detection. Notably, our system does not need target device sample labels, which allows for generalization to many new devices.


page 1

page 2

page 3

page 4


VOICe: A Sound Event Detection Dataset For Generalizable Domain Adaptation

The performance of sound event detection methods can significantly degra...

Toward Real-World Pathological Voice Detection

Voice disorders significantly undermine people's ability to speak in the...

A Comparison Study on Infant-Parent Voice Diarization

We design a framework for studying prelinguistic child voicefrom 3 to 24...

Revisiting Singing Voice Detection: a Quantitative Review and the Future Outlook

Since the vocal component plays a crucial role in popular music, singing...

Unsupervised Domain Adaptation with Copula Models

We study the task of unsupervised domain adaptation, where no labeled da...

Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection

Approximately 1.2 As a result, automatic dysphonic voice detection has a...

Lightweight Dual-channel Target Speaker Separation for Mobile Voice Communication

Nowadays, there is a strong need to deploy the target speaker separation...

1 Introduction

Bhattacharyya (2014) reported that around people in the United States suffer from voice pathology. They have difficulties communicating with others, which can substantially reduce their quality of life. Most importantly, such diseases usually get worse if the situation is not treated at an early stage, resulting in substantial costs for individuals and for public insurances, depending on jurisdiction. Unfortunately, two main factors make detecting voice pathology difficult. First, it can be hard for individuals to notice their own degenerative voice problems in subtle stages, which are more evident to speech-language pathologists. Second, the pathologists may have limited availabilities due to ongoing pressures on the profession. Consequently, assessment is currently not optimal.

To enable early automatic diagnosis of pathological speech, some studies used traditional speech methods, such as time-frequency approaches (Umapathy et al., 2005), Mel frequency cepstral coefficients (MFCCs) (Fraile et al., 2009)

, Gaussian mixture models (GMMs)

(Godino-Llorente et al., 2006)

, hidden Markov models

(Costa et al., 2008) and wavelet coefficients (Salhi et al., 2010) to design the model for differentiating between normal and pathological voice samples. Lee et al. (2016) confirmed the feasibility of using an automatic speech recognition system for voice and speech disorders analysis and assessment. Wu et al. (2018)

proposed to use a convolutional deep belief network (CDBN) to pre-train the weights of a CNN (convolutional neural network) model for pathological voice detection; with the pre-training process, the system can yield satisfactory performance even only a small amount of training data is available.

Arias-Londoño et al. (2011) reached the 98% accuracy on the Massachusetts Eye and Ear Infirmary (MEEI) voice pathology database (Umapathy et al., 2005)

by combining two classifiers.

Fang et al. (2018)

first time used deep multilayer perceptrons (MLPs), to identify pathological voice and achieved 99% accuracy on the same dataset. These studies have shown possibilities to discriminate pathological voice through speech processing techniques.

However, these methods have yet to be made widely accessible to the public. We wish to balance the acuity of high-quality recordings with the availability of modern mobile devices, and to use models trained on either one. There are two main challenges – the first is the relative sparsity of device-specific data, and the second is the wide array of potential devices.

Along similar lines, (Alhussein and Muhammad, 2018)

used a fine-tuned CNN model to transfer voice samples from one device to a target device. However, this approach relies on the labeled samples from target device, which is usually costly to acquire in reality. Therefore, it is necessary to design an unsupervised architecture which can adapt to target device with unlabeled target samples. In this work, we design a system which can minimize the channel effects between the recording devices. Our framework addresses the problem of few and unlabeled target samples with unsupervised learning.

2 Methodology

Figure 1: The proposed robust pathological voice detection system. Data preprocessing transforms the voice samples to input features

(MFCCs or Filter banks). The bidirectional long short-term memory (BLSTM) network encodes input features

to . The label predictor forwards the encoded features

and estimates the probability

of pathological voice. The gradient reversal layer (GRL) is inserted before feeding to the device classifier. GRL multiplies the gradient by during backward propagation. With GRL, the encoder, BLSTM, learns features without any device information.

2.1 Front-end feature extraction

We extract 26-dimensional MFCCs and 40-dimensional filter banks features, following conventions in acoustic processing literature (Ravikumar and S., 2011; Ganchev et al., 2005). To avoid a loss of information in boundary effects, we set half of the window length as the frame-shift.

2.2 Back-end robust pathological voice detection system

In our formulation, each data point is represented by the triplet {input utterance, device, label}, denoted by . Our full model consists of a bidirectional long short-term memory (BLSTM) encoder, a dense layer label predictor, and a device classifier whose parameters are denoted by respectively. The BLSTM maps the input utterance to an embedding, .

2.2.1 Detection model

Our detection model contains a BLSTM (Hochreiter and Schmidhuber, 1997) that encodes a latent vector for each input audio clip . On top of it, a fully-connected dense layer classifier estimates the probability of the voice being pathological: .

2.2.2 Unsupervised domain adaptation

In order to generalize the detection model across devices, it is necessary to eliminate effects of channel mismatch caused from different devices. To address two main difficulties (i.e., the small number of target samples and the lack of labels), we propose an unsupervised domain adaptation system as following. Similar to Ganin et al. (2016), our model is encouraged to learn a domain-invariant embedding without using target domain labels. The training scheme can be written as:


where and denote source and target domain data respectively. The encoder and label predictor minimize the negative log-probability of the label. The device classifier maximizes the log-probability of device classifier as usual, while the encoder minimizes the log-probability of device classifier. Intuitively, the BLSTM encoder tries to learn an embedding that makes label prediction easy but device classification difficult (i.e., by wiping out information related to devices).

As shown in Fig. 1, the BLSTM encodes each input sample into an embedding . We implement the simultaneously maximize and minimize over the device loss by inserting a gradient reversal layer (GRL). The GRL has no effect during the forward pass, but it multiplies the gradient by in back propagation.

In this work, we focus on unsupervised domain adaptation. We modify the formulation, assuming no knowledge about the labels for the voice sample in the target device , which means no in the formulation, as following:


In Experiments (§3), we show that our robust pathological voice detection system can increase the area under the precision-recall curve (PR-AUC) significantly to solve the channel mismatch between different devices.

3 Experiments and results

We first compare the two deep learning models, BLSTM and multilayer perceptron (MLP) models, and examine the performance under numerous feature extraction settings. The details of models design are in Appendix

5.3. Based on these experiments, we select the best features and model to apply to our unsupervised domain adaptation system. We also compare our unsupervised domain adaptation system with three baseline results. We evaluate our results using the area under the precision-recall curve (PR-AUC), which is particularly suited to classifiers given imbalanced datasets.


Each audio sample was collected from the Far Eastern Memorial Hospital (FEMH), as detailed in Appendix 5.1. There were two voice collection mechanisms, recorded by a microphone and a smartphone. A vowel sound /:a/ was uttered in each recording. The pathological samples are the voice affected by the vocal fold pathologies. The source domain contained 183 voice samples (133 pathological samples and 50 control samples); we randomly choose 146 samples as the training set, and the other 37 samples as test set. The target domain (smartphone dataset) includes 52 pathological and 20 control samples; here, we use 26 pathological and 10 control samples as the test set. Note that the two sets differ in both recording devices and speakers.

3.1 MLP versus Bidirectional LSTM

We use 26-dimension MFCCs and 40-dimension filter banks with 32 ms windows. We also compare the effect of normalizing features over time to select the right model.

Model MFCCs Filter banks
Normalized Non-normalized Normalized Non-normalized
BLSTM 0.94150 0.9051 0.8765 0.9478
MLP 0.91541 0.8693 0.8109 0.8279
Table 1: PR-AUC scores of BLSTM and MLP frameworks with features: MFCCs, normalized MFCCs, filter banks and normalized filter banks.

3.2 Analysis of MFCCs and filter banks under various setting

The experiment compares MFCCs and filter banks across different settings, including window length and normalization. We use the BLSTM because Tables 1 show that it outperforms MLP in this task.

Feature 32ms (window length) 100ms (window length)
Normalized Non-normalized Normalized Non-normalized
MFCCs 0.94150 0.9051 0.9350 0.9111
Filter banks 0.8765 0.9478 0.8444 0.9378
Table 2: PR-AUC scores of MFCCs and filter banks with different setting. Each score is the mean of three different runs with distinct random seeds.

As shown in Table 2, for MFCCs, normalization improves performance; by contrast, for filter banks, non-normalized models performs better.

3.3 Domain adaptation

Previous experiments showed that non-normalized filter banks with 32 ms window length best capture the characteristics of data samples, so we use these as features to evaluate our domain adaptation model. We set up the following three baselines.

Baseline 1: No adaptation

Here, we train on the source device and test on the target device. The resulting poor performance demonstrates the channel mismatch effect (i.e., the discrepancy between data recorded from different devices).

Baseline 2: Target domain only

Here, we train and test on different subsets of the small target device dataset. Here, the low model performance reflects the impact of small dataset size.

Baseline 3: Frozen layers

After training the BLSTM classifier on the source device data, we freeze the BLSTM part and fine-tune the dense classifier layer using a subset of the target domain data. The frozen layer models perform better than the other two baselines but its performance is worse than our proposed detection system, which is combining the domain adversarial training (DAT) technique. Training on the source domain apparently overfits, and the decrease in the model’s generalizability can not be easily recovered by fine-tuning on limited data on the target domain.

                Method PR-AUC
Supervised Baseline 1: No domain adaptation 0.8448
Baseline 2: Target domain only 0.8509
Baseline 3: Frozen layer fine-tuning 0.9021
Proposed system: DAT with target labels (Equation 1)* 0.9522
Unsupervised Proposed system: DAT (Equation 4)* 0.9455
Table 3:

PR-AUC results of domain adaptation. * Both proposed detection systems significantly outperform the other three baseline methods (p-value < 0.05 by t-test, while there is no significant difference between these two).

4 Conclusion and discussion

In this work, we propose a robust pathological voice detection system and evaluate its performance using PR-AUC and examine the performance of MFCCs and filter banks with different settings. To the best of our knowledge, this is the first study that investigates this task based on the unsupervised adaptation approach. Our BLSTM model achieves a PR-AUC score of . Furthermore, to facilitate the deployment of the algorithm to mobile phones, we integrate unsupervised domain adaptation method, which increases the PR-AUC score from to without labeled target device samples.


  • Abadi et al. [2016] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proc. OSDI 16, pages 265–283, 2016. URL
  • Alhussein and Muhammad [2018] M. Alhussein and G. Muhammad. Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access, 6:41034–41041, 2018. ISSN 2169-3536. doi: 10.1109/ACCESS.2018.2856238.
  • Arias-Londoño et al. [2011] J. D. Arias-Londoño, J. I. Godino-Llorente, N. Sáenz-Lechón, V. Osma-Ruiz, and G. Castellanos-Domínguez. Automatic detection of pathological voices using complexity measures, noise parameters, and mel-cepstral coefficients. IEEE Transactions on Biomedical Engineering, 58(2):370–379, Feb 2011. ISSN 0018-9294.
  • Bhattacharyya [2014] Neil Bhattacharyya. The prevalence of voice problems among adults in the united states. Laryngoscope, 124(10):2359–2362, Oct 2014.
  • Costa et al. [2008] S. C. Costa, B. G. Aguiar Neto, and J. M. Fechine. Pathological voice discrimination using cepstral analysis, vector quantization and hidden markov models. In Proc. BIBE, pages 1–5, Oct 2008. doi: 10.1109/BIBE.2008.4696783.
  • Fang et al. [2018] Shih-Hau Fang, Yu Tsao, Min-Jing Hsiao, Ji-Ying Chen, Ying-Hui Lai, Feng-Chuan Lin, and Chi-Te Wang. Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of voice : official journal of the Voice Foundation, 2018.
  • Fraile et al. [2009] Rubén Fraile, Nicolas Saenz-Lechon, Juan Ignacio Godino-Llorente, Víctor Osma-Ruiz, and Corinne Fredouille. Automatic detection of laryngeal pathologies in records of sustained vowels by means of mel-frequency cepstral coefficient parameters and differentiation of patients by sex. Folia phoniatrica et logopaedica : official organ of the International Association of Logopedics and Phoniatrics (IALP), pages 146–152, 2009.
  • Ganchev et al. [2005] Todor Ganchev, Nikos Fakotakis, and George Kokkinakis. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task. In Proc. SPECOM, 2005. URL
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016. URL
  • Godino-Llorente et al. [2006] J. I. Godino-Llorente, P. Gomez-Vilda, and M. Blanco-Velasco. Dimensionality reduction of a pathological voice quality assessment system based on gaussian mixture models and short-term cepstral parameters. IEEE Transactions on Biomedical Engineering, 53(10):1943–1953, Oct 2006. ISSN 0018-9294. doi: 10.1109/TBME.2006.871883.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. URL{~}hochreit{~}juergen.
  • Lee et al. [2016] T. Lee, Y. Liu, P. Huang, J. Chien, W. K. Lam, Y. T. Yeung, T. K. T. Law, K. Y. S. Lee, A. P. Kong, and S. Law. Automatic speech recognition for acoustical analysis and assessment of cantonese pathological voice and speech. In Proc. ICASSP, pages 6475–6479, March 2016. doi: 10.1109/ICASSP.2016.7472924.
  • Ravikumar and S. [2011] K.M. Ravikumar and Ganesan S. Comparison of Multidimensional MFCC Feature Vectors for Objective Assessment of Stuttered Disfluencies. Technical report, 2011. URL
  • Salhi et al. [2010] Lotfi Salhi, Talbi Mourad, and Adnene Cherif. Voice disorders identification using multilayer neural network. Int. Arab J. Inf. Technol., 7:177–185, 2010.
  • Umapathy et al. [2005] K. Umapathy, S. Krishnan, V. Parsa, and D. G. Jamieson. Discrimination of pathological voices using a time-frequency approach. IEEE Transactions on Biomedical Engineering, 52(3):421–430, March 2005. ISSN 0018-9294. doi: 10.1109/TBME.2004.842962.
  • Wu et al. [2018] Huiyi Wu, John J. Soraghan, Anja Lowit, and Gaetano Di Caterina. A deep learning method for pathological voice detection using convolutional deep belief networks. In Proc. Interspeech, 2018.

5 Appendix

5.1 Data description

There are two collections, one is the larger dataset, which is recorded by the high-quality microphone; the other one is the smaller dataset, which is recorded by the smartphone (iPhone). The sampling rate was 44,100 Hz with a 16-bit resolution, and data were saved in an uncompressed wave format. Within voice samples, a vowel sound /:a/ was recorded. The pathological samples are the voice affected by the vocal fold pathologies.

The source domain (microphone) contained 183 voice samples (133 pathological samples and 50 control samples); we randomly choose 146 samples as the training set, and the other 37 samples as test set. The target domain (smartphone dataset) includes 52 pathological and 20 control samples; here, we use 26 pathological and 10 control samples as the test set.

5.2 Feature extraction procedures

Filter banks

can be derived through a series of calculations, pre-emphasis, framing, windowing, Fourier transform, and Mel filtering. These procedures aim to mimic non-linear human ear perception of sound. Filter banks are powerful feature extractors. Therefore, we introduce this approach as one of our feature extraction methods.

Mel-Frequency Cepstral Coefficients (MFCCs)

are the speech features derived from filter banks by applying a discrete cosine transform (DCT) to remove the correlation between filter bank coefficients. It can thus be regarded as the compressed representation of the filter banks. MFCCs are widely used in speech recognition, speaker recognition, and other speech processing tasks.

5.3 Model

For DNN-based multilayer perceptron (MLP) model, we follow the setting in Fang et al. [2018]

. The MLP model has three dense layers, each with 300 nodes and predicts the result by each frame. For the BLSTM classification model, at each step, the model takes as input the concatenation of the eleven neighboring frames. The BLSTM model has a fully connected dense layer with 512 nodes, two bidirectional LSTM layers, each with 512 nodes, and a fully connected output layer to classify two classes. Both models use ReLU as the activation function. For the domain classifier in the Fig. 

1, the DNN-based devices classifier consists of 3 dense layers, each with 300 nodes and the activation function is ReLU. We implement all models in TensorFlow [Abadi et al., 2016].