Machine Learning based COVID-19 Detection from Smartphone Recordings: Cough, Breath and Speech

04/02/2021 ∙ by Madhurananda Pahar, et al. ∙ Stellenbosch University 0

We present an experimental investigation into the automatic detection of COVID-19 from smartphone recordings of coughs, breaths and speech. This type of screening is attractive because it is non-contact, does not require specialist medical expertise or laboratory facilities and can easily be deployed on inexpensive consumer hardware. We base our experiments on two datasets, Coswara and ComParE, containing recordings of coughing, breathing and speech from subjects around the globe. We have considered seven machine learning classifiers and all of them are trained and evaluated using leave-p-out cross-validation. For the Coswara data, the highest AUC of 0.92 was achieved using a Resnet50 architecture on breaths. For the ComParE data, the highest AUC of 0.93 was achieved using a k-nearest neighbours (KNN) classifier on cough recordings after selecting the best 12 features using sequential forward selection (SFS) and the highest AUC of 0.91 was also achieved on speech by a multilayer perceptron (MLP) when using SFS to select the best 23 features. We conclude that among all vocal audio, coughs carry the strongest COVID-19 signature followed by breath and speech. Although these signatures are not perceivable by human ear, machine learning based COVID-19 detection is possible from vocal audio recorded via smartphone.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

COVID-19 (COronaVIrus Disease of 2019) was declared as a global pandemic on February 11, 2020 by the World Health Organisation (WHO). Caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), this disease affects the respiratory system and includes symptoms like fatigue, dry cough, shortness of breath, joint pain, muscle pain, gastrointestinal symptoms and loss of smell or taste [carfi2020persistent, wang2020clinical]. Due to its effect on the vascular endothelium, the acute respiratory distress syndrome can originate from either the gas or vascular side of the alveolus which becomes visible in a chest x-ray or CT scan for COVID-19 patients [marini2020management, aguiar2020inside]. Among patients infected with SARS-CoV-2, between 5% and 20% are admitted to ICU and their mortality rate varies between 26% and 62% [ziehr2020respiratory]. Medical lab tests are available to diagnose COVID-19 by analysis of exhaled breaths [davis2021breath]. This technique is reported to achieve an accuracy of 93% when considering a group of 28 COVID-19 positive and 12 COVID-19 negative patients [grassin2021metabolomics]. Related work using a group of 25 COVID-19 positive and 65 negative patients achieved an area under the ROC curve (AUC) of 0.87 [ruszkiewicz2020diagnosis].

Machine learning algorithms have been applied to detect COVID-19 by using image analysis. COVID-19 was detected from computed tomography (CT) images using a Resnet50 architecture with 96.23% accuracy in [walvekar2020detection]. The same architecture was shown to detect pneumonia due to COVID-19 with an accuracy of 96.7% [sotoudeh2020artificial] and to detect COVID-19 from x-ray images with an accuracy of 96.30% [yildirim2020deep].

The automatic analysis of cough audio for COVID-19 detection has also received attention. Coughing is a predominant symptom of many lung ailments and its effect on the respiratory system varies [higenbottam2002chronic, chang2008chronic]. Lung disease can cause the glottis to behave differently and the airway to be either restricted or obstructed and this can influence the acoustics of the vocal audio such as cough, breath and speech [chung2008prevalence, knocikova2008wavelet]. These differences may make it possible to identify the coughing sound associated with a particular respiratory disease such as COVID-19 [imran2020ai4covid, laguarta2020covid].

Previously, we have found that automatic COVID-19 detection is possible on the basis of the acoustic cough signal [pahar2020covid]. Here we extend this work by considering whether breathing and speech audio can also be used effectively for COVID-19 detection. To do this, we draw data from the Coswara dataset [sharma2020coswara] as well as the Interspeech Computational Paralinguistics ChallengE (ComParE) dataset [Schuller21-TI2]. To date, we are the first to report further evidence of accurate discrimination and conclude that vocal audio such as coughing, breathing and speech are all affected by the condition of the lungs to an extent that they carry acoustic features responsible for machine learning classifiers to detect COVID-19 signatures.

Section 2 briefly summarises the two datasets used for experimentation while Section 3

describes the feature extraction and cross-validated hyperparameter optimisation process. Experimental results are presented and discussed in Section 

4 and Section 5 concludes by summarising the findings.

2 Data

Dataset Type Label Subjects Total audio Average per subject Standard deviation
Coswara Breath COVID-19 Positive 88 8.58 mins 5.85 sec 5.05 sec
Breath Healthy 1062 2.77 hours 9.39 sec 5.23 sec
Breath Total 1150 2.92 hours 9.126 sec 5.29 sec

Normal Count COVID-19 Positive 88 12.42 mins 8.47 sec 4.27 sec
Normal Count Healthy 1077 2.99 hours 9.99 sec 3.09 sec
Normal Count Total 1165 3.19 hours 9.88 sec 3.22 sec

Fast Count COVID-19 Positive 85 7.62 mins 5.38 sec 2.76 sec
Fast Count Healthy 1074 1.91 hours 6.39 sec 1.77 sec
Fast Count Total 1159 2.03 hours 6.31 sec 1.88 sec

Cough COVID-19 Positive 119 13.43 mins 6.77 sec 2.11 sec
Cough Healthy 398 40.89 mins 6.16 sec 2.26 sec
Cough Total 517 54.32 minutes 6.31 sec 2.24 sec

Speech COVID-19 Positive 214 44.02 mins 12.34 sec 5.35 sec
Speech Healthy 396 1.46 hours 13.25 sec 4.67 sec
Speech Total 610 2.19 hours 12.93 sec 4.93 sec

Table 1: Summary of the Coswara and ComParE Datasets. COVID-19 positive subjects are underrepresented in both datasets. The average length of COVID-19 positive breaths are approximately 30% shorter than healthy breaths.

2.1 Coswara dataset

The Coswara dataset has been specifically developed with the testing of classification algorithms for COVID-19 detection in mind. Data collection is web-based, and participants contribute by using their smartphones to record their coughing, breathing and speech (counting one to twenty at a normal and a fast pace, and uttering the English vowels). Coswara dataset included participants from five different continents [sharma2020coswara, pahar2020covid] and audio recordings, sampled at 44.1 KHz [muguli2021dicova], of ‘deep breath’, ‘normal count’ and ‘fast count’ are used in this study.

Figure 1 shows breaths and Figure 2 counting at a normal pace recorded from COVID-19 positive and negative subjects. It is evident that breaths have much higher frequency content than speech and interesting to note that COVID-19 breaths are 30% shorter than non-COVID-19 breaths (Table 1). All audio recordings are pre-processed to remove periods of silence to within a margin of 50 ms using a simple energy detector.

Figure 1: Pre-processed breath signals from both COVID-19 positive and COVID-19 negative subjects show no visual differences at all. Breaths corresponding to inhalation are marked by arrows, and are followed by an exhalation.

Figure 2: Pre-processed speech (counting from 1 to 20 at a normal pace) from both COVID-19 positive and COVID-19 negative subjects show no obvious visual differences. It contains little spectral energy above 1kHz compared to breath in Figure 1.

2.2 ComParE dataset

This dataset has been provided with train, test and development sets as a part of the 2021 Interspeech Computational Paralinguistics ChallengE (ComParE) [Schuller21-TI2]. Since we employ a leave--out nested cross-validation, we have combined the development and training sets to train and optimise our classifiers.

The ComParE dataset contains recordings, sampled at 16 KHz, of both coughs and speech, where the latter is the utterance ‘I hope my data can help to manage the virus pandemic’ in the speaker’s language and they contribute in more than three different languages in the dataset.

2.3 Corpus comparison

A summary of these two datasets used in our experiments is presented in Table 1

. Here, we see that the COVID-19 positive class is underrepresented for both datasets, but especially for Coswara. Since this can detrimentally affect the performance especially of neural networks 

[van2007experimental, krawczyk2016learning], we have employed SMOTE data balancing during training [chawla2002smote] for the Coswara dataset. SMOTE oversamples the minor class by creating synthetic samples (rather than random oversampling). We have in the past successfully applied SMOTE to cough detection [pahar2021deep] and classification based on audio recordings [pahar2020covid]. Coswara dateset contains more subjects than ComParE dataset.

3 Feature Extraction and Classifiers

We have extracted mel frequency cepstral coefficients (MFCCs) and log energies of linearly spaced filters, along with velocity and acceleration, as well as zero-crossing rate (ZCR) and kurtosis as features from audio signals.

MFCCs have been found to be the most useful in the field of automatic speech recognition

[pahar_coding_2020] but also for differentiating dry and wet coughs [chatrzarrin2011feature]. Log energies computed from linearly spaced filters have also proved useful in biomedical allocations, including cough audio classification [aydin2009log, botha2018detection, pahar2021tb]. The ZCR [bachu2010voiced] is the number of times the time-domain signal changes sign within a frame, and is an indicator of variability. Finally, kurtosis [pahar2021deep] indicates the tailedness of a distribution and therefore the prevalence of higher amplitudes. The feature extraction process is illustrated in Figure 3.

Figure 3: Feature extraction process. The overlapping frame length ( in Equation 1) is adjusted in such a way that the entire recording is divided into segments. For number of MFCCs, the final feature matrix has () dimensions.

The frame length (), number of segments (), lower order MFCCs () and number of linearly spaced filters () have been used as the feature extraction hyperparameters, mentioned in Table 2 and 4. If is the length of the audio in samples, is the number of segments; we define the length of the overlapping window () in Equation 1.


The input feature matrix to the classifiers has the dimension of () for number of MFCCs along with number of velocity and number of acceleration (Figure 3). In contrast with the traditional fixed frame rates, this novel way of extracting features ensures that the entire recording is captured within a fixed number of frames; allowing especially the deep neural networks (DNN) to discover more useful temporal patterns and provide better classification performance.

Hyperparameters Description Range

Frame length ()
into which audio where
is segmented
Segments () into which frames , where
were grouped

MFCCs ()
lower order , where
MFCCs to keep

Linearly spaced
used to extract 40 to 200
filters () log energies in steps of 20

Table 2: Feature extraction hyperparameters. We have used between 13 and 65 MFCCs and between 40 and 200 linearly spaced filters to extract log energies in the audio signal.
Hyperparameters Classifier Range
Regularisation LR, SVM where,
Strength ()
penalty () LR 0 to 1 in steps of 0.05
penalty () LR, MLP 0 to 1 in steps of 0.05
Kernel SVM where,
Coefficient ()
No. of neighbours () KNN 10 to 100 in steps of 10
Leaf size () KNN 5 to 30 in steps of 5

No. of neurons (

MLP 10 to 100 in steps of 10
No. of conv filters () CNN where
Kernel size () CNN 2 and 3
Dropout rate () CNN, LSTM 0.1 to 0.5 in steps of 0.2
Dense layer size () CNN, LSTM where
LSTM units () LSTM where
Learning rate () LSTM, MLP where,
Batch Size () CNN, LSTM where
Epochs () CNN, LSTM 10 to 250 in steps of 20
Table 3: Classifier hyperparameters, optimised at the inner loop of the leave--out nested cross-validation.

We have evaluated seven classifiers: logistic regression (LR), support vector machines (SVM), multilayer perceptrons (MLP), k-nearest neighbour (KNN), convolutional neural networks (CNN), long short-term memory (LSTM) recurrent neural networks and a residual based network (Resnet50). LR was applied to both the Coswara and ComParE dateset to provide a baseline. For the ComParE dataset, results are only reported for SVM, KNN and MLP classifiers since the DNN performed poorly. We ascribe this to the smaller number of subjects in this dataset. For the Coswara dataset, on the other hand, only the performance of DNN is reported since they performed substantially better than the simpler classifiers.

Due to extreme computational load, hyperparameter optimisation and performance evaluation has been performed using a leave--out nested cross-validation scheme [liu2019leave] for all classifiers except Resnet50 on both datasets. The hyperparameters are listed in Table 3 and a five-fold split, similar to that employed in [pahar2020covid], was used for cross validation.

Dataset Type Classifier Best Feature Best Classifier Hyperparameters Performance
Hyperparameters (Optimised inside nested cross validation) Spec Sens Acc AUC
Coswara Breath Resnet50 Default Resnet50 (Table 1 in [he2016deep]) 92% 90% 91% 0.92 0.03
LSTM =0.1, =32, =128, =0.001, =256, =170 90% 86% 88% 0.91 0.04
CNN =48, =2, =0.3, =32, =256, =210 87% 85% 86% 0.89 0.04
LR , 69% 72% 71% 0.74 0.05
Normal Resnet50 Default Resnet50 (Table 1 in [he2016deep]) 83% 80% 82% 0.86 0.05
Count LSTM =0.3, =32, =128, =0.001, =256, =90 83% 79% 81% 0.84 0.05
CNN =96, =2, =0.1, =32, =128, =130 81% 78% 79% 0.83 0.05
LR , 66% 70% 68% 0.71 0.05
Fast LSTM =0.3, =32, =64, =0.01, =128, =150 84% 78% 81% 0.85 0.04
Count Resnet50 Default Resnet50 (Table 1 in [he2016deep]) 83% 78% 81% 0.82 0.04
CNN =48, =2, =0.3, =16, =128, =170 82% 76% 79% 0.81 0.04
LR , 66% 69% 67% 0.69 0.04
ComParE Cough KNN+SFS =60, =25 84% 90% 87% 0.93 0.01
KNN =60, =25 78% 80% 80% 0.85 0.01
MLP =0.65, =40 76% 80% 78% 0.83 0.01
SVM , 75% 78% 77% 0.81 0.01
LR , 69% 73% 71% 0.78 0.01
Speech MLP+SFS =0.35, =70 82% 88% 85% 0.91 0.01
MLP =0.35, =70 81% 85% 83% 0.89 0.01
KNN =70, =15 80% 84% 82% 0.84 0.01
SVM , 79% 81% 80% 0.83 0.01
LR , 69% 72% 71% 0.77 0.01
Table 4: Classifier performance as evaluated in the outer loop of nested cross-validation. The average specificity, sensitivity, accuracy and AUC together with its standard deviation () are shown, as calculated over the outer folds during nested cross-validation. Hyperparameters producing the highest AUC at the outer loop have been noted as the ‘best classifier hyperparameter’.

4 Results

Figure 4: Coswara breath and cough classification: A Resnet50 classifier achieved the highest AUC of 0.92 in classifying COVID-19 breath. Speech (normal and fast) can also be used to classify COVID-19 with AUCs of 0.86 and 0.85 using Resnet50 and LSTM classifiers respectively.

Mean AUC calculated across the nested cross-validation outer folds along with standard deviation () for the best classifiers on both datasets are summarised in Table 4. We see that LR achieves similar performance for breath and speech in the Coswara dataset (AUC between 0.69 and 0.74) and for cough and speech in ComParE dataset (AUC between 0.77 and 0.78). We also see that the classification performance achieved for speech in the Coswara dataset (AUCs of 0.85 and 0.85 for normal and fast speech by a Resnet50 and LSTM respectively) are similar to that achieved for the speech in the ComParE dataset (AUC of 0.89 using an MLP).

For breath, best performance is achieved with a Resnet50 architecture, with AUC of 0.92. The ROC curve for this classifier is shown in Figure 4. The same figure also shows the ROC curves of the Resnet50 and LSTM architectures when classifying Coswara normal and fast speech respectively. These results indicate that COVID-19 classification is most accurate when performend on the basis of breath signals but that it is also possible using speech. Furthermore, the speech rate has very little influence on performance (AUCs of 0.86 and 0.85 for normal and fast count respectively).

Figure 5: ComParE cough and speech classification: The highest AUC of 0.93 and 0.91 is achieved from the best 12 and 23 features in classifying COVID-19 cough and speech from a KNN and MLP classifier. Initially they produced AUC of 0.85 and 0.89 on the entire feature combination, shown in Table 4.

Table 4 also shows that the COVID-19 coughs in the ComParE dataset can be distinguished from healthy coughs with a mean AUC of 0.85 using a KNN classifier while using linearly spaced filters to extract log energy. When applying sequential forward selection (SFS) [devijver1982pattern] to this classifier in order to find the individual features that contribute the most towards performance, an improved AUC of 0.93 was achieved when extracting the best 12 features. For the ComParE speech signal, best performance (an AUC of 0.89) was achieved by an MLP classifier while using MFCC features. When SFS is applied to this classifier, the AUC improves to 0.91 for the best 23 features. The corresponding ROC curves are shown in Figure 5. We note that, for the ComParE dataset, cough-based classification with SFS has outperformed speech-based classification with SFS at all operating points of the ROC curve, but that the performance is close. Although all other classifiers have performed the best on log-filterbanks, MLP has always performed the best on MFCCs and has been proved to be the best classifier in classifying COVID-19 speech spoken in different languages.

Informal listening assessment of the Coswara and ComParE dataset indicates that the former has greater variance and more noise than the latter and it is also reflected in the higher

in Table 4. It is interesting to note, MFCCs are always the features of choice for this noisier dataset, while the log energies of linear filters are often preferred for the less noisy data. A similar conclusion was also drawn in [botha2018detection], where coughs were recorded in a controlled environment with little environmental noise. A higher number of segments also generally leads to better performance as it allows the classifier to find patterns in smaller stretches of the audio signal.

5 Conclusions

We have used two datasets that include participants from around the world to determine whether COVID-19 signatures can be detected in smartphone vocal audio recordings of cough, breath and speech. We have found that successful classification is possible for all three audio types and across both datasets. However, coughs audio carries the strongest COVID-19 signatures, followed by breath and speech; as the highest AUC obtained from cough-based, breath-based and speech-based COVID-19 classification are 0.93, 0.92 and 0.91 respectively. COVID-19 cough and breath audio carry higher frequency contents than speech and also much shorter, which is perfectly captured by our novel feature extraction technique. We also note that hyperparameter optimisation has selected a higher number of MFCCs than what is required to match the resolution of the human auditory system, and also a densely populated log-filterbank; thus we postulate that the information used by the classifiers to detect COVID-19 signature is at least to some extent not perceivable by the human ear. This study first shows machine learning based COVID-19 detection on all three types of vocal audio such as cough, breath and speech recorded via smartphone is possible with high accuracy and this easily deployable, non-contact based COVID-19 screening tool has the potential to decrease load on the health care system around the globe.

As the future work, we are continuing to increase the classifier and feature extraction hyperparameter ranges to improve performance and applying voice activity detection to automatically detect vocal activities in a smartphone audio.