The emergence and spread of the novel coronavirus (COVID-19) is deemed as a major public health threat for almost all countries around the world. Moreover, explicitly or implicitly, the Coronavirus pandemic has brought an unprecedented impact on every single person across the world. To combat the COVID-19 pandemic and its consequences, clinicians, nurses and other care providers are battling in the front-line. Apart from that, scientists and researchers from a bench of research domains are also stepping up in response to the challenges raised by this pandemic. For instance, several different kinds of drugs and vaccines are being developed and trialled, to treat the virus or to protect against it [Cynthia20-RAD, ren2020traditional, cao2020covid, Peeples20-NRA, Nicole20-DCV], and meanwhile methods and technologies are designed and investigated to accelerate the diagnostic testing speed[DURNER2020, Chio20-IST].
Particularly, considering the community of data science, massive efforts have been and are still being made to mine data-driven information. In particular, a number of works have been proposed to promote automatic screening by analysing chest CT images[li2020artificial, afshar2020covid, wang2020covid, farooq2020covid]. For instance, in [li2020artificial] a deep model COVNet was developed to extract visual features to detect COVID-19. However, no research work has yet been done to explore the sound-based COVID-19 assessment.
In the perspective of sound analysis, as coronavirus is a respiratory illness, abnormal breathing patterns from patients intuitively might be a potential indicator for diagnosis [wang2020abnormal]. According to the latest clinical research, the severity of the COVID-19 disease can be categorised into three levels, namely, mild, moderate, and severe illness [cascella2020features]. For each level, various typical respiratory symptoms can be observed, from dry cough presented in mild illness, to shortness of breath in moderate illness, and further to severe dyspnea, respiratory distress or tachypnea (respiratory frequency 30 breaths/min) in severe illness [cascella2020features]. Meanwhile, all these breathing disorders lead to abnormal variations of articulation. Consequently, it can be of great interest to use automatic speech and voice analysis to aid COVID-19 diagnosis, which is non-invasive and low-cost.
In addition, there could be many meaningful and powerful audio-based tools and applications, but so far underestimated and hence underutilised. Pretty recently, scientists elaborated a plenty of potential use-cases in the fight against COVID-19 spread, via exploiting intelligent speech and sound analysis [schuller2020covid]. Specifically, these envisioned directions are grouped into three categories, i. e., audio-based risk assessment, audio-based diagnosis, and audio-based monitors such as monitoring of spread, social distancing, treatment and recovery, and patient wellbeing [schuller2020covid].
Albeit the importance of the work by analysing voice or speech signals to battle this virus pandemic, none empirical research work has been done to date. To fill this gap, we present an early study on the intelligent analysis of speech under COVID-19. To the best of our knowledge, this is the first work towards this direction. Particularly, we take a data-driven approach to automatically detect the patients’ symptom severity, as well as their physical and mental states. Hopefully, it can help develop a rapid, cheap, and easy way to diagnose COVID-19 disease, and assist doctors to take care of their patients.
2 Data Collection
For this work, getting collected and annotated data is the first step. At present, data collection is underway worldwide from both infected patients at various stages of the disease, and healthy individuals as the control group. For instance, the App “COVID-19 Sounds App”111https://www.covid-19-sounds.org/en/ has been released from the University of Cambridge to help researchers detect if a person is suffering from COVID-19, by collecting recordings of participants’ voice, their breathing, as well as coughing. Likewise, researchers from Carnegie Mellon University launched a new App “Corona Voice Detect”222https://cvd.lti.cmu.edu/ to gather voice samples, such as coughs, several vowel sounds, counting up to twenty, and the alphabet. Nonetheless, currently all these data are not publicly available for research purposes.
We collected in-the-wild individual-case data of 52 COVID-19 infected hospitalised patients (20 females and 32 males) from two hospitals in Wuhan, China. Data were collected between March 20th to March 26th. For each patient, five sentences were recorded one after another via the Wechat App, when doctors and nurses were making their daily rounding to check the patients. As a result, five recordings per patient were acquired during data collection. Note that, these five sentences all have neutral meaning and one example is given below:
(when translated into English)
Today is 2020 March 26th.
I agree to use my voice for coronavirus-related research purposes.
Today is the twelfth day since I stayed in the hospital.
I wish I could rehabilitate and leave hospital soon.
The weather today is sunny.
Moreover, three self-report questions were answered by each patient, regarding her (or his) sleep quality, fatigue, and anxiety. Specifically, they rated their sleep quality/ fatigue/ anxiety by choosing from three different levels (i. e., low, mid, and high). Furthermore, regarding the demographic information, another four characteristics of the patients were collected, including age, gender, height, and weight. Note that, the height and weight information from 13 patients were not provided. A statistic overview of the data can be seen in Table 1.
|meanstd dev||female||male||all genders|
Furthermore, a distribution of the self-reported sleep quality, fatigue, and anxiety, grouped by gender, is illustrated in Fig. 1.
3 Data Preprocessing
Once the COVID-19 audio data were collected, a series of data preprocessing processes were implemented. Specifically, we did the following four processes: data cleansing, hand-annotating of voice activities, speaker diarisation, and speech transcription. Details are provided below.
Data Cleansing: as the recordings were collected in the wild, there were few unsuccessful recordings where the patient failed to provide any speech rather than noisy background. In such cases, the recordings were discarded for further analysis. As a result, recordings from one female patient were discarded, resulting in data from only 51 subjects for further processing.
Voice Activity Detection: for each recording, then, the presence or absence of human voice was detected manually in Audacity333https://www.audacityteam.org/. This is because, for some recordings there were silence periods for the first few seconds and/or the last few seconds. Note that, we only removed the beginning and the end unvoiced parts from each recording where no audible breathing took place. Hence, only voiced segments (e. g., speech, breathing, and coughing) from the recordings were maintained.
Speaker Diarisation: among the remained voiced segments, there is speech from other people other than the targeted patient. For this reason, we manually checked and annotated the speaker identities for each voiced segment, indicating if the voice was generated by the patient, or by anyone else.
Speech transcription: after annotating the speaker identities, voiced segments from the targeted diagnosed patients were converted into text transcriptions. Note that, while the collection was in the wild, beyond the aforementioned five sentences, some spontaneous recordings were spoken by the patients but with impromptu and unscripted content.
After data preprocessing, we obtained in total of 378 segments. For this preliminary study, we focus only on the scripted segments from patients, leading to 260 pieces of recordings for further analysis. A statistic of the distribution of the five sentences is demonstrated in Table 2. It can be seen that the distribution is imbalanced. The reason is that, some patients recorded the same content more than once, while some patients failed to supply all five recordings.
These 260 audio segments from 51 COVID-19 infected patients, were then converted to mono signals with a sampling rate of 16 kHz for further analysis.
4 Experiments and Results
In this section, we detail the experiments that were carried out to verify the feasibility of audio-only-based COVID-19 diagnose. More specifically, we first describe the experimental setups including the applied acoustic feature sets as well as related evaluating strategies. Afterwards, we elaborate the experiment performance for COVID-19 severity estimation, as well as prediction performance of three COVID-19 patient self-reported status, namely, sleep quality, fatigue, and anxiety degrees. Last, we discuss the limitation of the current study, and provide future work plans and directions.
4.1 Feature Extraction
Two established acoustic feature sets were considered in this study, namely, the Computational Para-linguistics Challenge (ComParE) set and the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS). Specifically, these feature sets were extracted with the openSMILE toolkit [Eyben10-openSMILE].
The former features set, the ComParE feature set, is a large-scale brute-force feature set utilised in a series of INTERSPEECH Computational Paralinguistics Challenges since 2013 [Schuller13-TI2, Schuller19-TI2]. It contains 6 373 static features by computing various statistical functionals over 65 low-level descriptor (LLD) contours [Schuller13-TI2]. These LLDs consist of spectral (relative spectra auditory bands 1-26, spectral energy, spectral slope, spectral sharpness, spectral centroid, etc.), cepstral (Mel frequency cepstral coefficient 1-14), prosodic (loudness, root mean square energy, zero-crossing rate,
via subharmonic summation, etc.), and voice quality (probability of voicing, jitter, shimmer and harmonics-to-noise ratio). For more details, the reader is referred to[Schuller13-TI2].
Different from the large-scale ComParE set, the other feature set applied in this work, eGeMAPS, is much smaller. It consists of only 88 features derived from 25 LLDs. Particularly, these features were chosen concerning their capabilities to describe affective physiological changes in voice production. For more details about these features, please refer to [Eyben16-TGM].
4.2 Evaluation Strategy
In this work, we carried out four audio-based classification tasks. First, we performed COVID-19 severity estimation based on how many days of the hospitalisation. The hypothesis is that, a COVID-19 patient is generally very sick at the early stage of the hospitalisation, and then recovers step by step. As a consequence, the patients were approximately grouped into three categories, i. e., the high-severity stage for the first 25 days, the mid-severity stage between 25 and 50 days, and the low-severity stage after 50 days. Besides, another three classification tasks considered in this study is to predict the self-reported sleep quality, fatigue, and anxiety levels of COVID-19 patients, the potential of which has been spotted in [schuller2020covid].
For these classification tasks, we implemented Support Vector Machines (SVMs) with a linear kernel function as the classifiers for all experiments, due to its widespread usage and appealing performance achieved in intelligent speech analysis[schmitt2016border, Han17-Prediction]. Specifically, a series of complexity constants were evaluated in [, , , , ]. Further, to deal with the imbalanced data during training, class weighting strategy was employed to automatically adjust the C values in proportion to the importance of each class. The SVMs were implemented in Python based on the scikit-learn library.
Moreover, for all experiments in this study, Leave-One-Subject-Out (LOSO) cross-validation evaluations were carried out to satisfy the speaker independent evaluation constraint. In this context, all the 260 instances were divided into 51 speaker-independent folds, with each fold containing only instances from one patient. With the LOSO evaluation scheme, one of the 51 folds was used as the test set and the other folds were put together to form a training set to train an SVM model. Then, this process was repeated 51 times until all folds were utilised as the test set. Note that, for each folder, an on-line standardisation was applied to the test set by using the means and variations of the training partition.
Then, the average performance was computed over the predictions of all instances. In this work, we utilised three most frequently-used measures, i. e., Unweighted Average Recall (UAR), the overall accuracy (also known as Weighted Average Recall or WAR), and the F1 Score (also known as F-score or F-measure) that is the harmonic mean of precision and recall.
4.3 Severity Estimation
In Table 3, we reported the performance of the best SVM models for the two selected feature sets, respectively. In particular, the best model was chosen from varied SVMs with different values based on UAR. It can be seen that, the large feature set, ComPARE, performs slightly better than eGeMAPS for the severity estimation, achieving UAR, accuracy, and F1-score.
More, we further inspect the audio recordings from patients with varied severe levels. An illustration is given in Fig. 2. In particular, three recordings were taken from three different patients, who were asked to say the same content. It can be shown that, the first patient, failed to produce the sentence due to his severe symptoms. The second sample is from a female patient. She successfully said the whole content following the given template, however, had to pause several times to take a heavy breath before carrying on the remained content. In contrast, the third patient managed to generate the same recording more clearly and fluently.
4.4 Sleep Quality, Fatigue, and Anxiety Prediction
Considering audio-based sleep quality, fatigue, and anxiety estimation, we further trained SVM models for each task separately. Corresponding results are shown in Table 4, where the performance of the best models for each task and each feature set are provided. Similarly, the best performance was taken where the highest UAR was obtained, and meanwhile performance in terms of accuracy and F1-score is given.
When comparing the three tasks, the best performance is achieved for sleep quality classification, reaching up to UAR. Then, for anxiety prediction, a UAR of is attained. When it comes to fatigue prediction, the best performance of UAR is only , which is just slightly better than chance level ( for three-class classification).
Further, when comparing two selected feature sets, the compact eGeMAPS set consistently outperforms the large-scale ComPARE feature set. On the one hand, these results reveal the effectiveness of the eGeMAPS set for audio-based sleep quality, fatigue, and anxiety detection. On the other hand, the inferior performance based on ComPARE might be due to the short number of training samples.
In this preliminary study, experiments were carried out based on speech recordings from COVID-19 infected and hospitalised patients. The results have demonstrated the feasibility and effectiveness of audio-only-based COVID-19 analysis, specifically in estimating the severity level of the disease, and in predicting the health status of patients including sleep quality, fatigue, and anxiety. Nonetheless, there are still many ways to extend the present study for further development.
First, due to time limitation, the collected data set is relatively small, and lacks control group data from both healthy subjects and patients with other respiratory diseases. These data collections are still in progress for more comprehensive analysis in future. In addition, AI techniques can be considered to tackle the data scarcity issue, such as data augmentation via generative adversarial networks [Goodfellow2014, Zhang19-Snore, Han19-Adversarial]. Given more data, the performance of our models is expected to be further improved and more robust.
Second, only functional features computed over whole segments were investigated. However, abnormal respiratory symptoms might be instantaneous and occur only in a short period of time. In this context, analysing low-level features in successive frames with sequential modelling might bring further performance improvement. Moreover, in addition to conventional handcrafted features, deep representation learning algorithms, might be explored to learn representative and salient data-driven features for COVID-19 related tasks. These include deep latent representation learning [Han18-Emotion]
, self-supervised learning[pascual2019learning]
, and transfer learning[Ren18-LIR] to name but a few.
Further, in this study, the severe estimation based on days in hospitalisation is in a rough fashion, as we are lack of other information regarding the patients’ health states. Similarly, the self-reported questionnaires about the patients’ conditions are rather subjective, as they could have different principles. If clinical examinations and reports of the patients are provided such as CT scans of their lungs [bernheim2020chest, pan2020time, zhao2020relation], more objective and accurate labels would be attained.
Last but not least, in this paper SVMs were separately trained to estimate four tasks. Considering the potential correlation between the severity of the disease, and the patient’s sleep quality (or mood), a multi-task learning model might help effectively exploit the complementary information from these tasks [zhang2016cross, parthasarathy2017jointly].
At the time of writing this paper, the world has reported a total of 3,020,117 confirmed COVID-19 cases and 209,799 fatalities, according to a dashboard developed and maintained by the Johns Hopkins University444https://coronavirus.jhu.edu/map.html. To leverage the potential of computer audition to fight against this global health crisis, for the first time, experiments have been performed based on 51 COVID-19 infected and hospitalised patients from China. In particular, audio-based models have been constructed and assessed to predict the severity of the disease, as well as health-relevant states of the patients including sleep quality, fatigue, and anxiety. Experimental results have shown the great potential of exploiting audio analysis in the fight against COVID-19 spread.
In the future, we will continue the data collection process as well as collecting relevant clinical reports for a comprehensive understanding of the patient state. In addition, we attempt to introduce interpretable models and techniques to make the predictions more traceable, transparent, and trustworthy .
We express our deepest sorrow for those who left us due to COVID-19; they are lives, not numbers. We further express our highest gratitude and respect to the clinicians and scientists, and anyone else these days helping to fight against COVID-19, and at the same time help us maintain our daily lives. This work was partially supported by the Zhejiang Lab’s International Talent Fund for Young Professionals (Project HANAMI), P. R. China, the JSPS Postdoctoral Fellowship for Research in Japan (ID No. P19081) from the Japan Society for the Promotion of Science (JSPS), Japan, and the Grants-in-Aid for Scientific Research (No. 19F19081 and No. 17H00878) from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan.