There is a growing body of interest to use voice as one of biomarkers to estimate one’s health conditions. Some health conditions like Parkinson’s disease [19, 10] and amyotrophic lateral sclerosis (ALS)  are directly related to speech production mechanism and others like schizophrenia [12, 22], depression [5, 6], and bipolar disorder [2, 13] are indirectly related through neurological processes that can modulate voice.
targeting to classify emotional status with a given voice segment. This work expands the scope to the personal well-being; in particular, we study various measurements such as anxiety, sleep quality and mood.
Like other related applications mentioned above, we also assume that persons’ condition is somehow embedded in their voice by modulating articulatory organs either voluntarily or involuntarily. Based on this assumption, we propose an approach to extract salient features from voice with respect to the target measurements and train them with machine learning algorithms to automatically estimate using voice.
In this work, we use subjects’ self-assessed measurements as ground truth of their well-being status. Dealing with subjective matters like well-being, however, is difficult partially because it is related to personal perspectives, feelings, and opinions which may vary across the subjects . Therefore, we use a questionnaire based approach [9, 14] to consolidate responses from multiple questions that reflect various aspects of target measurements.
2 Data Collection
We recruited 219 participants from Canada for data collection. Besides 13 participants that are not specified their demographic information, there are 114 males and 92 female and their age range is from 25 to 55 (33.6 8.1) years. Their native language is English including 12 French bilingual subjects. They were also asked to report their medical conditions. Being allowed multiple choices, no previous medical condition was reported for 159 participants while some reported that they have been treated due to some medical conditions notably 22 with depression and 13 with migraine or headache.
The data collection was designed to conduct five sessions per participants with several days of intervals in between. During the data collection campaign, 202 participants completed all five sessions and the intervals between sessions per participants are from 1.2 to 15.3 (3.7 1.9) days. At each session, the participants were asked to answer a set of questionnaires using mobile devices; four written surveys and a questionnaire that requires voice responses.
After removing sessions that are incomplete or have missing fields, we have total 1,048 sessions collected (approx. 4.8 sessions per subjects).
2.2.1 Written surveys
As discussed earlier, we use a questionnaire based approach to collect self-assessed measurements to represent subjects’ well-being status. In this regard, we adopt clinically validated questionnaires that are designed to measure anxiety, sleep quality and mood as follows.
are used to measure the level of anxiety. STAI has six emotional status to score themselves based on how they feel at the moment, while GAD7 has seven emotional status to score how often they felt over the past two weeks. Both have consolidating rules to generate the level of anxiety based on the answers; higher value means higher anxiety.
Sleep quality: We use Pittsburgh sleep quality index (PSQI)  which was design to measure sleep quality. The questionnaire comprises various aspects of sleep quality such as length of sleep as well as disturbing factors and their frequencies. The scoring guideline is to generate the level of sleep discomfort; higher value indicates worse sleep quality.
Mood: We use positive affect and negative affect schedule (PANAS) . There are ten different emotional status (five positive and five negative) to answer what extent the subject had felt over the past week. It gives an aggregated score to represent status of the subject; higher value indicates more negative mood and lower value indicates more positive mood.
Table 1 summaries the statistics of collected scores of the measurements along with possible score ranges. Note that the collected data covers possible range of individual measurements except PSQI covers only lower range.
2.2.2 Voice responses
The questionnaire that requires voice responses is designed to capture vocal behaviors and to use them in estimating the above described measurements. We asked seven different questions to elicit three types of voice responses; one spontaneous speech, four sentence readings and two paragraph readings.
For spontaneous speech, participants were given an instruction to speak freely on whatever topic they want for about a minute. For sentence or paragraph readings, phonetically balanced reading materials are prompted to the subjects so that they can read aloud (8.5 words for sentence readings and approx. 130 words for paragraph readings on average).
Table 2 shows the statistics of collected voice responses in terms of recorded length. Overall, we collected approximately 54 hours of voice data.
|Possible range||Collected Data|
|STAI||20 - 80||20 - 80||38.4 13.2|
|GAD7||0 - 21||0 - 21||6.1 4.9|
|PSQI||0 - 21||0 - 15||5.3 2.8|
|PANAS||10 - 50||10 - 50||24.4 6.7|
|Voice responses||Length (in secs.)|
|Sentence reading||Q2||5.4 1.5|
|Paragraph reading||Q6||48.6 8.6|
|Individual features||Concatenated features|
|Spontaneous||Sentence reading||Paragraph reading|
3.1 Feature extraction and selection
The types of features are in two categories, i.e. acoustic and linguistic features. Acoustic features are to capture signal-level modulations due to speakers’ status, while linguistic features are to capture language-level patterns that may be influenced by the condition.
Acoustic features are calculated on a per-frame basis. Frames are defined as 25 ms sliding windows that are created every 10 ms. A 41-dimensional supervector of various features such as mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), prosody and voice quality related features is generated every frame, and its delta and delta-delta are concatenated to capture frame-level context. To summarize, we use 19 statistical functions such as mean, median, skewness, kurtosis, quartile, percentile, slope, etc to generate a response-level feature vector.
Language features are based on the results from automatic speech recognition (ASR). We used Canary’s general English model which is trained on publicly available datasets like Tedlium and Librispeech using the time delayed neual network (TDNN) architecture in Kaldi . On top of common features such as part-of-voice ratio, syllable duration, filler ratio and word repetition ratio over the total number of spoken words, we extract a different feature set whether the response is spontaneous or read.
For the spontaneous voice responses where no prompted text was given, the semantic features are extracted including word popularity percentile111The word popularity was computed from the general English language model with 130K words and the distribution of the values per response was examined and its percentile values were used., and word frequency of the depression-related terms222We built a dictionary with depression-related words and negative expressions, as well as common words observed from depression patients’ speech, which resulted in 486 terms , and positive and negative sentiment likelihood333 The sentiment likelihood score is generated by the binary classification model trained on Stanford Sentiment Treebank using Sentiment Neuron
The sentiment likelihood score is generated by the binary classification model trained on Stanford Sentiment Treebank using Sentiment Neuron.. For the read responses, on the other hand, the ASR errors for insertion, deletion, and substitution are computed based on the given text.
The feature dimension is 2,357 for a read response and 2,364 for a spontaneous response. For feature selection, we compute the Pearson’s correlation coefficient between the extracted features and individual self-assessed measurement scores and selected topn correlated features for the model. Note that it is done in response- and measurement- level so that different features can be selected from different responses depending on the target measurement.
As a baseline, we use OpenSmile toolkit  with eGeMAPS configuration  which is widely used and shows promising results particularly in affective computing related fields such as speech emotion recognition. As it generates a 88 dimensional feature vector per one speech signal, we set a threshold of the proposed feature selection procedure to retain only the same number of features for fair comparisons.
We use fully connected deep neural network (FC-DNN) with 4 hidden layers of 256 neuron units. Each layer has the rectified linear unit (ReLU) as an activation function, and l2 regularizer and 50% of dropout to prevent overfitting. The output layer has only one unit as we target to build a regression model. We used an Adam optimizer with learning rate 0.0001 using mean squared error (MSE) as a loss function. The batch size is 32 and iteration stops after 100 epochs.
4 Experimental Results
We perform a 5-fold cross validation; we randomly split the whole dataset into 5 exclusive folds and iteratively use one fold as test data and others as training data. Each fold is subject independent in a sense that different folds do not share data from the same subject. Note that we consider all the sessions independent. A longitudinal study to investigate changes over time is left for future work.
The performance is measured in terms of concordance correlation coefficients (CCC) between self-assessed scores and predicted scores .
Table 3 shows CCC between self-assessed scores and predicted scores using voice responses with respect to the target measurement. The left part of the table shows the results with features from individual voice responses, while the right part of the table shows the results with concatenated features from individual voice responses. Fig. 1 shows the scattered plot of self-assessed measures and estimated values using concatenated features.
The results with individual voice responses show that the proposed method outperforms the conventional general purpose feature extraction method in estimating all measurements with all voice responses. The same applies to the results with concatenated features. This indicates that the proposed feature selection strategy that considers differences in individual responses is beneficial. The proposed linguistic features are also considered to contribute toward the improvement.
It is notable that the correlation coefficients with features from sentence reading are higher than the ones with features from spontaneous or paragraph reading responses. The results also show that concatenating features from individual voice responses boosts the performance. These suggest that concatenating features from multiple short voice responses rather than one long voice response with a fixed set of features is beneficial to estimate self-assessed measurements. It may be partially because that salient features are averaged out over a long sentence, but in-depth study will be in the future. In the future, we will also study impact of the ASR performance in extracting linguistic features as well as longitudinal study to investigate changes over time.
We showed promising results in estimating well-being status with voice in terms of various measurements, i.e. anxiety, sleep quality, and mood. The proposed method utilizes acoustic and linguistic features along with response- and measurement- level feature selection strategy followed by a deep neural network based regression model. Experimental results show that sleep quality using PSQI has the strongest correlation while anxiety using STAI has the weakest correlation. Although the concordance correlation coefficients between self-assessed scores and estimated scores may not be too strong, they are statistically significant with .
The authors like to thank Nathan Fisk, Scott Ferguson and Mark Bartlett for valuable discussions.
-  (2018) In an absolute state: elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation. 6 (4), pp. 529–542. Note: PMID: 30886766 External Links: Cited by: footnote 2.
-  (2019) Identifying Mood Episodes Using Dialogue Features from Clinical Interviews. In Proc. Interspeech 2019, pp. 1926–1930. External Links: Cited by: §1.
-  (2008) IEMOCAP: Interactive emotional dyadic motion capture database. Cited by: §1.
-  (1989) The pittsburgh sleep quality index: a new instrument for psychiatric practice and research. 28 (2), pp. 193 – 213. External Links: Cited by: 2nd item.
-  (2019) Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics. In AVEC 2019, Cited by: §1.
-  (2019) Assessing Neuromotor Coordination in Depression Using Inverted Vocal Tract Variables. In Proc. Interspeech 2019, pp. 1448–1452. External Links: Cited by: §1.
-  (2015-01) The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. 7, pp. 1–1. External Links: Cited by: §3.1.
Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, New York, NY, USA, pp. 835–838. External Links: Cited by: §3.1.
-  (1993-12) Cross-cultural adaptation of health-related quality of life measures: literature review and proposed guidelines. 46 (12), pp. 1417–1432. External Links: Cited by: §1.
-  (2019) Identifying Distinctive Acoustic and Spectral Features in Parkinson’s Disease. In Proc. Interspeech 2019, pp. 2498–2502. External Links: Cited by: §1.
-  (1989) A concordance correlation coefficient to evaluate reproducibility. 45 (1), pp. 255–268. External Links: Cited by: §4.1.
-  (2015) Can the Acoustic Analysis of Expressive Prosody Discriminate Schizophrenia?. The Spanish journal of psychology 18 (March 2016), pp. E86. External Links: Cited by: §1.
-  (2019) Into the Wild: Transitioning from Recognizing Mood in Clinical Interactions to Personal Conversations for Individuals with Bipolar Disorder. In Proc. Interspeech 2019, pp. 1438–1442. External Links: Cited by: §1.
-  (2001) Toward a multidimensional health assessment questionnaire (MDHAQ): assessment of advanced activities of daily living and psychological status in the patientfriendly health assessment questionnaire format. 42, pp. 2220–2230. Cited by: §1.
-  (2011-12) The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Cited by: §3.1.
-  (2017) Learning to generate reviews and discovering sentiment. abs/1704.01444. External Links: Cited by: footnote 3.
-  (2019) Profiling Speech Motor Impairments in Persons with Amyotrophic Lateral Sclerosis: An Acoustic-Based Approach. In Proc. Interspeech 2019, pp. 4509–4513. External Links: Cited by: §1.
-  (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In Interspeech 2013, Cited by: §1.
-  (2018) Vocal markers of motor, cognitive, and depressive symptoms in Parkinson’s disease. 2018-Janua, pp. 71–78. External Links: Cited by: §1.
-  (2006-05) A Brief Measure for Assessing Generalized Anxiety Disorder: The GAD-7. JAMA Internal MedicinePsychiatry Research2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017IEEE Transactions on Affective ComputingArthritis & RheumatismJournal of Clinical EpidemiologyBiometricsJournal of Language Resources and EvaluationCoRRClinical Psychological Science 166 (10), pp. 1092–1097. Cited by: 1st item.
-  (2009) Support for the reliability and validity of a six-item state anxiety scale derived from the state-trait anxiety inventory. Journal of Nursing Measurement 17 (1), pp. 19–28. External Links: Cited by: 1st item.
-  (2019) Objective Assessment of Social Skills Using Automated Language Analysis for Identification of Schizophrenia and Bipolar Disorder. In Proc. Interspeech 2019, pp. 1433–1437. External Links: Cited by: §1.
-  (1988) Development and validation of brief measures of positive and negative affect: The PANAS scales.. Vol. 54, American Psychological Association, US. External Links: Cited by: 3rd item.