Depression, a disease of considerable attention, has been affecting more than 300 million people worldwide. With the severity of depression growing without an adequate cure, a person with such illness will suffer from multiple symptoms, including insomnia, loss of interest, and at the extreme end, committing suicide. An increasing amount of research has been conducted on automatic depression detection and severity prediction, in particular, from conversational speech, which has embedded crucial information about one’s mental state. However, the models so far are heavily restricted by the limited amount of depression data. This data sparsity has caused difficulty in accuracy enhancement and reproduction.
Many sparse scenarios in natural language processing (NLP) tasks have benefited from pretrained text embeddings like GloVe, BERT and ELMo. Regarding multi-modal research, pretrained audio embeddings such as SoundNet have been found to outperform traditional spectrogram based features regarding acoustic environment classification. All these pretrained neural networks take advantage of a self-supervised encoder-decoder model, which does not require manual labeling and, therefore, can be pretrained on large datasets.
However, little research has been done on pretraining audio features. Utilizing audio-based features for depression detection has its potential downsides compared to high-level text-based features: Content-rich audio contains undesirable information, such as environmental sounds, interfering speech, and noise. Features are typically low-level and extracted within a short time-scale (e.g., ), each containing little information about high-level st entire sequence (e.g., spoken word). In our point of view, a successful audio-embedding for depression detection needs to be extracted on sequence-level  (e.g., sentence), in order to capture rich, long-term spoken context as well as emotional development within an interview. Thus, this work aims to explore whether depression detection via audio can benefit from a pretrained network.
This paper proposes DEPA, a self-supervised, Word2Vec like pretrained depression audio embedding method for automatic depression detection. Two sets of DEPA experiments are conducted. First, we investigate the use of DEPA by pretraining on depression (in-domain) data. Second, we further explore out-domain pretraining on other mental disorders interviewing conversation datasets and general-purpose speech datasets. To our knowledge, this is the first time a pretrained network is performed on a depression detection task. More importantly, this can be generated to other speech research with limited data resources.
2 Related Work
In this section, related work on depression detection and self-supervised learning will be discussed.
2.1 Depression detection
Various methods have been proposed for automatic depression detection. Previous speech-based detection work has experimented on various acoustic features, like prosodic features (e.g., pitch, jitter, loudness, speaking rate, energy, pause time, intensity, etc.), spectral features (e.g., formants, energy spectrum density, spectral energy distribution, vocal tract spectrum, spectral noise, etc.) and cepstral features (e.g., Mel-Frequency Cepstral Coefficients ), and more recently, feature combinations like COVAREP (CVP)
, which consists of a high-dimensional feature vector covering common features such as fundamental frequency and peak slope. Also deep learning methods have been employed to extract high-level feature representations[8, 1]. Despite the tryout on different features and models, the F1 accuracy generated from speech-based depression detection is average. Work in  indicated that by pretraining text-embeddings on a large, task-independent corpus, can significantly enhance detection performance.
2.2 Self-supervised learning
Self-supervised learning is a technique where training data is autonomously labeled, yet the training procedure is supervised. In NLP, pretrained word embeddings are trained with self-supervised learning, being applied to a variety of tasks, and achieving superior performance. The main philosophy is to predict the next words/sentences, given a contextual history/future, without requiring any manual labeling. Self-supervised methods can also extract some useful information about the data itself. Our main inspiration for this work stems from , where a self-supervised approach was taken to extract general-purpose audio representations. This method can thus be applied to depression detection to capture implicit information underneath each speaker’s speech and make predictions on their depressed state.
We propose DEPA, an auditory feature extracted via a neural network to capture non-trivial speech details. Our proposed method consists of a self-supervised encoder-decoder network, where the encoder is later used as DEPA embedding extractor from spectrograms. Given a spectrogram of a specific audio clip, where is the number of frames and the data dimension (e.g., frequency bins).
We proceed to slice into non-overlapping sub-spectrograms . Then, sub-spectrograms are selected with k spectrograms before and after a center one :
where . The self-supervised training process treats the center spectrogram as the target label, given its surrounding spectrograms and computes the embedding loss (Equation 1). The detailed pretraining process can be seen in Algorithm 1 and depicted in Figure 1.
The encoder architecture contains three downsampling blocks. Each block consists of a convolution, average pooling, batchnormalization, and ReLU activation layer.
Decoder architecture: The decoder upsamples via three transposed convolutional upsampling blocks and predicts the center spectrogram . The model is then updated via the embedding loss in Equation 1. The encoder-decoder architecture is shown in Figure 2.
After pretraining the encoder-decoder network, DEPA is extracted via feeding a variable-length audio segment (here on response-level) into the encoder model and obtaining a single -dimensional embedding. DEPA is then further fed into a depression detection network, which will be discussed in Section 4.1.
We aim to compare DEPA in regards to pretraining on related, e.g., in-domain (depression detection) and out-domain (e.g., speech recognition) datasets.
Regarding in-domain data, we utilized the publicly available DAIC dataset for in-domain pretraining in order to compare DEPA to traditional audio feature approaches. In order to ascertain DEPAs’ usability, we further used the mature switchboard (SWB) dataset, containing English telephone speech. The Alzheimer’s disease (AD) dataset was privately collected from a Shanghai Mental Clinic, containing about 400 hours (questions and answers) of Mandarin interview material from senior patients. The three datasets can be seen in Table 1.
The most broadly used dataset within depression detection is the Distress Analysis Interview Corpus - Wizard of Oz (DAIC) [7, 4], which encompasses 50 hours of data collected from 189 clinical interviews from a total of 142 patients. Two labels are provided for each participant: a binary diagnosis of depressed/healthy and the patient’s eight-item Patient Health Questionnaire score (PHQ-8) metric
. Thirty speakers within the training (28 %) and 12 within the development (34 %) set are classified to have depression (binary value is set to 1). The DAIC dataset is fully transcribed, including corresponding on- and offsets within the audio. The training subsets contains approximately 13 hours, and the development set approximately 6 hours of responses. This database was previously used for the Audio/Visual Emotion Challenge 2017 (AVEC2017), While this dataset contains training, development and test subsets, our evaluation protocol is reported on the development subset, since test subset labels are only available to participants of the AVEC2017 challenge.
Two features are investigated: MSP and CVP. Due to different sample rates across the datasets, we resample each datasets’ audio to 22050 Hz. 128 dimensional MSP are extracted with a window length of every-dimensional feature across an audio segment, therefore being dimensional.
DEPA Pretraining Process
In this work, the encoder-decoder training utilizes MSP features, with the hyperparameters, which extracts a
dimensional DEPA embedding. Moreover, the model is trained for 4000 epochs using Adam optimization with a starting learning rate of. The pretraining process differs for in-domain and out-domain datasets. For in-domain data, all responses of a patient are concatenated, meaning that silence or speech of the interviewer is neglected. For out-domain data, no preprocessing is done, meaning that the entire dataset is utilized.
4.1 Depression Detection
The final decision about the depression state and severity is carried out by a multi-task model, based on previous work in . This approach models a patients’ depression sequentially, meaning that only the patients’ responses are utilized. Due to the recent success of LSTM networks in this field [1, 6], our depression prediction structure follows a bidirectional LSTM (BLSTM) approach with four layers of size . A dropout of is applied after each BLSTM layer to prevent overfitting. The model outputs at each response (timestep) a two dimensional vector
, representing the estimated binary patient state () as well as the PHQ8 score (). Finally, first timestep pooling is applied to reduce all responses of a patient to a single vector . The architecture is shown in Figure 3.
Similar to , binary cross entropy loss between is used for binary classification (Equation 2), while huber loss between is used for regression (Equation 3), where are the ground truth PHQ8 binary and PHQ8 score, respectively.
is the sigmoid function.
Results are reported in terms of mean average error (MAE), and root mean square deviation (RMSE) for regression and macro-averaged F1 score for classification.
Detection training process
Training the detection process differs among DEPA, HCVP, and MSP features slightly. Even though all of them are extracted on response-level, HCVP and DEPA are fixed-sized vector representations, while MSP is a variable-length feature sequence. Data standardization is applied by calculating a global mean and variance on the training set and applying those on the development set. Adam optimization with a starting learning rate of is used.
Results in Table 2 are compared on two different levels:
Feature Comparison The first two rows of Table 2, indicate that indeed, fixed-sized response-level features (HCVP) outperform variable-sized sequence features (MSP). Regarding in-domain training (3rd row), DEPA excels in comparison to both traditional features in terms of classification and regression performance.
Out-domain DEPA pretraining has produced interesting results: pretraining on both out-domain datasets SWB and AD outperform the in-domain DAIC in terms of binary classification (F1). Further, pretraining on AD resulted in the lowest regression error rates in terms of MAE and RMSE. We think the superior performance of AD pretraining is because some cognitive impairment is highly related to depression; thus, more speech characteristics are shared between AD and DAIC (depression). More importantly, by jointly training on all available datasets (713h), performance reduces to MSP levels, implying that while pretraining can be done on virtually any dataset, one should pay attention to coherent dataset content. It is thus our future interest to explore how generalized a pretrained audio embedding is, given the fact that emotion can be language-independent.
This work proposed DEPA, an audio embedding pretraining method for automatic depression detection. An encoder-decoder model is trained in self-supervised fashion to predict and reconstruct a center spectrogram given a spectrogram context. Then, DEPA is extracted from the trained encoder model and fed into a multi-task depression detection BLSTM. DEPA exhibits an excellent performance compared to traditional spectrogram and COVAREP features. In-domain results suggest a significantly better result (F1 0.72, MAE 4.72) on detection presence detection compared to traditional spectrogram features without DEPA (F1 0.61, MAE 6.07). Out-domain results imply that DEPA pretraining can be done on virtually any spoken-language dataset, while at the same time being beneficial to depression detection performance.
-  (2018) Detecting depression with audio/text sequence modeling of interviews. In Proc. Interspeech 2018, pp. 1716–1720. External Links: Cited by: §2.1, §4.1.
-  (2016) Soundnet: learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, Cited by: §1.
-  (2014-05) COVAREP - A collaborative voice analysis repository for speech technologies. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), Florence, Italy, pp. 960–964. External Links: Cited by: §2.1.
-  (2014) SimSensei kiosk: a virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS ’14, Richland, SC, pp. 1061–1068. External Links: Cited by: §4.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
-  (2019) Text-based depression detection: what triggers an alert. arXiv preprint arXiv:1904.05154. Cited by: §2.1, §4.1, §4.1.
-  (2014-05) The Distress Analysis Interview Corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, pp. 3123–3128. External Links: Cited by: §4.
-  (2018) Measuring depression symptom severity from spoken language and 3d facial expressions. arXiv preprint arXiv:1811.08592. Cited by: item 2, §2.1.
-  (2009-04) The phq-8 as a measure of current depression in the general population. Journal of Affective Disorders 114 (1-3), pp. 163–173 (English). External Links: Cited by: §4.
-  (2014) GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §1.
-  (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. arXiv preprint arXiv:1909.07208. Cited by: §2.1.
-  (2017) AVEC 2017: real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, AVEC ’17, New York, NY, USA, pp. 3–9. External Links: Cited by: §4.
-  (2019) Self-supervised audio representation learning for mobile devices. arXiv preprint arXiv:1905.11796. Cited by: §2.2.