Depression is an illness that affects, knowingly or unknowingly, millions of people worldwide. Efficient and effective automatic depression diagnosis can be of substantial benefits. However, this is an extremely difficult task since a variety of complicated symptoms are reported and subjective clinical interview is the golden standard. Recent enhancement is mostly derived from multi-modal fusion and deep learning methods. Similar to a clinical interview in which a psychiatrist determines the patient’s mental state via his language and behaviours, automatic detection could be sourced from different signals, namely video, audio and text. Of the three modalities, audio features are mostly explored individually while text features by itself are rarely investigated. Lately the multi-modal fashion has prompted more modality fusion studies . While one could argue that more information will likely lead to a better model, using all possible aspects of multi-modal depression detection has its practical downsides. For instance, obtaining video recording consent might be a great obstacle in real-life situations, especially with mentally-ill patients. Hence this paper is guided by a principle that whether single modality could achieve similar performance to multi-modal models.
Some of the modal fusion studies have suggested superior performance of text features in depression detection, indicating the importance of semantic information . Among the few attempts of text-based models, word embeddings are usually trained from scratch, which might be suboptimal due to the lack of large quantities of data . Recently, general purpose text-embeddings such as ELMo and BERT 
, which are pretrained on large datasets, become popular due to their performance on many natural language processing benchmarks. Therefore, the use of pretrained contextual sentence embeddings, namely ELMo and BERT are investigated in the current work for their usage in depression detection.
Previous automatic assessment often involves classification or regression models, depending on the main task being depression presence or severity prediction. Though various deep learning models have been experimented, the assessment precision can still see great improvement. For severity prediction models, the mean absolute errors and root mean squared errors reported are particularly high. This again emphasizes the complexity of depression symptoms and the difficulty of precise predictions. Nevertheless, in health-related tasks, any true-false or positive-negative judgement could lead to severe outcomes. However due to the opaqueness of deep learning models, we often have no clue what goes wrong when a false prediction is made. Thus understanding the model is as critical as to enhance performance in such tasks.
Therefore, this paper mainly has two objectives: we firstly examine whether text features could achieve similar performance as do multi-modal methods; secondly we are interested in why the model makes certain predictions. Accordingly, our main contribution includes 1) a multi-task model design of combining detecting the presence of depression with predicting the severity; 2) substituting data-based word embeddings with pretrained text embeddings; 3) by applying attention mechanism we provide interpretations on which words or sentences trigger the model to believe a person is suffering from depression.
The rest of the paper is organized as follows. We provide a task overview with reference to relevant work and introduction to our model architecture in Section 2. Section 3 illustrates the experiments with different text embeddings and context settings. Analysis based on attention pooling are provided as interpretations of our model decisions in Section 4. Conclusions can be found in Section 5.
2 Task Overview
Data was acquired from the publicly available Distress Analysis Interview Corpus - Wizard of Oz (WOZ-DAIC) [6, 7] database, which encompasses 107 training and 35 development speakers. An evaluation subset was also published, yet labels for the evaluation are not available, therefore all experiments were validated on the development subset. This database was previously used for the AVEC2017 challenge 
. 30 speakers within the training (28 %) and 12 within the development (34 %) set are classified to have depression (PHQ8 binary value is set to 1). Two labels are provided for each participant: a binary diagnosis of depressed/healthy and the patient’s eight-itemPatient Health Questionnaire score (PHQ-8) metric. Consequently, automatic depression detection research based on this dataset can either predict the classification results or a severity score, to associate with the mental state label and PHQ-8 score.
Analyzing the data in Figure 1 helps to understand the challenges involved when modelling this task. The AVEC2017 challenge paper  states that scores larger than are considered to be depressed, however as presented in Figure 1, no clear causal relationship between the PHQ-8 and the patient state can be made e.g., though there is a tendency for depressed patients to have higher PHQ-8 scores, a PHQ-8 score of > is no guarantee for a depressed participant. Especially in the boundary region of both classes at a score range of 9 to 11, some participants cannot be assigned to a class according to their PHQ-8 score. This is due to the fact that PHQ-8 score is a reference and the clinician has the final decision on the diagnosis. PHQ-8 scores might be helpful in making a prediction but we still need to combine with clinician’s decision. If a patient is not depressed, then the PHQ-8 score does not indicate its depression severity.
To sum up, two observations can be made: 1) the dataset itself is relatively insufficient; 2) the depression state and PHQ-8 score are correlated but one characteristic does not necessarily predict the other.
2.2 Feature Selection and Extraction
The WOZ-DAIC dataset encompasses three major media: video, audio and transcribed text data. Prior work on this dataset with better performance generally utilizes modality fusion method . However, in  it is suggested that the key contribution is the addition of semantic information, which achieves a mean score of 0.81 individually. Hence in this work, we only incorporate the text data for the purpose of neat real-world application.
On the subject of text-based depression analysis, three different modelling settings () are widely used:
Context-free modelling uses each response of the participant as an independent sample, without information about the question, nor the time it was asked. This setting has the advantage of being easy to deploy in real world applications since predictions from single sentences can be made.
Context-dependent modelling requires the use of question-answer pairs, where each sample consists of a question asked and its corresponding answer.
Sequence modelling only models the patients responses in succession, without knowledge of the particular question asked.
In previous text-based work, work embeddings are usually trained from scratch. However, since depression data is hard to come by, using a model pretrained on larger datasets, unrelated to depression detection, could help alleviate this problem. In this work, we show that the use of pretrained word embeddings can lead to substantial performance gains. Standard Word2Vec models are usually trained on a shallow, two layer deep neural network architecture. While Word2Vec aims to capture the context of a specific sentence, it only considers the surrounding words as its training input, therefore does not capture the intrinsic meaning of a sentence. Recently, alternatives to Word2Vec became popular, specifically context-dependent sentence embeddings such as ELMo and not long ago BERT. ELMo generates embeddings for a word based on the context it appears in, thus produces slight variations for each word occurrence. Subsequently, ELMo requires to be fed an entire sentence before generating an embedding. BERT
similarly models sentences as vectors. Currently BERT is considered for many natural language processing (NLP) tasks to perform at a state-of-the-art level.
In our current work, raw text was firstly preprocessed, where tailing blanks were removed and every letter set to be lowercase. Meta information such as <laughter> or <sigh> are possibly helpful to the model, thus were not removed. Three different text embeddings are experimented: Word2Vec, ELMo and Bert:
with identical hyperparameters as in.
ELMo ELMo uses a three layer bidirectional structure with nodes in each layer. We used the average of all three layer embeddings as our sentence representation.
BERT An embedding can be extracted from each of the twelve layers. Here, the penultimate layer was used to extract a dimensional sentence embedding. Instead of finetuning Bert or ELMo models, we directly extracted embedding from the publicly available models.
2.3 Model description
|Gauss-Staircase ||GloVe (Fusion)||Context-Dep||-||-||-||0.84||3.34||4.46|
Evaluation results of the proposed text-based attention models (bottom) compared to previous text based-based (top) and multi-modal (middle) approaches.
As previously stated, two labels are provided for each participant. Prior work on DAIC-WOZ dataset usually splits the tasks of depression presence detection (binary classification)  and severity score prediction (regression with PHQ-8 score) . A few studies investigate both tasks e.g. in , but still treat the two separately: a classification and
severity prediction was achieved. However as seen in Section 2.1, the two characteristics are correlated but one cannot necessarily predict the other. Hence, both information sources are important in order to ascertain if the patient is ill. We thus propose a multi-task setting to combine the classification and regression tasks. Two outputs were thus constructed, one directly predicts the binary outcome of a participant being depressed, the other outputs the estimated PHQ-8-score.
For the multi-task loss (see Equation 3), we opt to use a combination of binary cross entropy (for classification, Equation 1) and huber loss (for regression, Equation 2). Here, represents the regressive model output, represents the binary model output,
is the sigmoid function,is the PHQ-8 score and
is the binary ground truth. The huber loss can be seen as a compromise between mean average error (MAE, L1) and mean square error (MSE, L2), resulting in a robust behaviour to outliers. Both losses are summed up and backpropagated during training.
Previous text-based work in  solely relied on the last-timestep () as the response/query representation, further referred to as time pooling. However  has shown that time pooling is only sub-optimal, since the network belief changes over time. We therefore exclusively use attention as our model time-representation vector function. Attention is defined in Equation 4, where is the entire input sequence, are specific input and output features at time , is the learned attention weight vector, is the output of the concatenated BLSTM model at time and the weighted average representation. A simple per time step attention mechanism is utilized in this work. Given an input vector at time step , attention can be calculated as seen in Equation 4, where is the time-independent parameter vector used for scoring.
In addition to the novel multi-task approach and attention pooling method stated above, our proposed architecture in this work is a commonly used bidirectional long short term memory (LSTM) recurrent neural network structure (seeTable 2
). After each BLSTM layer we apply a recurrent dropout with probability 10 %. In sparse data scenarios such as depression detection, gradient recurrent units (GRU) networks are generally seen as a well performing alternative to LSTM networks. In this work, we internally ran GRU networks, but did not experience a performance enhancement, therefore exclusively used LSTM. The source code is publicly available111www.github.com/richermans/text_based_depression.
The input data was preprocessed before training, where mean and variance of the training subset was calculated and subsequently applied on the development dataset. Training the models was done by running Adam optimization for at most 200 epochs. The initial learning rate was set to be, which was reduced by a factor of if the cross-validation loss did not improve for at most epochs. If the learning rate reached a value below
training was terminated and the model producing the lowest error on the development set was chosen for evaluation. Regarding data handling, padding was avoided by choosing a batchsize of 1. Moreover, random oversampling over the minority class (depressed) was utilized in order to circumvent data sparsity problems. Furthermore, recurrent weights were initialized by the uniform xavier method, where samples are drawn from, where and biases were set to zero.
For classification, macro precision and recall scores are used to calculate the average-score. In terms of regression, the mean average error () and root mean square error () is used between the model prediction and the ground truth PHQ-8 score .
Since the available amount of data can be considered insufficient, the results are often not directly reproduce-able. In order to somewhat circumvent this problem  proposed to gridsearch for every possible hyperparameter in order to ascertain a proper configuration. However, in our experience reproducibility cannot be guaranteed, even when fixing random seeds and hyper parameters. We therefore reported the best performing model, following the tradition in many previous studies. Our proposed setting can thus be seen as the optimal configuration for our experiments.
In this work we compared our sequence modelling approach to previous context-free and context-dependent approaches. The results of our models can be seen in Table 1. Fusion scoring refers to the mean score fusion of ELMo and BERT models respectively. It is indicated that our sequence model with pretrained text embeddings, either ELMo or BERT, has achieved a mean score of 0.87. This has outperformed other text-based approaches, and even multi-modal approaches. As the results of our experiments indicated, Word2Vec largely underperforms compared to ELMo and BERT approaches. Possible reasons are the limited dataset size, such that attention could not pick up meaningful text information.
Attention mechanism is deliberately chosen since the attention weights over time could be interpretations for what sentences/words trigger the model to predict a patient is depressed or not. The attention weights () over time for each speaker are visualized in Figure 2.
It can be seen that attention in context of Word2Vec training does behave similar to mean pooling. In contrast, ELMo and BERT features exhibit a robust and strong performance. For those two features, we observe that for many depressed patients attention spikes at the first to second response (see the first two rows in Figure 2). Within the first responses the participant usually states his/her heritage or his/her current residence. This is a potential indicator that the model learned to correlate places with depression e.g., living in a metropolitan region might insert a potential influence on residents’ mood and mental state. Further it was investigated if the training dataset reveals a patients mental status given his/her heritage, but no such clue could be found.
|5||except meeting that one woman||mhm|
|6||or leaving my comfort||i’m okay|
|7||doing a little bit of socializing||so|
|8||feel a little||but um|
|9||putting away more money before i retire||uh|
|10||the hardest decision||hmm|
We deliberately choose the attention mechanism for this work in order to visualize our model‘s belief by searching for the most likely sentences triggering depression. These sentences were extracted by finding all peaks for an attention-weight sequence (). Specifically, in order to remove insignificant sentences, only peaks having a height of 80% of the maximum attention weight were considered in this search. An overview of all important sentences can be seen in Table 3. The results show that both features, ELMo and BERT focus on short, non-descript words such as ‘um’ as well as positive answers such as ‘yeah’ and ‘yes’. Interestingly, attention seldom focuses on sentences with meaningful content, such as previous traumatic experiences or sentences with an inherent negative connotation. Moreover, our proposed models are decisive in nature, meaning that for most depressed patients, the models stress single, specific sentences heavily (weights over ) and neglect the majority of patient responses. The more remarkable result is that this model is purely trained on text data, thus never actually heard those words.
This work proposed the use of multi-task modelling in conjunction with pretrained sentence embeddings, namely ELMo and BERT for modelling text-based depression. Analysis of ELMo and BERT models revealed a correlation between short, interpersonal sounds such as ‘um’ and the model performance, possibly indicating that in order to detect depression, one should focus on behavioural aspects of text and not necessarily on content. Furthermore the proposed models often emphasize and decide the mental state according to the first couple responses of the patient, rather than being indecisive.
Our proposed BLSTM model outperforms previous single model approaches in terms of classification scores, culminating in a score of . In terms of regression, our best model using ELMo features achieves a mean average error of , being the best in its class for sequential depression modelling.
-  J. R. Williamson, E. Godoy, M. Cha, A. Schwarzentruber, P. Khorrami, Y. Gwon, H.-T. Kung, C. Dagli, and T. F. Quatieri, “Detecting depression using vocal, facial and semantic communication cues,” in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, ser. AVEC ’16. New York, NY, USA: ACM, 2016, pp. 11–18. [Online]. Available: http://doi.acm.org/10.1145/2988257.2988263
-  A. Haque, M. Guo, A. S. Miner, and L. Fei-Fei, “Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions,” nov 2018. [Online]. Available: http://arxiv.org/abs/1811.08592
-  T. Alhanai, M. Ghassemi, and J. Glass, “Detecting Depression with Audio/Text Sequence Modeling of Interviews,” Tech. Rep. [Online]. Available: https://github.com/talhanai/
-  M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of NAACL, 2018.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  J. Gratch, R. Artstein, G. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. Devault, S. Marsella, D. Traum, and S. Rizzo, “The distress analysis interview corpus of human and computer interviews.”
-  D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. Traum, R. Wood, Y. Xu, A. Rizzo, and L.-P. Morency, “Simsensei kiosk: A virtual human interviewer for healthcare decision support,” in Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, ser. AAMAS ’14. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2014, pp. 1061–1068. [Online]. Available: http://dl.acm.org/citation.cfm?id=2615731.2617415
-  F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cummins, M. Schmitt, and M. Pantic, “Avec 2017: Real-life depression, and affect recognition workshop and challenge,” in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, ser. AVEC ’17. New York, NY, USA: ACM, 2017, pp. 3–9. [Online]. Available: http://doi.acm.org/10.1145/3133944.3133953
-  K. Kroenke, T. Strine, R. Spitzer, J. Williams, J. Berry, and A. Mokdad, “The phq-8 as a measure of current depression in the general population,” Journal of Affective Disorders, vol. 114, no. 1-3, pp. 163–173, 4 2009.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA, May 2010, pp. 45–50, http://is.muni.cz/publication/884893/en.
L. Yang, D. Jiang, L. He, E. Pei, M. C. Oveneke, and H. Sahli, “Decision tree based depression classification from audio video and language information,” inProceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, ser. AVEC ’16. New York, NY, USA: ACM, 2016, pp. 89–96. [Online]. Available: http://doi.acm.org/10.1145/2988257.2988269
-  S. Scherer, Z. Hammal, Y. Yang, L.-P. Morency, and J. F. Cohn, “Dyadic behavior analysis in depression severity assessment interviews,” in Proceedings of the 16th International Conference on Multimodal Interaction, ser. ICMI ’14. New York, NY, USA: ACM, 2014, pp. 112–119. [Online]. Available: http://doi.acm.org/10.1145/2663204.2663238
-  S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention,” 2017.