The Verbal and Non Verbal Signals of Depression -- Combining Acoustics, Text and Visuals for Estimating Depression Level

04/02/2019 ∙ by Syed Arbaaz Qureshi, et al. ∙ 0

Depression is a serious medical condition that is suffered by a large number of people around the world. It significantly affects the way one feels, causing a persistent lowering of mood. In this paper, we propose a novel attention-based deep neural network which facilitates the fusion of various modalities. We use this network to regress the depression level. Acoustic, text and visual modalities have been used to train our proposed network. Various experiments have been carried out on the benchmark dataset, namely, Distress Analysis Interview Corpus - a Wizard of Oz (DAIC-WOZ). From the results, we empirically justify that the fusion of all three modalities helps in giving the most accurate estimation of depression level. Our proposed approach outperforms the state-of-the-art by 7.17 mean absolute error (MAE).



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depression is a common and serious medical illness that negatively affects how one feels. It is characterized by persistent sadness, loss of interest and an inability to carry out activities that one normally enjoys. It is the leading cause of ill health and disability worldwide. More than 300 million people are now living with depression, an increase of more than 18% between 2005 and 2015.111A statistic reported by the World Health Organization available at

Depression lasts between 4 and 8 months on average, and a few of the symptoms and side effects of depression are insomnia, weight loss, fatigue, feelings of worthlessness, drug or alcohol abuse, and impaired ability to think, concentrate and make decisions. In extreme cases, it may also be characterized by thoughts of death, suicide and attempt of suicide. Tragically, the annual number of death cases due to depression is on the rise.222A study by Hannah Ritchie and Max Roser in 2018 available at

The causes of depression are not completely known and they may not be down to a single source. Major depressive disorder is likely to be due to complex combinations of factors like genetics, psychology, and social surroundings of the sufferer. People who have experienced life events like divorce or death of a family member or friend, people who have personality issues such as the inability to deal with failure and rejection, people with previous records of major depression, and people with childhood trauma are at a higher risk of depression Beck and Alford (2009).

Depression detection is a challenging problem as many of its symptoms are covert. Since depressed people socialize less, its detection becomes difficult. Today, for the correct diagnosis of depression, a patient is evaluated on standard questionnaires. In the literature, different tools for screening depression have been proposed, such as the Personal Health Questionnaire Depression Scale (PHQ), the Hamilton Depression Rating Scale (HDRS), the Beck Depression Inventory (BDI), the Center for Epidemiologic Studies Depression Scale (CES-D), the Hospital Anxiety and Depression Scale (HADS), and the Montgomery and Asberg Depression Rating Scale (MADRS).333Recommandation of the French Haute Autorité de la Santé available at In particular, the eight-item PHQ-8 Kroenke et al. (2008) is established as a valid diagnostic and severity measure for depressive disorders in many clinical studies Kroenke (2012).

The steadily increasing global burden of depression and mental illness acts as an impetus for the development of more advanced, personalized and automatic technologies that aid in its detection. Affective computing is one field of research which focuses on gathering data from faces, voices and body language to measure human emotion. An important business goal of affective computing is to build human-computer interfaces that can detect and appropriately respond to an end user’s state of mind. As a consequence, techniques from affective computing have been applied for the automatic detection of depression Scherer et al. (2016); Morales et al. (2018).

In this paper, we introduce an attention-based neural network for the fusion of the acoustic, text and visual modalities. In particular, we encode seven modalities (two acoustic, one text and four visual). Different combinations of the acoustic, text and visual modality encodings are fed to the network to obtain fused vectors. These fused vectors are then passed to a deep regression network to predict the severity of depression based on a PHQ-8 scale. From our experiments, we show that:

  • the fusion of all modalities (acoustic, text, visual) helps in better estimation of depression level, compared to any other combination,

  • our approach outperforms the previous state-of-the-art by 7.17% on root mean squared error (RMSE) and 8.08% on mean absolute error (MAE) and,

  • the verbal input plays a predominant role in the regression process, confirming therapists’ experience.

The remainder of this paper is organized as follows. In section 2, we present the state-of-the-art approaches for the estimation of depression level. We then describe our methodology in section 3. This is followed by a brief overview of the multimodal DAIC-WOZ dataset used to benchmark our method, in section 4. Experiments, experimental settings and results are described in section 5. Section 6 concludes the paper.

2 Related work

Over the last few years, a great deal of research studies in Computer Science have been proposed to deal with mental health disorders Andersson and Titov (2014); Dewan et al. (2015)

. Within this context, the automatic detection of depression has received major focus. Some initial initiatives have targeted the understanding of relevant descriptors that could be used in machine learning frameworks.

Scherer et al. (2013) investigate the capabilities of automatic non verbal behavior descriptors to identify indicators of psychological disorders such as depression. In particular, they propose four descriptors that can be automatically estimated from visual signals: downward angling of the head, eye gaze, duration and intensity of smiles, and self-touches. Chatterjee et al. (2014) study the role of multiple context-based heart-rate variability descriptors for evaluating a person’s psychological health. Cummins et al. (2015) focus on how common paralinguistic speech characteristics (prosodic features, source features, formant features, spectral features) are affected by depression and suicidality and the application of this information in classification and prediction systems. Morales and Levitan (2016) argue that researchers should look beyond the acoustic properties of speech by building features that capture syntactic structure and semantic content. Within this context, Wolohan et al. (2018) show that overall classification performance suggests that lexical models are reasonably robust and well suited for a role in a diagnostic or the monitoring capacity of depression. Some other interesting work directions using text features include the study of social media De Choudhury et al. (2013); Hovy et al. (2017), eventually using specific corpora tuned for such tasks Losada and Crestani (2016).

Another promising research trend aims at leveraging all modalities into one learning model and is commonly called multimodal depression detection Morales (2018). Within this context, a great deal of successful research studies have been proposed. He et al. (2015)

evaluate feature fusion and model fusion strategies via local linear regression to improve accuracy in the BDI score using visual and acoustic cues.

Dibeklioğlu et al. (2015) compare facial movement dynamics, head movement dynamics, and vocal prosody individually and in combination, and show that multimodal measures afford most powerful detection. Yang et al. (2016)

achieve satisfying results over the benchmark dataset, DAIC-WOZ (Distress Analysis Interview Corpus - a Wizard of Oz), to estimate the PHQ-8 score by fusing audio, visual and text features with decision-tree classification. More recently,

Morales et al. (2018); Morales (2018) propose an extensive study of fusion techniques (early, late and hybrid) for depression detection combining audio, visual and text (especially syntactic) features, through SVM. In particular, they show that the syntax-informed fusion approach is able to leverage syntactic information to target more informative aspects of the speech signal, but the overall results tend to suggest that there is no statistical evidence of this finding.

In this paper, we propose an early fusion strategy using neural networks, to combine acoustic, visual and text modalities. For that purpose, different combinations of modalities are fed to an attention-based neural network to obtain fused vectors. These fused vectors are then passed to a deep regression network to predict the severity of depression based on a PHQ-8 scale over the benchmark dataset, DAIC-WOZ. The main motivation of our work is to automatically learn the significance of a given modality as each one may not have the same discriminative characteristics. For that purpose, attention-based neural networks are particularly suitable models, which is confirmed by the overall results obtained in this paper. To the best of our knowledge, this is the first attempt in that direction.

3 Methodology

Our proposed architecture (CombAtt) consists of three main components: (1) modality encoders, which take unimodal features as input, and output modality encodings, (2) the fusion subnetwork that fuses the individual modalities, and (3) the regression subnetwork that outputs the estimated PHQ-8 score, conditioned on the output of the fusion subnetwork.

Let TSD = {TSD1, TSD2, …, TSDm} be a set of m modalities of time series data. Each element in TSD is a two dimensional matrix. The rows of this matrix comprise time stamped feature vectors. The CombAtt network takes the elements of TSD as its input. The following subsections describe the individual components of the CombAtt network in detail.

3.1 Modality encoders

There are m encoders in the CombAtt network, to encode the m modalities of time series data. Each of them encodes a modality into an encoding vector. Let network MEk {ME1, ME2, …, MEm}, and Wk {W1, W2, …, Wm} be the set of parameters of the network MEk. Let TSDk TSD be the input to MEk. The encoding Ek {E1, E2, …, Em} (= E) of the input TSDk, obtained from MEk is given by equation 1.


To encode time series data, we use a LSTM network, with a forget gate Gers et al. (1999) as the recurrent unit, because of its robustness in capturing long sequences. The formulation of our LSTM network is given in the appendix, in section 2. The output from the LSTM layer acts as the encoding vector for a given modality. These modality encoding vectors are then fed to the fusion network.

3.2 Fusion subnetwork

The fusion subnetwork is composed of the tensor fusion layer, which disentangles unimodal and bimodal dynamics, and the attention fusion subnetwork, in which the attention mechanism allows to automatically weight modalities. Both components are described in the following sections.

3.2.1 Tensor fusion layer

The tensor fusion layer (TFL) disentangles unimodal and bimodal dynamics by modeling each of them explicitly. So, if Ei (li-D vector444See figure 1.) and Ej (lj-D vector) are two input encoding vectors to the tensor fusion layer, the output TFL(Ei, Ej) is given by the following equation:


where is the operator for the outer product of two vectors. The constant 1 is appended to each of the input vectors Ei and Ej to include the individual features Ei and Ej. TFL(Ei, Ej) is a (li+1)(lj+1) dimensional matrix, which is flattened into a ((li+1)(lj+1))-D vector. Since the tensor fusion layer is mathematically defined, it has no learnable parameters.

Figure 1: Tensor fusion layer.

Within CombAtt, we define a set of pairs of input encodings TFI = {(TFI11, TFI12), (TFI21, TFI22), …, (TFIr1, TFIr2)}, where TFIS = {TFI11, TFI12, TFI21, TFI22, …, TFIr1, TFIr2} is a subset of E. We feed the tuples of TFI to TFL, and obtain the output set of bi-modal encodings TFO = {TFO1, TFO2, …, TFOr} where . We define TFNS = E TFIS, the set difference of E and TFIS. The set of input encoding vectors to the attention fusion subnetwork is defined as MV = {MV1, MV2, …, MVs} = TFO TFNS.

3.2.2 Attention fusion subnetwork

By using an attention mechanism, the network “attends” to the most relevant part of the input to generate the output. Networks with an attention mechanism usually perform better than their counterpart without attention. As not all modalities are equally relevant for the estimation of depression level, this motivates the introduction of an attention fusion subnetwork, as an extension of the work of Poria et al. (2017).

The attention fusion subnetwork is shown in Figure 2. The input to the attention fusion subnetwork is MV, where the dimensionality of a vector MVk MV is dk. The first step of the architecture consists in giving the same dimension d to all the elements of MV. This is done using a stack of one or more dense layers. The resultant vectors are denoted by DEMV, which is the set {DEMV1, DEMV2, …, DEMVs}. Then, all the elements of DEMV are concatenated vertically into a vector V and passed through a deep regression network, called the attention generation subnetwork and represented by Natt. Natt outputs a vector of attention values f s1. Our attention generation subnetwork differs from Poria et al. (2017) in the way that we let the deep regression network decide its parameters without any constraint for the generation of attention values. So, let Watt be the parameters of the attention generation subnetwork, Natt is defined in equation 3.


In parallel, let H = [DEMV1 DEMV2 DEMVs] ds be the matrix obtained after the horizontal concatenation of the vectors in the set DEMV. The fusion of the elements of DEMV together with the attention values is performed as in equation 4.


F d1 is the fusion vector that we feed to the PHQ-8 score regression subnetwork for the estimation of the depression level of a given patient.

Figure 2: Attention fusion subnetwork.

3.3 PHQ-8 Score regression subnetwork

The PHQ-8 score regression subnetwork is a deep regression network conditioned on the fusion vector F. The output F of the attention fusion subnetwork is first concatenated to the gender of the patient555Note that gender plays an important role in the classification process of depression Albert (2015). and then fed to a few dense and dropout layers. The resultant vector is finally fed to a linear regression unit, which outputs the PHQ-8 score. So, let the PHQ-8 score regression subnetwork be denoted by NR and its parameters by WR, the PHQ-8 score is estimated as in equation 5.


In the following section, we present the gold standard DAIC-WOZ dataset that we use to perform our experiments.

4 DAIC-WOZ depression dataset

The DAIC-WOZ depression dataset666 is part of a larger corpus, the Distress Analysis Interview Corpus Gratch et al. (2014), that contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. These interviews were collected as part of a larger effort to create a computer agent that interviews people and identifies verbal and non-verbal indicators of mental illness. The data collected include audio and video recordings, and extensive questionnaire responses from the interviews conducted by an animated virtual interviewer called Ellie, controlled by a human interviewer in another room. The data has been transcribed and annotated for a variety of verbal and non-verbal features.

The dataset contains 189 sessions of interviews. We discarded a few interviews, as some of them were incomplete and others had interruptions. Each interview is recognized by a unique ID assigned to it. Each interview session contains a raw audio file of the interview session, a file containing the coordinates of 68 facial landmarks of the participant, a file containing HoG (Histogram of oriented Gradients) features of the face, two files containing head pose and eye gaze features of the participant, recorded over the entire duration of interview using a framework named OpenFace Baltrušaitis et al. (2016), a file containing the continuous facial action units of the participant’s face extracted using the facial action coding software CERT Littlewort et al. (2011), the COVAREP and formant feature files of the participant’s voice extracted using a framework named COVAREP Degottex et al. (2014), and a transcript file of the interview. All the features, leaving the transcript file, are time series data. A tabular description, showing some statistics of this dataset, is given in the appendix, in section 1.

Each row of the facial landmark file comprises the time stamp, confidence, detection success flag, X, Y and Z coordinates of each of the 68 facial landmarks. Each row of the head pose file comprises time stamp, confidence, detection success flag, Rx, Ry, Rz, Tx, Ty and Tz. Rx, Ry and Rz are the head rotation coordinates (measured in radians), and Tx, Ty and Tz are the head position coordinates (measured in millimetres). The eye gaze feature file has rows that contain time stamp, confidence, detection success flag, x0, y0, z0, x1, y1, z1, xh0, yh0, zh0, xh1, yh1 and zh1. The gaze is represented by 4 vectors. The first two vectors (x0, y0, z0 and x1, y1, z1) describe the gaze direction of both the eyes. The second two vectors (xh0, yh0, zh0 and xh1, yh1, zh1) describe gaze in head coordinate space (if the eyes are rolled up, the vectors indicate ’up’ even if the head is turned or tilted). Each row of the action units file comprises the time stamp, confidence, detection success flag, and a few real numbers indicating the facial action unit. The data in these files were recorded at a frequency of 30Hz.

The COVAREP and formant feature files contain time series data. Each row of these files comprises 74 and 5 real numbers respectively, representing various features of the participant’s voice or the virtual interviewer’s voice. Both the features were recorded at 100Hz frequency. One of the features in both the files is a flag named VUV (Voiced/Unvoiced), which denotes whether that segment is voiced or not. The DAIC-WOZ depression dataset manual advises not to use those rows whose VUV flag values are 0.

The transcript file contains the sentences spoken by Ellie and the participant. Each row of the file comprises start time, the time at which the speaker starts speaking, stop time, the time at which the speaker stops speaking, speaker, denoting whether the speaker is Ellie or the participant, and value, the exact sentence spoken by the speaker.

The training, development and test split files are provided with the dataset. The training and development split files comprise interview IDs, PHQ-8 binary labels, PHQ-8 scores, participant gender, and single responses to every question of the PHQ-8 questionnaire. The test split file comprises interview IDs and participant gender.

5 Experiments and results

This section is divided into three subsections. They are data preprocessing, experimental setup, and experimental results.

5.1 Data preprocessing

Different preprocessing techniques have been applied to different data modalities. We list them in the following subsections.

5.1.1 Visual modalities preprocessing

In the facial data modality, for every facial representation, we subtract the mean value of the Z-coordinate of the points from the Z-coordinates of all the points. This removes the bias along the Z-axis. Then, we normalize the points, so that the average distance to the origin is equal to 1. We calculate the distances between all possible pairs of points, and concatenate them with the normalized representation of the points. This results in a feature vector of size 2482 at each time step. In the head pose data modality, we rescale Tx, Ty and Tz

by dividing them by 100. We downsample all the visual data modalities to 5Hz. We adopt a zero-tolerance strategy and discard all the time steps in all the visual modalities where the success flag is 0. We do this to exclude the risk of introducing artefacts into the feature space. As the interviews of the participants are of different duration, we left-pad all the visual modality sequences with zero vectors along the temporal axis, to a common length of 10,000 time steps.

5.1.2 Acoustic modalities preprocessing

We discard those rows whose VUV flag values are 0. As this is an interview, the interviewer and the participant take turns to speak. We separate the participants COVAREP and formant features from the interview by using the start time and stop time given in the interview transcript file. We discard all those rows which belong to the turn of the participant where he/she has spoken for less than one second. We left-pad the COVAREP and formant features of all the participants with zeros along the time axis, to obtain a common length of 80,000 and 120,000 time steps respectively.

5.1.3 Text modality preprocessing

We collect only the participants utterances, and sort them according to their start times. As many of the participants have spoken colloquially, we formalize the utterances by replacing the contractions with the corresponding full words. Now, each utterance is encoded into a 512-D vector by using a pretrained Universal Sentence Encoder Cer et al. (2018). In this way, we construct time series data using the transcript file. We left-pad this data with zeros along the temporal axis, to obtain a common length of 400 time steps for all the participants.

5.2 Experimental setup

The hyperparameter details of each of the components of CombAtt network are described in the following sections. These hyperparameters were set empirically, after searching over a fairly vast hyperparameter space. The training procedure is provided in section 4 of the appendix.

5.2.1 Modality encoders

In the Facial Landmark Encoder (MEFL

), we use a LSTM layer with 256 memory cells. We feed the output of the LSTM layer to a dropout layer with a dropout rate of 0.3. The output of the dropout layer is passed through a 20 unit ReLU layer. The output of this hidden layer is again fed to a dropout layer of rate 0.3, before regressing the PHQ-8 score in the output layer. The facial landmark encoding, a 256-D vector obtained from

MEFL, is denoted by EFL.

The Head Pose Encoder (MEHP) contains a 2-layer LSTM, with 6 and 5 memory cells. The output of the second LSTM layer is fed to a dropout layer with a dropout rate of 0.166. The resultant vector is then concatenated to the gender of the participant, and passed through a 5 unit ReLU layer and a dropout layer with a rate of 0.25, before regressing the PHQ-8 score. The head pose encoding, a 5-D vector obtained from MEHP, is denoted by EHP.

In the Eye Gaze Encoder (MEEG), we use a single LSTM layer with 64 memory cells. The output of the LSTM layer is concatenated to the gender of the participant, and passed through a couple of dense dropout layers, before regressing the PHQ-8 score. The dropout rates of the first and the second dropout layers are 0.2 and 0.125, respectively. The number of hidden units in the first and the second dense layers are 32 and 8, respectively. ReLU activation is used in the hidden dense layers. The eye gaze encoding, a 64-D vector obtained from MEEG, is denoted by EEG.

The Action Units Encoder (MEAU) contains a single LSTM layer with 15 memory cells. The output of the LSTM layer is passed through a dropout layer of rate = 0.133. The resultant vector is appended to the gender of the participant, and is fed to a 6 unit ReLU layer. The output from this dense layer is fed to a dropout layer of rate 0.166, before regressing the PHQ-8 score in the output layer. The action units encoding, a 15-D vector obtained from MEAU, is denoted by EAU.

The COVAREP and formant feature encoders (MECOV and MEFMT) contain single LSTM layers with 37 and 10 memory cells respectively. The outputs from the LSTM layers of both the networks are passed through a dropout layer whose rate is 0.2 and 0.25 in COVAREP and formant feature encoders, respectively. Gender of the participant is appended to the output of this dropout layer. The resultant vectors pass through a stack of dense-dropout layers, and the output of this stack is used for regressing the PHQ-8 score in the output layer. The rate of the second dropout layer is 0.2 and 0.25 in COVAREP and formant feature encoders, respectively. The dense layer in MECOV has 15 hidden units, and the dense layer in MEFMT has 6 units. ReLU activation is used in the hidden layers of both the encoders. The COVAREP and formant feature encodings (37-D and 10-D vectors) obtained from these networks are denoted by ECOV and EFMT respectively.

The transcript encoder (METR) contains a single LSTM layer with 200 memory cells. The sum of the outputs from all the LSTM time steps is passed to a dropout layer of rate = 0.3. The resultant vector is fed to a 60 unit ReLU layer. The output from this dense layer is fed to a dropout layer of rate 0.3, before regressing the PHQ-8 score in the output layer. The transcript encoding, a 200-D vector obtained from METR, is denoted by ETR.

5.2.2 Tensor fusion layer

We pair the visual and acoustic modalities into three groups - (EFL, EHP), (EAU, EEG), and (ECOV, EFMT), and pass them to the tensor fusion layer to obtain the outputs TFOFLHP (1542-D vector), TFOAUEG (1040-D vector), and TFOCOVFMT (418-D vector) respectively. These three outputs, along with ETR, are passed on to the attention fusion subnetwork.

5.2.3 Attention fusion subnetwork

We first shorten TFOFLHP, TFOAUEG, TFOCOVFMT, and elongate ETR to a length of 450. TFOFLHP is shortened by feeding it to a stack of 600 unit ReLU and 450 unit ReLU layers, to obtain DEMVFLHP. TFOAUEG is shortened by feeding it to a stack of 570 unit ReLU and 450 unit ReLU layers, to obtain DEMVAUEG. TFOCOVFMT is shortened by feeding it to a 450 unit ReLU layer, to obtain DEMVCOVFMT. ETR is elongated by feeding it to a stack of 315 unit ReLU and 450 unit ReLU layers, to obtain DEMVTR. The attention generation subnetwork takes the vertical concatenation of a subset SM777SM is the set of modalities which are being fused. {DEMVFLHP, DEMVAUEG, DEMVCOVFMT, DEMVTR

} as input, and feeds it to a 300 unit tanh layer. The output of this hidden layer is passed on to a softmax layer which outputs the attention values. The number of hidden units in this softmax layer is determined by the length of the subset modalities


To see how the different combinations of acoustic, text and visual modalities help in the estimation of PHQ-8 score, and to see the effect of using attention for fusion of modalities, we trained and tested the following networks. In the network CombAttav, we fuse DEMVCOVFMT, DEMVFLHP and DEMVAUEG (the audio and visual modality encodings), and pass them on to the regression subnetwork. In the network CombAtttv, we fuse DEMVTR, DEMVFLHP and DEMVAUEG (the text and visual modality encodings), and feed them to the regression subnetwork. In the network CombAttat, we fuse DEMVCOVFMT and DEMVTR (acoustic and text modality encodings), and feed them to the regression subnetwork. In the network CombAttatv, we fuse DEMVCOVFMT, DEMVTR, DEMVFLHP and DEMVAUEG (the audio, text and visual modality encodings), and pass them on to the regression subnetwork. In the network CombAttattentionless, we concatenate DEMVCOVFMT, DEMVTR, DEMVFLHP and DEMVAUEG vertically, and pass them on to a stack of dense-dropout layers. The dense layer has 300 hidden units, and uses ReLU as its activation. The dropout rate of the dropout layer is 0.25. The output from this stack is appended to the gender, and is passed on to a regression unit, which regresses the PHQ-8 score.

5.2.4 Regression subnetwork

The input of this network is the concatenation of the fused modalities with the gender of the participant, and is fed to a couple of dense dropout layers. The first and second dense layers have 310 and 83 hidden units, respectively, and use ReLU as the activation function. The corresponding dropout layers have rates of 0.25 and 0.2, respectively. The output of the second hidden layer is fed to a regression unit, which regresses the PHQ-8 score (a integer value between 0 and 24).

5.3 Experimental results

We first compare the results of the networks, which use different combinations of modalities, as defined in sections 5.2.1 and 5.2.3, in Table 1

. In particular, we use three evaluation metrics: root mean squared error (RMSE), mean absolute error (MAE) and explained variance score (EVS).

MEFL 6.24 5.30 0.12
MEHP 6.45 5.24 0.08
MEEG 6.57 5.45 0.04
MEAU 6.53 5.06 0.18
MECOV 6.60 5.71 0.03
MEFMT 6.65 5.66 0.01
METR 4.80 3.74 0.48
CombAttav 5.25 3.89 0.39
CombAtttv 5.11 3.65 0.48
CombAttat 4.64 3.65 0.57
CombAttattentionless 4.69 3.66 0.50
CombAttatv 4.14 3.07 0.62
AWbehavioural 5.54 4.73 NA
MMD 4.65 3.98 NA
VFSCsemantic 4.46 3.34 NA
Table 1: Results for combinations and state-of-the-art.

It is clear that the fusion of all modalities helps in better estimation of depression level, compared to any other combination (CombAttatv outperforms all the other networks). Among the modality encoders, METR gives the most accurate estimate, indicating that verbal input plays a predominant role in estimating depression level, which is confirmed by therapists’ experience. Results also show the benefits of the attention mechanism, which allows to take into account the weight of each modality in the classification process. Note that the results obtained by CombAttatv are statistically significant when compared to all other configurations.

We also compare the performance of the CombAtt network with three approaches, that can be considered as the state-of-the-art within this task:

AWbehavioural : This is the winning approach Stepanov et al. (2017) in AVEC 2017 Ringeval et al. (2017)

depression sub-challenge. The authors use feature extraction methods on acoustic and text features, and recurrent neural network on visual features. It is the current state-of-the-art on the test split of DAIC-WOZ. The authors develop four models, but we compare our approach with the best of these four models (


MMD : In this approach Yang et al. (2017)

, the authors propose a multimodal fusion framework composed of deep convolutional neural network (DCNN) and deep neural network (DNN) models. The framework considers acoustic, text and visual streams of data. For each modality, handcrafted feature descriptors are fed to a DCNN that learns high-level global features with compact dynamic information. Then, the learned features are fed to a DNN to predict the PHQ-8 scores. For multimodal fusion, the estimated PHQ-8 scores from the three modalities are integrated in another DNN to obtain the final PHQ-8 score.

VFSCsemantic : In this approach Williamson et al. (2016), the authors derive the biomarkers from visual, acoustic and text modalities. The authors define semantic context indicators, which use the provided transcripts to infer a subject’s status with respect to four conceptual classes. The semantic context feature is the sum of points accrued from all four indicators. This approach is the state-of-the-art on the official development split of DAIC-WOZ.

The results reported in Table 1 show that CombAtt outperforms all state-of-the-art approaches. In particular, CombAtt achieves an increase of 7.17% on RMSE and 8.08% on MAE over VFSCsemantic (the best performing state-of-the-art). On performing statistical significance tests, we found that the improvements are statistically significant.

6 Conclusion

In this paper, we introduce an attention-based fusion network (CombAtt) for the estimation of PHQ-8 score. Experimental results show that (1) fusing all modalities helps in giving the closest estimation of depression level, (2) the text modality plays an important role in the regression process, and (3) CombAtt outperforms previous state-of-the-art approaches. However, there is still a great margin for improvement. Indeed, as stated in Valstar et al. (2016)

, a baseline classifier that constantly predicts the mean score of depression provides an RMSE=5.73 and an MAE=4.74. In our future work, we propose to (1) study recent multi-task learning architectures such as

Sanh et al. (2018) and (2) dig deeper into high-level text representations such as Devlin et al. (2018).


  • Albert (2015) Paul R. Albert. 2015. Why is depression more prevalent in women? Journal of Psychiatry and Neuroscience, 40(4):219–221.
  • Andersson and Titov (2014) Gerhard Andersson and Nickolai Titov. 2014. Advantages and limitations of internet-based interventions for common mental disorders. World Psychiatry, 13:4–11.
  • Baltrušaitis et al. (2016) Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In

    Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on

    , pages 1–10. IEEE.
  • Beck and Alford (2009) Aaron T Beck and Brad A Alford. 2009. Depression: Causes and treatment. University of Pennsylvania Press.
  • Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
  • Chatterjee et al. (2014) Moitreya Chatterjee, Giota Stratou, Stefan Scherer, and Louis-Philippe Morency. 2014. Context-based signal descriptors of heart-rate variability for anxiety assessment. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3631–3635.
  • Cummins et al. (2015) Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. 2015. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71:10–49.
  • De Choudhury et al. (2013) Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013. Predicting depression via social media. ICWSM, 13:1–10.
  • Degottex et al. (2014) Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. Covarep—a collaborative voice analysis repository for speech technologies. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 960–964. IEEE.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.
  • Dewan et al. (2015) Naakesh A. Dewan, John S. Luo, and Nancy M. Lorenzi. 2015. Mental Health Practice in a Digital World: A Clinicians Guide. Springer Publishing Company, Incorporated.
  • Dibeklioğlu et al. (2015) Hamdi Dibeklioğlu, Zakia Hammal, Ying Yang, and Jeffrey F. Cohn. 2015. Multimodal detection of depression in clinical interviews. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 307–310.
  • Gers et al. (1999) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with lstm.
  • Gratch et al. (2014) Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. 2014. The distress analysis interview corpus of human and computer interviews. In LREC, pages 3123–3128. Citeseer.
  • He et al. (2015) Lang He, Dongmei Jiang, and Hichem Sahli. 2015. Multimodal depression recognition with dynamic visual and audio cues. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on, pages 260–266. IEEE.
  • Hovy et al. (2017) Dirk Hovy, Margaret Mitchell, and Adrian Benton. 2017. Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 152–162.
  • Kroenke (2012) Kurt Kroenke. 2012. Enhancing the clinical utility of depression screening. Canadian Medical Association Journal, 184(3):281–282.
  • Kroenke et al. (2008) Kurt Kroenke, Tara Strine, Robert L Spitzer, Janet Williams, Joyce T Berry, and Ali Mokdad. 2008. The phq-8 as a measure of current depression in the general population. Journal of affective disorders, 114:163–73.
  • Littlewort et al. (2011) Gwen Littlewort, Jacob Whitehill, Tingfan Wu, Ian Fasel, Mark Frank, Javier Movellan, and Marian Bartlett. 2011. The computer expression recognition toolbox (cert). In 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG), pages 298–305.
  • Losada and Crestani (2016) David E Losada and Fabio Crestani. 2016. A test collection for research on depression and language use. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 28–39.
  • Morales et al. (2018) Michelle Morales, Stefan Scherer, and Rivka Levitan. 2018. A linguistically-informed fusion approach for multimodal depression detection. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 13–24.
  • Morales (2018) Michelle Renee Morales. 2018. Multimodal Depression Detection: An Investigation of Features and Fusion Techniques for Automated Systems. Ph.D. thesis, City University of New York.
  • Morales and Levitan (2016) Michelle Renee Morales and Rivka Levitan. 2016. Speech vs. text: A comparative analysis of features for depression detection systems. In 2016 IEEE Workshop on Spoken Language Technology, pages 136–143.
  • Poria et al. (2017) Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017.

    Multi-level multiple attentions for contextual multimodal sentiment analysis.

    In Data Mining (ICDM), 2017 IEEE International Conference on, pages 1033–1038. IEEE.
  • Ringeval et al. (2017) Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pages 3–9. ACM.
  • Sanh et al. (2018) Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2018. A hierarchical multi-task approach for learning embeddings from semantic tasks.
  • Scherer et al. (2016) Stefan Scherer, Gale M. Lucas, Jonathan Gratch, Albert Skip Rizzo, and Louis-Philippe Morency. 2016. Self-reported symptoms of depression and PTSD are associated with reduced vowel space in screening interviews. IEEE Transactions of Affective Computing, 7(1):59–73.
  • Scherer et al. (2013) Stefan Scherer, Giota Stratou, Marwa Mahmoud, Jill Boberg, Jonathan Gratch, Albert Rizzo, and Louis-Philippe Morency. 2013. Automatic behavior descriptors for psychological disorder analysis. In 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pages 1–8.
  • Stepanov et al. (2017) Evgeny Stepanov, Stephane Lathuiliere, Shammur Absar Chowdhury, Arindam Ghosh, Radu-Laurentiu Vieriu, Nicu Sebe, and Giuseppe Riccardi. 2017. Depression severity estimation from multiple modalities. arXiv preprint arXiv:1711.06095.
  • Valstar et al. (2016) Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 3–10. ACM.
  • Williamson et al. (2016) James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentruber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, and Thomas F Quatieri. 2016. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 11–18. ACM.
  • Wolohan et al. (2018) JT Wolohan, Misato Hiraga, Atreyee Mukherjee, Zeeshan Ali Sayyed, and Matthew Millard. 2018. Detecting linguistic traces of depression in topic-restricted text: Attending to self-stigmatized depression with nlp. In Proceedings of the First International Workshop on Language Cognition and Computational Models, pages 11–21.
  • Yang et al. (2016) Le Yang, Dongmei Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. 2016. Decision tree based depression classification from audio video and language information. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 89–96.
  • Yang et al. (2017) Le Yang, Dongmei Jiang, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. 2017.

    Multimodal measurement of depression using deep learning models.

    In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pages 53–59. ACM.