Major depressive disorder (MDD) is the most common mental illness with around of adults in the world suffering from it . While the standard clinical depression assessment relies on structured interviews by specially trained psychiatrists , it is subjective, time-consuming and hard to access. There is convergent evidence suggests that non-verbal facial behaviors provide rich and reliable sources for reflecting human depression status [15, 10] (e.g., depressed patients usually have reduced facial expressions), and they are easily to be recorded in non-invasive ways using portable devices (mobile phone, laptop, etc.). As a result, a large number of recent studies attempt to automatically recognize depression status from subjects’ faces.
Standard face-based approaches [40, 14, 6, 7] predict depression directly from face images/videos. However, in real-world applications, face images are sometimes not accessible due to various ethical and privacy policies. Since an early study  show that mid and low-level facial attributes (e.g., Facial Action Units (AUs) and facial landmarks) are informative for depression status, a certain number of recent studies devote to recognize depression from automatic detected facial attributes such as facial landmarks [35, 19, 29], gaze direction [1, 2], facial action units (AUs) [11, 28, 27], and head poses . Besides some of them compute several statistics [16, 36]
(e.g., displacement, velocity, acceleration) from facial attributes time-series as the clip-level representation for depression recognition, recent advances in deep learning (e.g., 1D-CNN[28, 27], LSTM , attention-based temporal CNN , Causal CNN , etc.) also have been applied to infer depression from facial attribute time-series, and achieved enhanced results over most hand-crafted approaches.
However, all of these approaches only manually design models (hand-crafted feature extraction or manually designed CNNs) to extract features from facial attributes, and conduct simple fusion strategy to combine depression cues extracted from all attributes, e.g., they conduct standard decision-level fusion or simply concatenate all facial attributes as a joint representation. Since each facial attribute has a unique data structure, existing approaches fail to design a task-specific architecture for each facial attribute’s feature extraction. Moreover, while each facial attribute contains both unique and common cues (also carried by other facial attribute) for depression recognition, these simple fusion strategies can not optimally retain complementary cues and minimize the redundancy from all facial attributes. In other words, existing approaches that manually design networks are not able to optimally extract and combine depression-related features from multiple facial attributes, which would theoretically limit the recognition performance.
In this paper, we address the aforementioned issues by introducing Neural Architecture Search (NAS) technique to explore an optimal model from small depression dataset, for multiple facial attributes-based depression recognition. Instead of conducting frame/short segment-level depression modelling, our approach starts with employing the spectral encoding method  to obtain a clip-level facial behavioral representation for each subject, which has been frequently claimed duo be more reliable for depression recognition. Then, we propose a novel multi-stream CNN-GCN framework, where each stream is specifically customized to the unique data structure of a specific facial attribute, aiming to learn extract depression-related features from its clip-level representation, while its fusion module selects a best intermediate latent representation from each stream and conducting optimal operations for all representations’ fusion. To achieve this optimal network, we propose an end-to-end NAS strategy to jointly search task-specific architectures for all modules with limited number of depression data. The proposed multi-stream CNN-GCN framework is illustrated in Fig. 1. The main contributions and novelties of our approach is summarized as follows.
We propose a novel multi-stream framework for multiple facial attributes-based automatic depression recognition, where each stream can be either a CNN or a GNN, and the fusion module optimally combines most informative latent representations that are produced by these streams for depression recognition. To the best of our knowledge, this is the first CNN-GCN framework for face-based depression recognition.
We propose a Neural Architecture Search (NAS) method to search an optimal multi-stream CNN-GCN framework from a depression dataset that only contains 107 training samples, where a novel motion average loss function is proposed to stabilize the searching process. To the best of our knowledge, this is the first work that extends the Neural Architecture Search (NAS) technique to automatic depression analysis.
Ii Related work
Ii-a Facial attributes-based depression recognition
Due to the ethical concern or storage issue, recent depression recognition challenges [31, 24, 25] encourage researchers to recognize depression from automatically detected facial attributes (e.g., AUs, emotions, facial landmarks, etc.). Jaiswal et al.  summarize video-level facial attribute time-series into a histogram, and feed it to MLP to predict the target person’s depression level. Yang et al.  introduce a novel hand-crafted video descriptor that manually models the dynamics of 2D facial landmarks of each video segment, which is then fed to a CNN for segment-level depression-related feature extraction. Haque et al.  use a Causal Convolutional Neural Network (C-CNN) to deep learn sentence-level depression cues from 3D facial landmarks. Du et al. propose a Atrous Residual Temporal Convolutional Network (DepArt-Net) that generates multi-scale contextual features from several low-level visual behaviors, then temporally fuse them through attention mechanism to capture the long-range depression-related cues. Song et al.[28, 27]
propose to use Fourier transforms to encode facial attribute time-series (AUs, gazes, and head poses) of a clip into a length-independent spectral representation, incorporating multi-scale temporal information. However, all these approaches manually design networks for depression feature extraction and fusion without considering the task-specific architecture.
Ii-B Neural Architecture search
Neural Architecture Search (NAS) is an AutoML technique that allows to automatic design a task-specific artificial neural network architecture for the target. Early pioneering work 41, 3, 39, 21, 30]34, 23, 32] iteratively propose plausible network architectures, where validation results obtained by the explored network is used as a reward signal to update the controller, enforcing it to propose a better architecture. While such strategies achieved promising performance, they are time-consuming. To accelerate searching process, ENAS  shares network’s parameters for all candidate architectures. It treats the entire search space as a super computational graph, and the candidate neural architecture can be viewed as a directed acyclic subgraph. By sharing the model weights among all the different subgraphs (candidate architectures), it is possible to avoid training each subgraph completely from scratch. Recently, Liu et al.  propose a DARTs model that replaces the discrete searching process with a continuously differentiable strategy, allowing gradient descent-based architecture optimization and resulting in exponentially faster searching speed. However, such continuously differentiable strategy is less likely to explore an optimal architecture [37, 38, 4, 17] compared with RL-based methods.
In this section, we present the details of our depression recognition approach. Our approach first converts all facial attribute time-series of each clip to spectral representations, which provides a set of length-independent clip-level facial attribute representations, summarising various long-term facial behaviors of the target person (Sec. III-A). Then, we describe the proposed multi-stream CNN-GCN depression model in Sec. III-B, which aims to learn task-specific and complementary depression cues from the clip-level representation of each facial attribute while optimally combining them for depression recognition. Finally, we propose an end-to-end strategy to search for the target depression model from a small depression dataset (Sec. III-A).
Iii-a Clip-level facial attribute representations encoding
As discussed above, depression status is more reliable to be reflected from long-term facial behaviors. In this sense, the first step of our approach is to summarize a clip-level representation for the target clip. Since the length of recorded face clip can be various for different subjects, the facial attribute time-series of each subject would be variable. To this end, we apply the spectral encoding algorithm  to produce a pair of fixed-size spectral representations from facial attribute time-series of each clip, which summarize multi-scale facial behavioral dynamics. Specifically, given facial attributes () of a clip with an arbitrary length, each of them is a time-series with channels. Then, their spectral representations can be denoted as and , where and are amplitude and phase spectra of the facial attribute;
is a pre-defined hyperparameter of retained frequency components (the number of columns) for the spectra. As a result, facial attributes of clips with various lengths can be summarized into clip-level spectral representations with the same size.
Iii-B Multi-stream CNN-GNN depression analysis model
As illustrated in Fig. 1, the proposed model consists of multiple feature extractors and a fusion module. Each feature extractor aims to extract depression-related cues from the produced spectral representations of a specific facial attribute. In particular, we define a feature extractor as either a CNN or a GNN depending on the data structure of the facial attribute. We first categorize facial attributes into two types: (1) facial attributes whose channels do not have clear spatial correlations (e.g., AUs, gaze and head pose); and (2) facial landmarks that have strict spatial distributions. For each facial attribute time-series of the first type, we concatenate its clip-level spectral representations as a multi-channel heatmap
, and use a 1D-CNN to deep learn depression-related feature from it. For facial landmark time-series, we represent them as a clip-level graph. Specifically, we concatenate spectral representations of each facial landmark as a single spectral vector, which is represented as a node feature in the graph. Here, nodes that belongs to a certain facial region (e.g., eyes or mouth) are fully connected, and all nodes are connected to the node that describes the nasal root, allowing messages can be exchanged for different facial regions during the GNN processing.
Meanwhile, we also propose a fusion module to optimally combine depression features produced by all feature extractors. It contains a input block and several fusion blocks. The input block consists of parallel input layers, each of which takes the best latent feature from a specific feature extractor. Here, we use the average pooling that aligns all latent representations to the same size, which are then fed to several fusion blocks to optimally combine all depression cues. The employed candidate operators for each feature extractor and the fusion module are provided in the supplementary material.
Iii-C Neural Architecture Search for the multi-stream model
To obtain an optimal multi-stream model to predict depression from multiple facial attributes, this sections proposes to automatically search for the optimal architecture of the model under the supervision of depression labels. Our searching strategy is made up of two stages: the warm-up stage and the end-to-end depression model searching stage.
Iii-C1 Warm-up stage
Let’s assume is the search space size for the feature extractor, and is the search space size for the fusion module. If we directly conduct the end-to-end searching for all feature extractors and the fusion module, the search space size for the entire model can be denoted as
The is intimating when the search space size for each module is large. To this end, the warm-up stage first conducts a pre-searching for each feature extractor to reduce the search space size from to . Consequently, the searching complexity of the warm-up stage is , while the complexity of end-to-end searching is reduced to . In short, the warm-up stage reduces the depression network searching complexity to
For each feature extractor, we learn a Long-short-term-memory network (LSTM) as the controller to sample architectures. The pseudocode of the warm-up stage is demonstrated in the supplementary material.
Iii-C2 End-to-end architecture searching stage
Once we obtained the reduced search space for each feature extractor, we then jointly search for the final architecture of all feature extractors and the fusion module. To achieve a high depression recognition performance, this joint searching process enforces all feature extractors to be explored to learn unique and complementary depression cues rather than repeatedly learn depression cues that contained in all facial attributes. Specifically, we individually learn a LSTM as the controller to sample architectures for each feature extractor, all of which are jointly optimized with the controller that samples the architecture for the fusion module. This process is demonstrated in Algorithm 1 and Fig. 2.
Iii-C3 Optimization details
At each training iteration , the controller (LSTM) samples an architecture based on its current policy . Then, we instantiate a child network from and train it from scratch. The performance of the trained child network is evaluated by the Root Mean Square Error (RMSE) between its predictions and the ground-truth on the valiation set, which is used as the reward signal to train the controller via the PPO algorithm. Let denote the parameters of the controller at the time step and
denote the probability ofunder policy . The PPO objective for sampled architecture is formulated as:
During the joint search, the joint probability for the fusion architecture can be factorized as
where and stands for the architectures of the fusion module and each feature extractor. Finally, the loss function for updating the controller is approximated by an average:
where are the architectures sampled from .
Since the size of the depression recognition dataset is very limited, the trained models of a certain sampled architecture can have variable performances due to the stochastic factors of the training (e.g., the shuffled training samples or different initial weights). This means that the validation error in a single trial can not accurately represent the generalization capability of the sampled architecture. In principle, we can mitigate this problem using the Monte Carlo method, averaging the performance by training and evaluating multiple times each time is sampled. However, the number of trials used for averaging, denoted as , is difficult to determine, and a large entails significant computational costs. In this sense, we modify the Monte Carlo method with a motion average (line 17 of the Algorithm 1). This process replaces the validation error, i.e. the reward signal, in a single trial with the average results achieved across the whole training process for . The proposed motion average is formulated as
where is the validation loss obtained by the instantiated child network of the architecture
. Intuitively, the more frequently an architecture is sampled, the more accurate its performance estimate is, allowing to obtain a more accurate ranking of architectures when the policy converges in the later stage.
In this paper, we evaluate the proposed approach on a widely-used facial attributes-based depression analysis dataset: DAIC-WOZ dataset. The DAIC-WOZ dataset used in the AVEC 2016 depression recognition challenges contains a total of 189 clinical interviews. The interviews are conducted by an animated virtual interviewer called Ellie, ranging from 7 to 33 minutes in length. Each interview session contains multiple audio-visual and verbal recordings of the participants answering Ellie’s questions, as well as self-assessed PHQ-8 scores(0-24) as ground truth labels.
Iv-B Implementation details
For each clip, we first remove frames whose face detection are failed or have low confidence of the detected face. We then normalize each of its facial attribute time-series by subtracting the median of the time-series (we retain original values for facial landmark time-series to retain spatial information). Meanwhile, during the spectral encoding, we retainedfrequency components of the spectral signals for all features, following the frequency alignment method in .
Training details: For CNN networks and fusion blocks, in addition to the stem architecture sampled by the controller, a fixed regression layer, i.e., average pooling+dropout+linear, is used to output the final prediction in the warm-up stage. For GNN networks, only dropout+linear is added, since the pooling layer is included in the NAS search space.The loss functions of the controller and child networks are the same in both warm-up and fusion stages. During the NAS for fusion, the cell layers of the single-modal controller LSTMs are concatenated as the cell and the input of the first layer of the fusion controller LSTM. The controller and child networks are both optimized with Adam. The hyperparameters of the controller and child networks are optimized on a validation set and keeps same in all experiments. Finally, to obtain the best architecture, we trained the top-3 architectures proposed by the controller on the full training set and report the best performing architecture on the test set.
Metrics: We adopt the two metrics (i.e., root mean square error(RMSE) and mean absolute error(MAE)) to evaluate the performance of our approach, which have been used in previous AVEC challenges. They are defined as:
where and denote predictions and the ground truth PHQ-8 depression scores.
Iv-C Comparison to existing approaches
Table I compares our best system to recent state-of-the-art methods. It is clear that our method achieves the new state-of-the-art performance with 27% RMSE and 30% MAE improvement over the existing state-of-the-art facial attributes-based approach . A more detailed analysis is shown in Table II, where we report the depression recognition results achieved by each explored single-modal network (each predict depression from a single facial attribute). The results demonstrate that the automatically searched architecture provide promising results for all attributes, with large advantages over other methods. Particularly, our approach obtained the top-2 best performance from the GNN-based facial landmark stream and CNN-based AU stream, which show that even using a single modality achieves better performance than any existing facial attribute-based approach. In other words, the aforementioned results suggest that our approach allows to search superior network architecture than existing manually designed architectures for extracting depression cues from each facial attribute, i.e., existing approaches can not explicitly extract depression cues from each facial attribute. More importantly, these results indicates that there is a great potential of applying NAS for automatic depression analysis. Fig. 3 visualizes the predictions of our best system.
|Williamson et al. ||6.45||5.33|
|Song et al. ||6.29||5.15|
|Haque et al. ||-||5.01|
|Du et al. ||5.78||4.61|
|Yang et al. ||5.39||4.72|
|Song et al. ||6.32||6.36||6.18||-||6.29|
|Du et al. ||5.88||6.21||6.32||6.02||5.78|
|Song et al. ||5.01||5.24||5.04||-||5.15|
|Du et al. ||4.65||4.99||5.21||5.01||4.61|
Iv-D Ablation studies
According to Table II, we first notice that facial landmark time-series contain more depression-related cues than other facial attributes. This can be explained by the fact that facial landmarks can comprehensively describe behaviors of various facial regions. Although AUs can also objectively describe facial behaviors of the full face, the errors of automatically detected AU intensities may limit their ability in inferring depression status. Meanwhile, gaze and head pose can only reflect some specific facial behaviors and fail to include depression-related behaviors occurred in other local facial regions.
Graph-based facial landmarks Table III compares the depression recognition results achieved by applying the explored CNN and GNN to process clip-level facial landmark representations. The results show that the explored GNN clearly outperforms the CNN, suggesting that facial landmark time-series is more suitable to be represented as a graph. Moreover, we show that our fusion module is able to combine the latent representations produced from CNNs and GNN, as the best results is achieved by the CNN-GNN model.
, we demonstrate the learning curve of the controller for each time-step, i.e., the average validation error of the architectures sampled at each step. When adopting our motion average loss, the validation loss curve has decreased variance and less fluctuation. When a single validation error is adopted as the reward signal, the variance is consistently larger and eventually converges to a sub-optimal policy compared to the policy obtained by using our motion average loss.
In this paper, we propose the first Neural Architecture Search approach to jointly explore optimal CNN-GCN feature extractors and a fusion module for predicting depression from multiple facial attributes. The results show that the model explored by our approach can learn superior depression-related features from all facial attributes, and the fusion module can further enhance the performance by combining depression-related supplementary cues from all facial attributes, with extremely large improvements over the existing state-of-the-art (30% RMSE improvements). In summary, our study provides solid evidences and a strong baseline for applying NAS to automatic depression analysis.
-  (2015) Cross-cultural detection of depression from nonverbal behaviour. In 2015 11th IEEE International conference and workshops on automatic face and gesture recognition (FG), Vol. 1, pp. 1–8. Cited by: §I.
-  (2016) Multimodal depression detection: fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Transactions on Affective Computing 9 (4), pp. 478–490. Cited by: §I.
-  (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §II-B.
Fairnas: rethinking evaluation fairness of weight sharing neural architecture search.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12239–12248. Cited by: §II-B.
-  (2009) Detecting depression from facial actions and vocal prosody. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, Vol. , pp. 1–7. External Links: Cited by: §I.
-  (2020) Encoding temporal information for automatic depression recognition from facial analysis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1080–1084. Cited by: §I.
-  (2021) MDN: a deep maximization-differentiation network for spatio-temporal depression detection. IEEE Transactions on Affective Computing. Cited by: §I.
-  (2019) Encoding visual behaviors with attentive temporal convolution for depression prediction. Proceedings - 14th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2019 (61673033). External Links: Cited by: 3rd item, §I, §II-A, TABLE I, TABLE II.
-  (2004) The structured clinical interview for dsm-iv axis i disorders (scid-i) and the structured clinical interview for dsm-iv axis ii disorders (scid-ii).. Cited by: §I.
-  (2004) Facial expressivity in the course of schizophrenia and depression. European archives of psychiatry and clinical neuroscience 254 (5), pp. 335–342. Cited by: §I.
-  (2017) Topic modeling based multi-modal depression detection. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pp. 69–76. Cited by: §I.
-  (2014) The distress analysis interview corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3123–3128. Cited by: §IV-A.
-  (2018) Measuring depression symptom severity from spoken language and 3d facial expressions. arXiv preprint arXiv:1811.08592. Cited by: §I, §II-A, TABLE I.
-  (2021) Automatic depression recognition using cnn with attention mechanism from videos. Neurocomputing 422, pp. 165–175. Cited by: §I.
-  (1986) Facial expression of positive and negative emotions in patients with unipolar depression. Journal of affective disorders 11 (1), pp. 43–50. Cited by: §I.
-  (2019) Automatic prediction of depression and anxiety from behaviour and personality attributes. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 1–7. Cited by: §I, §II-A.
Improving one-shot nas by suppressing the posterior fading.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13836–13845. Cited by: §II-B.
-  (2019) DARTS: Differentiable architecture search. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–13. External Links: Cited by: §II-B.
-  (2016) Multimodal and multiresolution depression detection from speech and facial landmark features. AVEC 2016 - Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, co-located with ACM Multimedia 2016, pp. 43–50. External Links: Cited by: §I, §IV-A.
-  Global health data exchange (ghdx). Note: http://ghdx.healthdata.org/gbd-results-tool?params=gbd-api-2019-permalink/d780dffbe8a381b25e1416884959e88bAccessed: 2021-12-24 Cited by: §I.
Efficient Neural Architecture Search via parameter Sharing.
35th International Conference on Machine Learning, ICML 20189, pp. 6522–6531. External Links: Cited by: §II-B.
-  (2019) Multi-level attention network using text, audio and video for depression prediction. AVEC 2019 - Proceedings of the 9th International Audio/Visual Emotion Challenge and Workshop, co-located with MM 2019, pp. 81–88. External Links: Cited by: §I.
Regularized evolution for image classifier architecture search. In
Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §II-B.
-  (2017) AVEC 2017 - Real-life depression, and affect recognition workshop and challenge. AVEC 2017 - Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, co-located with MM 2017, pp. 3–9. External Links: Cited by: §II-A.
-  (2019) AVEC 2019 workshop and challenge: state-of-mind, detecting depression with ai, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, pp. 3–12. Cited by: §II-A.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III-C3.
-  (2020) Spectral Representation of Behaviour Primitives for Depression Analysis. IEEE Transactions on Affective Computing 14 (8), pp. 1–16. External Links: Cited by: §I, §I, §II-A, §III-A, §IV-B.
-  (2018) Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 158–165. Cited by: §I, §II-A, TABLE I, TABLE II.
-  (2017) Depression severity prediction based on biomarkers of psychomotor retardation. AVEC 2017 - Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, co-located with MM 2017 (2), pp. 37–43. External Links: Cited by: §I.
-  (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §II-B.
-  (2016) AVEC 2016 - Depression, mood, and emotion recognition workshop and challenge. AVEC 2016 - Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, co-located with ACM Multimedia 2016, pp. 3–10. External Links: Cited by: §II-A.
-  (2020) Npenas: neural predictor guided evolution for neural architecture search. arXiv preprint arXiv:2003.12857. Cited by: §II-B.
-  (2016) Detecting depression using vocal, facial and semantic communication cues. AVEC 2016 - Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, co-located with ACM Multimedia 2016, pp. 11–18. External Links: Cited by: TABLE I.
-  (2017) Genetic cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1379–1388. Cited by: §II-B.
-  (2016) Decision tree based depression classification from audio video and language information. In Proceedings of the 6th international workshop on audio/visual emotion challenge, pp. 89–96. Cited by: §I.
-  (2021) Integrating deep and shallow models for multi-modal depression analysis—hybrid architectures. IEEE Transactions on Affective Computing 12 (1), pp. 239–253. External Links: Cited by: 3rd item, §I, §II-A, §IV-C, TABLE I.
-  (2019) Evaluating the search phase of neural architecture search. arXiv preprint arXiv:1902.08142. Cited by: §II-B.
-  (2020) Overcoming multi-model forgetting in one-shot nas with diversity maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7809–7818. Cited by: §II-B.
-  (2017) Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552 6. Cited by: §II-B.
-  (2018) Visually interpretable representation learning for depression recognition from facial images. IEEE Transactions on Affective Computing 11 (3), pp. 542–552. Cited by: §I.
-  (2017) Neural architecture search with reinforcement learning. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp. 1–16. External Links: Cited by: §II-B.