Vision based body gesture meta features for Affective Computing

02/10/2020 ∙ by Indigo J. D. Orton, et al. ∙ 0

Early detection of psychological distress is key to effective treatment. Automatic detection of distress, such as depression, is an active area of research. Current approaches utilise vocal, facial, and bodily modalities. Of these, the bodily modality is the least investigated, partially due to the difficulty in extracting bodily representations from videos, and partially due to the lack of viable datasets. Existing body modality approaches use automatic categorization of expressions to represent body language as a series of specific expressions, much like words within natural language. In this dissertation I present a new type of feature, within the body modality, that represents meta information of gestures, such as speed, and use it to predict a non-clinical depression label. This differs to existing work by representing overall behaviour as a small set of aggregated meta features derived from a person's movement. In my method I extract pose estimation from videos, detect gestures within body parts, extract meta information from individual gestures, and finally aggregate these features to generate a small feature vector for use in prediction tasks. I introduce a new dataset of 65 video recordings of interviews with self-evaluated distress, personality, and demographic labels. This dataset enables the development of features utilising the whole body in distress detection tasks. I evaluate my newly introduced meta-features for predicting depression, anxiety, perceived stress, somatic stress, five standard personality measures, and gender. A linear regression based classifier using these features achieves a 82.70 novel dataset.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Dissertation structure

Chapter 2 contains a literature review covering related distress detection research and relevant technology for working with the body modality specifically. Chapter 3 introduces a novel dataset for depression detection. I present my method in Chapter 4. I evaluate my features’ validity in Chapter 5. Finally, I conclude and outline potential future work in Chapter 6.

2.1 Automatic Detection of Distress

Distress is expressed through all modalities. Many approaches have been developed to automatically detect distress using behavioural cues, these include both mono-modal and multi-modal approaches. For example, face analysis methods are one of the common, and powerful, techniques enabled by the development of tools for extracting accurate positional information, such as facial landmarks, and tools for automatic interpretation based on existing psychology approaches such as FACS [21].

I review uses of the primary modalities, cross-modal considerations, multi-modal approaches, and the expanding use of deep learning within automatic distress detection. The face, head, and body modalities are the most relevant, though I briefly provide examples of text and vocal modal usage.

Text Modality

Dang et al. [13] use linguistic attributes, auxiliary speech behaviour, and word affect features to predict depression severity and emotional labels within the DAIC dataset [26]. Linguistic attribute features include total number of words, unique words, pronouns, lexical proficiency, among others. Auxiliary speech behaviour cover parallel actions, such as laughing or sighing, and meta information such as word repeats and average phrase length. Word features are determined by a collection of annotated corpora that assign -grams categorizations or ratings relating to their affective semantic. For example, assigning an emotion type (anger, disgust, joy, etc) to a word or rating words from 0 to 10 on affective attributes such as arousal, valence, dominance, and pleasure. These three feature types are used to predict depression measures.

Audio Modality

The audio modality can be a strong predictive source as non-verbal features of speech can be predictive of distress irrespective of the content of a person’s speech [43].

Features from audio data commonly include prosody, jitter, intensity, loudness, fundamental frequency, energy, Harmonic-to-Noise-Ratio (HNR), among others. Ozdas et al. [43] present a method for analysing fluctuations in the fundamental frequency of a person’s voice to assess their risk of suicide. Alghowinem et al. [2]

explore the use of a broad selection of vocal features, extracted using the

“openSMILE” toolkit [22] to detect depression. Dibeklioglu et al. [16] use vocal prosody to detect depression.

In their investigation of psychomotor retardation caused by depressive states, Syed et al. [55] use low-level descriptors to model turbulence in subject speech patterns. By profiling the turbulence of depressed and non-depressed participants with a depression dataset they develop a model for predicting depression severity based on the level of turbulence.

Facial Modality

Joshi et al. [36] present a “bag of facial dynamics” depression detection method based on the same expression clustering as their “bag of body dynamics” method described in more depth below. For this facial method the space-time interest points (STIPs) are generated for face aligned versions of the source videos.

While Joshi et al. use a categorization approach, Dibeklioglu et al. [16] use generic features representing facial movement dynamics for depression detection. This method involves generating statistical derivations from movement features such as velocity, acceleration, and facial displacement over a time period and then modeling their effect.

Whilst Dibeklioglu et al. take a generic approach to facial movement, Syed et al. [55] attempt to model behaviour discussed in psychology literature, using features representing psychomotor retardation to predict depression severity. Psychomotor retardation has been shown to be linked to depression [50]. In particular, Syed et al. generate features to capture craniofacial movements that represent psychomotor retardation, and thus indicate depression. To capture the target movements they design features that represent muscular tightening, a depressed subject is expected to have impaired muscle movements. They model three types of movement: head movement, mouth movement, and eyelid movement. These movements are represented by temporal deltas to define the amount of movement in the region. From these localised movement deltas the authors aim to represent specific actions such as blinking or contorting of the mouth. Relatively simplistic features derived from these actions, such as blink rate, can be indicative of depression [3, 19]. The nature of human behaviour is that these kinds of simplistic features can contribute useful information to distress detection models. Moreover, modeling specific facial actions has been examined as well. For example, Scherer et al. [46] use smile features such as intensity and duration, along with other modalities, to detect depression.

Yang et al. [58] present a novel facial descriptor, a “Histogram of Displacement Range (HDR)”, which describes the amount of movement of facial landmarks. The histogram counts the number of occurrences of a displacement within a certain range of movement. Where Syed et al. represented the amount of movement of certain facial features to measure psychomotor retardation, Yang et al. represent the number of times the face is distorted, so to speak, by landmarks moving a certain amount.

Eye Sub-Modality

While Syed et al. [55] explored the use of eye lid features, eye gaze features have also been shown to be effective in predicting distress. This modality has become viable as eye tracking technology has progressed sufficiently to enable accurate processing of eye features.

Alghowinem et al. [1] use eye gaze/activity to perform binary classification of depression in cross-cultural datasets. They extract iris and eyelid movements to extract features such as blink rate, duration of closed eyes, and statistical “functionals” (i.e. simple derivations) of the amount of activity. However, activity is not the only indicator, Scherer et al. [46] use average eye gaze vertical orientation (among other features), in the span degrees.

Head Modality

Joshi et al. [36] present a “histogram of head movements” depression detection method that models movement of a person’s head over time. They use three facial landmarks, the corner of each eye and the tip of the nose, to compute the orientation of the subject’s head. The histogram uses orientation bins of width 10 within the range degrees for windows of time within a video. These windowed histograms are then averaged over the full length of the video. The resulting average histogram is a descriptor of the amount of movement within the video by representing the variety of angles the head orients to. This method achieves comparable performance to their “bag of facial dynamics” method.

A number of the methods using eye activity features also incorporate head activity in their models. Alghowinem et al. [1] model head activity similarly to their modeling of eye activity. As with eye activity, they extract statistical derivations of movement and angular shift. They also include the duration of the head at different orientations, the rate of change of orientation, and the total number of orientation changes. Scherer et al. [46] use head vertical orientation, similar to their eye gaze orientation feature, as a feature for predicting depression. They use the average pitch of the head within a 3D head orientation model. Dibeklioglu et al. [16] also model head movement dynamics in a similar fashion to their facial movement dynamics features. Similar to Alghowinem et al. they extract statistical derivations of movement velocity, amplitude, and acceleration as head movement features.

The head movement statistical derivation features presented by Alghowinem et al. and Dibeklioglu et al. are similar to the features I introduce in this dissertation, in that they represent meta movement information of the modality, rather than categorizing movement. Though, of course, Alghowinem et al. also incorporate categorized movement via their orientation features.

Body Modality

The body modality is the least researched of the modalities reviewed. This is due to a number of factors including the difficulty of dataset creation and the available technology for extracting raw body data. Contrast this with the relative ease of working with the other modalities and it is no surprise they received more attention. However, much of the relevant body modality based automatic distress detection research appeared in the early 2010s as some private clinical datasets were gathered and the parallel technological ecosystem expanded to support generation of body modality features.

The Case for Further Investigation

De Gelder [14] presents the case for further research of bodily expressions within affective neuroscience. Though a parallel field, the core argument is much the same for affective computing’s investigation of bodily expressions. Specifically, at the time of writing (2009) de Gelder asserts that 95% of “social and affective neuroscience” focuses on faces and that the remaining 5% is mostly split between vocal, musical, and environmental modalities with a very few number of papers investigating the body modality. This was similar to the state of automatic distress detection in the early 2010s, though the vocal modality has a more prominent position in the literature and is more evenly balanced with the facial modality.

Affectation control and robust automatic detection

Non-verbal features provide discriminative information regardless of a person’s conscious communication, this is particularly important for automatic distress detection. Different modalities can be consciously controlled to varying degrees. For example, facial expressions are more easily controlled than bodily expressions [14]. By including more modalities, and representations of those modalities, automatic distress detection could become more robust to conscious affectation modification of modalities.

Further to robustness, one of the advantages of the face and body modalities is their ability to detect micro-expressions. Micro-expressions are instinctual reactions to some stimulus that can be predictive of emotional and distress state [20, 28]. They are significantly harder to control than general expressions.

Expression Categorization

Much of the body modality research has approached the problem as a transfer of the methods from the facial modality by modeling body movements as expressions in much the same way facial expressions are [33, 36, 34]. Differences in these methods have been centred around: what tracklets form the basis of the body data [36], the use of deep learning vs manual descriptor definition [46], and process for generating categories of expressions [51].

Joshi et al. [36] demonstrate the discriminative power of bodily expressions for predicting clinically based depression measures. They use STIPs from recordings of a participant’s upper body and generate a “Bag of Body Dynamics (BoB)” based on codebooks of expression representations. STIPs are generated for a video, then Histograms of Gradient (HoG) and Optical Flow (HoF) are computed spatio-temporally around the STIPs, these histograms are then clustered within a sample, the cluster centres form a representation of the movements occurring in the video. The cluster centres from all videos in a training set are clustered again to generate the codebook of expressions. A BoB feature vector is generated for each sample by counting the number of cluster centres within the sample that fit within each codebook expression. Finally, these BoB feature vectors are used by an SVM to detect depression.

Joshi et al. [34] extend on this approach by combining a “holistic body analysis”, similar to the BoB method, this method uses STIPs for whole body motion analysis and adds relative body part features. These features represent the movement of the head and limbs relative to the participant’s trunk, represented as polar histograms.

Applying the same essential method as Joshi et al., Song et al. [51] present a method for learning a codebook of facial and bodily micro-expressions. They identify micro-expressions by extracting STIPs over very short time intervals (e.g. a few hundred milliseconds), then, as Joshi et al. do, they compute local spatio-temporal features around the STIPs and learn a codebook of expressions based on these local features. Finally, for each sample they generate a Bag-of-Words style feature vector based on the codified micro-expressions present in the sample.

Distress Behaviour Descriptors

Psychology literature describes specific behaviours that are correlated with psychological distress and disorders. For example, Fairbanks et al. [23] find self-adaptors and fidgeting behaviour to be correlated to psychological disorders. Based on this work, Scherer et al. [46] evaluate the use of these behaviours for automatic distress detection. Specifically, they manually annotate their dataset for hand self-adaptor behaviours and fidgets, including hand tapping, stroking, grooming, playing with hands or hair, and similar behaviours. To identify whether regions are relevant to these behaviours they also annotate these behaviours with categories such as head, hands, arms, and torso, and then extract statistical information such as the average duration of self-adaptors in each region. They also annotate leg fidgeting behaviour such as leg shaking and foot tapping, again they use statistical derivations of these behaviours as features for detection.

Whilst Scherer et al. manually annotated self-adaptors and fidgets, Mahmoud et al. [40] present an automatic detector of fidgeting, and similar behaviours, based on a novel rhythmic motion descriptor. They extract SURF interest point tracklets from colour and depth data and then apply their novel rhythmic measure to check similarity among cyclic motion across tracklets. Rhythmic motion is then localised based on Kinect skeletal regions and classified as one of four classes: Non-Rhythmic, Hands, Legs, and Rocking.

Multi-Modal Fusion

Combining modalities for prediction has proven effective when combining a variety of modalities [58, 35, 16]. There are four primary types of fusion: feature fusion such as feature vector concatenation, decision fusion such as majority vote, hybrid fusion which uses both, and deep learning fusion which merges inner representations of features within a deep learning architecture. The deep learning fusion method differs from feature fusion as the features are provided to separate input layers and only merged after inner layers, but before decision layers.

Song et al. [51] combine micro-expressions from facial and bodily modalities with sample-level audio features. They evaluate three methods, early fusion by concatenating audio features to features from each visual frame, early fusion using a CCA [30] kernel, and late fusion based on voting, where the per-frame predictions from the visual modalities are averaged over the sample and then combined with the audio prediction. Dibeklioglu et al. [16]

fuse facial, head, and vocal modalities using feature concatenation. They extend on this by performing feature selection on the concatenated vector, rather than the source vectors, using the Min-Redundancy Max-Relevance algorithm 


Alghowinem et al. [1] perform hybrid modality fusion, both combining feature vectors from modalities and performing a majority vote on individual modality classification predictions. The vote fusion is based on three classifiers, two mono-modal classifiers and one feature fusion classifier.

Huang et al. [31]

train long-short-term memory (LSTM) models on facial, vocal, and text modalities and then use a decision level fusion, via a SVR, to predict the final regression values. This paper differs from many deep learning approaches as it uses the decision level fusion, rather than having the deep learning models find patterns across feature types.

Temporal contextualisation

Gong & Poellabauer [25] present another approach to usage of multiple modalities where the text modality provides contextualisation for features from the audio-visual modalities. They apply a topic modeling method for depression detection using vocal and facial modalities, where features are grouped based on the topic being responded to within an interview. The authors suggest that without topic modeling the features are averaged over too large a time period such that all temporal information is lost. By segmenting the samples they aim to retain some of the temporal information. Arbitrary segmentation would not necessarily be useful, thus their use of logical segmentation based on topic.

Deep Learning

Much of the recent work in distress detection leverages advances in deep learning, especially advances related to recurrent architectures, such as LSTMs, which can model sequence data well. In distress detection most modalities provide sequence data, audio-visual streams, natural language, or sequence descriptors of the data (such as FACS AUs).

Chen et al. [9] present a method utilizing text, vocal, and facial modalities for emotion recognition. They explore the use of existing vocal features such as fundamental frequency analysis, auto-learnt features based on pre-trained CNNs to extract vocal and facial features, and word embedding features for the text. The auto-learnt facial features are derived from existing facial appearance data already extracted from the raw videos (i.e. they do not have their CNNs process raw frames to extract features). They then experiment with SVR and LSTM models to evaluate the temporal value of the LSTM model. They find that fused auto-learnt features from all modalities in combination with the LSTM model provides the greatest performance.

Yang et al. [59]

present an interesting multi-level fusion method that incorporates text, vocal, and facial modalities. They design a Deep Convolutional Neural Network (DCNN) to Deep Neural Network (DNN) regression model that is trained, separately, for audio and video modalities. They also trained their regression models separately for depressed and non-depressed participants, resulting in four separate DCNN - DNN models. They use the openSMILE toolkit for their audio features, this is common among many of the vocal modality methods (e.g. Alghowinem et al. from above), and FACS AUs for their visual features. They derive a temporally relevant feature vector from the set of all AUs by calculating the change in AUs over time. Their text modality model uses Paragraph Vectors in combination with SVMs and random forests and predicts a classification task rather than a regression task. Finally, they fuse, using DNNs, the audio and visual model predictions per training set, i.e. the depressed participant trained models are fused and the non-depressed models are fused. They then use another DNN to fuse the two fused regression predictions (i.e. depressed and non-depressed) and the classification prediction from the text modality. While they use DNNs to fuse decisions at multiple levels, this is a decision-level fusion method, not a deep learning fusion method, as they fuse the regression predictions from each model rather than the inner layer outputs.

Yang et al. [58] present a second paper which utilises the same structure of DCNNs and DNNs, with two significant changes: firstly, the video features are changed to a new global descriptor they present, the “Histogram of Displacement Range (HDR)” which describes the amount of movement of facial landmarks, and secondly, the text modality now uses the same DCNN - DNN architecture to perform regression based on text data. Having changed the output of the text model the final fusion is a regression fusion using the same method as the audio visual models were fused in the first paper.

Finally, no method, that I am aware of, uses deep learning end-to-end for automatic distress detection such that features are learnt from raw data for predicting distress. All methods apply deep learning on top of audio-visual descriptors and hand-crafted features. One of the core difficulties is the relative sparsity of existing data and thus the restricted ability to learn interesting features. Therefore, continued development of features and approaches to behavioural representation is valuable and able to contribute to methods that use deep learning.

2.2 Body Gestures

Body gestures present a number of challenges including deriving features directly from pixels, extracting positional data, detecting gestures within the data, and designing representative features based on gestures. This section focuses on previous work on body gesture and pose representation, which forms the basis of my feature definition and generation.

There are three primary approaches to body gesture representation within the affective computing literature: traditional computer vision feature detection algorithms such as STIPs [36, 34], pose estimation [8] from standard video recordings, and use of specialised 3D capture equipment such as Kinects [40].

Recording Type - 2D vs 3D

The first two approaches use standard 2D video recordings to extract the base data for calculating features. The third approach uses 3D video recordings to enable more detailed analysis of body movement via depth. The most common 3D capture equipment used within related work is Kinects. Kinects have an added benefit that they provide skeletal key points (i.e. joint locations) within a recording, thus enabling more accurate representation of body data.

Extracting Body Representations

Given a video recording, 2D or 3D, with or without skeletal information, the next phase is feature extraction. Feature extraction approaches fall into two primary categories, which apply to both 2D and 3D videos: generic video feature representations and body modality specific representations.

Generic Interest Points

The first approach does not target specific body areas or movement, instead it assumes that the only subject within the video is the subject the model is concerned with, i.e. the participant, and extracts generic video features from the recording to represent the body and gestures. Examples of these generic features are Space-Time Interest Points (STIPs), SURF, Histogram of Gradients (HOGs) and Optical Flow (HOFs), among others. These features can be extracted from colour and depth recordings such that they are applicable to both 2D and 3D recordings.

While some approaches apply these generic features (or rather, derivations of these features) directly to prediction tasks [36, 51] (examples are discussed in Section 2.1

), others aim to incorporate heuristics and information based on body parts. For example, interest points can be segmented based on location within the frame to heuristically distinguish body parts (e.g. head and arms) 


Another example is Mahmoud et al. [40] who use SURF keypoints from colour and depth images across a video to define tracklets for their rhythmic motion measure. Their dataset is based on Kinect captured videos to provide depth, this also provides skeletal regions such as feet and head. They use these Kinect skeletal regions to localise their keypoints’ motion.

Body Modality Specific

The second approach extracts body modality specific interest points (e.g. skeletal models) to calculate features from. There are two primary methods for extracting these interest points: joint estimations from specialised capture equipment such as Kinects and pose estimation from existing video recordings. Such skeletal models have gained popularity in the past few years for action recognition tasks [54, 17, 10, 48, 56, 18].

In this dissertation I use pose estimation to extract body interest points (e.g. joints) from each frame in a two-dimensional video. In the past three to four years there has been substantial work on these pose estimation systems [57, 7, 49, 60, 39]. The current state-of-the-art is OpenPose by Cao et al. [8]. OpenPose uses skeletal hierarchy and part affinity fields to estimate pose interest points (i.e. joints) relative to each other and determine both part and orientation based on the direction of the proposed limb. The state-of-the-art model, at the time of writing, generates 25 interest points identifying the location of all major joints and more detailed information regarding head and feet. OpenPose can also be used to extract facial landmarks and detailed hand models.

2.3 Summary

I have reviewed methods for automatic depression detection and the variety of modalities and models used within the field. I have also discussed, in broad terms, methods used for working with body data within related fields, from variation in capture equipment to difference in core interest points.

The first component of the review outlined the gap within the existing use of the body modality that my proposed methodology addresses. While the second component outlined methods for working with the body modality, and specifically technology that I use to implement my methodology.

3.1 Existing Related Datasets

There are a number of relevant existing datasets, however none satisfy the requirements of my research. Namely that they: are available, include psychological distress labels, and include source videos.

I provide an overview of four existing datasets, two clinical datasets, a distress focused dataset using facial, auditory, and speech modalities, and one affective computing body gesture dataset. Each of these datasets, and their related work, provide useful insight for my data collection and model development research.

3.1.1 Clinical

Joshi et al. describe a clinical dataset [33] for depression analysis that contains recordings of participants’ upper bodies and faces during a structured interview. It includes clinical assessment labels for each participant, including depression severity. This dataset was collected at the Black Dog Institute [6], a clinical research institute in Australia.

Joshi et al. [34] describe a clinical dataset collected at the University of Pittsburgh containing full body recordings of participants in interviews, it uses Major Depressive Disorder (MDD) [4] diagnosis and Hamilton Rating Scale of Depression (HRSD) [29] as labels. The authors validate the use of full body features for depression prediction within this dataset.

Given the clinical nature of these datasets they are not publicly available.

3.1.2 Audio-Visual Distress

Gratch et al. introduce the Distress Analysis Interview Corpus (DAIC) dataset [26] containing audio-visual recordings of interviews with participants with varying levels of distress. Participants are assessed for depression, post traumatic stress disorder (PTSD), and anxiety based on self-evaluation questionnaires. The dataset provides audio data (and derivations such as transcripts) and extracted facial landmark data, though they do not provide source video data nor pose data. Source video recordings are rarely shared in distress focused datasets due to the sensitive of the data.

The dataset contains four participant interview structures: face-to-face interviews with a human, teleconference interviews with a human interviewer, “Wizard-of-Oz” interviews with a virtual agent controlled by an unseen interviewer, and automated interviews with a fully autonomous virtual agent.

This dataset informs my dataset collection via the labels it contains and their interview structures. The DAIC method has participants complete a set of questionnaires, then participants are interviewed and recorded, and finally the participant completes another set of questionnaires. Within the interview the interviewer asks a few neutral questions to build rapport, then asks questions related to the participant’s distress symptoms, and finally asks some neutral questions so the participant can relax before the interview finishes.

3.1.3 Body Gestures

Palazzi et al. introduce a audio-visual dataset of dynamic conversations between different ethnicities annotated with prejudice scores. Videos in this dataset contain two participants interacting in an empty confined space. Participants move around the space throughout the session, enabling analysis of body language affected during conversation.

This dataset highlights, and focuses on, the effect attributes of a counterpart, such as race, gender, and age, have on a person’s behaviour. In this dataset counterparts are explicitly non-uniform. Whereas, in my dataset the interviewer (i.e. counterpart for a participant) is the same for all interviewees. This is useful in controlling for the counterpart variable in human behaviour, supporting the isolation of correlations between distress and behaviour.

Datasets such as this one are not useful for my current research question, however, as they provide source videos they present opportunities for future work investigating my features generalisability to other domains.

3.2 Method

3.2.1 Design

This dataset is designed to enable investigation of the body modality for use in automatic detection of distress for early screening. This is a non-clinical dataset.

Its source data is audio-visual recordings of conversational interviews. The recordings capture the whole body of participants to enable features based on a whole body modality. These interviews involve questions related to distress to elicit emotive responses from participants, however the responses to these questions are irrelevant to the core data. The interviews use a conversational style to best enable naturalistic gestures from participants.

Labels are scored results from established self-evaluation questionnaires for assessing distress and personality traits, as well as demographic labels such as gender. The distress questionnaires are: the PHQ-8 [38, 37] for depression, GAD-7 [52] for anxiety, SSS-8 [24] for somatic symptoms, and the PSS [11] for perceived stress. Personality traits are measured using the Big Five Inventory [32].

3.2.2 Participants


I advertised for participants via University of Cambridge email lists, student social media groups, classified sections of websites, such as Gumtree [27], specific to the Cambridge area, and paper fliers posted around the University of Cambridge. Participants were directed to a describing the study along with a participant registration form.

The registration form captured demographic data and two self-evaluation psychological distress questionnaires. Demographic data captured includes gender, age, ethnicity, and nationality. Gender and age were required while ethnicity and nationality were not. The two psychological distress questionnaires were the PHQ-8 [38, 37] and GAD-7 [52].


In total 106 people registered to participate and 35 were invited to the face to face session. The participant population is balanced with regards to distress levels and gender222 Non-binary/other was given as an option in the registration form. A number of people registered with this option. However, none of those people met the distress level criteria and were thus not selected for an interview.. Distress level balancing aims to include participants at the extents of the distress spectrum such that there is a distinct difference between the high and low distress populations. Participant distress level is selected based on PHQ-8 and GAD-7 questionnaire responses such that participants are balanced between high (i.e. major or severe) and low (i.e. mild) distress. Of the invited participants, there are 18 with high distress and 17 with low distress.


Participants are compensated for their time with a £15 voucher.

3.2.3 Face to Face Session

During the face to face session participants sign a research consent form outlining the interview process, complete a battery of five established psychology questionnaires evaluating distress levels and personality traits, are interviewed by a researcher, and finally sign a debrief research consent form that outlines the full purpose of the study. Participants are not aware of the focus of the research (i.e. body modality analysis) before the interview such their affectations are natural.

To achieve the conversational interview dynamic the interviewer asks general questions regarding the participant’s life and further encourages the participant to elaborate. For example, the interviewer asks “can you tell me about one time in your life you were particularly happy?” and then asks follow up questions regarding the example the participant provides. The interview style and structure is inspired by the DAIC dataset. In developing the interview structure and questions I also drew on training documents provided by Peer2Peer Cambridge [44], an organisation dedicated to peer support for mental health which trains students in general counseling.

So as to avoid influencing participants’ behaviour the interviewer remains as neutral as possible during the interview, while still responding naturally such that the participant is comfortable in engaging in the questions. Furthermore, to ensure neutrality the interviewer is explicitly not aware of the distress level of participants before the interview and has no prior relationships with any participant.

Technical Faults

16 interviews are interrupted due to technical faults333The camera disconnected.. Recordings that are interrupted are treated as multiple individual samples within the dataset (though they remain connected by their participant ID).

3.3 Preliminary Analysis

The dataset contains a total of 35 interviewed participants with a total video duration of 7 hours 50 minutes and 8 seconds. Each participant provides responses to 5 questionnaires, including 2 responses to both the PHQ-8 and GAD-7 questionnaires as participants completed both during registration and the face-to-face session.

Though significantly more people registered for participation, I include only those interviewed in this analysis.

3.3.1 Validation Criteria

There are three primary criteria the dataset should satisfy with regards to label results:

  1. The psychological measures statistically match previous work and published norms444 Published norms are the standard values for questionnaire results as defined by the psychology literature. These aim to be representative of the general population. They thus provide a benchmark for other work (generally within psychology) to check smaller populations’ results against. (i.e. the distribution within the participant population is similar to that of the general population).

  2. There are no confounding correlations. For example, gender correlating highly to depression would indicate a poorly balanced dataset and would be confounding for depression analysis.

  3. The labels are well balanced to enable machine learning.

As the common measure of similar distress detection research is depression I focus primarily on it for this validation.

General statistics regarding the questionnaire and demographic results within the dataset are provided in Table 3.1. Covariance is presented as normalized covariance values, also known as the correlation coefficient.

Label Possible range Max Min Mean Median Std. Depression covariance

Depression 0–24 19 0 7.43 8 5.87 -
Anxiety 0–21 19 0 7.00 8 5.53 86.15%
Perceived stress 0–40 30 1 18.17 18 8.03 84.00%
Somatic symptoms 0–32 27 1 9.06 7 6.94 74.16%
Extraversion 0–32 31 3 16.37 17 6.33 -30.49%
Agreeableness 0–36 34 12 25.67 26 5.60 -42.21%
Openness 0–40 39 7 27.29 28 6.77 4.29%
Neuroticism 0–32 31 1 16.86 18 8.60 80.00%
Conscientiousness 0–36 36 10 21.46 21 6.87 -46.41%
Gender - - - - - - 9.47%
Age - 52 18 25.40 22 9.1 -11.09%

Table 3.1: General statistics regarding questionnaire and demographic results within the dataset. The “Depression covariance” column is most important as it demonstrates the independence of variables with regards to depression (for example, it shows that age and gender are not confounding of depression).
Published Norms

A comparison of the mean values for distress and personality measures between my dataset and the published norms is presented in Table 3.2. While there are differences, the measures are generally in line with the published norms. The dataset has slightly higher mean distress scores, though a substantially higher mean perceived stress score. Depression, extraversion, and neuroticism measures are particularly close to their published norms. While the dataset mean for agreeableness and openness are substantially greater than the published norms (over 10% over the technical range for those measures).

Label Dataset mean Norm mean Source

Depression 7.43 6.63 Ory et al. [42]
Anxiety 7.00 5.57 Spitzer et al. [52]
Perceived stress 18.17 12.76 Cohen et al. [11]
Somatic symptoms 9.06 12.92 Gierk et al. [24]
Extraversion 16.37 16.36 Srivastava et al. [53]
Agreeableness 25.67 18.64 Srivastava et al. [53]
Openness 27.29 19.61 Srivastava et al. [53]
Neuroticism 16.86 16.08 Srivastava et al. [53]
Conscientiousness 21.46 18.14 Srivastava et al. [53]

Table 3.2: Comparison of the mean questionnaire values within my dataset to the published norms. This shows that the population distribution, with regards to these distress and personality measures, is generally in line with the broader population.
Confounding Correlations

While the other distress measures (anxiety, perceived stress, and somatic stress) are strongly correlated with depression, the personality measures have below 50% covariance with the exception of neuroticism which has an 80% covariance. Furthermore, the demographic measures, gender and age, are negligibly correlated, with 9.47% and -11.09% covariance, respectively. This suggests that the labels are not confounding of each other.

Label Balance

There are 17 participants below the mean depression result (7.43) and 18 participants above. The mean depression score of the group below the overall mean is 2.18 while the score for those above is 12.39. Ideally for machine learning the dataset’s distribution would include more participants at the severe depression end of the spectrum, though the present distribution still places the below group firmly in the “mild” category and the above group in the “major depression” category.

There are 18 male and 17 female participants. As the gender covariance shows, the split on the depression measure and the split on gender are not the same participants (gender is balanced across the distress spectrum).

3.3.2 Difference from Registration

Participants complete the PHQ-8 and GAD-7 questionnaires during registration and during the interview process. These questionnaires are temporal, specifically, they relate to the participant’s mental state in the past two weeks. Given this, some difference between registration and interview results is expected.

With the exception of a small number of outliers, participants were generally consistent in self-evaluation between registration and interview. PHQ-8 responses have a mean difference of 0.89 while GAD-7 responses have a mean difference of 0.63. This supports the selection of participants based on temporal self-evaluation questionnaire results.

3.3.3 Interview Meta Statistics

There is a total of 7 hours 50 minutes and 8 seconds of participant interview recordings, with a mean interview duration of 13 minutes and 25 seconds. The standard deviation of interview duration is 3 minutes and 20 seconds and the median interview duration is 13 minutes and 8 seconds. Depression score and interview duration are not correlated, with a covariance of 6.95%. Furthermore, interview duration is not correlated with any questionnaire result (i.e. distress or personality measure), all absolute covariance values are below 25%, which provides confidence in the reliability of the data.

3.4 Summary

I have introduced a new audio-visual dataset containing recordings of conversational interviews between participants and a researcher, and annotated with established psychology self-evaluation questionnaires for depression, anxiety, somatic symptoms, perceived stress, and personality traits. This dataset involves 35 participants and 65 recordings (due to recording interruptions) with a total video duration of 7 hours 50 minutes and 8 seconds.

There are a number of relevant existing datasets including clinical datasets which contain body gestures but are inaccessible beyond their home institute, distress datasets that contain facial expressions and speech modalities but no body gestures or source videos, and video datasets containing body gestures but lacking distress labels. While these datasets inform my dataset design and collection, no dataset I am aware of satisfies the criteria for research on body gesture modalities for predicting distress.

An analysis of the questionnaire results in the dataset show they are aligned with psychology literature published norms, they are not confounded by factors such as gender or age, and have a useful and balanced distribution across the distress spectrum.

4.1 Pose Estimation

Figure 4.2: Example of OpenPose estimation output of the participant interview position. Subject is not a study participant. Pose points are indicated by the green dots.

Pose estimation extracts per-frame approximations of skeletal joint locations. This enables more accurate gesture analysis than direct pixel based approaches (such as STIPs per Joshi et al.’s method [36]).

I process each video to extract per-frame skeletal pose data using OpenPose [8] by Cao et al. (OpenPose is discussed in more detail in Section 2.2) as it is the current state-of-the-art in pose estimation. I use the BODY_25 pose model provided with OpenPose111 OpenPose models provided at As the name suggests, this model generates 25 pose points corresponding to a subject’s joints. OpenPose also extracts more detailed hand data that provides joint estimations for each joint in the hand. Figure 4.2 presents an example of extracted pose points.

4.2 Data Preparation

I perform three data preparation steps: filtering, recovery, and smoothing. Filtering smooths dataset-level noise, recovery fixes outlier noise (where outliers are detection failures for specific pose points, but not the whole body), and smoothing reduces detection noise.

Extracted pose estimation data has two primary forms of noise: individual frames, or short segments of frames, where detection of a person, or part of a person, is lost, and detection points moving around slightly even if the person is static. Manual review of a selection of detection loss cases shows no consistent cause (for both complete detection loss and partial loss). It appears to be the deep learning model failing inexplicably on some frames.

4.2.1 Filtering

The absence of, or low representation of, gestures is relevant information for prediction tasks (simply put, if more activity is relevant then less activity must be too). However, the absence of gestures can also be due to a lack of opportunity to express gestures, such as when a sample is too short. These short samples lead to dataset-level sample noise that can hinder predictive models.

Samples shorter than 1 minute are excluded as it is difficult to provide enough opportunity for gesture dynamics in less than 1 minute of video. 12 out of 65 samples within the dataset are shorter than 1 minute.

4.2.2 Detection Recovery

Pose estimation within my dataset contains frames where certain pose points (e.g. an arm or a leg) are not detected. In these cases OpenPose returns for each pose point not detected. Manual review of a number of instances shows that the joint does not moved much, if at all, during the lost frames. However, the pose point “moving” to position causes significant noise in feature calculation.

Therefore, I perform detection “recovery”

to infer the position of the pose point in the missing frames, thus providing a smoother pose point movement. I recover the position by linearly interpolating the pose point’s position between the two closest detected frames temporally surrounding the lost frame(s).

It is worth noting that this pose point detection failure is different to full detection failure where the whole participant is not detected within a frame. I do not attempt to recover such full failure cases as the failure is more serious and cause is ambiguous. I do not want to introduce stray data by “recovering” significantly incorrect data. Partial failures suggest a simple failure of OpenPose to extend through its body hierarchy. Since other pose points are still detected I am more confident that it is not a “legitimate” failure to do with participant position. Instead, full failure cases are treated as “separators” within a sample. Gesture detection occurs on either side but not across such separators.

4.2.3 Detection Smoothing

To extract more accurate and relevant features I smooth the pose data by removing high frequency movement within pose points. Such high frequency movement of pose points is caused by OpenPose’s detection noise (i.e. the exact pose point might move back and forth by a few pixels each frame while its target joint is static). Thus smoothing is not smoothing human movement, but rather smoothing pose extraction noise. To smooth the data I apply a fourier transform filter to each dimension for each pose point. The smoothing steps are:

  1. Separate the data into and positions and smooth them (i.e. apply the following steps) independently.

  2. I convert the position sequence data using a fourier transform with a window length of .

  3. Set the medium and high frequency values (all frequencies above the first five) to .

  4. Invert the fourier transform on the updated fourier values to reconstruct the smoothed pose data.

  5. Concatenate the smoothed windows.

4.3 Feature Extraction

I extract two types of features: aggregate features across the whole body and localised features for specific body parts, which I term “body localisations”. The localisations are: head, hands, legs, and feet. As the participants are seated the body trunk does not move substantially such that a gesture might be detected. The whole-body features include aggregations of the body localisation features as well as features incorporating the whole body.

By including localised and non-localised features I can model the information provided by individual body parts and also the overall behaviour.

Gesture definition

I define a gesture as a period of sustained movement within a body localisation. Multiple body localisations moving at the same time are treated as multiple individual, overlapping, gestures.

A gesture is represented as a range of frames in which the target body localisation has sustained movement.

For example, if a participant waves their hand while talking it would be a hand gesture. If they were to cross their legs it would register as a gesture in both the leg and feet localisations.

Whole Body Features
  • Average frame movement - the per-frame average movement of every tracked pose point (i.e. whole body). This is the only feature that is not based on detected gestures.

  • Proportion of total movement occurring during a gesture - the proportion of total movement (i.e. the whole body) that occurred while some body localisation was affecting a gesture.

  • Average gesture surprise - gesture surprise is calculated per-gesture as the elapsed proportional time since the previous gesture in the same localisation, or the start of the sample (proportional to length of the sample) for the first gesture. This overall feature averages the surprise value calculated for every gesture across all tracked localisations. I use the term “surprise” as the feature targets the effect on a gesture level basis, rather than the sample level. This is not a measure of how much of a sample no gesture is occurring as it is normalised on both the sample length and the number of gestures222 To illustrate further: if 2 gestures occurred within a sample such that 80% of the sample duration had no gesture occurring, the average gesture surprise would be . Whereas, if there were 100 gestures, still with 80% of the sample with no gesture occurring, the average surprise be 0.8%, even though both samples had the same proportion without any gesture occurring. This matches the intuition that each gesture within 100 evenly spaced gestures would be unsurprising as they were regularly occurring, whereas the 2 evenly spaced gestures would be surprising because nothing was happening in between. .

  • Average gesture movement standard deviation - the standard deviation of per-frame movement within a gesture is averaged across all detected gestures. This is intended to indicate the consistency of movement intensity through a gesture.

  • Number of gestures - total number of detected gestures across all tracked localisations.

Localised Features

Whole body and localised features are concatenated in the same feature vector. Localised features are included in the final feature vector for each localisation included in the vector.

  • Average length of gestures - the average number of frames per gesture.

  • Number of gestures - the total number of gestures, irrespective of gesture length.

  • Average per-frame gesture movement - the average movement across all gestures.

  • Total movement in gestures - the total amount of movement affected by the detected gestures.

  • Average gesture surprise - the average surprise across all gestures.


All features are normalised such that the length of the sample does not affect the results.

I normalise sum based features (e.g. gesture length, gesture count, total movement, etc) against the total number of frames in the sample and against the total number of gestures for gesture average values. For example, gesture surprise is normalised against the total number of frames and normalised a second time against the total number of gestures.

Absent Features

If a feature has no inputs (such as when no gesture was detected within a body localisation) its value is set to to enable models to incorporate the absence of movement in their predictions.

4.3.1 Body Localisation

Gestures in different body localisations provide distinct information. Aggregating gestures from different localisations provides a general representation of this information, however, having features localised to specific body localisations provides further information, without significantly increasing the dimensionality.

I define four localisations, hands, head, legs, and feet, based on specific pose estimation points.


I use the finger tip points (including thumb) detected by OpenPose as the gesture detection points. This means wrist based gestures (e.g. rolling of a hand) are detected. Each hand is processed separately, that is, I detect gestures and calculate individual gestures independently in each hand, these gestures are then aggregated into a single body localisation feature vector. This makes the final features side agnostic. This ensures differences in dominant hand between participants will not affect the result.


While OpenPose includes face detection, providing detailed facial landmarks, I use the general head position points provided by the standard pose detection component.


I represent legs using the knee pose points from OpenPose. As with hands, I process gestures in each leg independently and then aggregate to a single feature vector.


Each foot is comprised of four pose points within OpenPose. Aggregation is the same as hands and legs.


I do not include a “trunk” localisation as there is minimal movement in the trunk, given the seated nature of the dataset interviews. Though some participants may lean forwards and backwards, these movements are not represented well within my data as the camera faces the participants directly such that forwards and backwards leaning would be towards the camera, thus requiring depth perception which is not included in my data. Side to side leaning is restricted by the arms of the participant’s chair. As such the localisations that are relatively free to move, those other than the trunk, are intuitively the most likely areas to provide predictive information.

4.3.2 Gestures

Gesture Detection

To detect gestures within a set of pose points (e.g. finger tips, knees, feet, head, etc) I scan the activity of the target points for sustained periods of movement. The gesture detection step takes cleaned per-frame pose estimations and outputs a collection of ranges of non-overlapping frames that contain gestures within the localisation.

First, the per-frame absolute movement delta is calculated for each pose point. The movement is then averaged across all localisation pose points. Movement deltas are distances. Formally,


where is the amount of movement for pose point at time , is the position value of pose point at time , and is the averaged per-frame movement across all points.

Second, I average the movement of each frame within a window such that a small number of frames do not have a disproportionate effect on the detection. That is,


where is the windowed average at window index , is the length of the window, and is the average movement at frame , from Equation 4.1. In this dissertation I use , i.e. a second of movement is represented by 3 windows, this is experimentally chosen.

Third, the detector iterates through the averaged windows until a window with an average movement above a threshold is found. The first frame of this initial window is considered the beginning of the gesture. The gesture continues until consecutive windows (I use 3, i.e. 30 frames, as an approximate of a second) are found below the movement threshold. The last frame of the final window above the movement threshold is considered the end of the gesture. This is provided more formally in Algorithm 1.

8:for each window movement at index  do
9:     if  then
10:          // Start the gesture on the first window that exceeds the movement threshold.
11:         if  then
15:     else if  then
17:          // A gesture is completed after consecutive windows below the movement threshold.
18:         if  then
19:               // The end of a gesture is the final window that exceeds the threshold.
21:              if  then
22:                   append               
23:               // Reset to find the next gesture.
28:// Close the final gesture.
29:if  then
31:     if  then
32:          append      
Algorithm 1 Gesture detection

Having detected gestures in each body localisation I extract the features described above from each gesture and aggregate them to form the final feature vector.

4.4 Feature Space Search

I have described a collection of novel features whose individual and mutual predictive value is as yet unknown. Some features are potentially unhelpful to predictive models. Therefore, distinguishing useful features from unhelpful features is a key operation to enable development of accurate models, and thus validate these features. To this end, I perform an exhaustive search of the feature space to identify the combination of features with the best performance.

Alghowinem et al. [1] demonstrate the benefits of a variable feature set, achieving accuracy improvements of up to 10%. Their classification method is based on SVMs and the dimensionality reduction enabled by feature filtering is significant. They use a statistical T-threshold based approach to feature selection. However, Dibeklioglu et al. [16] argue that basing selection on optimization of mutual information will achieve better results than individual information based selection. I follow this mutual information approach and thus feature selection is based on the results achievable given a combination of features, rather than features’ individual relevance.

I define few enough features such that a brute force feature combination space search (i.e. test every permutation) is viable. As each feature can be included or excluded the space has permutations, where is the number of features being searched.

I iterate over every permutation and perform three fold cross validation using the combination of features, the permutation with the greatest average cross validation F1 score is taken as the best permutation, which enables testing and evaluating the effectiveness of the proposed features.

4.5 Summary

In this chapter I have described my methodology for extracting gesture meta features from videos. The four core stages are: pose estimation, data preparation, feature extraction, and finally classifier training. I use OpenPose to extract pose data per-frame as it is the state-of-the-art in pose estimation. I then filter samples and perform two operations to reduce noise within the remaining samples: recovery of partial detection failures and smoothing of high frequency detection noise. Features are extracted by detecting individual gestures, calculating per-gesture features such as speed, and aggregating per-gesture features within their localisations and over the whole body. Classifier choice is an evaluation detail and discussed in the next chapter.

5.1 Implementation details

Before presenting evaluation results, I outline the details of the evaluation setup.


I perform evaluations using a three fold cross validation. The training and test samples are split in a participant-independent manner and use stratified folding to balance them with regards to labels. Cross validating with more folds leads to fewer test samples per fold. Given the small size of the dataset, this can lead to more erratic performance (i.e. more extremes in cross validation results). I assess results based on the average cross validation F1 score and the standard deviation between fold F1 results. Average F1 provides an objective measure of model quality. Whilst the standard deviation provides an indicator of the consistency of the model.

Classifier Models

I evaluate four types of classifiers on all tasks:

  • A linear regression based classifier (denoted as lin) using a classification threshold of 0.5.

  • A logistic regression based classifier (denoted as

    log) using the L-BFGS solver.

  • A linear kernel SVM (denoted as svm) with balanced class weighting (without balancing the class weightings the classifier consistently chooses to predict a single class on every fold).

  • A random forest (denoted as rf) with 40 trees, feature bootstrapping, a minimum of 3 samples per leaf, a maximum depth of 5, balanced class weighting, and exposing 80% of features per node. These parameters were chosen experimentally.


I only evaluate binary classification tasks (i.e. high vs. low depression score) within this dissertation. However, the dataset contains continuous values for distress and personality measures, thus a constant threshold is required for each label (similar to the participation selection criteria discussed in Section 3.2.2). These thresholds are chosen such that the resulting binary classes are as balanced as possible. Given the small size of the dataset, balancing the classes is important to classifier training. Per-label thresholds are reported in Table 5.1.

Label Threshold # Participants above # Participants below

Depression 7 18 17
Anxiety 7 18 17
Perceived stress 17 18 17
Somatic stress 6 19 16

Neuroticism 17 18 17
Extraversion 16 18 17
Agreeableness 25 20 15
Conscientiousness 20 20 15
Openness 27 19 16

Table 5.1: Binary classification thresholds for distress and personality labels.

A larger dataset, or an expansion of this dataset, would enable regression models. Regression models would also naturally provide a multi-class classification solution as the original questionnaire scoring categories could be applied to the regressed predictions.

Though I evaluate the features’ predictive capability against multiple labels, my primary focus is on depression detection, as this is the area that most of the related work discusses. An evaluation the features’ capability on other labels is presented in Section 5.5.


Though I evaluate four localisation types: head, hands, legs, and feet, the feet localisation has a negative effect on performance, shown in Section 5.4. As such, all other sections use the three useful localisations: head, hands, and legs, in their evaluations.

Feature Reporting Notation

To concisely report features used by different classification models I describe a brief notation for enumerating features. The notation defines tokens for localisations, feature type, and how the lin model interprets the feature. The structure is [localisation]-[feature type][linear polarity]. Localisation and feature type token mappings are provided in Table 5.2.

Localisation Token Overall O Hands Hn Head He Legs L Feet F Feature Token Overall Average frame movement FM Proportion of total movement occurring during a gesture GM Average gesture surprise GS Average gesture movement standard deviation GD Number of gestures GC Localised Average length of gesture GL Average per-frame gesture movement GA Total movement in gestures GT Average gesture surprise GS Number of gestures GC
Table 5.2: Feature notation tokens.

I define an indicator of how a model interprets a feature based on its contribution to a positive classification. A greater value (e.g. more activity) contributing to a positive classification is denoted by +. Conversely a greater value contributing to a negative classification is denoted by . A value which has a negligible effect (defined as a linear model applying a near-zero coefficient) is denoted by /. Finally, if the lin classifiers for each cross-validation fold are inconsistent in usage of the feature the ? indicator is used.

For example, F-GS would denote that a greater amount of surprise in feet gestures indicates a negative classification.

5.2 Baseline

As a baseline I evaluate the use of the only non-gesture based feature, average per frame movement (O-FM). Results are given in Table 5.3.

Model F1 avg F1 std

34.43% 11.45%
log 34.43% 11.45%
svm 33.82% 10.73%
rf 64.29% 5.83%

Table 5.3: F1 aggregate scores for models using the baseline feature on the depression detection task. The rf model achieves the best performance. In this evaluation only one feature is used, this means that the feature is available to every decision node in the random forest, enabling strong overfitting.

While the rf model achieves the best results, it is inconsistent across its folds. Its 3 F1 scores are: 70.06%, 60.00%, and 47.06%. However, it does suggest that the movement feature alone is valuable for prediction. Though, the worse than random results achieved by the other models suggest that movement is not a linear indicator. It is possible, indeed likely, that the rf model’s result is reflective an overfitting of the dataset.

5.3 Feature Search

I evaluate the effect of the feature space search to identify the best feature combination.

All Features

Table 5.4 presents results for each model when provided with the full feature vector unmodified. Given all features the lin, log, and svm models all improve on their baselines, while the rf model is worse than its baseline. The reduction in performance by the rf

model can be attributed to the overfitting ability of random forests. This ability is especially prevalent with single feature problems as the feature must be made available to every node within the forest, thus enabling direct fitting of multiple decision trees to the dataset. The

lin model achieves significantly better-than-random performance, indicating that the features have some linear predictive capability.

Model F1 avg F1 std

66.81% 8.89%
log 56.55% 5.12%
svm 62.70% 9.19%
rf 41.88% 6.04%

Table 5.4: Classifier F1 performance for detecting depression within my dataset given all features. There are two primary interest points in this table: 1) the lin classifier performs best given all features, suggesting that the features provide some linear information, and 2) the rf classifier performs significantly worse than the one feature baseline in Table 5.3, further suggesting the rf classifier is overfitting the one feature baseline.

Searched Features

I perform an exhaustive feature space search to identify the best combination of features (determined by the average cross-validation F1 score). This provides two important outcomes: the best accuracy possible for a given model using these features within my dataset and the most relevant features to predict depression within my dataset. A good accuracy from the first outcome validates the features I have developed. The second outcome then provides a basis for further analysis of these features and their relation to depression within my dataset.

This search requires a model fast enough to train that the full space can be searched in a viable amount of time and, preferably, a model that is interpretable. For these reasons I use the lin model to search the feature space, and then evaluate the best feature combination with the other classifier types.

Figure 5.1: Comparison of baseline results to feature searched results. This shows that performing a full feature search enables the lin classifier to outperform the other classifiers and feature combinations. Also of interest is the reduction in performance by the rf model when provided with more features, suggesting that it may be overfitting when provided with the single baseline feature. The specific feature combination resulting from the search is discussed below in Chosen Features.
Model F1 avg F1 std

82.70% 8.95%
log 54.17% 5.89%
svm 56.55% 5.12%
rf 53.18% 14.96%

Table 5.5: Performance of the best feature combination as determined by an exhaustive feature space search using the lin classifier to detect depression. This demonstrates the power, and the linearity, of the gesture meta features as the lin classifier is able to achieve a high F1 score.
Feature Search Improves Best Performance

The best performance when applying the feature search, again from the lin classifier, is significantly better, 82.70%, than the classifier’s baseline of 34.43% and all features baseline of 66.81%. This demonstrates the importance of reducing dimensionality and feature confusion, especially when using relatively direct methods such as linear regression. Results are provided in Table 5.5, a comparison of these results to the baseline results is presented in Figure 5.1.

Chosen Features

The full feature combination is: {O-FM?, O-GM+, O-GC?, Hn-GC?, Hn-GT, Hn-GS?, He-GL?, He-GC?, He-GA+, He-GT, He-GS?, L-GL, L-GC?, L-GT+}. Analysing this feature set, we can derive some intuition as to information indicative of depression. The overall (O-*) features suggest that the number of gestures (O-GC) and the amount of movement within those gestures (O-GM) is relevant to depression. The O-GM+ token suggests that more movement within gestures relative to all other movement is indicative of depression. The localised features suggest that the length of gestures (*-GL) has a correlation with depression, however, this correlation differs between localisations. The head localisation is ambiguous as to whether shorter or longer gestures (He-GL?) is indicative of depression. Whilst longer gestures in the legs localisation (L-GL) is indicative of less depression. Within this model, less total movement of the hands (Hn-GT) is indicative of distress.

Negative Performance Impact on Other Models and Overfitting

The identified feature set is chosen using the lin model, so it is unsurprising it has a greater improvement than any other model. While the log classifier’s performance does not change much, the svm and rf classifiers have reduced performance compared to their all features and one feature baselines, respectively. There are two aspects to consider here: the value of pre-model filtering to each model and the potential for each model to overfit. Focusing on the rf classifier as it has a more distinct reduction in performance; random forests have inbuilt feature discretion, so the pre-model filtering of features does not have as great a complexity reducing effect as it does on the other models. My hyper-parameters give the rf model a relatively large selection of features (80%) per node, thus it should generally filter, to some extent, those naturally unhelpful features. Random forests, as decision tree ensemble methods, have a propensity for overfitting data. By reducing the number of features available I reduce the available surface for overfitting. Moreover, when only using one feature, as in the baseline, every decision node in the random forest has access to the feature, thus enabling particularly strong overfitting of a small dataset.

Dimensionality Compression

I do not perform any dimensionality compression operations in my presented evaluations. However, I have experimented with Principle Component Analysis (PCA) both independently and in combination with a feature search. Neither approach achieved especially interesting results, all were worse than when not applying PCA. Given this, and the already relatively low number of dimensions, I do not see it as a critical path of investigation for this dissertation.

5.4 Body Localisations

Not all localisations are necessarily beneficial. Identifying which localisations are helpful is made more difficult by the localisations’ interactions within a classifier and the effect they have on the overall features. Though a localisation may generate features that are chosen using feature search, they may reduce overall accuracy by obfuscating predictive information in the overall features.

Given this, I experiment with localisations included individually and in varying combinations. I also provide an example of a localisation, feet, that negatively effects performance when included, even when all other localisations are also included (and are thus providing the same predictive information). A comparison of the best F1 scores for localisation combinations is presented in Figure 5.2.

Figure 5.2: Comparison of the best F1 average scores from localisation combinations using the lin classifier using all features. The most interesting results are the bottom four (vertically) localisation results: head - legs, hands - head - legs, feet - head - legs, and hands - feet - head - legs. Specifically, the feet localisation impairs the performance of the localisation combinations when included. This trend is also seen in feet - legs and feet - head.
Localisation Inclusion

Clearly not all of the features generated per localisation provide useful information. Inclusion of more localisations, and thus a larger feature space, does not guarantee a better optimal result is available within the space. As the overall features (those aggregated across all localisations) are effected by each localisation, the quality of the feature space can be degraded with the inclusion of localisations. For example, this occurs regularly in Figure 5.2 when including the feet localisation. In particular, the best base configuration, head - legs using the lin classifier, achieves a 70.88% F1 score, when the feet localisation is included this drops to 49.84%. I have not identified any psychology literature, or clear intuition, as to why the feet

localisation hinders performance. I see three probable explanations: 1) some literature does support this and I have simply not identified the literature, 2) this is accurately representing that feet movement meta information does not distinctively change with distress, but no literature has explicitly investigated this, and 3) this is simply a attribute of the dataset that is not reflective of any broader trend, either due to the dataset size or the nature of the interview dynamic.

Best Base Performance Configuration

Though the head - legs configuration achieves the best performance when all features are used, it does not achieve the best performance when features are chosen based on an exhaustive search. While my primary configuration, hands - head - legs, achieves 82.70% F1 average, the head - legs configuration achieves 80.53%, results are presented in Table 5.6.

Model F1 avg F1 std

80.53% 4.04%
log 59.52% 8.91%
svm 65.40% 8.40%
rf 51.32% 5.76%

Table 5.6: Performance of models when using features chosen via exhaustive search with source features from the head - legs configuration. This configuration achieves close to the best performance of the standard configuration (Table 5.5). The lin classifier also has more consistent performance across cross validation folds than it does on the standard configuration, with a standard deviation of 4.04% compared to 8.95%. However, these results do not clearly define which configuration is generally better as the differences are quite minor.

5.5 Generalisability

I have demonstrated the gesture meta features’ predictive value with regards to depression detection. I now evaluate their predictive capability for other labels including participants’ gender, personality measures, anxiety, perceived stress, and somatic stress.

I apply the best feature combination identified in Section 5.3 to each of the labels, presented in Table 5.7. I also perform a feature space search for each label, using the same process as Section 5.3, to provide greater insight into the effect of features and their reliability across labels, presented in Table 5.8. A comparison is shown in Figure 5.3. Consistent identification of features across uncorrelated labels reinforces the hypothesis that they provide predictive information beyond a specific label and dataset (i.e. are less likely to be overfitting).

Figure 5.3: Comparison of F1 average scores when using the optimal feature combination for the depression label vs. the optimal feature combination for each label. Each label has a significant performance improvement when using its optimal feature combination.
Label lin log svm rf
F1 avg F1 std F1 avg F1 std F1 avg F1 std F1 avg F1 std

Depression 82.70% 8.95% 54.17% 5.89% 56.55% 5.12% 53.18% 14.96%
Anxiety 47.18% 23.21% 38.33% 19.20% 30.18% 4.02% 53.26% 16.58%
Perceived stress 47.18% 23.21% 38.33% 19.20% 30.18% 4.02% 53.26% 16.58%
Somatic stress 42.74% 30.29% 51.68% 4.01% 44.44% 6.29% 54.89% 8.37%
Neuroticism 62.08% 0.76% 31.71% 10.89% 33.36% 12.59% 38.30% 8.30%
Extraversion 51.04% 14.06% 67.95% 17.33% 65.14% 12.98% 52.94% 20.83%
Agreeableness 69.48% 5.97% 73.61% 2.13% 67.98% 3.92% 56.96% 20.79%
Conscientiousness 71.28% 6.53% 72.95% 6.93% 78.77% 6.24% 79.19% 6.11%
Openness 49.64% 14.94% 65.08% 5.94% 64.46% 5.31% 61.47% 8.64%
Gender 34.26% 18.18% 52.96% 2.28% 40.63% 18.42% 63.14% 5.29%

Table 5.7: Performance of models using the feature combination identified for the depression label (Section 5.3) for predicting a variety of labels. The best results per-label are bolded. These results suggest that the depression chosen feature combination does not generalise particularly well. These results are also surprising as the distress labels (anxiety, perceived stress, and somatic stress), which are correlated to the depression label, perform poorly, whilst uncorrelated labels (such as agreeableness, openness, and gender) perform better. This may be due to overfitting of feature profiles to labels (i.e. the optimal features for the label) or truly distinct feature profiles between the correlated distress labels, though the former appears more probable.
Label lin log svm rf
F1 avg F1 std F1 avg F1 std F1 avg F1 std F1 avg F1 std

Depression 82.70% 8.95% 54.17% 5.89% 56.55% 5.12% 53.18% 14.96%
Anxiety 88.46% 9.53% 46.14% 9.16% 54.33% 12.32% 52.94% 12.85%
Perceived stress 88.46% 9.53% 46.14% 9.16% 54.33% 12.32% 52.94% 12.85%
Somatic stress 86.37% 6.45% 58.44% 8.85% 68.08% 16.97% 49.48% 11.38%
Neuroticism 76.39% 8.56% 38.85% 5.15% 48.25% 4.47% 40.78% 13.59%
Extraversion 84.04% 10.24% 74.61% 10.34% 73.39% 12.75% 58.59% 31.33%
Agreeableness 82.70% 5.77% 69.14% 6.67% 50.43% 14.85% 67.97% 1.85%
Conscientiousness 88.72% 4.25% 73.14% 3.13% 69.90% 7.84% 78.91% 2.80%
Openness 85.01% 6.62% 64.32% 5.06% 68.77% 8.84% 60.32% 7.36%
Gender 81.69% 5.32% 68.11% 10.22% 76.71% 5.10% 64.81% 11.42%

Table 5.8: Performance of models for predicting a variety of labels when using features identified via a feature space search specific to the label. This demonstrates the performance improvement, compared to Table 5.7, achieved via label specific feature combinations. The best results for each label are bolded.


The depression feature combination does not generalise well to the other distress measures for the lin model, though it achieves 62–71% F1 scores for neuroticism, agreeableness, and conscientiousness. Interestingly, the other labels’ best results using the depression combination are above 60% F1, with the exception of the other distress measures, which only achieve a best of 53–54% F1. This is surprising as the distress labels are strongly correlated to the depression label.

However, when features are chosen on a per-label basis the results improve significantly. All labels111 Though perceived stress and anxiety measures have a covariance of 82.98% within my dataset, once reduced to binary classifications they are equivalent (i.e. 100% correlation). Thus results for both labels are the same. achieve 76%+ (all but one are above 80%) average F1 score with their best classifier. The best classifier for all labels is the lin classifier, which is to be expected given it is the search classifier and previous evaluation has shown the features provide useful linear information.

Fitting features to labels

There are two potential explanations for the substantial improvement in performance when using label specific feature combinations: each label could legitimately have a unique feature profile through which its distinctive attributes are expressed or the labels could be experiencing a level of overfitting due to the small size of the dataset. While it is likely to be a combination of the two reasons, labels that are relatively uncorrelated with depression are still able to achieve good performance using the depression chosen features, suggesting it is not just overfitting. For example, agreeableness, extraversion, and conscientiousness all have covariances of less than 50% with depression, yet they achieve 72.61%, 67.95%, and 79.19% average F1 scores, respectively, with the depression features. Openness which has a 4.29% covariance with depression achieves 65.08% with the depression chosen features. Furthermore, as Table 5.9 shows, the openness feature combination shares only 6 out of its 9 features with the depression combination, whilst it also excludes 8 out of 14 of the depression combination’s features. Therefore, it can be reasonably suggested that the features do generalise, acknowledging that fitting the features directly to a label achieves the best performance, as would be expected with all prediction tasks.

Cross classifier generalisability

Within the depression feature combination results, the lin, log, and rf models all achieve the best results on at least 2 labels each. The svm classifier performs close to the best on many labels, such as conscientiousness where it achieves 78.77% compared to the best result of 79.19%.

The svm model performs better when classifying conscientiousness, extraversion, and agreeableness, using the depression feature combination, than it does when predicting the depression label. It is also better at transferring the depression feature combination to other labels than the lin model that the feature combination was chosen with. Its performance improves on most labels when using the label targeted feature sets, such as on gender where it achieves 76.71% and somatic stress with 68.08% F1, whilst with the depression feature combination its F1 result was less than 50% for both. Interestingly, whilst the svm model performed particularly well on agreeableness using the depression feature combination, it performed worse when using the label targeted feature combination.


Within the results for the depression feature combination the conscientiousness label achieves the best average F1 score across all classifiers, 75.55%, with an average standard deviation of 6.45%.

Unlike the distress evaluation questionnaires, the BFI (personality) questionnaire is designed such that responses should remain relatively consistent over time. Future work could investigate whether the features are able to consistently predict personality measures over time for a participant. Do the features remain predictive of personality as a person’s temporal distress measures change?

Chosen Features

Table 5.9 presents the feature sets resulting from the per-label feature searches. There are some consistencies and intuitive inversions within the feature sets. For example, the head localised average gesture movement (He-GA) is relevant for many of the labels, though in some it is inconsistent whether greater or lesser is indicative of the label (instances of inconsistencies are denoted as He-GA? within Table 5.9). The feature usage is inverted between neuroticism, where a faster average speed is indicative of a positive classification, and extraversion and agreeableness, where a slower average speed is indicative of positive classification.

Furthermore, as stated above, features that are chosen by uncorrelated labels support the hypothesis that these features are providing useful information. For example, of the 7 features the gender label chooses, 5 are also chosen by the anxiety label, including hand total gesture movement (Hn-GT), hand gesture surprise (Hn-GS), and leg gesture surprise (L-GS). Whilst gender and anxiety have a covariance of 7.23% within the dataset.

Label # Feat. Localizations

Overall Hands Head Legs

Depression 14 FM?, GM+, GC? GC?, GT, GS? GL?, GC?, GA+, GT, GS? GL, GC?, GT+
Anxiety 13 FM+, GM+, GD, GC GT?, GS+ GL+, GA?, GT GL, GA+, GT+, GS+
Perceived stress 13 FM+, GM+, GD, GC GT?, GS+ GL+, GA?, GT GL, GA+, GT+, GS+
Somatic stress 9 GD GL?, GT?, GS+ GC, GA+ GL, GC, GT+
Neuroticism 14 GM+, GS?, GC GC+, GA GL?, GC+, GA+, GT GL, GC+, GA?, GT+, GS+
Extraversion 10 FM+, GC+ GL, GC+, GA+ GC?, GA GL+, GT, GS
Agreeableness 9 GS+, GC+ GC, GA GA, GS+ GC, GA?, GT?
Conscientiousness 8 FM?, GM, GS? GL+, GS? GC+ GL?, GC?
Openness 9 GM+, GD+ GL, GC? GC, GA?, GS? GA?, GT?
Gender 7 GS+, GD GC+, GT+, GS GA+ GS

Table 5.9: Features chosen for each label when features are searched specifically for the label. “Positive classification” for the gender label is female (i.e. He-GA+ indicates a higher average speed of head gestures is indicative of the participant being female). Refer to Table 5.2 for the full notation mapping.

I represent four measures of movement: average movement per-frame (i.e. speed), standard deviation of movement (i.e. consistency of speed), total movement both over an entire sample and over individual gestures, and proportion of total movement that occurs during gestures (i.e. how much/little movement is there outside of gestures). The amount and speed of movement is intuitively correlated to distress and personality. This intuition is validated as these movement representations are included in a variety of feature sets resulting from feature space searches, across multiple labels. Indeed, Table 5.9 shows that the speed of the head during a gesture (i.e. He-GA) is correlated with positive classifications for labels including neuroticism, depression, somatic stress, while it is inversely correlated with other classifications including extraversion and agreeableness.


One of the less initially intuitive features is “gesture surprise”. This feature represents the distance between a gesture and the previous gesture (or beginning of the sample) as a proportion of the total length of the sample. This per-gesture value is then averaged across all gestures, this means it is not a measure of the proportion of the sample when no gesture occurs (as discussed in Section 4.3). The intuition is to represent whether the participant’s gestures occur regularly or after a period of stillness (i.e. how “surprising” is the gesture). Gesture surprise is included in every feature combination resulting from a feature search of a label, shown in Table 5.9, suggesting it provides useful information.


The features are, in general, linearly correlated with different labels, as shown by the effectiveness of the lin classifier, which has been the focus of the feature development and evaluation. Using this classifier provides interpretability of the features and integrity that the features are providing real, useful, information, not simply a platform for overfitting.

However, not all features are consistently linear with regards to certain labels. Some features are learnt as correlated and inversely correlated on the same label in different cross validation rounds. This inconsistency denoted with the usage indicator ?. For example, while the hands localised gesture surprise feature (Hn-GS) is linearly correlated with somatic stress, anxiety, perceived stress, and gender, it is inconsistently correlated with conscientiousness and depression. These inconsistencies within feature correlation are the exception, not the rule. Indeed, all of the features that experience inconsistency on a label are shown to be linearly correlated to another label, or linearly correlated within another localisation.

5.6 Comparison with Related Work

I evaluate two comparison methods: a linear SVM model based on Bag-of-Words features of FACS [21] Action Units (AUs), and the Bag-of-Body Dynamics (BoB) method presented by Joshi et al. [36]. The first method provides a comparison of modality (facial-modality) within my dataset and a cross-dataset benchmark with the closest related dataset, DAIC. The second method provides a comparison to an existing body-modality method that uses an automatic expression categorization approach to predict depression.

Au Svm

This method uses basic facial expression analysis to predict a binary depression label. I implement a simple linear kernel SVM based on a Bag-of-Words (BoW) feature vector of FACS AUs to predict the binary depression label in my dataset and, separately, a binary depression label within DAIC. As with my dataset, I apply a threshold to DAIC’s PHQ-8 score labels such that the dataset is as balanced as possible. The resulting threshold is 5 (i.e. 6 and above is considered “depressed”), this results in 68 of 139 samples classed as “depressed”. This is a lower threshold than used in my dataset (which is 7).

Within the BoW feature vector, binary AUs are counted for each frame they are detected in, whilst intensity measured AUs (i.e. continuous value measures) are summed across all frames. The sums and counts are then normalised against the number of frames in which the face was successfully detected.

The DAIC dataset provides per-frame predictions for 20 AUs. I use OpenFace [5] to extract per-frame AUs from my dataset. OpenFace predicts 35 AUs per-frame, substantially more than provided by DAIC. While this method does not use the gesture features, I still only evaluate it on the samples that pass the gesture filtering step (Section 4.2.1) so that the evaluation is consistent with previous results.

Bag-of-Body Dynamics

Joshi et al. [36]

use a BoW approach to predict depression in clinically assessed participants from their BlackDog dataset. The BoW feature vector is comprised of categorized body expressions based on K-Means clustering of histograms surrounding STIPs within a video of the subject. Their method is:

  1. Extract Space-Time Interest Points (STIPs).

  2. Calculate histograms of gradients (HoG) and optic flow (HoF) around the STIPs.

  3. Cluster the histograms per-sample using K-Means, the resulting cluster centres define the sample’s “key interest points” (KIPs).

  4. Cluster KIPs across all samples to define a BoB codebook, also using K-Means.

  5. Generate a BoB feature vector for each sample by fitting its KIPs to the codebook.

  6. Finally, apply a non-linear SVM to the resulting feature vectors to predict depression.

They use a radial basis function kernel (RBF) for their SVM.

Differences in Method

In the original paper Joshi et al. experiment with multiple KIP counts, codebook sizes, and perform an extensive grid search for their SVM parameters. I use a single KIP count and codebook size, 1,000 and 500, which they also experiment with. While I perform a grid search on the RBF parameters, it may not be as extensive as their search.

I also exclude samples that generate fewer STIPs than the KIP count. This results in 45 out of 65 samples being included.


The results for each model are presented in Table 5.10, a comparison of methods applied to my dataset is shown in Figure 5.4. I compare the best depression detection model from my previous evaluations (i.e. the feature combination identified in Section 5.3 with the lin classifier), and a lin classifier using all gesture meta features, with two BoB based SVMs and a FACS AUs based linear SVM for predicting depression. The lin model performs best, achieving an 82.70% F1 average with the optimal feature combination and 66.81% with all features, compared to the next best at 63.71% from the FACS AUs SVM model and 61.10% from the BoB RBF SVM model. Results for all models are based on the same three fold cross-validation mean F1 as previously.

Figure 5.4: Comparison of method F1 scores for predicting the depression label in my dataset. This shows the lin classifier using the optimal depression feature combination identified in Section 5.3 (lin (opt.)) outperforming the comparison methods. The lin classifier using all of the gesture meta features (lin (all)) also outperforms the comparison methods, though not by as much. There are two caveats to this comparison: the features for the lin (opt.) classifier have been fitted to the label and the BoB results are for a reduced dataset (45 samples compared to 53 for the lin and AUs models) as not all of the samples produced enough STIPs for the methods to operate appropriately.

The FACS AUs SVM model performs better on my dataset than the DAIC dataset. This is almost certainly due to the difference in quantity and quality of available FACS AU features. The DAIC dataset provides 20 AU predictions per frame while my dataset provides 35 AU predictions.

Model F1 avg F1 std

Optimal feature combination 82.70% 8.95%
All features baseline 66.81% 8.89%
BoB RBF kernel 61.10% 6.09%
Linear kernel 60.13% 9.24%
FACS AUs My dataset 63.71% 8.02%
DAIC dataset 56.53% 3.10%

Table 5.10: Comparison of my lin model, FACS AUs based SVM models, and the BoB SVM models for predicting depression as defined by the PHQ-8 questionnaire, on my dataset. The FACS AUs DAIC model is the exception, its results are for the DAIC dataset. The lin model using the optimal feature combination identified in Section 5.3 achieves the best results. The lin model using all features (i.e. the all feature baseline from Section 5.3) also beats the comparison methods.

5.7 Multi-Modal

Given the success of the FACS AUs SVM classifier in Section 5.6, I also evaluate a multi-modal method that fuses my gesture meta features and FACS AUs.

I perform multiple multi-modal experiments:

  1. Feature vector fusion including all features, evaluated across all four classifier types.

  2. Feature vector fusion with a feature search on my gesture meta features, though all AU features are retained in every search iteration.

    1. Search using the lin classifier.

    2. Search using the svm classifier, as this was used successfully with the AUs features alone.

    3. Search using a radial-basis function kernel SVM classifier which achieves comparable results to the linear kernel SVM classifier on the AUs features alone.

  3. Hybrid fusion inspired by Alghowinem et al. [1]. This involved feature fusion and decision level fusion via a majority vote of three classifiers: one classifier with only meta features, one with only AUs features, and one with the fused feature vector. I experimented with all meta features and the feature searched meta features. I used the best classifier type for the individual classifiers, i.e. lin for gesture meta features, svm for AUs features, and then also svm for fused features.


The best result was by the feature search method with an svm classifier (2.b), achieving 81.94% F1 average. The feature search with lin classifier (2.a) was second best with an 80.93% F1 average. However, these are worse than the best depression detection score of 82.70% when using the gesture meta features alone with the lin model. Moreover, the hybrid fusion approaches achieved F1s in the mid-70s, so rather than correcting errors by the meta features, it averaged the success rate between the meta classifier and the AUs classifier, resulting in a worse F1.

Potential Future Fusion Approaches

These approaches to fusion are somewhat simplistic. More sophisticated approaches, such as deep learning fusion, may achieve better results than the meta features alone. An interesting route for future work is to use recurrent deep learning to fuse temporally aligned meta features with AUs, rather than fusing them post-aggregation.

5.8 Summary

In this chapter I have evaluated the introduced gesture meta features and found they provide linear predictive information. To achieve the best performance I perform an exhaustive search of the feature space to identify the best feature combination. This demonstrates the potential of the features, however, it also introduces a risk of overfitting. This risk is mitigated by two factors: the best performance is achieved by a linear regression based classifier (i.e. a classifier not prone to overfitting) and many features are present in optimal feature combinations for multiple uncorrelated labels.

I compared my method to a basic facial-modality method and an existing body-modality method based on STIPs (i.e. generic analysis of video data). My method outperforms both when using the lin classifier, both when using all possible features and when using the optimal feature combination for the depression label.

Finally, I perform a multi-modal experiment utilising the facial-modality comparison method and my novel gesture meta features. Despite testing multiple approaches to fusion, no method I evaluated beat the mono-modal gesture meta features classifiers. However, all related work that I am aware of achieves better performance when incorporating multiple modals, as such it is likely that further investigation of multi-modal approaches incorporating the gesture meta features will identify a multi-modal approach that does improve upon my mono-modal results.

Future Work


Future work could expand the introduced dataset and increase its quality via manual annotations. For example, annotating time frames based on the topic of conversation could enable sample segmentation methods such as Gong & Poellabauer [25]. This could improve the feature modeling and classifier stages, another route of future work is gesture definition and detection. For example, manual annotation of gestures would enable auto-learnt gesture detectors.

Regression Tasks

All labels I evaluate, with the exception of gender, are scales based on self-evaluation questionnaires relating to distress or personality. The distress scales define multiple levels of severity while the personality scales do not. Both types of labels are prime for regression prediction tasks. Moreover, distress classification prediction could be provided by the defined severity levels once a regression had been performed. However, this requires a larger dataset (initial experiments with regression on the dataset support this assertion).


Applying the presented features within larger datasets could further illuminate the properties of certain features, either confirming their inconsistency, or solidifying the linearity in one direction. Such properties should then be tested with regards to the psychology literature.


I do not evaluate a “trunk” localisation (i.e. hips to shoulders), though it may prove useful in future work. In my dataset participants are seated during the whole interview, thus the two-dimensional range of motion in the trunk (i.e. up/down and side to side) is small. As I do not include depth recording the greatest range of motion, forward and backward, is difficult to incorporate. Future work that either includes depth sensing or has participants in a wider variety of scenarios, where they may be standing or more physically engaged, may find some use in a trunk localisation.

Cross Localisation Co-occurrence

Co-occurrence of gestures across localisations is not considered in this dissertation, though it is an area for future work. Applying deep learning to co-occurrence modeling to identify relevant patterns would be particularly interesting.

Improved Surprise

In future work this feature could be extended to be more indicative of “surprise”, rather than simply an average of distance since last gesture. For example, a gesture that occurs within a regular pattern of gestures, regardless of their distance, might be considered to have low surprise, while a gesture interrupting that pattern by occurring in the middle of the pattern’s silent period would be very surprising. This more sophisticated measure of surprise could account for naturalistic behaviour such as rhythmic movement/gestures. These types of extensions of features are exciting for two reasons: firstly, they are interpretable, they do not come from a black box model, and thus can be supported by psychology literature. Secondly, they are immediately measurable and their design is based on traditional statistical techniques such as repetition modeling and can therefore be iterated on more readily than deep learning based features.


Future work, some of which has been discussed here, could extend these features to achieve better results in more diverse datasets. A greater variety of scenarios where the participant is more constantly physically engaged such that movement is more constant, e.g walking or a standing conversation, would challenge the design of the presented gesture meta features. Indeed, the current design would have trouble as it relies on the default state being a lack of movement.

Deep Learning for Feature Generation

As I discussed in Chapter 2, deep learning has been applied to the depression detection task, but still relies on hand-crafted features and descriptors. As with many fields it is likely that deep learning methods will, in the next few years, achieve state-of-the-art results, far exceeding the potential of more traditional approaches. However, to reach this point the deep learning models need large enough datasets to train on.

Assuming no especially large dataset is developed specifically for automatic depression detection, one potential for future work is the use of transfer learning (as Chen et al. 

[9] explore for vocal and facial features) for body expressions. For example, CNNs could be trained on general body expression datasets to learn inner representations of body language. The output of inner layers could then be used in depression detection tasks (Razavian et al. [47] present a transfer learning approach to image recognition tasks and achieve very good results on niche tasks using a generic pre-trained image CNN).

Classifiers and Deep Learning

In this dissertation I have used two traditional statistical models, linear and logistic regression, and two more advanced machine learning methods, support vector machines and random forests. However, I have purposefully avoided more complex machine learning methods such as neural networks, and especially deep learning. These more sophisticated methods have achieved significant improvements on the state-of-the-art in many domains, however, they suffer from a lack of interpretability and a propensity to overfit data. As my primary focus has been validating gesture meta features I define interpretability as a core requirement and, given the size of my dataset, avoiding methods that overfit is important.

However, having validated these features, deep learning presents opportunities with regards to learning more sophisticated aggregation functions (this is slightly different to the feature generating deep learning discussed above). I aggregate three core features of gestures: their movement, duration, and “surprise”. I perform aggregation via averaging and standard deviations. However, given a larger dataset, a deep learning model could foreseeably learn both more sophisticated aggregation functions and core meta-features. It is important to emphasize here the burden on dataset size that exists when attempting to learn valuable features using deep learning.

Multi-Modal Approaches with Deep Learning

Deep learning can learn useful modality integration functions. While I experimented with a multi-modal approach in Section 5.7 it did not achieve better results than my mono-modal method. I believe this is due to the rather direct approaches to multi-modal integration I experimented with. Applying deep learning to modal integration thus presents an opportunity for further work.

Furthermore, integrating a greater diversity of modalities and modality representations would be interesting. For example, applying the Bag-of-Body Dynamics features along with my gesture meta features and FACS AUs. With regards to the Bag-of-Body Dynamics method, future work could apply the core clustering concept to pose estimation data rather than STIPs data and may achieve better results from this.


  • Alghowinem et al. [2015] Sharifa Alghowinem, Roland Goecke, Jeffrey F Cohn, Michael Wagner, Gordon Parker, and Michael Breakspear. Cross-cultural detection of depression from nonverbal behaviour. FG, pages 1–8, 2015.
  • Alghowinem et al. [2016] Sharifa Alghowinem, Roland Goecke, Julien Epps, Michael Wagner, and Jeffrey F Cohn. Cross-Cultural Depression Recognition from Vocal Biomarkers. In Interspeech 2016, pages 1943–1947. ISCA, September 2016.
  • Alghowinem et al. [2018] Sharifa Alghowinem, Roland Goecke, Michael Wagner, Julien Epps, Matthew Hyett, Gordon Parker, and Michael Breakspear. Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors. IEEE Transactions on Affective Computing, 9(4):478–490, 2018.
  • Association [1994] American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. Washington, Am Psychiatr Assoc, pages 143–146, 1994.
  • Baltrušaitis et al. [2018] Tadas Baltrušaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66. IEEE, 2018.
  • [6] Black Dog Institute., 2019. Accessed 2019/06/01.
  • Cao et al. [2016] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. CoRR, cs.CV, 2016.
  • Cao et al. [2018] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields., December 2018.
  • Chen et al. [2017a] Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition. AVEC@ACM Multimedia, pages 19–26, 2017a.
  • Chen et al. [2017b] Xinghao Chen, Hengkai Guo, Guijin Wang, and Li Zhang.

    Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition.

    In 2017 IEEE International Conference on Image Processing (ICIP), pages 2881–2885. IEEE, 2017b.
  • Cohen et al. [1983] Sheldon Cohen, Tom Kamarck, and Robin Mermelstein. Perceived Stress Scale, 1983.
  • [12] Crisis Text Line., 2019. Accessed 2018/12/03.
  • Dang et al. [2017] Ting Dang, Brian Stasak, Zhaocheng Huang, Sadari Jayawardena, Mia Atcheson, Munawar Hayat, Phu Ngoc Le, Vidhyasaharan Sethu, Roland Goecke, and Julien Epps. Investigating Word Affect Features and Fusion of Probabilistic Predictions Incorporating Uncertainty in AVEC 2017. AVEC@ACM Multimedia, pages 27–35, 2017.
  • de Gelder [2009] B de Gelder. Why bodies? Twelve reasons for including bodily expressions in affective neuroscience. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535):3475–3484, November 2009.
  • [15] Detecting Crisis: An AI Solution., 2018. Accessed 2018/11/20.
  • Dibeklioglu et al. [2015] Hamdi Dibeklioglu, Zakia Hammal, Ying Yang, and Jeffrey F Cohn. Multimodal Detection of Depression in Clinical Interviews. ICMI, pages 307–310, 2015.
  • Du et al. [2015a] Yong Du, Yun Fu, and Liang Wang. Skeleton based action recognition with convolutional neural network. In

    2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR)

    , pages 579–583. IEEE, 2015a.
  • Du et al. [2015b] Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. CVPR, pages 1110–1118, 2015b.
  • Ebert [1996] D Ebert. Eye-blink rates and depression: Is the antidepressant effect of sleep deprivation mediated by the dopamine system? Neuropsychopharmacology, 15(4):332–339, October 1996.
  • Ekman [2009] P Ekman. Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition), 2009.
  • Ekman and Friesen [1978] P Ekman and W V Friesen. Facial coding action system (FACS): A technique for the measurement of facial actions, 1978.
  • Eyben et al. [2010] Florian Eyben, Martin Wöllmer, and Björn Schuller.

    Opensmile: the munich versatile and fast open-source audio feature extractor

    the munich versatile and fast open-source audio feature extractor. ACM, New York, New York, USA, October 2010.
  • Fairbanks et al. [1982] Lynn A Fairbanks, Michael T McGuire, and Candace J Harris. Nonverbal interaction of patients and therapists during psychiatric interviews. Journal of Abnormal Psychology, 91(2):109–119, 1982.
  • Gierk et al. [2014] Benjamin Gierk, Sebastian Kohlmann, Kurt Kroenke, Lena Spangenberg, Markus Zenger, Elmar Brähler, and Bernd Löwe. The Somatic Symptom Scale–8 (SSS-8). JAMA Internal Medicine, 174(3):399–407, March 2014.
  • Gong and Poellabauer [2017] Yuan Gong and Christian Poellabauer. Topic Modeling Based Multi-modal Depression Detection. AVEC@ACM Multimedia, pages 69–76, 2017.
  • Gratch et al. [2014] Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David R Traum, Skip Rizzo, and Louis-Philippe Morency. The Distress Analysis Interview Corpus of human and computer interviews. LREC, 2014.
  • [27] Gumtree., 2019. Accessed 2019/06/01.
  • Haggard and Isaacs [1966] Ernest A Haggard and Kenneth S Isaacs. Micromomentary facial expressions as indicators of ego mechanisms in psychotherapy. In Methods of Research in Psychotherapy, pages 154–165. Springer, Boston, MA, Boston, MA, 1966.
  • Hamilton [1967] Max Hamilton. Development of a Rating Scale for Primary Depressive Illness. British Journal of Social and Clinical Psychology, 6(4):278–296, December 1967.
  • Hardoon et al. [2006] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical Correlation Analysis: An Overview with Application to Learning Methods., 16(12):2639–2664, March 2006.
  • Huang et al. [2017] Jian Huang, Ya Li, Jianhua Tao, Zheng Lian, Zhengqi Wen, Minghao Yang, and Jiangyan Yi. Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network. AVEC@ACM Multimedia, pages 11–18, 2017.
  • John and Srivastava [1999] Oliver P John and Sanjay Srivastava. The Big Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives. In Handbook of personality Theory and research, pages 102–138., 1999.
  • Joshi et al. [2012] Jyoti Joshi, Abhinav Dhall, Roland Goecke, Michael Breakspear, and Gordon Parker. Neural-Net Classification For Spatio-Temporal Descriptor Based Depression Analysis., 2012.
  • Joshi et al. [2013a] Jyoti Joshi, Abhinav Dhall, Roland Goecke, and Jeffrey F Cohn. Relative Body Parts Movement for Automatic Depression Analysis. ACII, pages 492–497, 2013a.
  • Joshi et al. [2013b] Jyoti Joshi, Roland Goecke, Sharifa Alghowinem, Abhinav Dhall, Michael Wagner, Julien Epps, Gordon Parker, and Michael Breakspear. Multimodal assistive technologies for depression diagnosis and monitoring. J. Multimodal User Interfaces, 7(3):217–228, 2013b.
  • Joshi et al. [2013c] Jyoti Joshi, Roland Goecke, Gordon Parker, and Michael Breakspear. Can body expressions contribute to automatic depression analysis? In 2013 10th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2013), pages 1–7. IEEE, 2013c.
  • Kroenke et al. [2001] Kurt Kroenke, Robert L Spitzer, and Janet B W Williams. The PHQ-9: Validity of a Brief Depression Severity Measure. Journal of General Internal Medicine, 16(9):606–613, 2001.
  • Kroenke et al. [2009] Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet B W Williams, Joyce T Berry, and Ali H Mokdad. The PHQ-8 as a measure of current depression in the general population. Journal of Affective Disorders, 114(1-3):163–173, April 2009.
  • Liu et al. [2015] Zhao Liu, Jianke Zhu, Jiajun Bu, and Chun Chen. A survey of human pose estimation: The body parts parsing based methods. Journal of Visual Communication and Image Representation, 32:10–19, July 2015.
  • Mahmoud et al. [2013] Marwa Mahmoud, Louis-Philippe Morency, and Peter Robinson. Automatic multimodal descriptors of rhythmic body movement. ACM, December 2013.
  • [41] Mental Health Foundation., 2019. Accessed 2019/06/04.
  • Ory et al. [2013] Marcia G Ory, SangNam Ahn, Luohua Jiang, Kate Lorig, Phillip Ritter, Diana D Laurent, Nancy Whitelaw, and Matthew Lee Smith. National Study of Chronic Disease Self-Management: Six-Month Outcome Findings. Journal of Aging and Health, 25(7):1258–1274, September 2013.
  • Ozdas et al. [2000] A Ozdas, R G Shiavi, S E Silverman, M K Silverman, and D M Wilkes. Analysis of fundamental frequency for near term suicidal risk assessment. In IEEE International Conference on Systems, Man, and Cybernetics, pages 1853–1858. IEEE, 2000.
  • [44] Peer2Peer Cambridge., 2019. Accessed 2019/06/01.
  • Peng et al. [2005] Hanchuan Peng, Fuhui Long, and C Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226–1238, August 2005.
  • Scherer et al. [2014] Stefan Scherer, Giota Stratou, Gale Lucas, Marwa Mahmoud, Jill Boberg, Jonathan Gratch, Albert Skip Rizzo, and Louis-Philippe Morency. Automatic audiovisual behavior descriptors for psychological disorder analysis. Image and Vision Computing, 32(10):648–658, October 2014.
  • Sharif Razavian et al. [2014] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. pages 806–813, 2014.
  • Shukla et al. [2017] Parul Shukla, Kanad K Biswas, and Prem K Kalra. Recurrent Neural Network Based Action Recognition from 3D Skeleton Data. In 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pages 339–345. IEEE, 2017.
  • Simon et al. [2017] Tomas Simon, Hanbyul Joo, Iain A Matthews, and Yaser Sheikh. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping. CVPR, cs.CV, 2017.
  • Sobin and Sackeim [2019] Christina Sobin and Harold A Sackeim. Psychomotor symptoms of depression., 2019.
  • Song et al. [2013] Yale Song, Louis-Philippe Morency, and Randall Davis. Learning a sparse codebook of facial and body microexpressions for emotion recognition. In the 15th ACM, pages 237–244, New York, New York, USA, 2013. ACM Press.
  • Spitzer et al. [2006] Robert L Spitzer, Kurt Kroenke, Janet B W Williams, and Bernd Löwe. A Brief Measure for Assessing Generalized Anxiety Disorder. Archives of Internal Medicine, 166(10):1092–1097, May 2006.
  • Srivastava et al. [2003] Sanjay Srivastava, Oliver P John, Samuel D Gosling, and Jeff Potter. Development of personality in early and middle adulthood: Set like plaster or persistent change? Journal of Personality and Social Psychology, 84(5):1041–1053, 2003.
  • Su et al. [2017] Benyue Su, Huang Wu, and Min Sheng. Human action recognition method based on hierarchical framework via Kinect skeleton data. In 2017 International Conference on Machine Learning and Cybernetics (ICMLC), pages 83–90. IEEE, 2017.
  • Syed et al. [2017] Zafi Sherhan Syed, Kirill A Sidorov, and A David Marshall. Depression Severity Prediction Based on Biomarkers of Psychomotor Retardation. AVEC@ACM Multimedia, pages 37–43, 2017.
  • Wei et al. [2017] Shenghua Wei, Yonghong Song, and Yuanlin Zhang. Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition. In 2017 IEEE International Conference on Image Processing (ICIP), pages 91–95. IEEE, 2017.
  • Wei et al. [2016] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional Pose Machines. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4732. IEEE, 2016.
  • Yang et al. [2017a] Le Yang, Dongmei Jiang, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. Multimodal Measurement of Depression Using Deep Learning Models. AVEC@ACM Multimedia, pages 53–59, 2017a.
  • Yang et al. [2017b] Le Yang, Hichem Sahli, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke, and Dongmei Jiang. Hybrid Depression Classification and Estimation from Audio Video and Text Information. AVEC@ACM Multimedia, pages 45–51, 2017b.
  • Zhang et al. [2016] Hong Bo Zhang, Qing Lei, Bi Neng Zhong, Ji Xiang Du, and Jia Lin Peng. A Survey on Human Pose Estimation. Intelligent Automation and Soft Computing, 22(3):483–489, July 2016.