M^3T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild

This report describes a multi-modal multi-task (M^3T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge, held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020. In the proposed M^3T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal. The spatio-temporal visual features are extracted with a 3D convolutional network and a bidirectional recurrent neural network. Considering the correlations between valence / arousal, emotions, and facial actions, we also explores mechanisms to benefit from other tasks. We evaluated the M^3T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.



There are no comments yet.


page 2


Emotion Recognition for In-the-wild Videos

This paper is a brief introduction to our submission to the seven basic ...

Multi-modal Emotion Estimation for in-the-wild Videos

In this paper, we briefly introduce our submission to the Valence-Arousa...

Multi-modal Egocentric Activity Recognition using Audio-Visual Features

Egocentric activity recognition in first-person videos has an increasing...

Multi-Modal Continuous Valence And Arousal Prediction in the Wild Using Deep 3D Features and Sequence Modeling

Continuous affect prediction in the wild is a very interesting problem a...

A Multimodal LSTM for Predicting Listener Empathic Responses Over Time

People naturally understand the emotions of-and often also empathize wit...

Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition

Collecting and accessing a large amount of medical data is very time-con...

Depression Severity Estimation from Multiple Modalities

Depression is a major debilitating disorder which can affect people from...

Code Repositories


PyTorch code for "M³T: Multi-Modal Multi-Task Learning for Continuous Valence-Arousal Estimation"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatically understanding human affect is of great importance in human-machine interactions. Psychologists have developed the circumplex model of emotion [23] to describe peoples’ state of feeling. In the circumplex model of emotion, valence (i.e., how positive or negative an emotion is) and arousal (i.e., how powerful an emotion is) are the two dimensions that can be linked to affective and cognitive response. Researchers in computer science have made great efforts in estimating the affective states (i.e. valence and arousal) using visual or audio signals [16, 13, 14, 11].

However, valence and arousal are not the only way to represent human affect. There are two other widely adopted approaches: through categorical emotions [5] (e.g., happiness, sadness, anger, fear, disgust, surprise, etc.), and through facial actions (e.g., facial action units defined by the Facial Action Coding System [6]). The connections between the affective dimensions and the two other approaches are often ignored. Recently, Kollias et al. [16, 15] collected a large scale in-the-wild dataset, Aff-Wild2, which is not only annotated with valence and arousal, but also with categorical emotions and eight facial action units. Based upon the newly collected benchmark, we propose a multi-modal multi-task (T) framework to estimate the continuous valence and arousal, where valence-arousal estimation benefits from the emotion recognition task.

Fig. 1 illustrates the main idea of the proposed T framework. Given the videos and their corresponding audio tracks, T first extracts the visual features through a multi-task visual subnetwork, and extracts the audio features with an acoustic sub-network. In the multi-task visual subnetwork, we explore two mechanisms to benefit from the other tasks: training with losses for several tasks, and concatenating features from different tasks. Then, a late-fusion mechanism is used to fuse the two features. The ultimate features are used to estimate valence and arousal.

Fig. 1: Overview of the proposed multi-modal multi-task (T) framework.

Ii The Multi-Modal Multi-Task (T) Framework

The proposed T framework consists of three parts: multi-task visual network, acoustic network, and the multi-modal feature fusion module. In this section, we provide details of the three components.

Fig. 2: Architecture of the multi-task visual sub-network. Given the input clip, 3D features for arousal and valence are extracted from two-branch 3D convolutional blocks. Then, the 3D features are concatenated with 2D features pre-trained for other tasks. The concatenated features for each frame are encoded by bi-GRUs and passed to fully-connected layers for final predictions.

Ii-a Multi-Task Visual Network

We used categorical emotion recognition and AU detection to assist the valence estimation, because arousal can be reflected by the intensity of facial actions, while valence estimation is highly related to categorical emotion recognition. Valence indicates how pleasant or unpleasant a person is. It is intuitive to rate a “happy” emotion with a high valence score, and to rate a “sad” emotion with a low valence score. Due to the differences between valence and arousal, the multi-task visual network have two branches for the two affective dimensions, respectively.

Fig. 2 shows the details of the multi-task visual network. The multi-task visual network follows the V2P architecture [24], which has been successfully applied to visual speech recognition. Under this architecture, we first extract spatio-temporal features with 3D convolutional blocks from a given video clip, and then aggregate these features through bidirectional recurrent neural networks.

The multi-task visual network leverages the information from categorical emotions and facial actions with two mechanisms. First, we consider both the 3D features from the 3D convolution blocks and the 2D features a 2D network. The 2D static features are from extracted from a pretrained emotion recognition model and an AU detection model. Second, the architecture are trained with losses for multiple tasks: for valence estimation, for arousal estimation, and the cross-entropy loss for emotion recognition. Below, we present details of the 3D convolutional blocks, 2D static features, recurrent layers, and the losses.

3D convolutional blocks: As shown in Fig. 2, given

input frames, we used a 3D VGG-like backbone to extract the spatial-temporal features for every frame. Considering the difference between valence and arousal, the overall visual network has two branches, i.e., valence branch and arousal branch. In the 3D convolutional blocks, the two branches share the first three convolutional layers. Then, each branch has two specific convolutional layers. Therefore, through the 3D convolutional blocks, we obtain 3D features for valence and arousal, respectively. To initialize the network properly, we first pretrain the model for video-based face recognition on

selected identities in the development set of the VoxCeleb2 dataset [3], which consists of talking faces recorded under a variety of in-the-wild conditions.

2D static features: We extract emotion and AU features for each individual frame as 2D static features. Considering their different correlations with valence and arousal, we initially concatenate emotion features with the 3D valence features, and AU features with the 3D arousal features. It is also possible to concatenate emotion features with the arousal features; we discuss the impact of this choice in Sec. III-C.

For emotion features, we use -dimensional features from the average-pooling layer of an SENet-101 [7] model pretrained on a large-scale dataset for facial expression recognition, which is a union of AffectNet [20], RAF-DB [17], and privately collected images. For AU features, we use the self-supervised -dimensional encoder features using TCAE [18], which is trained on the union of VoxCeleb1 [21] and VoxCeleb2 [3] datasets. As these models are computationally expensive and trained at a different resolution (), we dump frozen features to disk, and retrieve pre-computed features on-the-fly during training.

Recurrent layers: Then, for each time step , the corresponding 3D features and 2D features are concatenated along the channel dimension, and fed to a -layer,

-cell bidirectional recurrent neural network with Gated Recurrent Units (GRU) before two fully-connected layers. After the fully-connected layers, we obtain the expected outputs, i.e., the estimated valence and arousal.

Loss , for valence and arousal estimation: We aim to maximize the agreement between the annotations and predictions by maximizing their Canonical Concordance Coefficients (CCC). Therefore, we minimize the formulated losses and , respectively:


where (resp. ) are the CCCs between ground truth valence (resp. arousal) values and the predicted valence (resp. arousal) values. For an explanation of CCCs, please refer to Eq. (7) in Sec. III-A.

Loss for emotion recognition: This is the standard categorical cross-entropy loss for classification:


where is ground truth, and is the prediction for each of the seven expression classes. Note that we only apply this loss to frames that are annotated with expression.

The final loss is a weighted average of the two losses:


where is a balancing factor used to stabilize training. We do not further exploit AU presence labels, as only of the videos have such annotations, and adding this information did not improve performance in our preliminary experiments.

Ii-B Acoustic Network

The input acoustic features are stacked log-Mel spectrogram energies, synchronized with the video to yield one -dimensional feature per time step. The features within each window are encoded with a -layer, -cell bidirectional GRU, and passed through an MLP with two fully-connected layers which transform the features as .

Ii-C Multi-Modal Feature Fusion Module

To train the audio-visual joint model, we remove the fully-connected layers from the single-modality models. The output -dimensional visual features is first projected to dimensions with a fully connected layer, and concatenated with the -dimensional GRU outputs from the audio subnetwork, in a late-fusion fashion. We finally pass the concatenated features through a two-layer bidirectional GRU and two fully-connected layers to obtain the final predictions.

We note that visual and acoustic information are not equally informative at each time step; for example, the person may be acting in silence, or temporarily invisible. Inspired by this observation, instead of simple feature concatenation, we propose another fusion scheme, which aggregates information from the two modalities using an attention mechanism.

Formally, denote the visual features at time by and acoustic features by . We implement two scoring functions, and which derive “quality scores” for the visual and audio modality based on the features. In our experiments, we instantiate as a one-layer bi-GRU with

hidden units, followed by the sigmoid function. Finally we compute a fused representation

for each time step, which is then passed to the GRU and fully-connected layers for final predictions:

Fig. 3: An attentional multi-modal feature fusion scheme. At each time step, we re-weight audio and visual information with an attention mechanism, and aggregate the features for final prediction.

Iii Experiments

Iii-a Implementation Details

Dataset. We use the Aff-Wild2 dataset [15, 16], which contains videos with annotations for valence-arousal, facial expression and facial action units. It is an extension of the previous Aff-Wild dataset [28], and currently the largest audiovisual in-the-wild database annotated for valence and arousal. According to the partition and annotations provided by the ABAW 2020 challenge organizers, there are , , and subjects in the training, validation and test subsets respectively for the VA estimation track.

Video preprocessing. We first run face detection on the provided videos using the RetinaFace detector [4] with the ResNet-50 backbone. The detected faces are grouped with an IoU-based tracker, aligned to a canonical template using the five detected landmarks with a similarity transformation, and cropped to

. Additionally, we smooth the facial landmarks with a temporal Gaussian kernel, but only if the variance of the bounding box coordinates is below a conservative threshold (since the subjects sometimes move dramatically, e.g. jumping for joy, lying down on bed, riding a roller coaster). A vector of zeros is used in lieu of the absent visual frames or features during evaluation.

111We were able to extract frames, which account for of the dataset.

Audio preprocessing. We sample audio at a conventional kHz rate, and extract -dimensional log-Mel spectrograms. Since the dataset comprises videos recorded at different frame rates ( fps to fps), we adopt the solution proposed in [19] to extract synchronized audio features, by advancing the analysis window at a rate proportional to the video frame rate. By stacking extra context frames from both directions, we obtain -dimensional feature vectors for each video frame. Videos recorded at fps or lower are discarded while training with only the audio stream, and a vector of zeros is used in lieu of the features otherwise.

Data augmentation. During training, we take a random crop at the same position for each frame, and apply random horizontal flipping to the entire sequence. During evaluation, we take a central crop.

Evaluation metric. The official metric for the challenge is the Canonical Concordance Coefficient, which is defined as


where and are the ground truth annotations and the predicted values, and are their variances, and are mean values, and is the covariance. CCC takes values in , where indicates perfect concordance and indicates perfect discordance. Higher mean valence and arousal CCC is desired for the valence-arousal estimation task.

Iii-B Experimental Settings

We implemented our network in PyTorch 

[22]. The network is trained on servers with NVIDIA Titan Xp GPUs, each with GB memory, and optimized with the Adam optimizer [10] using default parameters. The inputs are batches of clips, normalized to . We use cyclical learning rates [25] with and for single-stream training, and manual LR decay starting with for joint training. Weight decay is set to .

We sample -frame windows (i.e. in Fig. 2

) from each video in the training set during one epoch. During inference, we segment test videos into non-overlapping clips. The model is trained in three stages: first, we train single-modality models with CCC loss (for the audio subnetwork), or with the multi-task loss (for the visual subnetwork). Next, we initialize the fusion network by training for three epochs while keeping the visual and audio encoders frozen. Finally, the network is fine-tuned end-to-end.

Iii-C Results and Discussion

Method Modality CCC 
 V A Valence Arousal Mean
Baseline (PatchGAN) [12]
SENet-50, fine-tuned [29]
GRU, scratch
Ours, scratch, w/o MTL
+ SE (V) / TCAE (A)
+ SE (V) / SE (A)
  + VoxCeleb2 pretrain
Ours, scratch
+ SE (V) / SE (A)
  + VoxCeleb2 pretrain
Audio-visual (T)
Concat fusion
Attn. fusion
TABLE I: Results for valence-arousal estimation on the validation set of ABAW Challenge 2020.

We report our results on the official validation set of the ABAW 2020 Challenge [12] in Table I. Our best performing model achieves a mean CCC of , which is a significant improvement over previous results.

We make several observations: first, the proposed model achieves strong arousal estimation performance, which can be attributed to the use of 3D convolutions, which captures temporal dynamics at an early stage; second, similar to the finding in [16], the combination of audio and video yields noticeable improvements for arousal estimation, but not valence estimation; third, interestingly, better results can be obtained by concatenating SENet-101 features instead of the AU features to the arousal branch. Finally, the attention-based slightly underperforms concatenation fusion. This might be because our models did not train long enough before this submission (we have not reached full model convergence at the time of submission).

How does pretraining affect performance? State-of-the-art results for valence-arousal estimation with static frames use pretrained VGG-Face descriptors [16, 13]. To the best of our knowledge, we are the first to apply a 3D ConvNet to in-the-wild valence-arousal estimation. To understand the role of pretraining in our case, we report results of both training from scratch on Aff-Wild2, and using external pretraining (on VoxCeleb2). As illustrated in Table I, similar to the case with 2D ConvNets for VA estimation, pretraining also boosts the performance of our 3D backbone. It can be argued that this improvement is due to the model being initialized with a strong facial feature extractor that is more robust to identity and lighting conditions, which also leads to much faster convergence.

Due to time constraints, the numbers reported for the audio-visual models here do not use VoxCeleb2 pretraining. We will update the corresponding numbers as soon as the results are available.

Iv Conclusion

In this report, we have described a multi-modal, multi-task learning framework named T for continuous valence-arousal estimation, which has been used in our entry to the ABAW challenge at FG 2020. The proposed framework leverages 3D and 2D ConvNet visual features, categorical emotion labels, as well as audio information. Our results show significant improvements over the baseline on the ABAW Challenge validation set.


We thank Xuran Sun for providing us with the pre-trained SENet FER model.


  • [1] D. Aspandi, A. Mallol-Ragolta, B. Schuller, and X. Binefa (2020) Adversarial-based neural network for affect estimations in the wild. External Links: 2002.00883 Cited by: TABLE I.
  • [2] Y. Bengio and Y. LeCun (Eds.) (2015) International conference on learning representations. External Links: Link Cited by: 10.
  • [3] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. See Annual conference of the international speech communication association, Yegnanarayana, pp. 1086–1090. External Links: Link, Document Cited by: §II-A, §II-A.
  • [4] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019) RetinaFace: single-stage dense face localisation in the wild. CoRR abs/1905.00641. External Links: Link, 1905.00641 Cited by: §III-A.
  • [5] P. Ekman (1992) An argument for basic emotions. Cognition & emotion 6 (3-4), pp. 169–200. Cited by: §I.
  • [6] E. Friesen and P. Ekman (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3. Cited by: §I.
  • [7] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7132–7141. Cited by: §II-A.
  • [8] (2017) IEEE conference on computer vision and pattern recognition. IEEE Computer Society. Cited by: 17.
  • [9] (2017) IEEE winter conference on applications of computer vision. IEEE Computer Society. External Links: Link, ISBN 978-1-5090-4822-9 Cited by: 25.
  • [10] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. See International conference on learning representations, Bengio and LeCun, External Links: Link Cited by: §III-B.
  • [11] D. Kollias, M. A. Nicolaou, I. Kotsia, G. Zhao, and S. Zafeiriou (2017) Recognition of affect in the wild using deep neural networks. In EEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1972–1979. Cited by: §I.
  • [12] D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou (2020) Analysing affective behavior in the first ABAW 2020 competition. External Links: 2001.11409 Cited by: §III-C, TABLE I.
  • [13] D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. W. Schuller, I. Kotsia, and S. Zafeiriou (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision 127 (6-7), pp. 907–929. Cited by: §I, §III-C.
  • [14] D. Kollias and S. Zafeiriou (2018) A multi-task learning & generation framework: valence-arousal, action units & primary expressions. CoRR abs/1811.07771. External Links: Link, 1811.07771 Cited by: §I.
  • [15] D. Kollias and S. Zafeiriou (2018) Aff-Wild2: extending the Aff-Wild database for affect recognition. CoRR abs/1811.07770. External Links: Link, 1811.07770 Cited by: §I, §III-A.
  • [16] D. Kollias and S. Zafeiriou (2019) Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace. CoRR abs/1910.04855. Cited by: §I, §I, §III-A, §III-C, §III-C.
  • [17] S. Li, W. Deng, and J. Du (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. See 8, pp. 2584–2593. Cited by: §II-A.
  • [18] Y. Li, J. Zeng, S. Shan, and X. Chen (2019) Self-supervised representation learning from videos for facial action unit detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10924–10933. Cited by: §II-A.
  • [19] T. Makino, H. Liao, Y. M. Assael, B. Shillingford, B. Garcia, O. Braga, and O. Siohan (2019) Recurrent neural network transducer for audio-visual speech recognition. CoRR abs/1911.04890. External Links: 1911.04890 Cited by: §III-A.
  • [20] A. Mollahosseini, B. Hassani, and M. H. Mahoor (2019) AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affective Computing 10 (1), pp. 18–31. Cited by: §II-A.
  • [21] A. Nagrani, J. S. Chung, and A. Zisserman (2017) VoxCeleb: A large-scale speaker identification dataset. In Annual Conference of the International Speech Communication Association, pp. 2616–2620. Cited by: §II-A.
  • [22] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    See Annual conference on neural information processing systems, Wallach et al., pp. 8024–8035. External Links: Link Cited by: §III-B.
  • [23] J. A. Russell (1980) A circumplex model of affect.. Journal of personality and social psychology 39 (6), pp. 1161. Cited by: §I.
  • [24] B. Shillingford, Y. M. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett, M. Mulville, B. Coppin, B. Laurie, A. W. Senior, and N. de Freitas (2018) Large-scale visual speech recognition. CoRR abs/1807.05162. External Links: Link, 1807.05162 Cited by: §II-A.
  • [25] L. N. Smith (2017) Cyclical learning rates for training neural networks. See 9, pp. 464–472. External Links: Link, Document Cited by: §III-B.
  • [26] H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.) (2019) Annual conference on neural information processing systems. External Links: Link Cited by: 22.
  • [27] B. Yegnanarayana (Ed.) (2018) Annual conference of the international speech communication association. ISCA. External Links: Link, Document Cited by: 3.
  • [28] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia (2017) Aff-Wild: valence and arousal ’in-the-wild’ challenge. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1980–1987. Cited by: §III-A.
  • [29] Z. Zhang and J. Gu (2020) Facial affect recognition in the wild using multi-task learning convolutional network. External Links: 2002.00606 Cited by: TABLE I.