An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild

07/07/2021 ∙ by Panagiotis Antoniadis, et al. ∙ National Technical University of Athens 18

In this work we tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage bodily as well as contextual features, as part of a broader emotion recognition framework. We choose to use a standard CNN-RNN cascade as the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the RGB input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Affect-in-the-wild-2 (Aff-Wild2) dataset verify the superiority of our methods over existing approaches, while by properly incorporating all of the aforementioned modules in a network ensemble, we manage to surpass the previous best published recognition scores, in the official validation set. All the code was implemented using PyTorch[<>] and is publicly available[<>].



There are no comments yet.


page 3

Code Repositories


Code of the team NTUA-CVSP for the ABAW2 Competition in ICCV 2021

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic affect recognition constitutes a subject of rigorous studies across several scientific disciplines and bears immense practical importance as it has extensive applications in environments that involve human-robot cooperation, sociable robotics, medical treatment, psychiatric patient surveillance and many other human-computer interaction scenarios.

Representing human emotions has been a basic topic of research in psychology. While the cultural and ethnic background of a person can affect his expressive style, Ekman indicated that humans perceive certain basic emotions in the same way regardless of their culture [ekman1971constants], [ekman1994strong]. These six universal facial expressions (happiness, sadness, surprise, fear, disgust and anger) constitute the categorical model. Contempt was subsequently added as one of the basic emotions [matsumoto1992more]. Due to its direct and intuitive definition of facial expressions, the categorical model is used in the majority of emotion recognition algorithms [jung2015joint] [hasani2017facial] [zhao2016peak] and large-scale databases (MMI [pantic2005web], AFEW [dhall2013emotion], FER-Wild [mollahosseini2016facial], etc). However, the subjectivity and ambiguity of restricting human emotion to discrete categories result in large intra-class variations and small inter-class differences.

Recently, the dimensional model proposed by Russell [russell1980circumplex] has gained a lot of attention where emotion is described using a set of 2 latent dimensions that are valence (how pleasant or unpleasant a feeling is) and arousal (how likely is the person to take action under the emotional state). Another dimension called dominance is used sometimes to know whether the person is controlling the situation or not. Since a continuous representation can distinguish between subtly different displays of affect and encode small changes in the intensity, some recent algorithms [nicolaou2011continuous], [chang2017fatauva], [kollias2019expression] and databases (Aff-Wild [zafeiriou2017aff] [kollias2019deep], Aff-Wild2 [kollias2019expression], OMG-Emotion [barros2018omg], AFEW-VA [kossaifi2017afew], etc) have utilized the dimensional model for uncontrolled emotion recognition. Even so, predicting a 2-dimensional continuous value instead of a category increases a lot the task complexity and lacks intuitiveness.

The remainder of the paper is structured as follows: Firstly, we provide an overview of the latest and most notable related work in the domain of video-based emotion recognition in-the-wild. Subsequently, we analyze our proposed model architecture. Next, we present our experimental results on the Aff-Wild2 dataset, followed by conclusive remarks.

2 Related Work

Emotion recognition has been extensively studied for many years using different representations of human emotion, like basic facial expressions, action units and valence-arousal.

Recently, many studies try to leverage all emotion representations and jointly learn these three facial behavior tasks. Kollias et al. [kollias2019face] proposed FaceBehaviorNet the first study of all facial behaviour tasks learned jointly in a single holistic framework. They utilized many publicly available emotion databases and proposed two strategies for coupling the tasks during training. Later, Kollias et al. released Aff-Wild2 dataset [kollias2019expression] the first large scale in-the-wild database containing annotations for all 3 main behavior tasks. They also proposed multitask learning models that employ both visual and audio modalities and suggested to use the ArcFace loss [deng2019arcface] for expression recognition. In an additional work, Kollias et al [kollias2021affect] [kollias2021distribution] studied the problem of non-overlapping annotations in multitask learning datasets. They explored task-relatedness and proposed a novel distribution matching approach, in which knowledge exchange is enabled between tasks, via matching of their predictions’ distributions. Last year, the First Affective Behavior Analysis in-the-wild (ABAW) Competition [kollias2020analysing] was held in conjunction with the IEEE Conference on Face and Gesture Recognition 2020. The competition contributed in advancing the state-of-the-art methods on the dimensional, categorical and facial action unit analysis and recognition using the Aff-Wild2 dataset.

Figure 1: Overview of our proposed method.

3 Method

A complete schematic diagram of our proposed model is shown in Fig. 1. Firstly, we will present the structure of the subnetwork regarding the RGB visual modality, along with our proposed extensions for the enhancement of emotion understanding. Next, we will do the same for the aural model, and finally we will present the unified visual-aural architecture.

3.1 Seq2Seq

In order to leverage temporal information and emotion labels throughout each video, our method takes as input sequences of frames that contain either visual or aural information (which will be described in the next subsections). After extracting for each sequence of frames its intermediate feature representation, we use an LSTM (Long Short-Term Memory)

[hochreiter1997long] model, which maps the input sequence of features to output labels.

3.2 Visual Model

A single RGB image usually encodes static appearance at a specific point in time but lacks the contextual information about previous and next frames. We aspire to enhance the descriptive capacity of the extracted deep visual embeddings through the feature level combination of multiple feature extractors that focus on different parts of the human instance. In our implementation, we utilize the human face as our primary source of affective information, while we also make use of the body and surrounding depicted environment, in a supplementary manner. For all of the subsequent convolutional branches, we use a standard 50-layer ResNet [he2016deep]

, as our feature extractor backbone. The ResNet-50 variant produces 2048-dim deep feature vector representations for each given input image. All ConvNet backbones are pre-trained using various task-specific datasets. As it will be discussed later on, pre-training constitutes the main differentiating factor among the multiple ConvNet backbones that comprise our visual model.

3.2.1 Face

The human face is commonly perceived as the window to the soul and the most expressive source of visual affective information. We introduce an input stream which explicitly operates on the aligned face crops of the primary depicted human agents. The localization, extraction and alignment of face regions per frame has been carried out by the official distributors of the Aff-Wild2 dataset priorly. During frames where face detection and alignment has failed and the corresponding face crops are missing, we feed the ConvNet feature extractor with an input tensor full of zeros.

The ConvNet backbone of the face branch receives manual pre-training on AffectNet [mollahosseini2017affectnet]

which constitutes the largest facial expression database, containing over 1M images, annotated on both categorical and dimensional level. We pre-trained the face branch on AffectNet for 5 epochs using a batch size of 64 and a learning rate of 0.001 achieving 64.09% validation accuracy.

3.2.2 Context

We incorporate a context stream in the form of RGB frames whose primary depicted agents have been masked out. For the acquisition of the masks we use body bounding boxes and multiply them element-wise with a constant value of zero. Prior to the acquisition of the body bounding boxes and the corresponding masks, we calculate the 2D coordinates for 25 joints of the body of the primary depicted agent using the BODY25 model of the OpenPose [cao2018openpose] publicly available toolkit. Let be the detected set of horizontal and vertical joint coordinates of a given agent, at frame . The bounding box of the agent within a given image of height and width , is calculated as follows:


All joints with a detection confidence score that is less than 10%, are discarded. The masked image for a given input image is calculated as follows:


where the tuple corresponds to all valid pixel locations. Throughout our experiments we set and .

Contextual feature extraction is a scene-centric task, and therefore we choose to initialize the corresponding ConvNet backbone using the Places365-Standard

[zhou2016places], a large-scale database of photographs, labeled with scene semantic categories. The pre-trained model is publicly available333

3.2.3 Body

We incorporate an additional input stream that focuses solely on encoding bodily expressions of emotion, with the aim of alleviating the problem of undetected or misaligned face crops. The contribution of the body branch in the emotion recognition process becomes more evident during frames where the corresponding face crops are not existent. The body stream operates either on the bounding of the depicted agents or the entire image. The ConvNet feature extractor is pre-trained on the object-centric ImageNet

[deng2009imagenet] database. The pre-trained weights are publicly available444

The early feature fusion of all of the aforementioned input streams results in a 6144-dim concatenated feature vector. Subsequently, the fused features are fed into a bidirectional, single-layer LSTM, with 512 hidden units, for further temporal modeling.

3.3 Aural Model

In the branch that incorporates audio information, starting with a sequence of input waveforms, we extract the mel-spectrogram representation of each. Then, we use a 18-layer ResNet model pretrained on the ImageNet dataset to extract a 512 feature vector for each input waveform. Finally, using an LSTM layer, we map the feature sequence to labels (either expression labels in the case of Track 2 or VA labels in the case of Track 1).

3.4 Loss Functions

For Track 1 of the ABAW competition, we use both a standard mean-squared error as well as loss term based on the concordance correlation coefficient (CCC). It is defined as:


where and

denote the variance of the predicted and ground truth values respectively,

and are the corresponding mean values and is the respective covariance value. The range of CCC is from -1 (perfect disagreement) to 1 (perfect agreement). Hence, in our case we define as:


where and are the respective CCC of valence and arousal.

For Track 2, we use a standard cross-entropy function . We also enforce semantic congruity between the extracted visual embeddings and the categorical label word embeddings from a 300-dim GloVe [pennington2014glove]

model, pre-trained on Wikipedia and Gigaword 5 data, in the same manner as in

[NTUA_BEEU]. More specifically, given an input sample , we transform the concatenated visual embeddings into the same dimensionality as the word embeddings

through a linear transformation

, with being the ground truth target label. We later apply an MSE loss between the transformed visual embeddings and the word embeddings which correspond to the ground truth emotional label and denote this term as :


where is the Iverson bracket and

is the class index. The whole network can be trained in an end-to-end manner by minimizing the combined loss function

. For simplicity, we set during all of our subsequent experiments.

4 Experimental Results

Tables 1 and 2 present our results on the Aff-Wild2 validation set, for the Expression and Valence-Arousal sub-challenges respectively, together with a performance comparison with the baseline and top entries from last year’s ABAW competition [kollias2020analysing].

Method Score Accuracy Total
Baseline [kollias2021analysing] 0.30 0.50 0.366
NISL2020 [deng2020multitask] - - 0.493
ICT-VIPL [liu2020emotion] 0.33 0.64 0.434
TNT [kuhnke2020two] - - 0.546
Audio 0.375 0.495 0.415
Visual (F) 0.453 0.584 0.496
Visual (BCF) 0.517 0.640 0.558
Visual (BCF) + 0.532 0.639 0.567
Visual (F + BCF) 0.543 0.657 0.580
Audio + Visual (F) 0.519 0.645 0.561
Audio + Visual (BCF) 0.536 0.654 0.575
Audio + Visual (F + BCF) 0.555 0.668 0.592
Table 1: Results on the Aff-Wild2 validation set, for the Expression sub-challenge.
Method CCC-V CCC-A Total
Baseline [kollias2021analysing] 0.23 0.21 0.22
NISL2020 [deng2020multitask] 0.335 0.515 0.425
ICT-VIPL [zhang2020m] 0.32 0.55 0.435
TNT [kuhnke2020two] 0.493 0.613 0.553
Audio 0.243 0.400 0.322
Visual (F) 0.330 0.539 0.435
Visual (BCF) 0.344 0.550 0.447
Visual (F + BCF) 0.358 0.597 0.478
Audio + Visual (F) 0.366 0.582 0.474
Audio + Visual (BCF) 0.382 0.586 0.484
Audio + Visual (F + BCF) 0.386 0.616 0.502
Table 2: Results on the Aff-Wild2 validation set, for the Valence-Arousal sub-challenge.

F denotes a visual branch trained only using the cropped face in the input video, while BCF denotes the visual branch that incorporates both bodily as well as contextual features, as seen in Figure 1. Audio denotes the results of the aural branch. In both tables we also include the results of weighted average ensembling for all different combinations of F, BCF, and Audio. We readily see that including the body and the context as additional information results in a performance boost (especially in the case of the Expression sub-challenge). Furthermore, fusion of any two methods results in heightened performance, when compared to the single branches. Finally, in both Tables, the fusion of all three different models results in the best score (0.592 for the Expression sub-challenge and 0.502 for the VA sub-challenge).

5 Conclusion

We presented our method for the “2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW)” challenge. Apart from using only face-cropped images, we leverage both contextual (scene), bodily features, as well as the audio of the video, to further enhance our model’s perception. Our results show that fusion of the different streams of information results in significant performance increase, when compared both to single streams, as well as previous best published results.