Emotion Recognition for In-the-wild Videos

02/13/2020 ∙ by Hanyu Liu, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

This paper is a brief introduction to our submission to the seven basic expression classification track of Affective Behavior Analysis in-the-wild Competition held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020. Our method combines Deep Residual Network (ResNet) and Bidirectional Long Short-Term Memory Network (BLSTM), achieving 64.3



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automated facial expression recognition (FER) in-the-wild is a long-standing problem in affective computing and human-computer interaction. To analyze facial expression, psychologists and computer scientists have classified the facial expression into a list of emotion-related categories, such as six basic emotions, i.e., anger, disgust, fear, happiness, sadness, and surprise. Ekman et al.

[2] have shown that the six basic emotional expressions are universal among human beings. There has been an encouraging progress on facial expression recognition and during the past decades.

In the the Affective Behavior Analysis in-the-wild (ABAW) 2020 Competition[7], the holders provide a large scale in-the-wild database called Aff-Wild2[10, 11, 9, 8, 6], including videos annotated with emotion categories, facial action unit[3], and valence and arousal[17] dimension. In this paper, we present our method used in the expression track in ABAW. In this track, the task is to distinguish seven basic facial expressions (i.e., neutral, anger, disgust, fear, happiness, sadness, surprise) of the person in the given videos. Our method adopts a 101-layer ResNet[4] with convolutional block attention module (CBAM)[19]

to extract frame-by-frame features. Then, the features are fed into a bidirectional recurrent neural network with long short-term memory (BLSTM)

[5] units.

Ii Related Work

Deep neural network-based algorithms are widely used in image and video analysis in recent years. Convolutional neural networks (CNN), for example, residual neural network (ResNet), VGGNet

[18], AlexNet[12]

are shown effective in image classification and image feature extraction. Long short-term memory network (LSTM)


, a specific improvement of recurrent neural network (RNN), which is capable of capturing serial information, is used in natural language processing as well as video analysis. Meanwhile, the architecture of combination of CNN and RNN is proved to have excellent performance on emotion related tasks. Woo et al. proposed convolutional block attention module (CBAM)

[19], a lightweight and general attention module boosting the performance of all kinds of CNNs. D. Kollias et al. collected Aff-Wild dataset[20], the first dataset with annual annotations for each frame of the videos for facial action unit, facial expression and valence-arousal research.

Iii Method

Fig. 1: Architecture of the proposed method.

Our method consists of three part: ResNet-101 with CBAM that extracts features for each frame, BLSTM that captures the dynamic features of continuous frames, classification module that makes the decisions. Fig. 1 illustrate the framework of our method. Below, we present the three parts in details.

Iii-a ResNet-101 with CBAM

Since ResNet has achieved considerable performance in a lot of computer vision tasks

[4], we adopt a 101-layer ResNet (ResNet101) to extract visual features from each frame. Considering facial expression appears in particular location in the image, we add a convolutional block attention module (CBAM)[19] after each residual block of ResNet101 to introduce channel attention and spatial attention. Fig. 2 illustrates the structure of a residual block with CBAM.

Fig. 2: Structure of the residual block with CBAM

Iii-B Blstm

Since facial expressions is continuous in the time dimension, we use a Long Short-Term Memory Network (LSTM) to process timing information. Considering that we need to select features for each frame, including the starting frame of the eight frame video clip, we use a bidirectional LSTM here.

Iii-C Classification module

Lastly, fully connected layers are applied to classify the features into seven classes based on the features extracted and selected by previous layers.

Iv Experiment

Iv-a Dataset

Other than the emotion-annotated part in the provided Aff-Wild2 dataset, we used several internal facial expression datasets (AffectNet[15], RAF-DB[13][14]) and a self-collected datasets with 300,000 images to pre-train our model.

Aff-Wild2: Aff-Wild2 annotated in total 539 videos consisting of 2,595,572 frames with 431 subjects, 265 of which are male and 166 female. The dataset are split into train/validation/test parts in a subject-independent manner, with 253, 71, 233 subjects in each.

AffectNet: AffectNet contains about 440,000 manually annotated facial images collected from Internet search engines. We only used the images with neutral and 6 basic emotions in the training part, including around 280,000 images.

RAF-DB: Real-world Affective Faces Database (RAF-DB) contains around 30,000 facial images annotated with basic or compound expressions. We only used the 12,271 ones in the training part annotated with basic emotions.

Iv-B Preprocessing

The original videos were first divided into frames. These images were later applied on RetinaFace detector[1] to detect all the faced to be analyzed, aligned and cropped into size of . In order to make the external dataset perform better, all the procedures during preprocessing are similar to the official preprocessing except the tools used. For some frames in which human face were not able to be detected by the detector, the corresponding images were removed from the training dataset.

Iv-C Training

We implemented our model using PyTorch


, on a server with four Nvidia GeForce GTX Titan X GPUs, each with 12GB memory. The model is trained with stochastic gradient descent (SGD) with learning rate 0.0001 and momentum 0.9. Loss function is cross entropy loss. The training batch size is set as 4. At each step during the training, one video from all the videos in the training dataset is selected with equal probability. And then a continuous 8 frame video clip (i.e., without any frame from which faces are unable to be detected) is randomly selected from this video as a batch. The model makes to its best performance usually within 200,000 batches. After every 1000 iteration, we recorded the temporary parameters of the model as a checkpoint.

Iv-D Evaluation

All the video frames in validation set are arranged into 8 frame clips to be calculated collectively. If the length of a video is not divisible by 8, the last several frames are padding with zeros. The BLSTM part outputs the features for each time step in the clip so that these 8 frames are classified and labeled in a single round. We counted the number of successfully predicted frame as well as the total number of frames processed by our model.

The final metric is a combination of accuracy and formulated as:


where is the accuracy which is computed as the ration total number of correctly predicted frames over the total frames. of computed as unweighted mean of all of seven categories. The of a single category is computed as:


We manually select the parameter with best performance on validation set from all the checkpoints.

Iv-E Result

Method Acc F1
baseline[7] - - 0.36
ResNet+BLSTM 0.647 0.281 0.402
ResNet+BLSTM+CBAM 0.640 0.333 0.434

TABLE I: Result on the validation set

We evaluated our method on the validation set of Aff-Wild2 and reported the result of our method in Table I. The baseline method is MobileNetV2. ResNet+BLSTM is the combination of vanilla ResNet101 and BLSTM. ResNet+CBAM+BLSTM added the CBAM after each layer of ResNet101. As can be seen in Table I, ResNet+CBAM+BLSTM achieves higher final metric .

V Conclusion

Our proposed method reaches 64.65% accuracy on the validation set, and 43.43% final metric on the validation set, 7.43% higher than the 36% baseline proposed in the competition announcement.


The authors would like to thank Xuran Sun for providing us with the pre-trained ResNet FER model and Yuanhang Zhang for assistance.


  • [1] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019) RetinaFace: single-stage dense face localisation in the wild. CoRR abs/1905.00641. Cited by: §IV-B.
  • [2] P. Ekman (1992) An argument for basic emotions. Cognition & emotion 6 (3-4), pp. 169–200. Cited by: §I.
  • [3] E. Friesen and P. Ekman (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto 3. Cited by: §I.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §I, §III-A.
  • [5] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §I, §II.
  • [6] D. Kollias, M. A. Nicolaou, I. Kotsia, G. Zhao, and S. Zafeiriou (2017) Recognition of affect in the wild using deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1972–1979. Cited by: §I.
  • [7] D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou (2020) Analysing affective behavior in the first ABAW 2020 competition. External Links: 2001.11409 Cited by: §I, TABLE I.
  • [8] D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. W. Schuller, I. Kotsia, and S. Zafeiriou (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision 127 (6-7), pp. 907–929. Cited by: §I.
  • [9] D. Kollias and S. Zafeiriou (2018) A multi-task learning & generation framework: valence-arousal, action units & primary expressions. CoRR abs/1811.07771. External Links: Link, 1811.07771 Cited by: §I.
  • [10] D. Kollias and S. Zafeiriou (2018) Aff-Wild2: extending the Aff-Wild database for affect recognition. CoRR abs/1811.07770. External Links: Link, 1811.07770 Cited by: §I.
  • [11] D. Kollias and S. Zafeiriou (2019) Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace. CoRR abs/1910.04855. External Links: Link Cited by: §I.
  • [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §II.
  • [13] S. Li, W. Deng, and J. Du (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2584–2593. Cited by: §IV-A.
  • [14] S. Li and W. Deng (2019) Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing 28 (1), pp. 356–370. Cited by: §IV-A.
  • [15] A. Mollahosseini, B. Hasani, and M. H. Mahoor (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10 (1), pp. 18–31. Cited by: §IV-A.
  • [16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Annual Conference on Neural Information Processing Systems, pp. 8024–8035. Cited by: §IV-C.
  • [17] J. A. Russell (1980) A circumplex model of affect.. Journal of personality and social psychology 39 (6), pp. 1161. Cited by: §I.
  • [18] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
  • [19] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §I, §II, §III-A.
  • [20] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia (2017) Aff-Wild: valence and arousal ’in-the-wild’ challenge. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1980–1987. Cited by: §II.