The FaceChannel: A Light-weight Deep Neural Network for Facial Expression Recognition

04/17/2020 ∙ by Pablo Barros, et al. ∙ University of Cambridge Istituto Italiano di Tecnologia 0

Current state-of-the-art models for automatic FER are based on very deep neural networks that are difficult to train. This makes it challenging to adapt these models to changing conditions, a requirement from FER models given the subjective nature of affect perception and understanding. In this paper, we address this problem by formalizing the FaceChannel, a light-weight neural network that has much fewer parameters than common deep neural networks. We perform a series of experiments on different benchmark datasets to demonstrate how the FaceChannel achieves a comparable, if not better, performance, as compared to the current state-of-the-art in FER.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

It is well-accepted that ‘basic’ emotions are perceived, recognized and commonly understood by people across cultures, around the world [13]. Translating such an idea to universal automatic Facial Expression Recognition (FER), however, is a major challenge. A common obstacle in developing such a solution is that each person might express basic affect differently. Most natural expressions are composed of a series of transitions between several basic affective reactions, occurring in a very short period of time [7]. Contributing to the complexity of analyzing facial expressions automatically is the subjective nature of affective expression. Depending on the context of the interaction, the level of engagement and the affective bond between the interaction partners the understanding of an affective exchange between them can change drastically [18, 20].

One way to address this problem is by formalizing affect in a manner that bounds the categorization ability of a computational system. This requires to choose a highly effective formalization for the task at hand [15, 2, 1]. Although there exists extensive research on defining, identifying and understanding human affect, most of the computational models that deal with affect recognition from facial expressions are limited to very limited categorizations. The most commonly used approach is categorization into emotional categories (with universal emotions being the most popular) or using a dimensional representation (usually, arousal and valence), presumably due to the availability of training data that provides such encapsulations.

Most of the current, and effective, solutions for automatic affect recognition are based on extreme generalization, usually employing end-to-end deep learning techniques 

[32]. Such models usually learn how to represent affective features from a large number of data samples, using strongly supervised methods [19, 22, 26, 25, 24]. As a result, these models can extract facial features from a collection of different individuals, which contributes to their generalization of expression representations enabling a universal and automatic FER machine. The development of such models was supported by the collection of several “in-the-wild” datasets [11, 34, 39, 40] that provided large amounts of well-labelled data. These datasets usually contain emotion expressions from various multimedia sources ranging from single frames to a few seconds of video material. Because of the availability of large amounts of training data, the performance of deep learning-based solutions forms the state-of-the-art in FER, benchmarked on these datasets [9, 31, 12, 38].

Most of these models, however, employ large and deep neural networks that demand a lot of computational resources for training. As a result, these models specialize on recognizing emotion expressions under conditions represented in the datasets they are trained with. Thus, when these models are applied to data under different conditions, not represented in the training data, they tend to perform poorly. Retraining these networks to adapt to these novel scenarios is usually helpful in improving the performance of these models. Yet, owing to the large and deep architecture of these models, retraining the entire network with changing conditions is rather expensive.

Furthermore, once trained, these deep neural models are relatively fast to recognize facial expressions when provided with rich computational resources. With reduced processing power, however, such as in robotic platforms, these models usually are extremely slow and do not support real-time application.

In this paper, we address the problem of using such deep neural networks by formalizing the FaceChannel neural network. This model is an upgraded version of the Multi-Channel Convolution Neural Network, proposed in our previous work 

[4]. The FaceChannel is a light-weight convolution neural network, trained from scratch, that presents state-of-the-art results in FER. To evaluate our model, we perform a series of experiments with different facial expressions datasets and compare them with the current state-of-art.

Ii The FaceChannel

Fig. 1: Detailed architecture and parameters of The FaceChannel.

In this paper, we present an updated version of the original FaceChannel proposed in our earlier works [4] that employs a VGG16 model [36]-based topology, but with much fewer parameters (see Fig. 1), to improve the robustness of the model. The FaceChannel network has

convolutional layers, including 4 pooling layers. We use batch normalization within each convolutional layer and a dropout function after each pooling layer. Following the original FaceChannel architecture, we apply shunting inhibitory fields 

[14]

in our last layer. Each shunting neuron

at the position (,) of the receptive field in the layer is activated as:

(1)

where is the activation of the common unit in the same position and is the activation of the inhibitory neuron. A learned passive decay term, is the same for each shunting inhibitory field. Each convolutional and inhibitory layer of the FaceChannel implements a ReLuactivation function.

The output of the convolutional layers is fed to a fully connected layer with units, each one implementing a ReLu activation function, which is then fed to an output layer. Our model is trained using a categorical cross-entropy loss function.

As typical for most deep learning models, our FaceChannel has several hyper-parameters that need to be tuned. We optimized our model to maximize the recognition accuracy using a Tree-structured Parzen Estimator (TPE[6] and use the optimal training parameters throughout all of our experiments. The entire network has around million adaptable parameters, which makes it very light-weight as compared to commonly used VGG16-based networks.

Iii Experimental Setup

To evaluate the FaceChannel, we perform several benchmarking experiments on different FER datasets. As some of these datasets do not contain enough data to train a deep neural network, we pre-train our model on the AffectNet dataset [34]. We then fine-tune the model with the train split of each dataset and evaluate it following individual evaluation protocols.

Iii-a Datasets

Fig. 2: Annotation distributions for: A) the AffectNet dataset [34]

has a high variance on arousal and valence with a large number of data points and B) The continuous expressions of the OMG-Emotion 

[3] videos cover high arousal and valence spread.

Iii-A1 AffectNet

AffectNet [34] consists of more than 400 thousand “in-the-wild” images that are manually annotated with seven categorical labels and continuous arousal and valence. As the test-set labels are not publicly available, all our experiments are performed using only the training and validation samples.

As AffcetNet, by far, consists of the best data distribution we experimented with (see Fig. 2 A), we use it to pre-train the FaceChannel for all the other experiments. This also guarantees that the FaceChannel first learns how to extract facial features relevant for FER. Fine-tuning the model with each specific dataset allows it to adapt to each of the specific conditions represented in these datasets.

Iii-A2 OMG-Emotion

We also provide experimentation on the One Minute Gradual Emotion Recognition (OMG-Emotion) dataset [3], which is composed of one minute long YouTube videos annotated taking into consideration continuous emotional behavior. This dataset helps us to evaluate how the model performs on recognising expressions from a particular individual. The videos were selected from YouTube using a crawler technique that uses specific keywords based on long-term emotional scenes such as “monologues”, “auditions”, “dialogues” and “emotional scenes”, which guarantees that each video has only one person performing an emotional display. A total of videos were collected, which sums up to around hours of audio-visual data. Each utterance on the videos is annotated with two continuous labels, representing arousal and valence. The emotion expressions displayed in the OMG-Emotion dataset are heavily impacted by person-specific characteristics that are highlighted by the gradual change of emotional behavior over the entire video. The videos in this dataset cover a diverse range of arousal and valence values as seen in Fig. 2 B.

Iii-A3 Fer

We then evaluate the FaceChannel on the the FER dataset [5]. This dataset contains face images, crawled from the internet, with a total of images in the training-set, and in the test-set. This dataset used a crowdsourcing labeling technique, and each image was labeled by at least different annotators. The obtained labelling distribution is thus provided to train the model. Each label relates to one of the six universal emotions (Angry, Disgust, Fear, Happy, Sad, Surprise) and Neutral.

Iii-A4 Fabo

Further, to evaluate the model for controlled environment settings, we train and evaluat it on the Bi-modal Face and Body benchmark dataset FABO [16]. This dataset is composed of recordings of the face and body motion of different subjects using two cameras, one of them capturing only the face and the second one capturing the upper body. Each video contains one subject executing the same expression in a cycle of two to four expressions per video.

Emotional State Videos Emotional State Videos
Anger Happiness
Anxiety Puzzlement
Boredom Sadness
Disgust Surprise
Fear Uncertainty
TABLE I: Number of videos available for each emotional state in the FABO dataset. Each video has to executions of the same expression.

The FABO dataset has annotations for the temporal phase of each video sequence. To create the annotation, six observers label each video independently and then a voting process is executed. We use only upper-body videos that have a voting majority regarding the temporal phase classification, similar to [16]. A total number of videos are used and are shown in Table I. The database contains ten emotional states: “Anger”, “Anxiety”, “Boredom”, “Disgust”, “Fear”, “Happiness”, “Surprise”, “Puzzlement”, “Sadness” and “Uncertainty”. As necessary for the temporal labeling, only the apex state of each sequence is used for training the model. The other frames, present in the remaining temporal phases are grouped into one new category named “Neutral” leading to a total of

emotional states to be classified.

For all the evaluated datasets, we follow the training and test separation protocols as described by the respective dataset authors to maintain comparability with other proposed models. We pre-process the individual frames to feed them to the FaceChannel. For each video frame, we detect the face of the subject using the Dlib 111https://pypi.org/project/dlib/ python library. Each face image is then resized to a dimension of x.

Iii-B Metrics

To measure the performance of the FaceChannel on the respective datasets, we use two metrics: accuracy, to recognize categorical emotion expressions, and the Concordance Correlation Coefficient (CCC[28] between the outputs of the models and the true label to recognize arousal and valence. The CCC is computed as:

(2)

where is the Pearson’s Correlation Coefficient between model prediction labels and the annotations, and denote the mean for model predictions and the annotations and and are the corresponding variances. The CCC metric allows us to have a direct comparison with the annotations available on the OMG-Emotion dataset. The use of CCC as the main objective measurement allows us to take into consideration the subjectivity of the perceived emotions for each annotator when evaluating the performance of our models.

Both metrics are in accordance with the experimental protocols defined by each of the individual dataset authors.

Iv Results

Model Arousal Valence Model Accuracy
AffectNet [34] FER [5]
AlexNet [27] 0.34 0.60 CNN VGG13 [5] 84.98%
MobileNet [21] 0.48 0.57 SHCNN [33] 86.54
VGGFace [30] 0.40 0.48 TFE-JL [29] 84.3
VGGFace+GAN [23] 0.54 0.62 ESR-9 [37] 87.15
Face Channel 0.46 0.61 Face Channel 90.50%
OMG-Emotion [3] FABO [16]
Zheng et al. [41] 0.35 0.49 Temporal Normalization [8] 66.50%
Huang et al. [22] 0.31 0.45 Bag of Words [8] 59.00%
Peng et al. [35] 0.24 0.43 SVM [17] 32.49%
Deng et al. [10] 0.27 0.35 Adaboost [17] 35.22%
FaceChannel 0.32 0.46 Face channel 80.54%
TABLE II: Concordance Correlation Coefficient (CCC), for arousal and valence, and the Categorical Accuracy when evaluating the FaceChannel with the AffectNet, OMG-Emotion, FER, and FABO datasets.

Although, the AffectNet corpus is very popular, not many researchers report the performance of arousal and valence prediction on its validation set. This is probably the case as most of the research uses the AffectNet dataset to pre-train neural models for generalization tasks in other datasets, without reporting the performance on the AffectNet itself. The baseline provided by the authors uses an AlexNet-based Convolutional Neural Network 

[27] re-trained to recognize arousal and valence. A similar approach is reported by Hewitt and Gunes [21], but using a much reduced neural network, to be deployed in a smart-device. Lindt et al. [30] reports experiments using the VGGFace, a variant of the VGG16 network pre-trained for face identification. Kollias et al. [23] proposed a novel training mechanics, where it augmented the trainint set of the AffectNet using a Generative Adversarial Network (GAN), and obtained the best reported accuracy on this corpus, achieving CCC for arousal and CCC for valence. Our FaceChannel provides an improved performance when compared to most of these results, reported in Table II, achieving a CCC of for arousal and for valence. Different from the work of Kollias et al. [23], we train our model using only the available training set portion, and expect these results to improve when training on an augmented training set.

The performance of the FaceChannel, reported in Table II, is very similar when compared to the current state-of-the-art results on the OMG-Emotion dataset, as reported by the winners of the OMG-Emotion challenge where the dataset was proposed [41, 35, 10]. All these models also reported the use of pre-training of uni-sensory convolutional channels to achieve such results, but employed deep networks with much more parameters to be fine-tuned in an end-to-end manner. The use of attention mechanisms [41] to process the continuous expressions in the videos presented the best results of the challenge, achieving a CCC of for arousal and for valence. Temporal pooling, implemented as bi-directional Long Short-Term Memorys (LSTMs) , achieved the second place, with a CCC of for arousal and for valence. The late-fusion of facial expressions, speech signals, and text information reached the third-best result, with a CCC of for arousal and for valence. The complex attention-based network proposed by Huang et al. [22] was able to achieve a CCC of in arousal and in valence, using only visual information.

When trained and evaluated with the FER+ model, our FaceChannel provides improved results as compared to those reported by the dataset authors [5]

. They employ a deep neural network based on the VGG13 model, trained using different label-averaging schemes. Their best results are achieved using the labels as a probability distribution, which is the same strategy we used. We outperform their result by almost

as reported in Table II. We also outperform the results reported in Miao et al. [33], Li et al. [29], and Siqueira et al. [37] which emply different type of complex neural networks to learn facial expressions.

Our model achieves higher accuracy for the experiments with the FABO dataset as well when compared with the state-of-the-art for the dataset [8]. They report an approach based on recognizing each video-frame, similar to ours. The results reported by Gunes et al. [17] for Adaboost and SVM-based implementations are reported using a frame-based accuracy. Our FaceChannel outperforms both models, as illustrated in Table II.

V Conclusions

In this paper, we present a formalization of the FaceChannel neural network for Facial Expression Recognition (FER). The network, an optimized version of the VGG16, has much fewer parameters, which reduces the training and fine-tuning efforts. That makes our model ideal for adapting to specific characteristics of affect perception, which may differ based on where the network is applied.

We perform a series of experiments to demonstrate the ability of the network to adapt to different scenarios, represented here by the different datasets. Each dataset consists of a specific constraint when representing affect from facial expressions. Our model demonstrated to be at par, if not better, than the state-of-the-art solutions reported on these datasets. To guarantee the reproducibility and dissemination of our model, we have made it fully available on GitHub222https://github.com/pablovin/AffectiveMemoryFramework.

In the future, we plan to study the application of our model on platforms with reduced data processing capabilities, such as on robots. Also, we plan to optimize the model further, by investigating different transfer learning methods, to make it even more robust to re-adaptation.

References

  • [1] S. Afzal and P. Robinson (2009-Sept) Natural affect data - collection and annotation in a learning context. In 3rd International Conference on Affective Computing and Intelligent Interaction., pp. 1–7. Cited by: §I.
  • [2] L. F. Barrett (2006) Solving the emotion paradox: categorization and the experience of emotion. Personality and social psychology review 10 (1), pp. 20–46. Cited by: §I.
  • [3] P. Barros, N. Churamani, E. Lakomkin, H. Sequeira, A. Sutherland, and S. Wermter (2018) The OMG-Emotion Behavior Dataset. In International Joint Conference on Neural Networks (IJCNN), pp. 1408–1414. Cited by: Fig. 2, §III-A2, TABLE II.
  • [4] P. Barros and S. Wermter (2016) Developing crossmodal expression recognition based on a deep neural model. Adaptive behavior 24 (5), pp. 373–396. Cited by: §I, §II.
  • [5] E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), Cited by: §III-A3, TABLE II, §IV.
  • [6] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: §II.
  • [7] F. Cavallo, F. Semeraro, L. Fiorini, G. Magyar, P. Sinčák, and P. Dario (2018) Emotion modelling for social robotics applications: a review. Journal of Bionic Engineering 15 (2), pp. 185–203. Cited by: §I.
  • [8] S. Chen, Y. Tian, Q. Liu, and D. N. Metaxas (2013) Recognizing expressions from face and body gesture by temporal normalized motion and appearance features. Image and Vision Computing 31 (2), pp. 175 – 185. Note: Affect Analysis In Continuous Input External Links: Document, ISSN 0262-8856 Cited by: TABLE II, §IV.
  • [9] W. Y. Choi, K. Y. Song, and C. W. Lee (2018) Convolutional attention networks for multimodal emotion recognition from speech and text data. In Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 28–34. Cited by: §I.
  • [10] D. Deng, Y. Zhou, J. Pi, and B. E. Shi (2018) Multimodal utterance-level affect analysis using visual, audio and text features. arXiv preprint arXiv:1805.00625. Cited by: TABLE II, §IV.
  • [11] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al. (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia 19 (3), pp. 34–41. Cited by: §I.
  • [12] Z. Du, S. Wu, D. Huang, W. Li, and Y. Wang (2019) Spatio-temporal encoder-decoder fully convolutional network for video-based dimensional emotion recognition. IEEE Transactions on Affective Computing. Cited by: §I.
  • [13] P. Ekman and W. V. Friesen (1971) Constants across cultures in the face and emotion. Journal of Personality and Social Psychology 17 (2), pp. 124–129. Cited by: §I.
  • [14] Y. Fregnac, C. Monier, F. Chavane, P. Baudot, and L. Graham (2003) Shunting inhibition, a silent step in visual cortical computation. Journal of Physiology, pp. 441–451. Cited by: §II.
  • [15] P. E. Griffiths (2003) III. basic emotions, complex emotions, machiavellian emotions 1. Royal Institute of Philosophy Supplements 52, pp. 39–67. Cited by: §I.
  • [16] H. Gunes and M. Piccardi (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In

    18th International Conference on Pattern Recognition(ICPR), 2006

    ,
    Vol. 1, pp. 1148–1153. External Links: Document, ISSN 1051-4651 Cited by: §III-A4, §III-A4, TABLE II.
  • [17] H. Gunes and M. Piccardi (2009-02) Automatic temporal segment detection and affect recognition from face and body display. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39 (1), pp. 64–84. External Links: Document, ISSN 1083-4419 Cited by: TABLE II, §IV.
  • [18] S. Hamann and T. Canli (2004) Individual differences in emotion processing. Current opinion in neurobiology 14 (2), pp. 233–238. Cited by: §I.
  • [19] D. Hazarika, S. Gorantla, S. Poria, and R. Zimmermann (2018) Self-attentive feature-level fusion for multimodal emotion detection. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 196–201. Cited by: §I.
  • [20] U. Hess, C. Blaison, and K. Kafetsios (2016) Judging facial emotion expressions in context: the influence of culture and self-construal orientation. Journal of Nonverbal Behavior 40 (1), pp. 55–64. Cited by: §I.
  • [21] C. Hewitt and H. Gunes (2018) Cnn-based facial affect analysis on mobile devices. arXiv preprint arXiv:1807.08775. Cited by: TABLE II, §IV.
  • [22] K. Huang, C. Wu, Q. Hong, M. Su, and Y. Chen (2019) Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5866–5870. Cited by: §I, TABLE II, §IV.
  • [23] D. Kollias, S. Cheng, E. Ververas, I. Kotsia, and S. Zafeiriou (2020) Deep neural network augmentation: generating faces for affect analysis.

    International Journal of Computer Vision

    , pp. 1–30.
    Cited by: TABLE II, §IV.
  • [24] D. Kollias, A. Schulc, E. Hajiyev, and S. Zafeiriou (2020) Analysing affective behavior in the first abaw 2020 competition. arXiv preprint arXiv:2001.11409. Cited by: §I.
  • [25] D. Kollias, P. Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou (2019) Deep affect prediction in-the-wild: aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pp. 1–23. Cited by: §I.
  • [26] M. E. Kret, K. Roelofs, J. J. Stekelenburg, and B. de Gelder (2013) Emotional signals from faces, bodies and scenes influence observers’ face expressions, fixations and pupil-size. Frontiers in human neuroscience 7. Cited by: §I.
  • [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: TABLE II, §IV.
  • [28] I. Lawrence and K. Lin (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics, pp. 255–268. Cited by: §III-B.
  • [29] M. Li, H. Xu, X. Huang, Z. Song, X. Liu, and X. Li (2018) Facial expression recognition with identity and emotion joint learning. IEEE Transactions on Affective Computing. Cited by: TABLE II, §IV.
  • [30] A. Lindt, P. Barros, H. Siqueira, and S. Wermter (2019) Facial expression editing with continuous emotion labels. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8. Cited by: TABLE II, §IV.
  • [31] E. Marinoiu, M. Zanfir, V. Olaru, and C. Sminchisescu (2018) 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2158–2167. Cited by: §I.
  • [32] D. Mehta, M. Siddiqui, and A. Javaid (2018) Facial emotion recognition: a survey and real-world user experiences in mixed reality. Sensors 18 (2), pp. 416. Cited by: §I.
  • [33] S. Miao, H. Xu, Z. Han, and Y. Zhu (2019) Recognizing facial expressions using a shallow convolutional neural network. IEEE Access 7, pp. 78000–78011. Cited by: TABLE II, §IV.
  • [34] A. Mollahosseini, B. Hasani, and M. H. Mahoor (2017) Affectnet: a database for facial expression, valence, and arousal computing in the wild. arXiv preprint arXiv:1708.03985. Cited by: §I, Fig. 2, §III-A1, §III, TABLE II.
  • [35] S. Peng, L. Zhang, Y. Ban, M. Fang, and S. Winkler (2018) A deep network for arousal-valence emotion prediction with acoustic-visual cues. arXiv preprint arXiv:1805.00638. Cited by: TABLE II, §IV.
  • [36] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II.
  • [37] H. Siqueira, S. Magg, and S. Wermter (2020) Efficient facial feature learning with wide ensemble-based convolutional neural networks. arXiv preprint arXiv:2001.06338. Cited by: TABLE II, §IV.
  • [38] J. Yang, K. Wang, X. Peng, and Y. Qiao (2018) Deep recurrent multi-instance learning with spatio-temporal features for engagement intensity prediction. In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 594–598. Cited by: §I.
  • [39] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In The 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 2236–2246. Cited by: §I.
  • [40] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia (2017) Aff-wild: valence and arousal ‘in-the-wild’challenge. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 1980–1987. Cited by: §I.
  • [41] Z. Zheng, C. Cao, X. Chen, and G. Xu (2018) Multimodal emotion recognition for one-minute-gradual emotion challenge. arXiv preprint arXiv:1805.01060. Cited by: TABLE II, §IV.