Human beings are able to express and recognize emotions as a way to communicate an inner state. Facial expression is the main form to convey this information and its understanding has transformed the treatment of emotions by the scientific community. Traditionally, scientists assumed that people have internal mechanisms comprising a small set of emotional reactions (e.g.
happiness, anger, sadness, fear, disgust) that are measurable and objective. Understanding these mental states from facial and body cues is a fundamental human trait, and such aptitude is vital in our daily communications and social interactions. In fields such as human-computer interaction (HCI), neuroscience, and computer vision, scientists have conducted extensive research to understand human emotions. Some of these studies aspire to creating computers that can understand and respond to human emotions and to our general behavior, potentially leading to seamless beneficial interactions between humans and computers. Our work aims to contribute to this effort, more specifically in the area of Facial Expression Recognition, or FER for short.
Deep Convolutional Neural Networks (CNN) have recently shown excellent performance in a wide variety of image classification tasks[18, 30, 33, 32]. The careful design of local to global feature learning with convolution, pooling, and layered architecture produces a rich visual representation, making CNN a powerful tool for facial expression recognition . Research challenges such as the Emotion Recognition in the Wild (EmotiW) series111https://sites.google.com/view/emotiw2018 and Kaggle’s Facial Expression Recognition Challenge222https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge
revealed the growing interest of the community in the use of deep learning for the solution of this problem, a trend we adopt in this work.
Recent developments for the FER problem consider processing the entire image regardless of the face location within the image, exposing them to potentially harmful noise and artifacts and incurring in unnecessary additional computational cost. This is problematic as the minutiae
that characterize facial expressions can be affected by environmental elements such as hair, jewelry, and other objects proximal to the face but in the image background. Some methods use heuristics to decrease the searching size of the facial regions. Such approaches contrast to our understanding of the human visual perception, which quickly parses the field of view, discards irrelevant information, and then focus the main processing on a specific target region of interest – the so calledvisual attention mechanism . Our approach tries to mimic this behavior as it aims to suppress the contribution of surrounding deterrent elements and it concentrates recognition solely on facial regions. Figure 1 illustrates how the attention mechanism works in a typical scene.
and performance on state-of-the-art machine learning benchmarks, their computational architecture is much simpler than those comprising the mechanisms in the human visual cortex . Attention has also been long studied by neuroscientists , who believe it is crucial for visual perception and cognition  as it is inherently tied to the architecture of the visual cortex and can affect its information.
Our contributions are summarized as follows: (1) To the best of our knowledge, this is the first CNN-based method using attention to jointly solve for representation and classification in FER problems; (2) We propose a dual-branch network to extract an attention map which in turn improves the learning of kernels specific to facial expression; (3) A new loss function is formulated for obtaining a facial manifold represented as a Gaussian Mixture Model; and (4) We create a synthetic generator to render face expressions.
2 Related Works
proposed jointly learning a deep CNN with a linear Support Vector Machine (SVM) output. His method achieved the first place on both public (validation) and private data on the FER-2013 Challenge. Liu et al.  proposed a facial expression recognition framework using 3DCNN together with deformable action parts constraints to jointly localize facial action parts and learn part-based representations for expression recognition. Liu et al. 
followed by including the pre-trained Caffe CNN models to extract image-level features. In the work of Kahouet al. 
a CNN was trained for video recognition and a deep Restricted Boltzmann Machine (RBM) was trained for for audio recognition. “Bag of mouth” features were also extracted to further improve the performance.
Yu and Zhang achieved state-of-the-art results in EmotiW in 2015 using CNNs. They used an ensemble of CNNs each with five convolutional layers  and showed that randomly perturbing the input images yielded a 2-3% boost in accuracy. Specifically, they applied transformations to the input images at training time. At testing time, their model generated predictions for multiple perturbations of each test example and voted on the class label to produce a final answer. Also of interest in this work is that they used stochastic pooling 
rather than max pooling due to its good performance on limited training data. Mollahosseiniet al. have also obtained state of the art results  with their network consisting of two convolutional layers, max-pooling, and four inception layers, the latter introduced by GoogLeNet. Their architecture was tested on many publicly available data sets.
In this section, we describe our contributions in designing a new network architecture, the formulation of the loss functions used for training, and our method to generate synthetic data.
3.1 Network architecture
Given a facial expression image , our objective is to obtain a good representation and classification of . The proposed model, Facial Expression Recognition with Attention Net (FERAtt), is based on the dual-branch architecture [11, 20, 27, 41] and consists of four major modules: (i) an attention module
to extract the attention feature map, (ii) a feature extraction moduleto obtain essential features from the input image , (iii) a reconstruction module
to estimate a good attention image, and (iv) a representation module that is responsible for the representation and classification of the facial expression image. An overview of the proposed model is illustrated in Figure 2.
Attention module. We use an encoder-decoder style network, which has been shown to produce good results for many generative [31, 41] and segmentation tasks . In particular, we choose a variation of the fully convolutional model proposed in  for semantic segmentation. We add four layers in the coder with skip connections and dilation of 2x. The decoder layer is initialized with pre-trained ResNet34  layers. This significantly accelerates the convergence. We denote the output features of the decoder by , which will be used to determine the attention feature map.
Feature extraction module. We use four ResBlocks 
to extract high-dimensional features for image attention. To maintain spatial information, we do not use any pooling or strided convolutional layers. We denote the extracted features as– see Figure 3b.
The reconstruction layer adjusts the attention map to create an enhanced input to the representation module. It has two convolutional layers, a Relu layer, and an Average Pooling layer which, by our design choice, resizes the input image ofto . This reduced size was chosen for the input of the representation and classification module (PreActivationResNet ), a number we borrowed from the literature and to facilitate comparisons. We plan to experiment with other sizes in the future. We denote the feature attention map as – see Figure 3d.
, shown on panel (d), that will then be classified by the last moduleof FERAtt. The image shown here is before reduction to size.
Representation and classification module. For the representation and classification of facial expressions, we have chosen a Fully Convolutional Network (FCN) of PreActivateResNet . This architecture has shown excellent results when applied on classification tasks. The output of this FCN, , is evaluated in a linear layer to obtain a vector with the desired dimensions. Finally, vector
is evaluated in a regression layer to estimate the probabilityfor each class .
3.2 Loss functions
The FERAtt network generates three outputs: a feature attention map , a representation vector , and a classification vector . In our training data, each image has an associated binary ground truth mask corresponding to a face in the image and its expression class . We train the network by jointly optimizing the sum of attention, representation, and classification losses:
We use the pixel-wise MSE loss function for , and for we use the BCE loss function. We propose a new loss function for the representation.
3.3 Structured Gaussian Manifold Loss
Suppose that we separate a collection of samples per class in an embedded space so that we have sets, , with the samples in , , one for each class , and the neural net function , are drawn independently according to probability for input .
We assume that has a known parametric form, and is therefore determined uniquely by the value of a parameter vector . For example, we might have , where , for
the normal distribution with mean
and variance. To show the dependence of on explicitly, we write as . Our problem is to use the information provided by the training samples to obtain a good transformation function that generate embedded spaces with known distribution associated with each category. Then the a posteriori probability can be computed from by the Bayes’ formula:
In this work, we are using the normal density function . The objective is to generate embedded sub-spaces with a defined structure. For our first approach we use Gaussian structures:
where . For the case :
In a supervised problem, we know the a posteriori probability for the input set. From this, we can define our structured loss function as the mean square error between the a posteriori probability of the input set and the a posteriori probability estimated for the embedded space:
3.4 Synthetic image generator
One of the limiting problems for FER is the small amount of correctly labeled data. In this work, we propose a renderer for the creation of a synthetic larger dataset from real datasets as presented in . allows us to make background changes and geometric transformations of the face image. Figure 4 shows an image generated from an example face of the BU3DFE dataset and a background image.
|(a) Face image (b) Background (c) Composition|
The generator method is limited to making low-level features that represent small variations in the facial expression space for the classification component. However, it allows creating a good number of examples to train our end-to-end system, having a larger contribution to the attention component. In future works we plan to include high-level features using GAN from the generated masks .
The renderer adjusts the illumination of the face image so that it is inserted in the scene more realistically. An alpha matte step is applied in the construction of the final composite image of face and background. The luminance channel of the image face model is adjusted by multiplying it by the factor where is the luminance of the region that contains the face in the original image.
We describe here the creation of the dataset used for training our network and its implementation details. We discuss two groups of experimental results: (1) Expression recognition result, to measure the performance of the method regarding the relevance of the attention module and the proposed loss function, and (2) Correction result, to analyze the robustness to noise.
To evaluate our method, we used two public facial expression datasets, namely Extended Cohn-Kanade (CK+)  and BU-3DFE . In all experiments, person-independent FER scenarios are used . Subjects in the training set are completely different from the subjects in the test set, i.e., the subjects used for training are not used for testing. The CK+ dataset includes 593 image sequences from 123 subjects. From these, we selected 325 sequences of 118 subjects, which meet the criteria for one of the seven emotions . The selected 325 sequences consist of 45 Angry, 18 Contempt, 58 Disgust, 25 Fear, 69 Happy, 28 Sadness and 82 Surprise . In the neutral face case, we selected the first frame of the sequence of 33 random selected subjects. The BU-3DFE dataset is known to be challenging and difficult mainly due to a variety of ethnic/racial ancestries and expression intensity . A total of 600 expressive face images (1 intensity x 6 expressions x 100 subjects) and 100 neutral face images, one for each subject, were used .
We employed a renderer to create training data for the neural network. uses a facial expression dataset (we use BU-3DFE and CK+, which were segmented to obtain face masks) and a dataset of background images (we have chosen the COCO dataset). Figure 5 show examples of images generated by the renderer on BU-3DFE dataset.
4.2 Implementation and training details
In all experiments we considered the architecture PreActResNet18 for the classification and representation processes. We adopted two approaches: (1) a model with attention and classification, FERAtt+Cls, and (2) a model with attention, classification, and representation, FERAtt+Rep+Cls. These were compared with the classification results. For the representation, the last convolutional layer of PreActResNet is evaluated by a linear layer to generate a vector of selected size. We have opted for 64 dimensions for the representation vector .
All models were trained on Nvidia GPUs (P100, K80, Titan XP) using PyTorch333http://pytorch.org/
for 60 epochs on the training set with 200 examples per mini batch and employing the Adam optimizer. Face images were rescaled to 3232 pixels. The code for the FERAtt is available in a public repository444https://github.com/pedrodiamel/ferattention.
4.3 Expression recognition results
This set of experiments makes comparisons between a baseline architecture and the different variants of the proposed architecture. We want to evaluate the relevance of the attention module and the proposed loss function.
. We used distinct metrics to evaluate the proposed methods. Accuracy is calculated as the average number of successes divided by the total number of observations (in this case each face is considered an observation). Precision, recall, F1 score, and confusion matrix are also used in the analysis of the effectiveness of the system. Dems̆ar
recommends the Friedman test followed by the pairwise Nemenyi test to compare multiple data. The Friedman test is a nonparametric alternative of the analysis of variance (ANOVA) test. The null hypothesis of the testis that all models are equivalent. Similar to the methods in , Leave-10-subject-out (L-10-SO) cross-validation was adopted in the evaluation.
show the mean and standard deviation for the results obtained on the real and synthetic BU3DFE datasets. The Friedman nonparametric ANOVA test reveals significant differences () between the methods. The Nemenyi post-hoc test was applied to determine which method present significant differences. The result for the Nemenyi post-hoc test (two-tailed test) shows that there are significant differences between the FERAtt+Cls+Rep and all the others, for a significance level at .
We repeated the experiment for the Synthetic CK+ dataset and Real CK+ dataset. Tables 3 and 4 show the mean and standard deviation for the obtained results. The Friedman test found significant differences between the methods with a level of significance of for the Synthetic CK+ dataset and for Real CK+ dataset. In this case we applied the Bonferroni-Dunn post-hoc test (one-tailed test) to strengthen the power of the hypotheses test. For a significance level of 0.05, the Bonferroni-Dunn post-hoc test did not show significant differences between the FERAtt+Cls and the Baseline for Synthetic CK+ with . When considering FERAtt+Rep+Cls and Baseline methods, it shows significant differences for the Real CK+ dataset with .
The results in Figure 6 show the 64-dimensional embedded space using the Barnes-Hut t-SNE visualization scheme  of the Gaussian Structured loss for the Real CK+ dataset. Errors achieved by the network are mostly due to the neutral class which is intrinsically similar to the other expressions we analyzed. Surprisingly, we observed intraclass separations into additional features, such as race, that were not taken into account when modeling or training the network.
|(a) (b) (c) (d) (e) (f) (g)|
4.4 Robustness to noise
The objective of this set of experiments is to demonstrate the robustness of our method to the presence of image noise when compared to the baseline architecture PreActResNet18.
Protocol. To carry out this experiment, the Baseline, FERAtt+Class, and FERAtt+Rep+Class models were trained on the Synthetic CK+ dataset. Each of these models was readjusted with increasing noise in the training set (). We maintained the parameters in the training for fine-tuning. We used the real database CK+, and 2000 images were generated for the synthetic dataset for test.
Results. One of the advantages of the proposed approach is that we can evaluate the robustness of the method under different noise levels by visually assessing the changes in the attention map . Figure 7 shows the attention maps for an image for white zero mean Gaussian noise levels . We observe that our network is quite robust to noise for the range of 0.01 to 0.1 and maintains a distribution of homogeneous intensity values. This aspect is beneficial to the subsequent performance of the classification module. Figures 8 and 9 present classification accuracy results of the evaluated models in the Real CK+ dataset and for 2000 Synthetic images. The proposed method FERAtt+CLs+Rep provides the best classification in both cases.
In this work, we present a new end-to-end network architecture with an attention model for facial expression recognition. We create a generator of synthetic images which are used for training our models. The results show that, for these experimental conditions, the attention module improves the system classification performance. The loss function presented works as a regularization method on the embedded space contributing positively to the system results. As a future work, we will experiment with larger databases, such as in , which contain images from the real world and are potentially more challenging.
-  (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), Cited by: §5.
-  (2016) Emergence of foveal image sampling from learning to attend in visual scenes. arXiv preprint arXiv:1611.09430. Cited by: §1.
-  (2003) Theoretical neuroscience: computational and mathematical modeling of neural systems. Journal of Cognitive Neuroscience 15 (1), pp. 154–155. Cited by: §1.
-  (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, pp. 1–30. Cited by: §4.3.
Attend, infer, repeat: fast scene understanding with generative models. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3225–3233. External Links: Cited by: §1.
-  (2018) Fast and robust multiple colorchecker detection using deep convolutional neural networks. arXiv preprint arXiv:1810.08639. Cited by: §3.4.
-  (2013) Challenges in representation learning: a report on three machine learning contests. In International Conference on Neural Information Processing, pp. 117–124. Cited by: §2.
-  (2014) Fractional max-pooling. arXiv preprint arXiv:1412.6071. Cited by: §2.
-  (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471. Cited by: §1.
Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623. Cited by: §1.
-  (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §3.1.
-  (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §3.1.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.1, §3.1.
DyadGAN: generating facial expressions in dyadic interactions.
Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 2259–2266. Cited by: §3.4.
-  (2001) Computational modelling of visual attention. Nature Reviews Neuroscience 2 (3), pp. 194. Cited by: §1.
-  (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Cited by: §1.
-  (2013) Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI ’13, New York, NY, USA, pp. 543–550. External Links: Cited by: §2.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §1.
-  (2018) Deep facial expression recognition: a survey. arXiv preprint arXiv:1804.08348. Cited by: §1.
-  (2016) Deep joint image filtering. In European Conference on Computer Vision, pp. 154–169. Cited by: §3.1.
Enhanced deep residual networks for single image super-resolution. In The IEEE conference on computer vision and pattern recognition (CVPR) workshops, Vol. 1, pp. 4. Cited by: §3.1.
-  (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501. Cited by: §2.
Facial expression recognition via a boosted deep belief network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1805–1812. Cited by: §2.
-  (2010) The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 94–101. Cited by: §4.1.
-  (2014) Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2204–2212. External Links: Cited by: §1.
-  (2016) Going deeper in facial expression recognition using deep neural networks. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1–10. Cited by: §2.
-  (2018) Learning dual convolutional neural networks for low-level vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3070–3079. Cited by: §3.1.
-  (2013) Manifold based sparse representation for facial understanding in natural images. Image and Vision Computing 31 (5), pp. 365–378. Cited by: §4.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Cited by: §3.1.
-  (2015-12-01) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: Cited by: §1.
-  (2018) Zero-shot” super-resolution using deep internal learning. In Conference on computer vision and pattern recognition (CVPR), Cited by: §3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Cited by: §1.
-  (2015-06) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239. Cited by: §2.
-  (2000) Mechanisms of visual attention in the human cortex. Annual review of neuroscience 23 (1), pp. 315–341. Cited by: §1.
-  (2014) Accelerating t-sne using tree-based algorithms.. Journal of machine learning research 15 (1), pp. 3221–3245. Cited by: Figure 6, §4.3.
-  (2015) Grammar as a foreign language. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2773–2781. External Links: Cited by: §1.
-  (2006) A 3D facial expression database for facial behavior research. In Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on, pp. 211–216. Cited by: §4.1.
-  (2015) Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435–442. Cited by: §2.
-  (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. Pattern Analysis and Machine Intelligence, IEEE Transactions on 31 (1), pp. 39–58. Cited by: §4.1.
-  (2016) Deep cascaded bi-network for face hallucination. In European Conference on Computer Vision, pp. 614–630. Cited by: §3.1, §3.1.