FERAtt: Facial Expression Recognition with Attention Net

02/08/2019 ∙ by Pedro D. Marrero Fernandez, et al. ∙ UFPE California Institute of Technology 28

We present a new end-to-end network architecture for facial expression recognition with an attention model. It focuses attention in the human face and uses a Gaussian space representation for expression recognition. We devise this architecture based on two fundamental complementary components: (1) facial image correction and attention and (2) facial expression representation and classification. The first component uses an encoder-decoder style network and a convolutional feature extractor that are pixel-wise multiplied to obtain a feature attention map. The second component is responsible for obtaining an embedded representation and classification of the facial expression. We propose a loss function that creates a Gaussian structure on the representation space. To demonstrate the proposed method, we create two larger and more comprehensive synthetic datasets using the traditional BU3DFE and CK+ facial datasets. We compared results with the PreActResNet18 baseline. Our experiments on these datasets have shown the superiority of our approach in recognizing facial expressions.



There are no comments yet.


page 1

page 3

page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human beings are able to express and recognize emotions as a way to communicate an inner state. Facial expression is the main form to convey this information and its understanding has transformed the treatment of emotions by the scientific community. Traditionally, scientists assumed that people have internal mechanisms comprising a small set of emotional reactions (e.g.

happiness, anger, sadness, fear, disgust) that are measurable and objective. Understanding these mental states from facial and body cues is a fundamental human trait, and such aptitude is vital in our daily communications and social interactions. In fields such as human-computer interaction (HCI), neuroscience, and computer vision, scientists have conducted extensive research to understand human emotions. Some of these studies aspire to creating computers that can understand and respond to human emotions and to our general behavior, potentially leading to seamless beneficial interactions between humans and computers. Our work aims to contribute to this effort, more specifically in the area of Facial Expression Recognition, or FER for short.

Figure 1: Example of attention in a selfie image. Facial expression is recognized on the front face which is separated from the less prominent components of the image by our approach. Our goal is to jointly train for attention and classification where faces are segmented and their expressions learned by a dual–branch network. By focusing attention on the face features we try to eliminate a detrimental influence possibly present on the other elements in the image during facial expression classification. A differential of our formulation is thus that we explicitly target learning expressions solely on learned faces and not on other irrelevant parts of the image.

Deep Convolutional Neural Networks (CNN) have recently shown excellent performance in a wide variety of image classification tasks

[18, 30, 33, 32]. The careful design of local to global feature learning with convolution, pooling, and layered architecture produces a rich visual representation, making CNN a powerful tool for facial expression recognition [19]. Research challenges such as the Emotion Recognition in the Wild (EmotiW) series111https://sites.google.com/view/emotiw2018 and Kaggle’s Facial Expression Recognition Challenge222https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge

revealed the growing interest of the community in the use of deep learning for the solution of this problem, a trend we adopt in this work.

Recent developments for the FER problem consider processing the entire image regardless of the face location within the image, exposing them to potentially harmful noise and artifacts and incurring in unnecessary additional computational cost. This is problematic as the minutiae

that characterize facial expressions can be affected by environmental elements such as hair, jewelry, and other objects proximal to the face but in the image background. Some methods use heuristics to decrease the searching size of the facial regions. Such approaches contrast to our understanding of the human visual perception, which quickly parses the field of view, discards irrelevant information, and then focus the main processing on a specific target region of interest – the so called

visual attention mechanism [15]. Our approach tries to mimic this behavior as it aims to suppress the contribution of surrounding deterrent elements and it concentrates recognition solely on facial regions. Figure 1 illustrates how the attention mechanism works in a typical scene.

Attention mechanisms have recently been explored in a wide variety of contexts [37, 16], often providing new capabilities to algorithms [9, 10, 5]. While they improve efficiency [25]

and performance on state-of-the-art machine learning benchmarks

[37], their computational architecture is much simpler than those comprising the mechanisms in the human visual cortex [3]. Attention has also been long studied by neuroscientists [35], who believe it is crucial for visual perception and cognition [2] as it is inherently tied to the architecture of the visual cortex and can affect its information.

Our contributions are summarized as follows: (1) To the best of our knowledge, this is the first CNN-based method using attention to jointly solve for representation and classification in FER problems; (2) We propose a dual-branch network to extract an attention map which in turn improves the learning of kernels specific to facial expression; (3) A new loss function is formulated for obtaining a facial manifold represented as a Gaussian Mixture Model; and (4) We create a synthetic generator to render face expressions.

2 Related Works

Tang [34]

proposed jointly learning a deep CNN with a linear Support Vector Machine (SVM) output. His method achieved the first place on both public (validation) and private data on the FER-2013 Challenge

[7]. Liu et al. [23] proposed a facial expression recognition framework using 3DCNN together with deformable action parts constraints to jointly localize facial action parts and learn part-based representations for expression recognition. Liu et al. [22]

followed by including the pre-trained Caffe CNN models to extract image-level features. In the work of Kahou

et al. [17]

a CNN was trained for video recognition and a deep Restricted Boltzmann Machine (RBM) was trained for for audio recognition. “Bag of mouth” features were also extracted to further improve the performance.

Yu and Zhang achieved state-of-the-art results in EmotiW in 2015 using CNNs. They used an ensemble of CNNs each with five convolutional layers [39] and showed that randomly perturbing the input images yielded a 2-3% boost in accuracy. Specifically, they applied transformations to the input images at training time. At testing time, their model generated predictions for multiple perturbations of each test example and voted on the class label to produce a final answer. Also of interest in this work is that they used stochastic pooling [8]

rather than max pooling due to its good performance on limited training data. Mollahosseini

et al. have also obtained state of the art results [26] with their network consisting of two convolutional layers, max-pooling, and four inception layers, the latter introduced by GoogLeNet. Their architecture was tested on many publicly available data sets.

3 Methodology

In this section, we describe our contributions in designing a new network architecture, the formulation of the loss functions used for training, and our method to generate synthetic data.

3.1 Network architecture

Given a facial expression image , our objective is to obtain a good representation and classification of . The proposed model, Facial Expression Recognition with Attention Net (FERAtt), is based on the dual-branch architecture [11, 20, 27, 41] and consists of four major modules: (i) an attention module

to extract the attention feature map, (ii) a feature extraction module

to obtain essential features from the input image , (iii) a reconstruction module

to estimate a good attention image

, and (iv) a representation module that is responsible for the representation and classification of the facial expression image. An overview of the proposed model is illustrated in Figure  2.

Figure 2: Architecture of FERAtt. Our model consists of four major modules: attention module , feature extraction module , reconstruction module , and classification and representation module . The features extracted by , and are used to create the attention map which in turn is fed into to create a representation of the image. Input images have pixels and are reduced to by an Averaging Pooling layer on the reconstruction module. Classification is thus done on these smaller but richer representations of the original image.


Attention module. We use an encoder-decoder style network, which has been shown to produce good results for many generative [31, 41] and segmentation tasks [29]. In particular, we choose a variation of the fully convolutional model proposed in [29] for semantic segmentation. We add four layers in the coder with skip connections and dilation of 2x. The decoder layer is initialized with pre-trained ResNet34 [12] layers. This significantly accelerates the convergence. We denote the output features of the decoder by , which will be used to determine the attention feature map.

Feature extraction module. We use four ResBlocks [21]

to extract high-dimensional features for image attention. To maintain spatial information, we do not use any pooling or strided convolutional layers. We denote the extracted features as

– see Figure 3b.

Reconstruction module.

The reconstruction layer adjusts the attention map to create an enhanced input to the representation module. It has two convolutional layers, a Relu layer, and an Average Pooling layer which, by our design choice, resizes the input image of

to . This reduced size was chosen for the input of the representation and classification module (PreActivationResNet [13]), a number we borrowed from the literature and to facilitate comparisons. We plan to experiment with other sizes in the future. We denote the feature attention map as – see Figure 3d.

Figure 3: Generation of attention map . A noisy input image (a) is processed by the feature extraction and attention modules whose results, shown, respectively, in panels (b) and (c), are combined and then fed into the reconstruction module . This in turn produces a clean and focused attention map

, shown on panel (d), that will then be classified by the last module

of FERAtt. The image shown here is before reduction to size.

Representation and classification module. For the representation and classification of facial expressions, we have chosen a Fully Convolutional Network (FCN) of PreActivateResNet [13]. This architecture has shown excellent results when applied on classification tasks. The output of this FCN, , is evaluated in a linear layer to obtain a vector with the desired dimensions. Finally, vector

is evaluated in a regression layer to estimate the probability

for each class .

3.2 Loss functions

The FERAtt network generates three outputs: a feature attention map , a representation vector , and a classification vector . In our training data, each image has an associated binary ground truth mask corresponding to a face in the image and its expression class . We train the network by jointly optimizing the sum of attention, representation, and classification losses:


We use the pixel-wise MSE loss function for , and for we use the BCE loss function. We propose a new loss function for the representation.

3.3 Structured Gaussian Manifold Loss

Suppose that we separate a collection of samples per class in an embedded space so that we have sets, , with the samples in , , one for each class , and the neural net function , are drawn independently according to probability for input .

We assume that has a known parametric form, and is therefore determined uniquely by the value of a parameter vector . For example, we might have , where , for

the normal distribution with mean

and variance

. To show the dependence of on explicitly, we write as . Our problem is to use the information provided by the training samples to obtain a good transformation function that generate embedded spaces with known distribution associated with each category. Then the a posteriori probability can be computed from by the Bayes’ formula:


In this work, we are using the normal density function . The objective is to generate embedded sub-spaces with a defined structure. For our first approach we use Gaussian structures:


where . For the case :


In a supervised problem, we know the a posteriori probability for the input set. From this, we can define our structured loss function as the mean square error between the a posteriori probability of the input set and the a posteriori probability estimated for the embedded space:


3.4 Synthetic image generator

One of the limiting problems for FER is the small amount of correctly labeled data. In this work, we propose a renderer for the creation of a synthetic larger dataset from real datasets as presented in [6]. allows us to make background changes and geometric transformations of the face image. Figure 4 shows an image generated from an example face of the BU3DFE dataset and a background image.

  (a) Face image    (b) Background    (c) Composition
Figure 4: Example of synthetic image generation. A cropped face image and a general background image are combined to generate a composite image. By using distinct background images for every face image we are able to generate a much larger training data set.


The generator method is limited to making low-level features that represent small variations in the facial expression space for the classification component. However, it allows creating a good number of examples to train our end-to-end system, having a larger contribution to the attention component. In future works we plan to include high-level features using GAN from the generated masks [14].

The renderer adjusts the illumination of the face image so that it is inserted in the scene more realistically. An alpha matte step is applied in the construction of the final composite image of face and background. The luminance channel of the image face model is adjusted by multiplying it by the factor where is the luminance of the region that contains the face in the original image.

4 Experiments

We describe here the creation of the dataset used for training our network and its implementation details. We discuss two groups of experimental results: (1) Expression recognition result, to measure the performance of the method regarding the relevance of the attention module and the proposed loss function, and (2) Correction result, to analyze the robustness to noise.

4.1 Datasets

To evaluate our method, we used two public facial expression datasets, namely Extended Cohn-Kanade (CK+) [24] and BU-3DFE [38]. In all experiments, person-independent FER scenarios are used [40]. Subjects in the training set are completely different from the subjects in the test set, i.e., the subjects used for training are not used for testing. The CK+ dataset includes 593 image sequences from 123 subjects. From these, we selected 325 sequences of 118 subjects, which meet the criteria for one of the seven emotions [24]. The selected 325 sequences consist of 45 Angry, 18 Contempt, 58 Disgust, 25 Fear, 69 Happy, 28 Sadness and 82 Surprise [24]. In the neutral face case, we selected the first frame of the sequence of 33 random selected subjects. The BU-3DFE dataset is known to be challenging and difficult mainly due to a variety of ethnic/racial ancestries and expression intensity [38]. A total of 600 expressive face images (1 intensity x 6 expressions x 100 subjects) and 100 neutral face images, one for each subject, were used [38].

We employed a renderer to create training data for the neural network. uses a facial expression dataset (we use BU-3DFE and CK+, which were segmented to obtain face masks) and a dataset of background images (we have chosen the COCO dataset). Figure 5 show examples of images generated by the renderer on BU-3DFE dataset.

Figure 5: Examples from the synthetic BU-3DFE dataset. Different faces are transformed and combined with randomly selected background images from the COCO dataset. After transformation, color augmentation is apply (brightness, contrast, Gaussian blur and noise).


4.2 Implementation and training details

In all experiments we considered the architecture PreActResNet18 for the classification and representation processes. We adopted two approaches: (1) a model with attention and classification, FERAtt+Cls, and (2) a model with attention, classification, and representation, FERAtt+Rep+Cls. These were compared with the classification results. For the representation, the last convolutional layer of PreActResNet is evaluated by a linear layer to generate a vector of selected size. We have opted for 64 dimensions for the representation vector .

All models were trained on Nvidia GPUs (P100, K80, Titan XP) using PyTorch


for 60 epochs on the training set with 200 examples per mini batch and employing the Adam optimizer. Face images were rescaled to 32

32 pixels. The code for the FERAtt is available in a public repository444https://github.com/pedrodiamel/ferattention.

4.3 Expression recognition results

This set of experiments makes comparisons between a baseline architecture and the different variants of the proposed architecture. We want to evaluate the relevance of the attention module and the proposed loss function.


. We used distinct metrics to evaluate the proposed methods. Accuracy is calculated as the average number of successes divided by the total number of observations (in this case each face is considered an observation). Precision, recall, F1 score, and confusion matrix are also used in the analysis of the effectiveness of the system. Dems̆ar


recommends the Friedman test followed by the pairwise Nemenyi test to compare multiple data. The Friedman test is a nonparametric alternative of the analysis of variance (ANOVA) test. The null hypothesis of the test

is that all models are equivalent. Similar to the methods in [28], Leave-10-subject-out (L-10-SO) cross-validation was adopted in the evaluation.

Results. Tables 1 and 2

show the mean and standard deviation for the results obtained on the real and synthetic BU3DFE datasets. The Friedman nonparametric ANOVA test reveals significant differences (

) between the methods. The Nemenyi post-hoc test was applied to determine which method present significant differences. The result for the Nemenyi post-hoc test (two-tailed test) shows that there are significant differences between the FERAtt+Cls+Rep and all the others, for a significance level at .

Method Acc. Prec. Rec. F1
PreActResNet18 69.37 71.48 69.56 70.50
2.84 1.46 2.76 2.05
FERAtt+Cls 75.15 77.34 75.45 76.38
3.13 1.40 2.57 1.98
FERAtt+Rep+Cls 77.90 79.58 78.05 78.81
2.59 1.77 2.34 2.01
Table 1: Classification results for the Synthetic BU-3DFE database applied to seven expressions.
Method Acc. Prec. Rec. F1
PreActResNet18 75.22 77.58 75.49 76.52
4.60 3.72 4.68 4.19
FERAtt+Cls 80.41 82.30 80.79 81.54
4.33 2.99 3.75 3.38
FERAtt+Rep+Cls 82.11 83.72 82.42 83.06
4.39 3.09 4.08 3.59
Table 2: Classification results for the Real BU-3DFE database applied to seven expressions.

We repeated the experiment for the Synthetic CK+ dataset and Real CK+ dataset. Tables 3 and 4 show the mean and standard deviation for the obtained results. The Friedman test found significant differences between the methods with a level of significance of for the Synthetic CK+ dataset and for Real CK+ dataset. In this case we applied the Bonferroni-Dunn post-hoc test (one-tailed test) to strengthen the power of the hypotheses test. For a significance level of 0.05, the Bonferroni-Dunn post-hoc test did not show significant differences between the FERAtt+Cls and the Baseline for Synthetic CK+ with . When considering FERAtt+Rep+Cls and Baseline methods, it shows significant differences for the Real CK+ dataset with .

Method Acc. Prec. Rec. F1
PreActResNet18 77.63 68.42 68.56 68.49
2.11 2.97 1.91 2.43
FERAtt+Cls 84.60 74.94 76.30 75.61
0.93 0.38 1.19 0.76
FERAtt+Rep+Cls 85.15 74.68 77.45 76.04
1.07 1.37 0.55 0.97
Table 3: Classification results for the Synthetic CK+ database applied to eight expressions.
Method Acc. Prec. Rec. F1
PreActResNet18 86.67 81.62 80.15 80.87
3.15 7.76 9.50 8.63
FERAtt+Cls 85.42 75.65 78.79 77.18
2.89 2.77 2.30 2.55
FERAtt+Rep+Cls 90.30 83.64 84.90 84.25
1.36 5.28 8.52 6.85
Table 4: Classification results for the Real CK+ database applied to eight expressions.

The results in Figure 6 show the 64-dimensional embedded space using the Barnes-Hut t-SNE visualization scheme [36] of the Gaussian Structured loss for the Real CK+ dataset. Errors achieved by the network are mostly due to the neutral class which is intrinsically similar to the other expressions we analyzed. Surprisingly, we observed intraclass separations into additional features, such as race, that were not taken into account when modeling or training the network.

Figure 6: Barnes-Hut t-SNE visualization [36] of the Gaussian Structured loss for the Real CK+ dataset. Each color represents one of the eight emotions including neutral.


    (a)        (b)        (c)        (d)        (e)          (f)          (g)
Figure 7: Attention maps under increasing noise levels. We progressively added higher levels of zero mean white Gaussian noise to the same image and tested them using our model. The classification numbers above show the robustness of the proposed approach as the Surprise score and all others are maintained throughout all levels, with a minor change for the highest noise level of 0.30.


Figure 8: Classification accuracy after adding incremental noise on the Real CK+ dataset. Our approach results in higher accuracy when compared to the baseline, specially for stronger noise levels. Our representation model clearly leverages results showing its importance for classification. Plotted values are the average results for all 325 images in the database.


Figure 9: Average classification accuracy after adding incremental noise on the Synthetic CK+ dataset. The behavior of our method in the synthetic data replicates what we have found for the original Real CK+ database, i.e., our method is superior to the baseline for all levels of noise. Plotted average values are for 2,000 synthetic images.


4.4 Robustness to noise

The objective of this set of experiments is to demonstrate the robustness of our method to the presence of image noise when compared to the baseline architecture PreActResNet18.

Protocol. To carry out this experiment, the Baseline, FERAtt+Class, and FERAtt+Rep+Class models were trained on the Synthetic CK+ dataset. Each of these models was readjusted with increasing noise in the training set (). We maintained the parameters in the training for fine-tuning. We used the real database CK+, and 2000 images were generated for the synthetic dataset for test.

Results. One of the advantages of the proposed approach is that we can evaluate the robustness of the method under different noise levels by visually assessing the changes in the attention map . Figure 7 shows the attention maps for an image for white zero mean Gaussian noise levels . We observe that our network is quite robust to noise for the range of 0.01 to 0.1 and maintains a distribution of homogeneous intensity values. This aspect is beneficial to the subsequent performance of the classification module. Figures 8 and 9 present classification accuracy results of the evaluated models in the Real CK+ dataset and for 2000 Synthetic images. The proposed method FERAtt+CLs+Rep provides the best classification in both cases.

5 Conclusions

In this work, we present a new end-to-end network architecture with an attention model for facial expression recognition. We create a generator of synthetic images which are used for training our models. The results show that, for these experimental conditions, the attention module improves the system classification performance. The loss function presented works as a regularization method on the embedded space contributing positively to the system results. As a future work, we will experiment with larger databases, such as in [1], which contain images from the real world and are potentially more challenging.


  • [1] E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), Cited by: §5.
  • [2] B. Cheung, E. Weiss, and B. Olshausen (2016) Emergence of foveal image sampling from learning to attend in visual scenes. arXiv preprint arXiv:1611.09430. Cited by: §1.
  • [3] P. Dayan, L. Abbott, et al. (2003) Theoretical neuroscience: computational and mathematical modeling of neural systems. Journal of Cognitive Neuroscience 15 (1), pp. 154–155. Cited by: §1.
  • [4] J. Demšar (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, pp. 1–30. Cited by: §4.3.
  • [5] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, k. kavukcuoglu, and G. E. Hinton (2016)

    Attend, infer, repeat: fast scene understanding with generative models

    In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3225–3233. External Links: Link Cited by: §1.
  • [6] P. D. M. Fernandez, F. A. Guerrero-Peña, T. I. Ren, and J. J. Leandro (2018) Fast and robust multiple colorchecker detection using deep convolutional neural networks. arXiv preprint arXiv:1810.08639. Cited by: §3.4.
  • [7] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D. Lee, et al. (2013) Challenges in representation learning: a report on three machine learning contests. In International Conference on Neural Information Processing, pp. 117–124. Cited by: §2.
  • [8] B. Graham (2014) Fractional max-pooling. arXiv preprint arXiv:1412.6071. Cited by: §2.
  • [9] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471. Cited by: §1.
  • [10] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015)

    Draw: a recurrent neural network for image generation

    arXiv preprint arXiv:1502.04623. Cited by: §1.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §3.1.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §3.1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.1, §3.1.
  • [14] Y. Huang and S. M. Khan (2017) DyadGAN: generating facial expressions in dyadic interactions. In

    Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on

    pp. 2259–2266. Cited by: §3.4.
  • [15] L. Itti and C. Koch (2001) Computational modelling of visual attention. Nature Reviews Neuroscience 2 (3), pp. 194. Cited by: §1.
  • [16] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Link Cited by: §1.
  • [17] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gulcehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, M. Mirza, S. Jean, P. Carrier, Y. Dauphin, N. Boulanger-Lewandowski, A. Aggarwal, J. Zumer, P. Lamblin, J. Raymond, G. Desjardins, R. Pascanu, D. Warde-Farley, A. Torabi, A. Sharma, E. Bengio, M. Cote, K. R. Konda, and Z. Wu (2013) Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ICMI ’13, New York, NY, USA, pp. 543–550. External Links: ISBN 978-1-4503-2129-7, Link, Document Cited by: §2.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §1.
  • [19] S. Li and W. Deng (2018) Deep facial expression recognition: a survey. arXiv preprint arXiv:1804.08348. Cited by: §1.
  • [20] Y. Li, J. Huang, N. Ahuja, and M. Yang (2016) Deep joint image filtering. In European Conference on Computer Vision, pp. 154–169. Cited by: §3.1.
  • [21] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017)

    Enhanced deep residual networks for single image super-resolution

    In The IEEE conference on computer vision and pattern recognition (CVPR) workshops, Vol. 1, pp. 4. Cited by: §3.1.
  • [22] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501. Cited by: §2.
  • [23] P. Liu, S. Han, Z. Meng, and Y. Tong (2014)

    Facial expression recognition via a boosted deep belief network

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1805–1812. Cited by: §2.
  • [24] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews (2010) The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 94–101. Cited by: §4.1.
  • [25] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu (2014) Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2204–2212. External Links: Link Cited by: §1.
  • [26] A. Mollahosseini, D. Chan, and M. H. Mahoor (2016) Going deeper in facial expression recognition using deep neural networks. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1–10. Cited by: §2.
  • [27] J. Pan, S. Liu, D. Sun, J. Zhang, Y. Liu, J. Ren, Z. Li, J. Tang, H. Lu, Y. Tai, et al. (2018) Learning dual convolutional neural networks for low-level vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3070–3079. Cited by: §3.1.
  • [28] R. Ptucha and A. Savakis (2013) Manifold based sparse representation for facial understanding in natural images. Image and Vision Computing 31 (5), pp. 365–378. Cited by: §4.3.
  • [29] O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §3.1.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015-12-01) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. External Links: ISSN 1573-1405, Document, Link Cited by: §1.
  • [31] A. Shocher, N. Cohen, and M. Irani (2018) Zero-shot” super-resolution using deep internal learning. In Conference on computer vision and pattern recognition (CVPR), Cited by: §3.1.
  • [32] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link, 1409.1556 Cited by: §1.
  • [33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015-06) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [34] Y. Tang (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239. Cited by: §2.
  • [35] S. K. Ungerleider and L. G (2000) Mechanisms of visual attention in the human cortex. Annual review of neuroscience 23 (1), pp. 315–341. Cited by: §1.
  • [36] L. Van Der Maaten (2014) Accelerating t-sne using tree-based algorithms.. Journal of machine learning research 15 (1), pp. 3221–3245. Cited by: Figure 6, §4.3.
  • [37] O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton (2015) Grammar as a foreign language. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2773–2781. External Links: Link Cited by: §1.
  • [38] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato (2006) A 3D facial expression database for facial behavior research. In Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on, pp. 211–216. Cited by: §4.1.
  • [39] Z. Yu and C. Zhang (2015) Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435–442. Cited by: §2.
  • [40] Z. Zeng, M. Pantic, G. Roisman, T. S. Huang, et al. (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. Pattern Analysis and Machine Intelligence, IEEE Transactions on 31 (1), pp. 39–58. Cited by: §4.1.
  • [41] S. Zhu, S. Liu, C. C. Loy, and X. Tang (2016) Deep cascaded bi-network for face hallucination. In European Conference on Computer Vision, pp. 614–630. Cited by: §3.1, §3.1.