A Fine-Grained Facial Expression Database for End-to-End Multi-Pose Facial Expression Recognition

07/25/2019 ∙ by Wenxuan Wang, et al. ∙ PING AN INSURANCE Ping An Bank FUDAN University 3

The recent research of facial expression recognition has made a lot of progress due to the development of deep learning technologies, but some typical challenging problems such as the variety of rich facial expressions and poses are still not resolved. To solve these problems, we develop a new Facial Expression Recognition (FER) framework by involving the facial poses into our image synthesizing and classification process. There are two major novelties in this work. First, we create a new facial expression dataset of more than 200k images with 119 persons, 4 poses and 54 expressions. To our knowledge this is the first dataset to label faces with subtle emotion changes for expression recognition purpose. It is also the first dataset that is large enough to validate the FER task on unbalanced poses, expressions, and zero-shot subject IDs. Second, we propose a facial pose generative adversarial network (FaPE-GAN) to synthesize new facial expression images to augment the data set for training purpose, and then learn a LightCNN based Fa-Net model for expression classification. Finally, we advocate four novel learning tasks on this dataset. The experimental results well validate the effectiveness of the proposed approach.



There are no comments yet.


page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial expression [5], as the most important facial attribute, reflects the emotion status of a person, and contains meaningful communication information. Facial expression recognition (FER) is widely used in multiple applications such as psychology, medicine, security and education [5]. In psychology, it can be used for depression recognition for analyzing psychological distress. On the other hand, detecting a student’s concentration or frustration is also helpful in improving the educational approach.

Facial expression recognition mainly contains four steps: face detection, face alignment, feature extraction and facial expression classification. (1) In the first step, the face is detected from the image with each labelled by a bounding box. (2) In the second step, the face landmarks are generated to align the face. (3) In the third step, the features that contain facial related information are extracted in either hand-crafted way,

e.g., SIFT, [4] Gabor wavelets [3, 22] and LBP [29]

or learned way by a neural network. (4) In the fourth step, various classifiers such as SVM, KNN and MLP can be adopted for facial expression classification.

The recent renaissance of deep neural networks delivers the human level performance towards several vision tasks, such as object classification, detection and segmentation [19, 18, 27]. Inspired by this, some deep network methods [15, 23, 35] have been proposed to address the facial expression recognition. In FER task, facial expression is usually assumed to contain six discrete primary emotions: anger, disgust, fear, happy, sad and surprise according to Ekman’s theory. With an additional neutral emotion, the seven emotions compose the main part of most common emotion datasets, including CK+ [20, 14], JAFEE [22], FER2013 [26] and FERG [2].

However, one most challenging problem of FER in fact is lacking of a large-scale dataset of high quality images, that can be employed to train the deep networks and investigate the impacting factors for the FER task. Another disadvantage of these datasets, e.g., JAFFE and FER2013 dataset, is the little diversity of expression emotions, which cannot really express the versatile facial expression emotions in the real world life.

To this end, we create a new dataset ED (Fine-grained Facial Expression Database) with 54 emotion types, which include larger number of emotions with subtle changes, such as calm, embarrassed, pride, tension and so on. Further, we also consider the influence of face pose changes on the expression recognition, and introduce the pose as another attribute for each expression. Four orientations (poses) including front, half left, half right and bird view are labelled, and each has a balanced number of examples to avoid training bias.

On this dataset, we can further investigate how the poses, expressions, and subject IDs affect the FER performance. Critically, we propose four novel learning tasks over this dataset as shown in Fig. 1(c). They are expression recognition with the standard balanced setting (ER-SS), unbalanced expression (ER-UE), unbalanced poses (ER-UP), and zero-shot ID (ER-ZID). Similar to the typical zero-shot learning setting [16], the zero-shot ID setting means that the testing faces of persons have not appeared in the training set. To tackle these four learning tasks, we further design a novel framework that can augment training data, and then train the classification network. Extensive experiments on our dataset, as well as JAFEE [22], FER2013 [26] show that (1) our dataset is large enough to be used to pre-train a deep network as the backbone network; (2) the unbalanced poses, expressions and zero-shot IDs indeed negatively affect the FER task; (3) the data augmentation strategy is helpful to learn a more powerful model yielding better performance. These three points are also the main contributions of this paper.

2 Related Work

2.1 Facial expression recognition

Extensive FER works based on neural networks have been proposed [15, 31, 36]. Khorrami et al. [15] trains a CNN for FER task, visualizes the learned features and finds that these features strongly correspond to the FAUs proposed in [6]. Attentional CNN [23] on FER is proposed to focus on the most salient parts of faces by adding a spatial transformer.

Generative Adversarial Net (GAN) [9] based models have also been investigated in solving the FER task. Particularly, GAN is usually composed of a generator and a discriminator. In order to weaken the influence of pose and occlusion, the pose-invariant model [35] is proposed by generating different pose and expression faces based on GAN. Qian et al. [28] propose a generative adversarial network (GAN) designed specifically for pose normalization in re-id. Yan et al. [34] propose a de-expression model to generate neutral expression images from source images by Conditional cGAN [24], and use the residual information in the intermediate layer in GAN to classify the expression.

(a) Data processing (b) Expression classes (c) Problem Context
Figure 1: (a) We show the flow of data processing of ED dataset. (b) ED has 54 different facial expression classes, and we organize them into four large classes. (c) ED dataset can be applied to various problem contexts. ER-SS: Expression recognition in the standard setting, ER-UE: Expression recognition with unbalanced expression, ER-UP: Expression recognition with unbalanced poses, ER-ZID: Expression recognition with zero-shot ID.

2.2 Previous Datasets

CK+. The extended Cohn-Kanade (CK+) database [20]is an updated version of CK database [14]. In CK+ database, there are 593 video sequences from 123 subjects. Of the 593 video sequences, 327 are selected according to the FACS coded emotion labels. The last frame of the selected video is labeled as one of the eight emotions: angry, contempt, disgust, fear, happy, sad, surprise and neutral.

JAFFE. The Japanese Female Facial Expression (JAFFE) database [22] contains 213 images of 256256 pixels resolution. The images are taken from 10 Japanese female models in a controlled environment. Each image is rated with one of the following 6 emotion adjectives: angry, disgust, fear, happy, sad and surprise.

FER2013. The Facial Expression Recognition 2013 database [26] contains 35887 images of 4848 resolution. These images are taken in the wild setting which means more challenging conditions such as occlusion and pose variations are included. They are labelled as one of the seven emotions as described above. The dataset is split into 28709 training images, 3589 validation images and 3589 test images.

KDEF. The dataset of Karolinska Directed Emotional Faces [21] contains 4900 images of pixels resolution. The images are taken from 140 persons (70 male, 70 female) from 5 angles with 7 emotions. The angles contain full left profile, half left profile, front, full right profile and half right profile. The emotion set contains 7 expressions: afraid, angry, disgusted, happy, sad, surprised and neutral.

2.3 Learning paradigms

Zero-shot learning recognize the new visual categories that have not been seen in the labelled training examples [16]

. The problem is usually solved by transferring learning from source domain to the target domain. Semantic attributes that describe a new object can be utilized in zero-shot learning. Xu

et al.[33] propose a zero-shot video emotion recognition. In this paper, we propose a novel FER task on the persons that are not in the training set. On the other hand, class imbalance is a common problem, especially in deep learning [12, 8]. For the first time, we propose a dataset that is large enough to help to evaluate the influence of unbalanced poses, expressions, and person IDs over the FER task. To alleviate this issue, we investigate synthesizing more data by GAN-based data augmentation inspired by recent works on Person Re-ID[28] and Facial expression recognition [35].

3 Fine-Grained Facial Expression Database

dataset #expression #subject #pose #image #sequence Resolution Pose list Condition
CK+ 8 123 1 327 593 F Controlled
JAFFE 7 10 1 213 - F Controlled
FER2013 7 - - 35887 - - In-the-wild
KDEF 7 140 5 4900 - FL,HL,F,FR,HR Controlled
ED 54 119 4 219719 5418 HL,F,HR,BV Controlled
Table 1: Comparison ED with existing facial expression database. In the pose list, F : front, FL : full left, HL: half left, FR: full right, HR: half right, BV: bird view

To the best of our knowledge, we contribute the largest fine-grained facial expression dataset to the community. Specifically, our ED dataset has the largest number of images (totally 219719 images) with 119 identities and 54 kinds of fine-grained facial emotions. Each person is captured from four different views of cameras as shown in Fig. 3. Furthermore, in Tab. 1, our dataset is compared against the existing dataset – CK+, JAFFE, FER2013, KDEF. We show that our ED is orders of magnitude larger than these existing datasets in terms of expression classes and number of total images.

Figure 2: Image distribution of different expressions.

3.1 The collection of Ed

(a) (b)
Figure 3: (a) Cameras used to collect facial expressions. (b) Distributions of subject ID and images over poses.

We create the ED dataset in 3 steps as in Fig. 1(a).

Data Collection. It takes us totally six months to collect video data. We invite more than 200 different candidates who are unfamiliar with our research topics. Each candidate is captured by four cameras placed at four different orientations to collect videos for persons as shown in Fig. 3 (a). The four orientations are front, half left, half right and bird view. The half left and half right cameras have a horizontal angle of 45 degrees with the front of the person, respectively. The bird view camera has a vertical angle of 30 degrees with the front of the person. Each camera takes 25 frames per second. The whole video capturing process is designed as a normal conversation between the candidate and two psychological experts. Totally, we aim at capturing 54 different types of expressions [17], e.g., acceptance, angry, bravery, calm, disgust, envy, fear, neutral and so on. The conversation will follow some scripts which are calibrated by psychologists, and thus can induce/inspire one particular type of expression successfully conveyed by the candidates. For each candidate, we only save 5 minutes’ video segment for each type of emotion.

Data Processing. With gathered expression videos, we further generate the final image dataset by human review, key images generation and face alignment. Specifically, the human review step is very important to guarantee the general quality of recorded expressions. Three psychologists are invited to help us review the captured emotion videos. Particularly, each captured video will be labeled by these psychologists. We only save the videos that have consistent labels by the psychologists. Thus totally about 119 identities’ videos are preserved finally. Then key frames are extracted from each resulting video and face detection and alignment are conducted by the toolboxes of Dlib and MTCNN [36] over each frame. Critically, the face bounding boxes are cropped from the original images and resized to a resolution of pixels. Finally we get the dataset ED of totally images.

3.2 Statistics and Meta-information of Ed

Data Information. There are 4 types of face information in our dataset, including person identity, facial expression, pose and landmarks.

Person Identity. Totally we have 119 persons, including 37 male and 82 female aging from 18 to 24. Most of them are university students. Each person expresses his/her emotions under guidance and the video is taken when the person’s emotion is observed.

Facial expression. Our dataset is composed of 54 types of emotions, based on the theory of Lee [17]

. In this work, it expands the emotion set of Plutchik by including more complex mental states based on seven eye features. The seven features include temporal wrinkles, wrinkles below eyes, nasal wrinkles, brow slope, brow curve, brow distance and eye apertures. The 54 emotions can be clustered into 4 groups by k-means clustering algorithm as shown in Fig. 

1(b). We also compute data distribution in Fig. 2.

(a) Face examples (b) Facial landmark examples
Figure 4: (a) There are some facial examples of ED with different poses and emotions. (b) We give the facial landmark examples as the meta-information of ED.

Pose. As an important type of meta-information, poses often cause facial appearance changes. In real world applications, facial pose variations are mainly introduced by the relative position and orientation changes of the cameras to persons. In ED, we collect videos from 4 orientations: half left, front, half right and bird view. Fig. 4(a) gives some examples of the ED of different poses. In ED we have 47053 half left, 49152 half right, 74985 front and 48529 bird view images. The distributions of subject ID and image number over poses are compared in Fig. 3 (b).

Facial Landmarks. Facial landmarks define the contour of facial components, including eye, nose, mouth and cheek. First we extract the facial landmarks with 68 points into position annotation text files by the Dlib. Then we convert the landmark position text file into images in a mask style. The example landmark images are shown in Fig. 4(b).

Tab. 1 shows the comparison between our ED with existing facial expression database. As shown in the table, our dataset contains 54 subtle expression types, while other datasets only contain 7 or 8 expression types. For the person number, CK+, KDEF and ED are nearly the same. The current public facial expression datasets are usually collected in two ways: in the wild or in the controlled environment. The FER2013 is collected in the wild, so the number of pose can not be determined. The rest datasets are collected in a controlled environment, where the number of pose for CK+ and JAFFE is 1, KDEF is 5 and ED is 4. Our ED is the only one that contains the bird view pose images which is very useful in real world scenario. For image number, ED contains 219719 images, which is 6 times larger than the second largest dataset. All datasets have a similar resolution except FER2013 which has only a resolution. CK+ and ED are generated from 593 video sequences and 5418 video sequences.

4 Learning on Ed

4.1 Learning tasks

In the ED, we consider the expression learning over different types of variants as shown in Fig. 1(c); and further study the influence of different poses and subjects over the FER. To the best of our knowledge, this is the first exploration on this type of tasks. Particularly, we are interested in the following tasks for this dataset.

Expression recognition in the standard setting (ER-SS). The first and most important task is to directly learn the supervised classifiers on ED. Particularly, as shown in Fig. 3(b) and Fig. 2, our dataset has balanced number of pose and emotion classes. We thus randomly shuffle our dataset and split it into 175000, 19719 and 25000 images for the train, validation and test set, respectively. The classifiers should be trained and validated on the train and validation sets, and predicted over the test set.

Expression recognition with unbalanced expression distribution (ER-UE). We further compare the results of learning classifiers with unbalanced facial expressions. In real word scenario, some facial expressions are rare, e.g., cowardice. Thus it is imperative to investigate the FER in such an unbalanced expression setting. Specifically, we take 20% of total facial expressions as the rare classes. Among these rare classes, 90% of the images are kept as the testing instances, the rest 10% are used as the train set. The other 80% classes are treated as the normal emotion classes, and all of them are used for training. Thus, totally we have 178989 and 140730 images for the train and test set, respectively. For expression type analysis, there are 54 expression types in train set and 11 expression types in test set. On average, the occurrence frequency of testing expression class is only of that of training classes. In our setting, we assume that the model works with the prior knowledge that there are 54 rather than 11 expression classes in testing, which makes the chance of ER-UE task keep .

Expression recognition with unbalanced poses (ER-UP). The learning task is further conducted with unbalanced poses. In this setting, we assume that the half left pose is rare in the train set. Thus the 10% of the half left pose images are used as the train set, and the rest 90% are used as test set. The other three types of poses – the half right, front, bird view pose images are used as the train set. Thus we get 177372 training images and 42347 testing images. For pose type analysis, there are 4 poses in train set and 1 pose in test set. This task aims to predict the expressions with rare poses in training set.

Expression recognition with zero-shot ID (ER-ZID). We aim at recognizing the expression types of the persons that have not been seen before. Particularly, we randomly pick the images from 21 and 98 persons as train and test set, respectively. This results in 189306 training and 30413 testing images. The task is to recognize expressions with zero-shot ID, referring to the disjoint subject ID in train and test sets. This enables us to verify whether the model can learn the person invariant feature for emotion classification.

4.2 Learning methods

Figure 5: Overview of our framework. It includes the FaPE-GAN and Fa-Net component. FaPE-GAN can synthesize face images with input image and target pose. The Fa-Net is the classification network which is trained by the augmented face images and original face images. The Fa-Net can be applied in supervised, unbalanced and zero-shot learning.

We propose an end-to-end framework to address the four learning tasks in Fig. 5. Particularly, to tackle the issues of learning unbalanced number of images, our key idea is to employ the GAN based models for data augmentation to produce balanced training set. Our framework has the components of Facial Pose GAN (FaPE-GAN), and Face classification Networks (Fa-Net). The former one is an image synthesis network, and the latter is a classification network.

FaPE-GAN. It is trained by a combination of the training images and synthesized face images of new poses. The facial poses are normally represented by a landmark set. As shown in Fig. 5, this network firstly takes the face image and the pose image as input, then the generator produces the fake image of the same person with the pose of , i.e., , and the discriminator tries to differentiate the fake target image from the real input image . Despite the pose may be changed in , our FaPE-GAN still aims to keep the face identity of . Critically, we introduce the adversarial loss as,


where are the distributions of real images . The training process iteratively updates the parameters of generator and discriminator . The generator loss can be formulated as,


where we have , is the real target image and is the reconstructed image, with the input image and facial pose . [24]

. The hyperparameter

is used to balance the two terms. The discriminator loss is formulated as,

. The training process iteratively optimizes the loss functions of

and . Fig. 6 shows two examples generated by FaPE-GAN .

Figure 6: GAN output examples

Fa-Net. The same classification network is utilized to address all the four learning tasks in Sec. 4.1. Particularly, the backbone network is LightCNN [32]. The can synthesize plenty of additional face images in alleviating the issues of unbalanced training images. The augmented faces and original input faces are thus used to train our classification network.

5 Experiments

Extensive experiments are conducted on ED to evaluate the learning tasks defined in Sec. 4.1. Furthermore, the tasks of facial emotion recognition are also evaluated on FER2013 and JAFFE dataset.

Implementation details. The is set to 10, and the Adam optimizer is used in learning the FaPE-GAN with the learning rate of . The and

are set as 0.5 and 0.999 respectively. The training epoch number is set to 100. For the facial expression classification network, We use the SGD optimizer with a momentum of 0.9 and decrease the learning rate by 0.457 every 10 steps. The max epoch number is set to 100. The learning rate and batch size varies depending on the dataset size. To train the classification model, we set the learning rate/batch size as 0.01/128, 2

/64 and /32, on ED, FER2013 and JAFFE, respectively.

5.1 Results on FER2013 dataset


. Following the setting of ER-SS, we conduct the experiments on FER2013 by using the entire 28709 training images and 3589 validation images to train/validate our model, which is further tested on the rest 3589 test images. The FER classification accuracy is reported as the evaluation metric to compare different competitors.

Competitors. Our model is compared against several competitors, including Bag of Words [13], VGG+SVM [7], GoogleNet [8], Mollahosseini et al [25], DNNRL [10] and Attention CNN [23]. Classifiers based on hand-crafted features, or specially designed architectures for FER, are investigated here. These methods can achieve the state-of-the-art results on this dataset.

Results on FER2013. To show the efficacy of our dataset, our classification network – Fa-Net is pre-trained on our ED, and then fine-tuned on the training set of FER2013 dataset. The results show that our model can achieve the accuracy of 71.1%, which is superior to other state-of-the-art methods, as compared in Tab. 2. Tab. 4 shows that the Fa-Net pre-trained on

ED can improve the expression recognition performance by 8.8% comparing to the one without pre-training. The confusion matrix in Fig. 

7 shows that pre-training increases the scores on all expression types. It demonstrates that the ED dataset with large expression variations from more persons can pre-train a deep network with good initialization parameters. Note that our Fa-Net is not specially designed for FER task, since our Fa-Net is built upon the backbone – LightCNN, one typical face recognition architecture.

Model Acc.
Bag of Words [13] 67.4%
VGG+SVM [7] 66.3%
GoogleNet [8] 65.2%
Mollahosseini et al [25] 66.4%
DNNRL [10] 70.6%
Attention CNN [23] 70.0%
Fa-Net 71.1%
Table 2:

Accuracy on FER2013 test set in supervised learning setting

(a) (b)
Figure 7: (a) The confusion matrix on FER 2013 for Fa-Net without pre-training. (b) The confusion matrix on FER2013 for Fa-Net pre-trained on ED

5.2 Results on JAFFE dataset

Settings. For the setting of ER-SS, we follow the split setting of the deep-emotion paper[23] to use 120 images for training, 23 images for validation, and keep totally 70 images for test (7 emotions per face ID).

Competitors. Our model is compared against several competitors, including Fisherface[1], Salient Facial Patch [11], CNN+SVM[30] and Attention CNN [23]. These methods are tailored for the tasks of FER.

As listed in Tab. 3, our model achieved the accuracy of 95.7%, outperforming all the other competitors. Remarkably, our model surpasses the Attention CNN by 2.9% in the same data split setting. The accuracy of CNN+SVM is slightly lower than our model by 0.4%, even though their model is trained and tested on the entire dataset. This shows the efficacy of our dataset in pre-training the network. Tab. 4 further shows that Fa-Net pre-trained on the ED has clearly improved the performance by 12.8%. The confusion matrix in Fig. 8 shows that the pre-trained Fa-Net only makes 3 wrong predictions and surpasses the one without pre-training on all expression types.

Model Acc.
Fisherface[1] 89.2%
Salient Facial Patch[11] 92.6%
CNN+SVM[30] 95.3%
Attention CNN[23] 92.8%
Fa-Net 95.7%
Table 3: Accuracy on JAFFE test set in supervised learning setting.
(a) (b)
Figure 8: (a) The confusion matrix on FER 2013 for Fa-Net without pre-training. (b) The confusion matrix on FER2013 for Fa-Net pre-trained on ED

5.3 Results on Ed

Results on our dataset. We conduct the four different learning tasks on our dataset, namely, supervised (ER-SS), unbalanced expression (ER-UE), unbalanced pose (ER-UP) and zero-shot ID (ER-ZID) by the data split setting described in Sec. 4.1. Note that, since our Fa-Net is built upon the general face recognition backbone – LightCNN, it can thus be served as the main network in our experiments.

ER-SS task. Our model has achieved the accuracy of 73.6% as shown in Tab. 5. It shows that our ED is well annotated so it can be used for classification task. Considering the large scale of the facial expression dataset, this performance is already very good and difficult to obtain which demonstrates that LightCNN is a good backbone for facial recognition. By using FaPE-GAN for data augmentation, the performance of our model is further improved by 0.9% comparing to the Fa-Net without GAN, which means that GAN is useful to generate more diversified examples for training.

Dataset Pre-trained Acc.
FER2013 62.3%
JAFFE 82.9%
Table 4: Results of the Fa-Net model with and without pre-trained on our ED.

ER-UE task. The accuracy of direct classification is 30.8% as shown in Tab. 5. This shows that the propose ER-UE task is very difficult, as the FER task greatly suffers from the unbalanced emotion data. Particularly, in our setting, only 10% examples from the 11 facial expression types appear in the training set, and the classifiers are thus confused by the other 43 emotion classes in the training stage. Furthermore, we also show that the data augmentation strategy endowed by our FaPE-GAN can indeed help to improve the performance of FER: the performance is improved by 3.5% which is larger than the 0.9% improvement in supervised learning setting. This indicates that the data augmentation is more effective in the data sparse condition such as unbalanced learning.

ER-UP task. Towards this task, our Fa-Net can hit the accuracy of 39.9% as shown in Tab. 5. Again, we argue that the proposed ER-UP is a very hard task, since this accuracy is only slightly better than the performance of ER-UE. This shows that the unbalanced pose data may also negatively affect the performance of FER task. Essentially, there are 54 types of expressions which are more diversified than the pose. Our data augmentation can still work in such a setting, and the synthesized data can help to train the Fa-Net, and alleviate the problem of unbalanced poses. As a result, it improves the performance of Fa-Net by 3.6%.

ER-ZID task. Surprisingly, the learning task proposed in this setting is the most challenging one compared with the other learning tasks. As shown in Tab. 5, we notice that our model only achieves an accuracy of 7.1% while the chance in fact is 1.9% ( as described before), since the zero-shot task is much more difficult than the unbalanced task. This indicates that the generalization ability of FER is subject to other persons that the model has never seen before. Actually, this is the most desirable property of the FER model, since one can not assume the faces of test persons always appear in the training set. In our ER-ZID task, only 21 persons in the test set are never seen in the training set. Interestingly, our FaPE-GAN based data augmentation still contributes a 0.4% performance improvement over the baseline. This suggests the data augmentation may be still a potential useful strategy to facilitate the training of classification network.

Overall, our classification model with FaPE-GAN based data augmentation has clearly surpasses the one without FaPE-GAN on all 4 task types.

model/acc ER-SS ER-UE ER-UP ER-ZID
Fa-Net 72.7 27.3 36.3 6.7
FaPE-GAN+Fa-Net 73.6 30.8 39.9 7.1
Table 5: Accuracy on ED for Fa-Net with and without data augmentation in supervised(ER-SS), unbalanced expression(ER-UE), unbalanced pose(ER-UP) and zero-shot ID(ER-ZID) setting

6 Conclusion

In this work, we introduce ED, a new facial expression database containing 54 different emotion types and more than 200k examples. Furthermore we propose an end-to-end deep neural network based facial expression recognition framework, which uses a facial pose generative adversarial network to augment the data set. We perform supervised, zero-shot and unbalanced learning tasks on our ED dataset, and the results show that our model has achieved the state-of-the-art. Subsequently, we fine-tune our model pre-trained on ED on the existing FER2013 and JAFFE database, and the results demonstrate the efficacy of our ED dataset.


  • [1] Z. Abidin and A. Harjoko. A neural network based facial expression recognition using fisherface. International Journal of Computer Applications, 59(3), 2012.
  • [2] D. Aneja, A. Colburn, G. Faigin, L. Shapiro, and B. Mones. Modeling stylized character expressions via deep learning. In

    Asian Conference on Computer Vision

    , pages 136–153. Springer, 2016.
  • [3] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan.

    Recognizing facial expression: machine learning and application to spontaneous behavior.


    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    , volume 2, pages 568–573. IEEE, 2005.
  • [4] S. Berretti, B. B. Amor, M. Daoudi, and A. Del Bimbo. 3d facial expression recognition using sift descriptors of automatically detected keypoints. The Visual Computer, 27(11):1021, 2011.
  • [5] C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero. Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition: History, trends, and affect-related applications. IEEE transactions on pattern analysis and machine intelligence, 38(8):1548–1568, 2016.
  • [6] R. Ekman. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA, 1997.
  • [7] M.-I. Georgescu, R. T. Ionescu, and M. Popescu. Local learning with deep and handcrafted features for facial expression recognition. arXiv preprint arXiv:1804.10892, 2018.
  • [8] P. Giannopoulos, I. Perikos, and I. Hatzilygeroudis. Deep learning approaches for facial emotion recognition: A case study on fer-2013. In Advances in Hybridization of Intelligent Methods, pages 1–16. Springer, 2018.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [10] Y. Guo, D. Tao, J. Yu, H. Xiong, Y. Li, and D. Tao. Deep neural networks with relativity learning for facial expression recognition. In 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 1–6. IEEE, 2016.
  • [11] S. Happy and A. Routray. Automatic facial expression recognition using features of salient facial patches. IEEE transactions on Affective Computing, 6(1):1–12, 2015.
  • [12] C. Huang, Y. Li, C. C. Loy, and X. Tang. Learning deep representation for imbalanced classification. In CVPR, 2016.
  • [13] R. T. Ionescu, M. Popescu, and C. Grozea. Local learning to improve bag of visual words model for facial expression recognition. In Workshop on challenges in representation learning, ICML, 2013.
  • [14] T. Kanade, Y. Tian, and J. F. Cohn. Comprehensive database for facial expression analysis. In fg, page 46. IEEE, 2000.
  • [15] P. Khorrami, T. Paine, and T. Huang. Do deep neural networks learn facial action units when doing expression recognition? In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 19–27, 2015.
  • [16] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
  • [17] D. H. Lee and A. K. Anderson. Reading what the mind thinks from how the eye sees. Psychological Science, 28(4):494, 2017.
  • [18] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang.

    Hydraplus-net: Attentive deep features for pedestrian analysis.

    ICCV, 2017.
  • [19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
  • [20] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 94–101. IEEE, 2010.
  • [21] D. Lundqvist, A. Flykt, and A. Öhman. The karolinska directed emotional faces (kdef). CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, 91:630, 1998.
  • [22] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba. Coding facial expressions with gabor wavelets. In Proceedings Third IEEE international conference on automatic face and gesture recognition, pages 200–205. IEEE, 1998.
  • [23] S. Minaee and A. Abdolrashidi. Deep-emotion: Facial expression recognition using attentional convolutional network. arXiv preprint arXiv:1902.01019, 2019.
  • [24] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv: Learning, 2014.
  • [25] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE winter conference on applications of computer vision (WACV), pages 1–10. IEEE, 2016.
  • [26] C. Pierre-Luc and C. Aaron. Challenges in representation learning: Facial expression recognition challenge, 2013.
  • [27] X. Qian, Y. Fu, Y.-G. Jiang, T. Xiang, and X. Xue. Multi-scale deep learning architectures for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5399–5408, 2017.
  • [28] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and X. Xue. Pose-normalized image generation for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 650–667, 2018.
  • [29] C. Shan, S. Gong, and P. W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image and vision Computing, 27(6):803–816, 2009.
  • [30] Y. Shima and Y. Omori. Image augmentation for classifying facial expression images by using deep neural network pre-trained with object image database. In Proceedings of the 3rd International Conference on Robotics, Control and Automation, pages 140–146. ACM, 2018.
  • [31] Z. Wang, K. He, Y. Fu, R. Feng, Y.-G. Jiang, and X. Xue. Multi-task deep neural network for joint face recognition and facial attribute prediction. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pages 365–374. ACM, 2017.
  • [32] W. Xiang, H. Ran, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics Security, PP(99):1–1, 2015.
  • [33] B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Transactions on Affective Computing, 9(2):255–270, 2018.
  • [34] H. Yang, U. Ciftci, and L. Yin. Facial expression recognition by de-expression residue learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2168–2177, 2018.
  • [35] F. Zhang, T. Zhang, Q. Mao, and C. Xu. Joint pose and expression modeling for facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3359–3368, 2018.
  • [36] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.