Cross-database non-frontal facial expression recognition based on transductive deep transfer learning

11/30/2018 ∙ by Keyu Yan, et al. ∙ 2

Cross-database non-frontal expression recognition is a very meaningful but rather difficult subject in the fields of computer vision and affect computing. In this paper, we proposed a novel transductive deep transfer learning architecture based on widely used VGGface16-Net for this problem. In this framework, the VGGface16-Net is used to jointly learn an common optimal nonlinear discriminative features from the non-frontal facial expression samples between the source and target databases and then we design a novel transductive transfer layer to deal with the cross-database non-frontal facial expression classification task. In order to validate the performance of the proposed transductive deep transfer learning networks, we present extensive crossdatabase experiments on two famous available facial expression databases, namely the BU-3DEF and the Multi-PIE database. The final experimental results show that our transductive deep transfer network outperforms the state-of-the-art cross-database facial expression recognition methods.



There are no comments yet.


page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, artificial intelligence (AI) technology has made explosive progress in many practical applications such as driverless car, human-computer interaction, school education, intelligent transportation, et al. However, the success of AI technology at present is only based on a large number of labeled data. In fact, many real scenarios can not be expressed by data including creative driven learning, knowledge system and learning to learn, et al. These sophisticated problems require machine to understand and comprehend human modes and emotions. Therefore, how to make machine interpret human emotion will become the next much talked-about topic in the AI and machine learning community. Human express emotions in a variety of ways including language, facial expression, gesture, and word, in which the facial expression is the most important channel to convey emotion information between different people. Therefore, many AI researchers and computer technician have paid great attention to the facial expression recognition (FER) problem to help machine perception human emotion and have gained some harvest from it 

[1][2][3][4][5],  [6][7][8]. In order to study facial expression more conveniently and systematically, Ekman [1] identified six basis expressions across all cultures and defined the standard for facial expression research named the Facial Action Coding System (FACS). Moreover, Zhi et al. [2] proposed a novel non-negative matrix factorization method based on graph-preserving sparse (GSNMF)for facial expression recognition problem. The GSNMF algorithm acquires better representation by transforming the high-dimensional facial expression images into a locality preserving subspace with sparse representation and achieves higher recognition results than NMF. In  [5]

, Nguyen et al. proposed a multimodal approach to recognize dynamic facial expression by combining a 3-dimensional convolutional neural networks (C3Ds) which extract the spatio-temporal features and a deep-belief networks (DBN) which can represent audio and video streams for dynamic expression databases.

In practical FER applications, the same expression often can be acquired at different viewpoints, thus generating multi-view heterogeneous samples whose statistical information have a great difference. However, most of the traditional FER methods are based on the frontal facial expression samples and the non-frontal facial expression data are only a small part adopted. But despite all that, some novel methods are still employed to tackling the non-frontal FER problem and have made great progress in recent years [9][10][11][12][13][14][15]. In [11], Tang et al. [11]

built the ergodic hidden Markov model (EHMM) to obtain supervector representation of non-frontal facial expression images and achieve promising results. In 


Kumano et al. propose a method which uses the variable-intensity template model to describe pose-invariant facial expressions from monocular video sequences. This method can estimate facial poses and expressions simultaneously by using a particle filter. In order to learn suitable optimal features for classifying the facial expressions from different facial view-points, Zhang et al. 

[12] proposed a feature-based deep neural network learning method which uses multiple network layers to describe the relationship between non-frontal facial features and their corresponding high-level semantic information.

Although the FER technology has achieved great success, however, the most of FER methods usually developed to based on the assumption of uniform probability distribution between the training and testing samples. In fact, in many practical scenarios, the hypothesis of uniform probability distribution is not satisfied because the training and testing data may come from two different databases which are acquired under the different environments or equipments. This leads to a challenging problem, namely, the cross-database non-frontal FER problem. To cope with this challenge, many effective approaches such as subspace-based methods and based on deep learning models had been proposed in recent years 

[16][17][18] [19][20][21],  [22][23]. To leverage the distribution discrepancy between training and testing facial expression images, in our preliminary work in [19], zheng et al. proposed a novel transductive transfer subspace learning method to jointly learn a discriminative subspace and to predict the label values of the unlabelled facial expression images by using all labelled training samples from source domain and an unlabelled auxiliary testing samples set from target domain. Duan et al.  [23] propose a new cross-domain kernel learning framework named Domain Transfer Multiple Kernel Learning (DTMKL) to deal with the wide divergences between feature distributions of different databases. In [20]

, Wei et al. proposed a deep nonlinear feature coding framework for unsupervised cross-domain FER problem, which introduce domain divergence minimization by Maximum Mean Discrepancy (MMD) and kernelization coding to build on a marginalized stacked denoising auto-encoder for extracting very efficient deep features. Zavarez et al. 

[21] utilize the fine-tune trick in deep convolutional network for cross-database video-based FER problem in several well-established facial expression databases. However, these methods of cross-database FER are typically based on the frontal and near-frontal facial samples or the samples of each domain with single view-point in their experiments. In many real application scenarios, FER not only faces with cross-database facial expression samples, but also handles a large number of non-frontal facial expression data, these facial expression images of different view-points also lead to the different distribution in the same database, making it more difficult to recognize the facial expression categories. Furthermore, when the non-frontal facial expression data are adopted in commonly cross-database FER task, it leads to a cross-database non-frontal FER problem which is a largely unexplored research field. This a very difficult subject, because researchers not only need to deal with the difference of distribution between databases, but also consider the distribution discrepancy under the same expression intra-databases, which leads to bigger difficult and challenge in learning the more discriminative facial expression features.

In this paper, we will address such a difficult and challenge problem, that is, cross-database non-frontal FER problem. For this purpose, we further expanding our preliminary work in [19]

from linear to nonlinear method by dint of deep neural network model to propose a transductive deep transfer learning framework which a novel transfer network layer is introduced in this framework. Considering the eminent performance of VGGface16-Net in human facial feature representation, we first utilize the VGGface16-Net to learn the excellent representation of multi-view facial expression feature from the raw non-frontal facial expression images. Behind the VGGface16-Net framework, we design a novel transfer layer architecture for cross-database non-frontal facial expression classification task. In this task, the loss function and the network parameters are jointly optimized, and obtained the prediction label values of the target database samples. In summary, the main contributions of this paper are summarized as follows:

  1. Different from the traditional hand-craft features trick in facial expression recognition method, we utilize VGGface16-Net end to end to learn a common feature representation between source and target database. In the optimization phase of the TDTL network, the source and target databases are mapped to a common nonlinear feature representation space, and the discrepancy of distribution from non-frontal view-points features in the same database and the distribution different of between databases are eliminated as much as possible.

  2. In this paper, we designed a novel transductive transfer layer based on deep learning architecture to adaptively deal with cross-database non-frontal FER problem. We random initialize the labels of target database, meanwhile these label values were learn with labelled source database and the unlabelled target database in this network framework, and final to obtain the predicted label values of target database.

  3. Unlike the traditional subspace learning methods, in TDTL model the training and testing samples are divided into different batches to be jointly optimized in the network framework so as to better predict label values of the target samples.

The remainder of the paper is structured as follows: Section 2 presents the transductive deep transfer learning method and shows how it run for cross-database non-frontal FER problem. The details of sufficient experiments and discussions are conducted in Section 3. Finally, we conclude this paper in the last section.

Ii Proposed Method

Fig. 1: The Training Flow Chart of The Proposed Deep Transductive Transfer Learning Networks

In this section, we introduce our deep transductive transfer learning framework based on VGGface16-Net for non-frontal FER problem in details. Figure 1 shows the network structure of the proposed transductive deep transfer learning, which consists of two sections: one is non-frontal facial expression feature learning part based on VGGface16-Net, the other is transductive transfer learning layer.

Ii-a Notations

In order to facilitate the discussion of this paper, we first give some notations be used in the whole paper. We denote as a set of as source domain samples from non-frontal facial expression database and

is the real class label vector set corresponding to

, in which represents th raw facial expression image sample and is the number of source database samples. Moreover, let be the target instances set from the non-frontal facial expression database, in which represents th image sample of target database and is the number of target database samples. We predefined as network parameter which is the unknown class label vector set corresponding to and is optimized with the update of network. In this model, each class label vector of and is represented as a vector, in which is the number of facial expression classes and the all elements take the value of 1 or 0 in each class label vector of . Let be the label vector of th sample, its each element satisfies the following rule to take value:

Compared with the traditional subspace learning method, the ability of nonlinear representation is the advantage of deep neural networks. In general, the whole neural network can be regarded as a nonlinear mapping function, here we define this nonlinear mapping function as .

Ii-B Transductive Deep Transfer Learning Model

Transductive transfer learning is a challenge topic because the target database samples have no labeled information to be utilized. In this paper, we follow the idea of the transductive transfer subspace learning model to focus on the cross-database non-frontal FER problem. Different from subspace learning theory in [19] [24] et al., for learning more discriminative expression information between the source and target domain, we train a deep neural network model to eliminate the differences of feature distributions between the source and target samples as well as the discrepancy of distribution view-points intra-databases. For this purposes, we design a transductvie deep transfer learning framework (TDTL) which consists of one feature learning section and one transductive transfer learning layer. In the first section, we adopt a contemporary widely-used VGGface16-Net to deal with the raw non-frontal facial expression images. Remarkably, this choice is based on two reasons: one is that the VGGface16-Net can effectively extract very excellent facial expression features and acquire useful adaptive transfer knowledge, the other is that the performance of VGGface16-Net in classification task of transfer learning is better than other state-of-the-art deep neural networks such as AlexNet [25] and GoogLeNet [26]

. The VGGface16-Net is a deep neural network model which has five stacks of convolution network, plus three fully-connected layers, with a total of 16 layers. In deep neural network framework, this deep structure can help us to acquire highly sophisticated cross-database non-frontal facial expression features effectively. In the feature learning section of the TDTL model, we retain the five stacks of ConvNet and the first two fully-connected layers of the VGGface16-Net to extract the feature of facial expression. Five stacks of VGGface16-Net consist of thirteen convolutional layers in total, in which each stack is followed by one max pooling layer and each convolutional layer contains one activation function, that is rectified linear units (ReLU):

It is worth mentioning that these convolution layers in the VGGface16-Net use many smaller convolution kernels ( or ) different from many other state-of-the-art deep neural network models which contain convolution layer with larger convolution kernel ( or

). More small convolution kernels are equivalent to more nonlinear mapping, which can increase the representation ability of network and extract more discriminative non-frontal facial expression feature. In addition, smaller convolution kernels can significantly decrease the number of network parameters and improve operational efficiency of neural network. Follow the five stacks of ConvNet, there are two fully-connected layers which transform these nonlinear low-level description features of the raw non-frontal facial expression images into high level semantic information. In the second section of the TDTL model, we design a novel architecture of transductive transfer learning layer which includes one fully-connected layer and one softmax layer to further learn the higher semantic information for our classification task of cross-database non-frontal facial expression, in which the fully-connected layer uses the hyperbolic tangent function

as its nonlinearity activation function. The softmax layer is used to accomplish the finally classification work.

Ii-C Transductive Deep Transfer Learning Training

According to the general tranductive transfer learning method, in the TDTL model, the source domain samples are used as the training data set and the target domain samples are used as the testing data set. These two data sets are merged together and then divided into different batches. Every batch contains the source samples and target samples according to certain proportion. In the TDTL network, the training and testing are conducted step by step according to one batch and then another batch. To obtain better recognition results, we define a special loss function in accordance with the proposed transfer learning networks framework as follows :


in which and are two different regularization terms to harmonize the facial expression feature learning and classification task, and are trade-off parameters to balance the two regularization terms and . In , the first regularization term is cross entropy loss function that depicts the distance of the training samples between the actual output (probability value) and the expected output (actual value).


in which is the number of the training samples in every batch, the subscript denotes that the training samples are from the source database, represents the prediction label value vector of the th training sample and is real label value vector of th the training sample. is calculated by a softmax function:

in which represents the output of network corresponding to th training sample , namely . In addition, the second regularization term is the proposed transductive transfer learning loss function:


where the is the number of the testing samples in one batch and the subscript denotes that the testing samples come from the target database. The second term of is a norm regularization term which can ensure the sparse structure of the predicted label values matrix , is trade-off parameter to control the sparsity of the columns of . When the value of is larger, the each column of will become more sparse than the value of is smaller. More sparse means that the value of more elements is equal to 0, which can better accomplish classification tasks. Moreover, the network weights of each layer are updated according to the optimal value of the loss function by using the back propagation algorithm. The final task of the TDTL model is to predict the label value matrix of the from the target domain samples which are used as testing samples.

Iii Experiments

Iii-a The choice of samples

Fig. 2: The located 68 landmark points for SIFT features in 5 viewpoints
Fig. 3: The selected facial blocks (

) for LBP feature extraction in 5 viewpoints

In this section, we conduct extensive experiments based on cross-database non-frontal facial expression images to evaluate the proposed transductive deep transfer learning model. We adopt two widely-used multi-view facial expression databases: the Carnegie Mellon University multi-pose, illumination, and expression (Multi-PIE) face database and the Binghamton University 3D Facial Expression (BU-3DEF) database in our experiments. The Multi-PIE database is a classic database developed by Gross et al. [27] for non-frontal facial expression recognition, which was collected from 337 people. The images of Multi-PIE database include six basis facial expressions, such as disgust (DI), smile (SM), squint (SQ), scream (SC), surprise (SU), and neutral (NE) under 19 illumination environments and 15 viewpoints. The BU-3DEF database is established by Yin et al. [28] for 3D non-frontal facial expression classification, which is composed of 606 facial expression sequences collected from 100 subjects. This database contains seven fundamental expression categories, i.e., anger (AN), disgust (DI), fear (FE), neutral (NE), happiness (HA), sadness (SA), and surprise (SU) with multiple expression intensities. By comparing the expression categories of two databases, we select four common facial expressions (DI, SM/HA, SU, and NE) and 5 conventional viewpoints (, , , and ) for our experiments from these two database, respectively. In addition, we randomly selected 100 subjects from all the 337 subjects of the Multi-PIE database, and for the BU-3DEF database, we choose all 100 subjects. It must be noted that, there are four expression intensities in three expression (HA, SU and DI) samples, while the NE expression only has one intensity in the BU-3DEF database. To the BU-3DEF database, we choose samples of HA, SU and DI expressions under five viewpoints with four expression intensities, and select samples of NE expression under five viewpoints with one expression intensities from 100 subjects, with a total of 6500 samples. Unlike BU-3DEF database, each type of expression of the Multi-PIE database only havs one expression intensity, and we choose samples of NE, SM, SU, and DI under five viewpoints with only expression intensity and one certain illumination condition from 100 subjects, in number 2000 samples. These samples will be used uniformly in the TDTL method and all comparison methods.

Class Multi-PIE BU-3DEF
Disgust 500 2000
Smile/Happiness 500 2000
Surprise 500 2000
Neutral 500 500
Total 2000 6500
TABLE I: The sample constitutions of the selected Multi-PIE and BU-3DEF databases with the same facial expression labels.

Iii-B Experiments Based on Transductive Deep Transfer Learning

Iii-C Experiments Based on Transductive Deep Transfer Learning

The experimental protocol of the TDTL model is set according to the conventional transductive transfer learning method, namely our experiments are carried out when the source database is used as training samples set and the target database is used as testing samples set. When one of the BU-3DEF database or Multi-PIE database is used as the source database, respectively, the other is served as the target database. After extracting non-frontal facial expression image features, two full connected layer are exploited to learn network weights that can better represent transfer knowledge, and then we get a fixed 4096 dimension features from the two fully-connected layer as high-lever semantic information for classification task. The end of the network is the transductive transfer learning layer that includes one 4 dimension fully-connected layer and one 4-class softmax layer, which are used to recognize the facial expression categories of target database samples. In the TDTL model, the all input data are fixed-size 224 224 RGB raw facial expression images. In order to keep the optimization algorithm robustness and optimization efficiency, in the first two fully-connected layers based on VGGface16-Net model, the dropout ratio is 0.5, and the learning rate is set at 0.01. At the same time, the initialization network weights are sampled in Gauss distribution , and the bias item is initialized to 0. We use a min-batch size of 500, in which contains fifty percent proportion of the source and target samples respectively. Moreover, in the fully-connected layer of the transductive transfer layer, we start with a learning rate of 0.005, the dropout ratio is set at 50 . To the loss function of the TDTL model: , we alternately set the trade-off parameters and to 0 or 1 to optimize the parameters of network model. To , the trade-off parameter is set to 150, in particular, we first randomly initialize the label matrix of target samples . The is updated in pace with the parameters of the neural network until convergence of the loss function . The recognition accuracies are calculated through the predicted label values of and the corresponding actual label values of .

Iii-D Comparison Experiments Setting

For the purpose of the comparison, we choose recently proposed well-performing cross-database FER methods in dealing with the cross-database non-frontal facial expression classification problems including TTRLSR, SA(Subspace Alignment) [24], GFK(Geodesic Flow Kernel) [29], TKL(transfer kernel learning) [30], TCA(Transfer Component Analysis) [31]. In the SA approach, the source and target domains are jointly represented by seeking a optimal domain adaptation solution for learning a mapping subspace which aligns the source samples and the target one. The GFK method is transfer learning method based on manifold transform and kernel learning. The TKL algorithm can bridge the discrepancy of source and target distributions in the reproducing kernel Hilbert space based on a domain-invariant kernel schema. The key innovation of TCA is to minimize the distribution discrepancy in different domains based on the maximum mean difference theory. It is worth mentioning that, in FER problem, many previous FER works show that the recognition results based on hand-craft features are better than the raw image samples as input data in many traditional pattern classification methods [19][7][32][33][34]. For more reasonable comparison these five baseline methods, we select two classical hand-craft features (SIFT and LBP) for these comparison experiments although we directly use the raw image samples in experiments of the TDTL method. To measure the impact of features on the recognition results, we furthermore adopt VGG features of the samples from the BU-3DEF and Multi-PIE databases for our comparison experiments.

To extract SIFT features of the BU-3DEF and Multi-PIE databases, we first use OpenGL software to capture 2D non-frontal facial expression image samples from 3D facial expression models of the BU-3DEF database. Before extracting the SIFT features, we manually locate the 68 landmark points for each facial image, in which these landmark points (see Fig. 2) as the key points for SIFT feature extraction are located in the major parts of AUs including mouths, brows, eyes, noses and face contour. According to the extraction method of SIFT feature, the SIFT feature of each sample is in size . We furthermore transformed the each extracted SIFT feature into a vector of length 8704(). Different from SIFT feature, we apply a LBP operator the 59-bin to extract LBP descriptors of the BU-3DEF and Multi-PIE databases, in which the subscript indicates adopting the operator in a neighbourhood and the superscript represents using only uniform patterns and labelling all remaining patterns with a single label. Each facial image was divided into 64 (see Fig. 3) regions and represented by the LBP histogram of these regions with the vector length of 3776. Moreover, we adopt the same VGGface16-Net model as our method to extract very efficient deep neural network feature for our comparison experiments. The extracted VGG feature of each sample is a vector of length 4096.

Method BU-3DEF to Multi-PIE Multi-PIE to BU-3DEF
Accuracy (%) F1-score Accuracy (%) F1-score
SIFT (68) + SA 33.50 0.2571 33.38 0.3102
SIFT (68) + GFK 31.82 0.2177 35.05 0.3109
SIFT (68) + TKL 36.95 0.3100 34.80 0.2855
SIFT (68) + TCA 39.20 0.3767 38.63 0.3209
SIFT (68) + TTRLSR 41.30 0.3829 40.66 0.3521
LBP (8*8) + SA 45.60 0.3707 33.18 0.2127
LBP (8*8) + GFK 44.20 0.3859 34.92 0.2536
LBP (8*8) + TKL 33.85 0.4208 33.00 0.2743
LBP (8*8) + TCA 44.20 0.3550 34.92 0.3274
LBP (8*8) + TTRLSR 43.90 0.4137 31.07 0.3219
VGG + SA 55.00 0.5275 50.82 0.4754
VGG + GFK 53.75 0.4976 57.80 0.5165
VGG + TKL 53.90 0.4842 50.46 0.4774
VGG + TCA 57.80 0.5545 58.32 0.5228
fine-tuned VGG 55.30 - 50.20 -
TDTL 66.85 0.6547 66.05 0.5309
TABLE II: Experimental results of all methods on BU-3DEF and Multi-PIE Databases According To Recognition Accuracy And F1-score .

The detailed parameters setting of all baseline methods in these experiments are reveal as follows:

  1. For the SA method, we traverse the subspaces of dimensionality for with interval 1 and select the optimal recognition rates and record them in the Table II.

  2. For the TKL method, we large range search the parameter from [0.01:0.01:0.09], [0.1:0.1:1] and [2:1:15], and the linear kernel is adopted as kernel function in this method, the best recognition results of three features are recorded in the Table II.

  3. For the GFK method, we use one kernel-based method: GFK(PCA, PCA) and traversing the dimensionality of PCA from 10 to 100 with interval 10, we search the best results to recorded in the Table II.

  4. For the TCA method, we large scope search the trade-off parameter from a preset parameter interval [1:300] and utilize the monomial linear kernel to calculate the kernel matrix of the TCA algorithm. The best results of corresponding to the various features are recorded in the Table II.

  5. For the TTRLSR method, we use grid search strategy to search the parameter , and , and the parameter grid is arranged at [0.01:0.01:0.09], [0.1:0.1:0.9] and with interval 5. Furthermore, we have tried to traverse the size of auxiliary data set from whole target data set according to 10% to 100% with interval 10%. Finally, we choose the best results from different proportion and parameters.

Iii-E Experimental Results and Discussions

Fig. 4:

The recognition rate confusion matrix of the TDTL method use BU-3DEF database as source samples and Multi-PIE database as target samples.

Fig. 5: The recognition rate confusion matrix of the TDTL method use Multi-PIE database as source samples and BU-3DEF database as target samples.

In this section, we will report the all recognition accuracies of the TDTL method and all comparison results based on recent the state-of-the-art transfer learning methods. These experimental results according to recognition accuracy are showed in Table II. The recognition accuracy is calculated in term of , where is the number of correct predictions to target domain samples and is the total number of target domain samples. In addition, according to Section III-A the BU-3DEF database is imbalanced according to the category in our experiments. To more objectively reflect the performance of all methods in this paper, we furthermore report the F1-score of all the experimental results, in which and

express the precision and recall of the

th facial expression, respectively, and means the number of facial expression categories. From table II, we can see that the TDTL model has achieved better recognition accuracies than these comparison experimental methods whether the BU-3DEF database is used as source domain samples or the Multi-PIE database as source domain samples. It’s also worth mentioning that, the F1-score of our method have also better performance (0.6547 0.5309) than other comparison experimental methods. To summarize, from the performances of recognition rates and F1-score, the proposed TDTL method is more suitable for dealing with cross-database non-frontal FER problem between the BU-3DEF and Multi-PIE databases. Moreover, from the comparison experimental results based on three features, it is also distinctly to see that the recognition accuracies of the VGG features are better than the SIFT and LBP features, meanwhile, the comparison experimental results of SIFT and LBP features are not significantly different in all comparison methods. In general, these comparison experimental results indicate that the VGG features can better represent the complicated information such as non-frontal facial expression data than the traditional hand-craft features like SIFT and LBP features. It is worth mentioning that, the TTRLSR model [19] also has achieved good recognition results and F1-score, especially in the SIFT features compared with other comparison methods. This phenomenon shows that transductive transfer learning method based on group sparse learning method can also achieve good results in complex cross-database non-frontal facial expression recognition tasks. In addition, the TTRLSR method need to select feature channels of the samples, the VGG feature does not satisfy this characteristic, thus the TTRLSR method has only the experimental results of the SIFT and LBP features. Finally, we also display the recognition rate confusion matrices of the TDTL method in Figs. 4 and 5. Compared with Figure 4 and Figure 5, we can clearly find that the DI, SM and SU expressions are more easily recognized, by contrast, the NE expression is more difficulty recognized by the TDTL method, especially when the BU-3DEF is used as a target database. This may be that the distribution of NE expression is similar to the distributions other three expressions, which leads to the low recognition performance of NE expression. .

Iv Conclusions

In this paper, a novel tranductive deep transfer learning (TDTL) framework based on widely-used VGGface16-Net is proposed to better deal with cross-database non-frontal FER problem. In this method, we designed a special transfer learning layer to jointly optimize the loss function for predicting the label values of the target samples. To evaluate the TDTL method, extensive experiments are conducted on two publicly available non-frontal facial expression databases, i.e., BU-3DEF and Multi-PIE database. The experimental results demonstrate that the TDTL model can effectively enhance the recognition effects in coping with non-frontal FER problem compare with recent state-of-the-art transfer learning methods. Additionally, From the results of comparison experiment, we can see that the VGG-based features achieves more excellent recognition accuracies than the traditional hand-craft features on two databases. These results furthermore indicate that the deep neural network is more prominent in acquiring the feature representation of human facial emotion.


  • [1] P. Ekman and W. V. Friesen, Facial Action Coding System: Investigatoris Guide. Consulting Psychologists Press, 1978.
  • [2] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 1, pp. 38–52, 2011.
  • [3] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 524–528, ACM, 2017.
  • [4] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer, “Meta-analysis of the first facial expression recognition challenge,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 4, pp. 966–979, 2012.
  • [5] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, “Deep spatio-temporal features for multimodal emotion recognition,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 1215–1223, IEEE, 2017.
  • [6] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283, ACM, 2016.
  • [7] Y. Zong, X. Huang, W. Zheng, Z. Cui, and G. Zhao, “Learning from hierarchical spatiotemporal descriptors for micro-expression recognition,” IEEE Transactions on Multimedia, vol. PP, no. 99, pp. 1–1, 2018.
  • [8] W. Zheng, X. Zhou, C. Zou, and L. Zhao, “Facial expression recognition using kernel canonical correlation analysis (kcca),” IEEE Transactions on Neural Networks, vol. 17, no. 1, pp. 233–238, 2006.
  • [9] S. Moore and R. Bowden, “Local binary patterns for multi-view facial expression recognition,” Computer Vision and Image Understanding, vol. 115, no. 4, pp. 541–558, 2011.
  • [10] W. Zheng, H. Tang, and T. S. Huang, “Emotion recognition from non-frontal facial images,” Emotion Recognition: A Pattern Analysis Approach, First Edition. Edited by Amit Konar and Aruna Charkraborty, pp. 183–213, 2014.
  • [11] H. Tang, M. Hasegawa-Johnson, and T. Huang, “Non-frontal view facial expression recognition based on ergodic hidden markov model supervectors,” in Multimedia and Expo (ICME), 2010 IEEE International Conference on, pp. 1202–1207, IEEE, 2010.
  • [12] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, “A deep neural network-driven feature learning method for multi-view facial expression recognition,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2528–2536, 2016.
  • [13] W. Zheng, H. Tang, Z. Lin, and T. S. Huang, “A novel approach to expression recognition from non-frontal face images,” in Computer Vision, 2009 IEEE 12th International Conference on, pp. 1901–1908, IEEE, 2009.
  • [14] S. Moore and R. Bowden, “Multi-view pose and facial expression recognition,” in Proc. BMVC, vol. 2, 2010.
  • [15] S. Kumano, K. Otsuka, J. Yamato, E. Maeda, and Y. Sato, “Pose-invariant facial expression recognition using variable-intensity templates,” International journal of computer vision, vol. 83, no. 2, pp. 178–194, 2009.
  • [16] W. Zheng and X. Zhou, “Cross-pose color facial expression recognition using transductive transfer linear discriminat analysis,” in Image Processing (ICIP), 2015 IEEE International Conference on, pp. 1935–1939, IEEE, 2015.
  • [17] W.-S. Chu, F. De la Torre, and J. F. Cohn, “Selective transfer machine for personalized facial expression analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 3, pp. 529–545, 2017.
  • [18] Y. Zong, W. Zheng, X. Huang, J. Shi, Z. Cui, and G. Zhao, “Domain regeneration for cross-database micro-expression recognition,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2484–2498, 2018.
  • [19] W. Zheng, Y. Zong, X. Zhou, and M. Xin, “Cross-domain color facial expression recognition using transductive transfer subspace learning,” IEEE Transactions on Affective Computing, 2016.
  • [20] P. Wei, Y. Ke, and C. K. Goh, “Deep nonlinear feature coding for unsupervised domain adaptation.,” in IJCAI, pp. 2189–2195, 2016.
  • [21] M. V. Zavarez, R. F. Berriel, and T. Oliveira-Santos, “Cross-database facial expression recognition based on fine-tuned deep convolutional network,” in Graphics, Patterns and Images (SIBGRAPI), 2017 30th SIBGRAPI Conference on, pp. 405–412, IEEE, 2017.
  • [22] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learning for visual recognition,” International Journal of Computer Vision, vol. 109, no. 1-2, pp. 42–59, 2014.
  • [23] L. Duan, I. W. Tsang, and D. Xu, “Domain transfer multiple kernel learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 465–479, 2012.
  • [24] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2960–2967, 2013.
  • [25]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, pp. 1097–1105, 2012.
  • [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al., “Going deeper with convolutions,” Cvpr, 2015.
  • [27] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.
  • [28] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” in Automatic face and gesture recognition, 2006. FGR 2006. 7th international conference on, pp. 211–216, IEEE, 2006.
  • [29] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    , pp. 2066–2073, IEEE, 2012.
  • [30] M. Long, J. Wang, J. Sun, and S. Y. Philip, “Domain invariant transfer kernel learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 6, pp. 1519–1532, 2015.
  • [31] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis.,” IEEE Transactions on Neural Networks, vol. 22, no. 2, p. 199, 2011.
  • [32] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 915–928, 2007.
  • [33] C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image and Vision Computing, vol. 27, no. 6, pp. 803–816, 2009.
  • [34]

    O. Déniz, G. Bueno, J. Salido, and F. De la Torre, “Face recognition using histograms of oriented gradients,”

    Pattern Recognition Letters, vol. 32, no. 12, pp. 1598–1603, 2011.