Using Deep Autoencoders for Facial Expression Recognition

01/25/2018 ∙ by Muhammad Usman, et al. ∙ 0

Feature descriptors involved in image processing are generally manually chosen and high dimensional in nature. Selecting the most important features is a very crucial task for systems like facial expression recognition. This paper investigates the performance of deep autoencoders for feature selection and dimension reduction for facial expression recognition on multiple levels of hidden layers. The features extracted from the stacked autoencoder outperformed when compared to other state-of-the-art feature selection and dimension reduction techniques.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Emotion recognition is an important area of research to enable effective human-computer interaction. Human emotions can be detected using speech signal, facial expressions, body language, and electroencephalography (EEG), etc. In this paper, we focus on facial expression recognition (FER), which is a widely being studied problem [1, 2]. FER has become a very interesting field of study and its applications are not limited to human mental state identification and operator fatigue detection, but also to other scenarios where computers (robots) play a social role such as an instructor, a helper, or even a companion. In such applications, it is essential that computers are able to recognize human emotions and behave according to their affective states. In healthcare, recognizing patients’ emotional instability can help in early diagnosis of psychological disorders [3]. Another application of FER is to monitor human stress level in daily human-computer interaction.

Humans can easily recognize another human’s emotions using facial expressions but the same task is very challenging for machines. Generally, FER consists of three major steps as shown in the Figure 1

. The first step involves the detection of a human face from the whole image by using image processing techniques. In the second step, key features are extracted from the detected face. Finally, machine learning models are used to classify images based on the extracted features.

Features descriptors like histograms of oriented gradients (HOG) [4], Local Gabor features [5] and Weber Local Descriptor (WLD) [6] are widely used techniques for FER, whereas HOG has shown to be particularly effective in literature for the task of FER [7]

. The dimensionality of these features is usually high. Due to the complexity of multi-view features, dimension reduction and more meaningful representation of this high dimensional data is a challenging task. Therefore, techniques like Principal Component Analysis (PCA) and Local Binary Pattern (LBP),

[8, 5], Non-Negative Matrix Factorization (NMF), etc., are being used to overcome high dimensionality problem by representing the most relevant features in lower-dimensions.

Fig. 1: Facial expression recognition (FER) block diagram

Machine learning techniques have revolutionized many fields of science including computer vision, pattern recognition, and speech processing through its powerful ability to learn nonlinear relationships over hidden layers, which makes it suitable for automatic features learning and modeling of nonlinear transformations. Deep neural networks (DNNs) can be used for feature extraction as well as for dimensionality reduction

[9, 7]. A large number of classification techniques has been used for FER. For example, Choi et al. [10] used artificial neural networks (ANNs) for classification of facial expressions. Authors in [8, 11]

have used Support Vector Machines (SVMs) for FER. In

[12, 13]

, authors utilized Hidden Markov Model (HMM) for FER. HMMs are mostly used for frame-level features to handle sequential data. Besides these classifiers, Dynamic Bayesian Networks


and Gaussian Mixture Model


are also utilized for learning facial expressions. The recent success of deep learning also motivates its use for FER

[16, 17].

In this paper, we use a novel approach based on stacked autoencoders for FER. We exploited autoencoders network for effective representation of high dimensional facial features in lower dimensions. Autoencoders represent an ANN configuration in which output units are linked to the input units through the hidden layers. A fewer number of hidden units allow them to represent input data into a low dimensional latent representation. While in stacked autoencoder, output of first layer is immediately given to second layer as an input. In other words, stacked autoencoders are built by stacking additional unsupervised feature learning hidden layers, and can be trained by using greedy methods for each additional layer. As a result, when the data is passed through the multiple hidden layers of stacked autoencoders, it encodes the input vector in a smaller representation more efficiently [18]. In our case, autoencoders network is more suitable as it not only reduces the dimension of data but can also detect most relevant features. In previous work, Hinton et al. [19] have shown that autoencoders networks can be used for effective dimension reduction and they can produce more effective representation than PCA.

For our experiments, we choose Extended Cohn-Kanade (CK+) [20] dataset which is extensively used for automatic facial image analysis and emotion classification. The HOG features are computed from the selected area of facial expressions and their dimensions have been reduced by using stacked autoencoders on multiple levels and with multiple hidden layers to get the most optimal encoded features. SVM model in the one-vs-all scenario is used for classification on this reduced form of features. We have performed multiple experiments on the selection of optimal dimension (10-500 features) of the feature vector. The feature vector with length 60, obtained after the introduction of four hidden layers in autoencoders network outperformed as compared to other dimensions. Most importantly, we also use PCA for dimension reduction in order to compare the baseline results with autoencoders. Our proposed method for FER using stacked autoencoders is also outperformed when results were compared with PCA and other recent approaches published in this domain. This demonstrates the effectiveness of stacked autoencoders for the selection of the most relevant features for FER task.

The rest of the paper is organized as follows. In Section II, we present background and related work. In Section III, the detail on each step of our proposed method is described. In Section IV, we explain the experimental procedure and obtained results. Finally, we conclude in Section V.

Ii Related Work

Facial expressions are visually observable non-verbal communication signals that occur in response to a person’s emotions and originate by the change in facial muscle. They are the key mechanism for conveying and understanding emotions. Ekman and Freisen [21] postulated six universal emotions (i.e., anger, fear, disgust, joy, surprise, and sadness) with distinct content and unique facial expression. Most of the studies in the area of emotion recognition usually focus on classifying these six emotions.

Much of the efforts have been made to classify facial expression with various facial feature by using machine learning algorithms. For example, Anderson et al. [22]

developed an FER system to recognize the six emotions. They use SVM and Multilayer Perceptrons and achieved a recognition accuracy of 81.82%. In

[6], Wang et al. combined HOG and WLD features to have missing information about the contour and shape. The proposed solution attained 95.86% recognition rate by using chi-square distance and the nearest neighbor method to classify the fused features. Lia et al. [23] used k-nearest neighbor to compare the performance of PCA and NMF on Taiwanese and Indian facial expression databases. They attained above 75% recognition rate by using both techniques.

Recently, a comprehensive study has been made by Liu et al. [8], they also combined HOG with Local Binary Patterns (LBP) features. For dimension reduction of extracted features, PCA was used. After applying several classifiers on reduced features, he received 98.3% maximum recognition rate. Similarly, Xing et al. [24] used local Gabor features with Adaboost classifier for FER and achieved 95.1% accuracy with the 10-time reduced dimensionality of traditional Gabor features.

Encouragingly, Jung et al. [25] used deep neural networks to extract temporal appearance as well as temporal geometric features from RAW data. They tested this technique on several datasets and obtained higher accuracy than state-of-the-art techniques. Jang et al. [26] worked on color images and attained 85.74% recognition rate by using color channel-wise recurrent learning using deep learning. Similarly, Talele et al. [27] used LBP features and ANN for classification and the maximum success rate was 95.48%.

Recently, the autoencoders models have been used more widely for features learning from data and classification problems. For example, Huang et al. [28] used sparse autoencoder networks for feature learning and classification. His technique was good to avoid human interaction but at the cost of computation complexity. Interestingly, Gupta et al. [29] developed a multi-velocity autoencoder network by using the multi-velocity layers for generating velocity-free deep motion features for facial expressions. The proposed technique attained the state-of-the-art accuracy on various FER datasets. An interesting work has been done by Makhzan et al. [30]

to investigate the effectiveness of sparsity on MNIST data. They showed that sparse autoencoders are simple to train and achieve better classification results as compared to the denoising autoencoders and Restricted Boltzmann Machines (RBMs) as well as networks trained with dropout. Another study

[31] explored the effect of hidden layers in stacked autoencoders on MNIST data. The authors showed that stacked autoencoders with larger depth have better learning capability but require more training examples and time.

Iii Proposed Method

Our proposed FER system consists of four steps (as illustrated in Figure 2). The first step is related to image processing, in which, we use the state-of-the-art Viola Jones [32]face detection method for face detection and extraction. This extracted portion represents the most variance when expression changes. In the second step, HOG features are computed from the cropped image. In the third step, high-dimensional HOD features are reduced to lower dimension using stacked autoencoders. Finally, in the fourth step, the SVM model is used on these lower dimension features to classify the facial expressions. Extended Cohn-Kanade Dataset (CK+) is used in our experiment. Most importantly, we investigated the performance of encoded features of length 5 to 100 using different depth of stacked autoencoders.

Fig. 2: Facial Emotion Recognition (FER) framework

Figure. 2 shows the flowchart of our overall experiment. The detail of each step is given below.

Iii-a Image Processing

At the image processing stage, we first detect and extract the face region to eliminate redundant regions which can affect recognition rate. The used databases contain much redundant information in the images, and to eliminate the redundant information, the robust real-time detector developed by Viola and Jones [32] is employed. In this way, we obtained the face local region around mouth and eyes as these parts represent the most discriminating information when facial expression changes.

Fig. 3: Face detection and cropping salient area

As shown in Figure. 3, we crop out the face region and re-size the face image to 128128 to get the salient areas of facial expression.

Iii-B Feature Extraction

Histogram of oriented gradients (HOG) is a feature descriptor which is widely used in computer vision and image processing [33]. The technique counts occurrences of gradient orientation in localized portions of an image. HOG is invariant to geometric and photometric transformations, except for object orientation. The images that are in the database have different expressions and different orientations of eyes, nose, and lip corners. HOG is used in our algorithm because it is a powerful descriptor to detect the variations (i.e., when facial expressions change). In our proposed approach, we applied HOG on cropped face images and extracted the feature vectors. The cropped image of size 128128 gives a feature vector of size 18100 using HOG. The feature vectors are concatenated to form feature matrix as shown in table I.

1x8100 (HOG Feature of image 1)
1x8100 (HOG Feature of image 2)
1x8100 (HOG Feature of image 3)
1x8100 (HOG Feature of image 4)
1x8100 (HOG Feature of image (N-1))
1x8100 (HOG Feature of image N)
TABLE I: HOG feature Matrix of N8100

Iii-C Dimension Reduction

The main aim of this work is to show that how nonlinear machine learning technique can be effectively used to obtain the most relevant holistic representation of features in a lower dimension. As the extracted HOG features have a high dimension (N8100) as compared to the number of available images (327). The state-of-the-art is to reduce the dimension of the features vector by using different dimension reduction techniques such as PCA, linear discriminant analysis (LDA) and NMF. For this purpose, we use autoencoder network on high dimensional feature descriptors extracted by using HOG. To compare the performance of these features, we also use PCA for dimension reduction of features. Both of these techniques are discussed below.

Iii-C1 Autoencoders

An autoencoder is an unsupervised architecture that replicates the given input at its output. It takes an input feature vector

and learns a code dictionary by changing the raw input data from one representation to another. An autoencoder applies backpropagation by setting the target values to be equal to the inputs (i.e.,

) as shown in the Figure 4.

Fig. 4: Architecture of an autoencoder network. The input vector of length is encoded to lower dimensional feature vector of length then reconstructed as with length similar to ()

For example, if autoencoders are inputted with correlated structural data, then the network will discover some of these correlations [34]. In an autoencoder, the lower dimension is represented by


Where is associated weight vector with the input unit and hidden unit, is the bias associated with the hidden unit and is the activation of the hidden unit in the network. Similarly,

is the sigmoid function that is given by




The stacked autoencoder can be described as follow


Encouragingly, an autoencoder can also discover the interesting structure of data, even when the number of hidden units is large, by imposing sparsity constraint on the hidden units. Such architecture is called sparse autoencoder. The cost function of a sparse autoencoder is given by:



is an activation function,

and are weights and biases respectively. The first term in equation (III-C1) tries to minimize the difference between the input and output. The second term is the weight decay that avoids over-fitting, where is the parameter for weight decay, is the number of layers in autoencoder network, and denotes the number of units for the layer. Similarly, represents the weight value between the unit of layer and the unit of layer , and is the bias associated with unit in layer . The last term is a sparse penalty term, where controls the weight of this term, and is a sparsity parameter and is the Kullback-Leibler () divergence that is given by


Typically, is set to be a small value close to . divergence is a standard function used for measuring the difference between two different distributions.

In this experiment, the extracted features using HOG is inputted to autoencoder network to encode them at the desired level of dimension by limiting the hidden units in hidden layers. The number of hidden layers is always experiential. Therefore, we also tried to explore the effect of an increase in the number of hidden layers for stacked autoencoder. This effect typically used for dimension reduction of input data. We have performed multiple experiments to validate our findings. To get a quality of encoded features from autoencoder, we use backpropagation for fine-tuning of network parameters. Mean square error (MSE) is used as a loss function with 400 epochs.

Iii-C2 Principal Component Analysis

The research domain of pattern recognition and computer vision is dominated by the extensive use of PCA which is also referred as Karhunen-Loeve expansion [35]. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA is an effective method to reduce the feature dimension and has been extensively being applied in FER for dimension reduction of features [8, 5]. Therefore, we chose PCA to compare its performance with nonlinear dimension reduction by autoencoder. High dimension feature matrix (N8100) is reduced to the multiple numbers of dimension (i.e., 10 to 500) using PCA.

Iii-D Support Vector Machine

SVMs are very powerful tool for binary as well as multi-class classification problems. Initially, SVMs was designed for binary classification that separates the binary class data (=2) with a maximized margin. However, for real-world problems, it is often required to discriminate between data for more than two (2) categories. Therefore, two representative ensemble schemes exist in SVMs, i.e., one-versus-all and one-versus-one to classify multi-class data. In this experiment, we use SVM in the one-vs-all scenario with a Gaussian kernel function. In one-vs-all scheme, SVM constructs separate binary classifiers to classify -classes of data. The binary classifier is trained by using the data from class as the positive example and the remaining number of classes as negative examples. During testing, the class label is predicted by the binary classifier that gives maximum output value. For binary classification task with training data and corresponding labels , the decision function can be formulated as:


Where denotes a separating hyper-plane, is a weight vector normal to the separating hyper-plane and denotes the bias or offset of the hyper-plane. Following is the region between hyper-planes that is also called margin band.


Finally, choosing the optimal values of and is formulated as a optimization problem, where equation 9 is maximized subject to the following constrain:


Iv Experimental Results and Discussion

The performance of our proposed approach for FER has evaluated on publicly available CK+ database. This dataset contains 593 sequences of images from 123 subjects. Only 327 out of 593 sequences of images are given the labels for 7 human facial expressions. Out of 7 expressions, we used six expressions (i.e., angry, happy, disgust, sadness, surprise, and fear) similar to the methods adopted in [18, 36, 8]. Contempt has only 18 labeled images so it was not included in our experiment. Each expressional image sequence starts with a neutral expression and ends with a peak expression (i.e., anger). In our experiment, we use five peak images of each expression to incorporate the temporal information of an expression. Figure. 5 shows a sample sequence of images for the six emotions that we use for training. For training, we use 80% of data while testing was performed using remaining 20%. During testing, we only use one peak image of each expression.

Fig. 5: Sample sequence of five images for various emotions: (a) Angry, (b) Disgust, (c) Fear, (d) Happy, (e) Sad and (f) Surprised

Multiclass SVM in the one-vs-all scheme with Gaussian kernel has been used for classification of facial expressions using MATLAB. We have performed multiple experiments on a different length of features obtained by autoencoder and PCA as shown in table II. It can be noted that encoded features obtained by the stacked autoencoders mostly outperformed the baseline (PCA) performance. By using autoencoder for dimension reduction, we achieved the highest recognition rate of 99.60% with 60 dimensions while with PCA 96.44% success rate is obtained with 80 dimensions.

Number of Feature PCA (Accuracy %) Autoencoders (Accuracy %)



TABLE II: Accuracy using different dimension (number of features) with SVM

We also investigated the effect of adding more hidden layers in autoencoder network. We have performed multiple experiments by introducing more hidden layers with a different number of hidden units (i.e., 500, 400, 300 and 200) while the encoded features are from 5 to 100. Figure 6 shows the structure of five autoencoders used in our experiments.

Fig. 6: The structure of five stacked autoencoders used in our experiment. (a) one hidden layer autoencoder, (b) two hidden layer autoencoder, (c) three hidden layer autoencoder, (d) four hidden layer autoencoder and (e) five hidden layer autoencoder

Table III shows the results of experiments in which a different number of hidden layers are introduced. It can be noted that higher the number of hidden layers not necessarily increase the accuracy, as already indicated in [37, 38]. From these results, we can state that after a certain number of the hidden layer for each number of feature, the accuracy starts decreasing. For example, with 80 features, when hidden layers are introduced, recognition rate increases till the second layer but after that, it decreases. Similarly, we find the same trend for all number of features but at a different number of hidden layers.

Number of Feature Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Hidden Layer 4 Hidden Layer 5
TABLE III: Recognition rate achieved on multiple hidden layers for different dimension (number of features)

Figure. 7 shows the trend of best recognition rate and the number of reduced dimensions with autoencoder and PCA. It can be seen from Figure. 7 that there is no regular relationship between accuracy and the number of dimensions, however, it remains almost same after the specific dimension, i.e., 200.

Fig. 7: Obtained accuracy by using different dimension of features

The accuracy of 99.60 with less (i.e., 60) number of features is not reported in the literature. We have also revived the latest papers in table IV, to compare our method with recently published papers in this domain. It can be seen from table IV, previously maximum achieved accuracy is 99.51% using a combination of features (i.e, HOG+LDA+PCA). Similarly, Liu et al. [8] achieved 98.3% recognition rate using local binary patterns (LBP) and HOG features together. They achieved this accuracy using a combination of features with 80 dimensions. While our proposed method has shown significantly better results while using a single type of features at lower dimensions.

Study Year Method Accuracy (%)
Xing et al. [24]
Local Gabor Feature +
Adaboost Classifier
Cossetin et al. [39]
Pairwise Feature
Kar et al. [40]
Liu et al. [24]
Local binary patterns
(LBP)+HOG with PCA
Our study
HOG + Autoencoders
TABLE IV: Comparison of some recent paper with our study

Although our proposed approach has achieved a state-of-the-art recognition rate but the time complexity of autoencoders is linearly dependent on the number of features and hidden layers. Greater the number of features or hidden layers, the more time that is required to train the model.

V Conclusion

The main contribution of this paper is to investigate the performance of deep autoencoders for lower dimensional feature representation. The experiment proves that nonlinear dimension reduction using autoencoders is more effective than linear dimension reduction techniques for FER. We used CK+ dataset in our experiments and compared our results using features obtained by autoencoder networks with state-of-the-art PCA. Most importantly, we explored the effect of an increase in the number of hidden layers which enhanced the learning capability of the network to provide more robust and optimal features for facial expression recognition.


  • [1] J. Kumari, R. Rajesh, and K. Pooja, “Facial expression recognition: A survey,” Procedia Computer Science, vol. 58, pp. 486–491, 2015.
  • [2] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003.
  • [3] S. Latif, J. Qadir, S. Farooq, and M. A. Imran, “How 5g (and concomitant technologies) will revolutionize healthcare,” arXiv preprint arXiv:1708.08746, 2017.
  • [4] P. Carcagnì, M. Del Coco, M. Leo, and C. Distante, “Facial expression recognition and histograms of oriented gradients: a comprehensive study,” SpringerPlus, vol. 4, no. 1, p. 645, 2015.
  • [5] M. Abdulrahman, T. R. Gwadabe, F. J. Abdu, and A. Eleyan, “Gabor wavelet transform based facial expression recognition using pca and lbp,” in Signal Processing and Communications Applications Conference (SIU), 2014 22nd.   IEEE, 2014, pp. 2265–2268.
  • [6] X. Wang, C. Jin, W. Liu, M. Hu, L. Xu, and F. Ren, “Feature fusion of hog and wld for facial expression recognition,” in System Integration (SII), 2013 IEEE/SICE International Symposium on.   IEEE, 2013, pp. 227–232.
  • [7] M. V. Akinin, N. V. Akinina, A. I. Taganov, and M. B. Nikiforov, “Autoencoder: Approach to the reduction of the dimension of the vector space with controlled loss of information,” in Embedded Computing (MECO), 2015 4th Mediterranean Conference on.   IEEE, 2015, pp. 171–173.
  • [8] Y. Liu, Y. Li, X. Ma, and R. Song, “Facial expression recognition with fusion features extracted from salient facial areas,” Sensors, vol. 17, no. 4, p. 712, 2017.
  • [9] Y. Wang, H. Yao, S. Zhao, and Y. Zheng, “Dimensionality reduction strategy based on auto-encoder,” in Proceedings of the 7th International Conference on Internet Multimedia Computing and Service.   ACM, 2015, p. 63.
  • [10] H.-C. Choi and S.-Y. Oh, “Realtime facial expression recognition using active appearance model and multilayer perceptron,” in SICE-ICASE, 2006. International Joint Conference.   IEEE, 2006, pp. 5924–5927.
  • [11] D. Ghimire, S. Jeong, J. Lee, and S. H. Park, “Facial expression recognition based on local region specific features and support vector machines,” Multimedia Tools and Applications, vol. 76, no. 6, pp. 7803–7821, 2017.
  • [12] M. H. Siddiqi, S. Lee, Y.-K. Lee, A. M. Khan, and P. T. H. Truc, “Hierarchical recognition scheme for human facial expression recognition systems,” Sensors, vol. 13, no. 12, pp. 16 682–16 713, 2013.
  • [13] M. Z. Uddin, J. Lee, and T.-S. Kim, “An enhanced independent component-based human facial expression recognition from video,” IEEE Transactions on Consumer Electronics, vol. 55, no. 4, 2009.
  • [14] Y. Li, S. Wang, Y. Zhao, and Q. Ji, “Simultaneous facial feature tracking and facial expression recognition,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2559–2573, 2013.
  • [15] M. Schels and F. Schwenker, “A multiple classifier system approach for facial expressions in image sequences utilizing gmm supervectors,” in Pattern Recognition (ICPR), 2010 20th International Conference on.   IEEE, 2010, pp. 4251–4254.
  • [16]

    P. Liu, S. Han, Z. Meng, and Y. Tong, “Facial expression recognition via a boosted deep belief network,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812.
  • [17] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson, “Generating facial expressions with deep belief nets,” in Affective Computing.   InTech, 2008.
  • [18] M. Liu, S. Li, S. Shan, and X. Chen, “Au-inspired deep networks for facial expression feature learning,” Neurocomputing, vol. 159, pp. 126–136, 2015.
  • [19] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [20] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on.   IEEE, 2010, pp. 94–101.
  • [21] P. Ekman and W. Friesen, “Facial action coding system: a technique for the measurement of facial movement,” Palo Alto: Consulting Psychologists, 1978.
  • [22] K. Anderson and P. W. McOwan, “A real-time automated system for the recognition of human facial expressions,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 1, pp. 96–105, 2006.
  • [23] J. Li and M. Oussalah, “Automatic face emotion recognition system,” in Cybernetic Intelligent Systems (CIS), 2010 IEEE 9th International Conference on.   IEEE, 2010, pp. 1–6.
  • [24] Y. Xing and W. Luo, “Facial expression recognition using local gabor features and adaboost classifiers,” in Progress in Informatics and Computing (PIC), 2016 International Conference on.   IEEE, 2016, pp. 228–232.
  • [25] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in deep neural networks for facial expression recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2983–2991.
  • [26] J. Jang, D. H. Kim, H.-I. Kim, and Y. M. Ro, “Color channel-wise recurrent learning for facial expression recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 1233–1237.
  • [27] K. Talele, A. Shirsat, T. Uplenchwar, and K. Tuckley, “Facial expression recognition using general regression neural network,” in Bombay Section Symposium (IBSS), 2016 IEEE.   IEEE, 2016, pp. 1–6.
  • [28] B. Huang and Z. Ying, “Sparse autoencoder for facial expression recognition,” in Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf on.   IEEE, 2015, pp. 1529–1532.
  • [29] O. Gupta, D. Raviv, and R. Raskar, “Multi-velocity neural networks for facial expression recognition in videos,” IEEE Transactions on Affective Computing, 2017.
  • [30] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprint arXiv:1312.5663, 2013.
  • [31] Q. Xu, C. Zhang, L. Zhang, and Y. Song, “The learning effect of different hidden layers stacked autoencoder,” in Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016 8th International Conference on, vol. 2.   IEEE, 2016, pp. 148–151.
  • [32] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.
  • [33] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 886–893.
  • [34] B. Leng, S. Guo, X. Zhang, and Z. Xiong, “3d object retrieval with stacked local convolutional autoencoder,” Signal Processing, vol. 112, pp. 119–128, 2015.
  • [35] A. E. Omer and A. Khurran, “Facial recognition using principal component analysis based dimensionality reduction,” in Computing, Control, Networking, Electronics and Embedded Systems Engineering (ICCNEEE), 2015 International Conference on.   IEEE, 2015, pp. 434–439.
  • [36] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas, “Learning active facial patches for expression analysis,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2562–2569.
  • [37] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
  • [38] J. Zabalza, J. Ren, J. Zheng, H. Zhao, C. Qing, Z. Yang, P. Du, and S. Marshall, “Novel segmented stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral imaging,” Neurocomputing, vol. 185, pp. 1–10, 2016.
  • [39] M. J. Cossetin, J. C. Nievola, and A. L. Koerich, “Facial expression recognition using a pairwise feature selection and classification approach,” in Neural Networks (IJCNN), 2016 International Joint Conference on.   IEEE, 2016, pp. 5149–5155.
  • [40] N. B. Kar, K. S. Babu, and S. K. Jena, “Face expression recognition using histograms of oriented gradients with reduced features,” in Proceedings of International Conference on Computer Vision and Image Processing.   Springer, 2017, pp. 209–219.