Human Face Expressions from Images - 2D Face Geometry and 3D Face Local Motion versus Deep Neural Features

by   Rafal Pilarczyk, et al.
Politechnika Warszawska

Several computer algorithms for recognition of visible human emotions are compared at the web camera scenario using CNN/MMOD face detector. The recognition refers to four face expressions: smile, surprise, anger, and neutral. At the feature extraction stage, the following three concepts of face description are confronted: (a) static 2D face geometry represented by its 68 characteristic landmarks (FP68); (b) dynamic 3D geometry defined by motion parameters for eight distinguished face parts (denoted as AU8) of personalized Candide-3 model; (c) static 2D visual description as 2D array of gray scale pixels (known as facial raw image). At the classification stage, the performance of two major models are analyzed: (a) support vector machine (SVM) with kernel options; (b) convolutional neural network (CNN) with variety of relevant tensor processing layers and blocks of them. The models are trained for frontal views of human faces while they are tested for arbitrary head poses. For geometric features, the success rate (accuracy) indicate nearly triple increase of performance of CNN with respect to SVM classifiers. For raw images, CNN outperforms in accuracy its best geometric counterpart (AU/CNN) by about 30 percent while the best SVM solutions are inferior nearly four times. For F-score the high advantage of raw/CNN over geometric/CNN and geometric/SVM is observed, as well. We conclude that contrary to CNN based emotion classifiers, the generalization capability wrt human head pose is for SVM based emotion classifiers poor.



There are no comments yet.


page 2

page 3

page 6

page 10

page 12


On Intra-Class Variance for Deep Learning of Classifiers

Several computer algorithms for recognition of visible human emotions ar...

Real-Time Facial Expression Emoji Masking with Convolutional Neural Networks and Homography

Neural network based algorithms has shown success in many applications. ...

Micro-Facial Expression Recognition Based on Deep-Rooted Learning Algorithm

Facial expressions are important cues to observe human emotions. Facial ...

Real-Time Facial Expression Recognition using Facial Landmarks and Neural Networks

This paper presents a lightweight algorithm for feature extraction, clas...

Lie-Sensor: A Live Emotion Verifier or a Licensor for Chat Applications using Emotional Intelligence

Veracity is an essential key in research and development of innovative p...

Distinguishing Posed and Spontaneous Smiles by Facial Dynamics

Smile is one of the key elements in identifying emotions and present sta...

Training an Emotion Detection Classifier using Frames from a Mobile Therapeutic Game for Children with Developmental Disorders

Automated emotion classification could aid those who struggle to recogni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognition of human face expression is an useful functionality in computer applications based on human-computer-Interfacing (HCI). The algorithmic background for such systems belongs to artificial intelligence (AI) in general with strong share of computer vision research and development.

Facial expressions as a natural non-verbal means for communication, are conveying human feelings. This language was developed in human beings evolution as a irreplaceable tool in their mutual communication.

Research of the passed centuries including recent findings of psychologists and physiologists, suggest that the visible emotions, regardless of the sexual, national, or religious aspects can be classified into the six basic forms (Ekman and Friesen [1]): happiness, anger, fear, disgust, surprise, and sadness. In recognition systems, for the completeness, the neutral category is appended.

The automatic recognition of facial expression attracted many researchers in the recent years of image analysis development. For instance, already in year 1998 the authors from famous Pittsburgh’s group (Lien, Kanade, Cohn, Li [2]) compared four types of facial expressions getting the accuracy of about for the frontal pose. However, the algorithm recognizes only upper face expressions in the forehead and brow regions. Interestingly, the facial expressions are represented by combinations of action units defined years earlier as the Facial Action Coding System (FACS) by Ekman and Friesen in [3]

. The method uses Hidden Markov Model (HMM) to recognize "words" of action units.

The FACS concept had significant impact on graphics standardization conducted within MPEG-4 group on 3D model of human head. The works had led to Candide-3 model [4] including numerical values for model vertexes and mesh indexes for model standard geometry, shape units for geometry individualization, and action units for model animation. Introducing Candide model happened to be the important step in image face analysis via 3D models. Then (year 2001), the missing factor was a tool for detecting facial landmarks in the fast and the accurate way. There was general opinion that such task is beyond of the contemporary computing technology.

The image analysis research was facing then the mythical "curse of high dimensionality modeling" and the optical/pixel flow tools were the evidence of the crisis.

However, after a decade both the computing technology and algorithms have made the enormous progress and getting in real time the 3D models of real human heads becomes the reality in human-computer interfaces HCI (as shown for instance in [5]). Nearly, in the same time the "neural revolution" arrived in the form of deep neural networks (DNN) changing dramatically the status of "intelligent applications" including visible emotions recognition in real time.

Deep learning approaches, particularly using convolutional neural networks (CNN) have been successful at computer vision related task due to achieving good generalization from multidimensional data [6, 7, 8, 9]. Facial expression and emotion recognition with deep learning methods were used in [10, 11, 12, 13, 14, 15]. Moreover DNN network were jointly learned with SVM by Tang[14] for emotion classification. There is great amount of databases for facial expression recognition from image as well as sophisticated recognition methods used in the wild.

2 Geometric feature extraction

On top of all facial image analysis usually the face detector is exploited. In our research we use the face detector of King [16] available in dlib library [17]. It is a novel proposal which appears to be more accurate than Viola and Jones face detector [18] (cf. Fig.1) available in opencv library and being still improved by retraining for new data sets.

Figure 1: Face detectors: (left) Viola and Jones algorithm (opencv); (right) King’s method (dlib).

The face detector offered in dlib is an instance of more general object detector which can be tuned for various object classes and various feature extractors.

The generality of the maximum margin object detector (MMOD) in dlib is based on similar concepts to those which were proposed by Joachims et alters [19] at developing the ideas of Structural Support Vector Machine (SSVM).

Comparing to Viola and Jones algorithm for face detection, instead of the boosted cascade of weak classifiers based on ensembles of local region contrasts, we have in Kings proposals both, the HOG features (HOG/MMOD), i.e. the Histogram of Oriented Gradients

[20], and features extracted by trained neural network (CNN/MMOD), as well. The SVM model is trained by max-margin convex optimization problem defined for collections of image rectangles.

The sophisticated method of selection of image rectangles with objects of interest avoids the combinatorial explosion by SSVM trick where only the worst constraints are taken to relevant quadratic optimization and by a smart heuristics which in the greedy way allows us to get suboptimal rectangle configurations for a complex but convex risk function.

2.1 Facial characteristic landmarks

At the feature extraction stage, the following three concepts of face description are confronted: (a) static 2D face geometry represented by its 68 characteristic landmarks (FP68); (b) dynamic 3D geometry defined by motion parameters for eight distinguished face parts (denoted as AU8) of personalized Candide-3 model; (c) static 2D visual description as 2D array of gray scale pixels (known as facial raw image).

Figure 2: FP68 indexing (left) and detection boxes with FP68 points on faces (right).

The FP68 detector is implemented as well in dlib [17]

library as well. The detector exploits many regression trees for 68 HOG features mapping. The cascade approach using of many small regression trees gives more effective detector than using one large regression model. The trees are built using stochastic gradient boosting of Friedman


On the other hand dlib FP68 detector is unsatisfactory for detecting landmarks on non-frontal face images. In most cases landmarks are incorrectly marked especially for test database. We switch to CNN-based FP68 regressor [22] built at the top of MobileNet-V1 [23] which is trained on public datasets 300W, 300VW, IBUG [24, 25, 26, 27, 28] in order to extract missing facial landmarks.

2.2 Candide-3 model personalization and animation

Candide-3 model (cf. Fig.3) consists of static human head geometry specified by a discrete set of 3D points (also known as mean Candide shape) with coordinates normalized to the spatial cube and the dynamic geometry specified via motion vectors assigned to distinguished groups of 3D points (also known as shape units and action units).

Figure 3: Candide-3 wire-frame model (XY view), the 3D model with the removed hidden invisible lines (with some triangles removed), and the motion of two action units (AU10, AU20) to be identified via 3D model controlled by FP68 points.

2.2.1 Concept of action and shape units

The Candide-3 model can be scaled independently in shape and action units, i.e. a local motion can be performed. While the shape units can be used to personalize the face features (e.g. mouth width) the dynamic scaling of action units enables simulation of local face motions typical for face expressions [29].

After local motions, global rotation is applied followed by global scaling and global translation. While local motions (personalization and animation) are performed in Cartesian coordinate system of Candide-3, the global scaling and translation are usually used to transfer the model to an observer coordinate system. If the observer is a web camera then the face image can be used to identify parameters of local motions provided we have a fast algorithm identifying in the image the points which convey the local motion.

In the identification of local motion parameters we have to know also the global affine motion parameters (rotation, scaling, and translation). To identify the global motion we need also points of model which do not belong to groups of shape and action units. Fortunately, we can select 37 facial salient points out of FP68 landmarks to be used to minimize the modeling error.

Figure 4: Average 2D shapes of training images: for neutral class 2D shape in the form of 68 detected facial salient points; for smile class 2D shape in the form of 68 detected facial salient points with vectors from neutral to smile; for angry class 2D shape in the form of 68 detected facial salient points with vectors from neutral to angry; for surprised class 2D shape in the form of 68 detected facial salient points with vectors from neutral to surprised.

In Fig.4 we see that the major information of the facial muscles movements is located at the eyes, eye brows, mouth and jaw. Therefore action units are assigned to this facial parts. The action units having an impact on visible emotion modeling are given in the table Tab.1. The particular face expression is always a linear combination of many action units:

  • smile: ,

  • anger: ,

  • surprise: .

Indexing Descrption Indexing Indexing
1 Inner brow raiser 5 Upper lid raiser
2 Outer brow raiser 7 Lid tightener
4 Brow lowerer 9 Nose wrinkler
26 Jaw drop 12 Lip corner puller
25 Lips part 10 Upper lip raiser
27 Mouth stretch 15 Lip corner depressor
16 Lower lip depressor 20 Lip stretcher
Table 1: Action units considered for muscle movements for visible emotions.

Since in the identification of action units parameters we can establish the correspondence of only 37 facial landmarks to Candide head points which have a significant impact on facial expressions, we select only eight action units to be identified: jaw drop (AU 26/27), brow lower (AU 4), lip corner depressor (AU 13/15), upper lip raiser (AU 10), lip stretcher (AU 20), lid tightener (AU 7), noise wrinkler (AU 9), eye closed (AU 42/43/44/45).

While the action units approximate the muscle movements at visible expressions, the shape units indicate the individual (personal) differences for the location and size of facial components. By registering the shape units, we could distort the original mean Candide-3 model into individualized form [30], thus the action units parameters, we identify, suffer much less influences of their personal information. In the personalization process we select most trusted coefficients out of shape units including the following seven elements of FACS: eye brows vertical position (SU 1), eyes vertical position (SU 2), eyes width (SU 3), eyes separation distance (SU 4), mouth vertical position (SU 5), mouth width (SU 6), eyes vertical differences (SU 7). Fig.5 illustrates the selected shape and the action units.

Figure 5: Examples: on the left the distorted shape units (SU 1, SU 6) and on the right the moved action units (AU 26/27, AU 13/15).

2.2.2 Candide modeling from images – theory

The dynamic 3D modeling of human head is represented by the parameters of transformation applied to Candide model. The global motion model has the following form:


where is the index of the point in Candide-3 model with

points of global estimation;

is the index set of points for the deformation , where is a list of indexes for the selected shape units to be used for personalization (individualization); is the unit deformation vector being the column of the deformation matrix which is assigned to the point at the deformation , ; the notation selects for the summation only those deformations which refer to the point .

Figure 6: 3D modeling for global estimation and personalization – projection error shown on the right.

Having the global motion parameters and local deformation for personalization, we can relate the local motion of action units with globally moved Candide points (vertexes):


where stands for all 3D points corresponding to landmarks. The scale parameter and the rotation matrix are the same as those used in global estimation. is the index set of points for the local animation , where is the set of action units while is the unit deformation vector.

Figure 7: 3D modeling with action units after global estimation and personalization – projection error shown on the right.

The main function of optimization package is to identify the transformation parameters (local deformations for action units and shape units, global scaling, rotation, and translation) of Candide-3 model onto the current face model. To this goal:

  1. Core 3D points for global motion are selected:

  2. Points for global estimation and individualization are selected from core 3D points:

  3. Indexes of deformation points for shape units are joined to core points:

  4. Active 2D points of facial salient points FP68 having corresponding points in core and deformation points, are selected.

  5. Active 3D points are specified as those points of which correspond to active 2D points:

  6. Number of active points is registered:

  7. The centroid for Candide model is computed:

For the current FP68 shape , the initial values of motion parameters with respect to Candide-3 shape are found using general formulas (11):

  1. Distortion coefficients and rotation:

  2. Scaling :


    where the 2D/3D centered shapes are defined as follows:

  3. Translation


Error function is defined:


where – scaling parameter; – the vector representation of the rotation matrix (see the inverse Rodrigues formulas below (9)); – the translation vector in the plane; – parameters of local deformations; such that i.e. it is the active index of 2D point corresponding to the active index of 3D point; – denotes the orthographic projection onto plane.

LMM (Levenberg Marquardt Method) optimization procedure is performed for the error function defined by equation (7)with initialization described above.

The function to compute the orthographic projection uses the current transformation parameters. The rotation is represented by 3D vector representing the rotation angle in radians and the rotation axis The rotation matrix is found from the Rodrigues formula.

Namely, let be the rotation matrix for rotation axis and rotation angle . If then otherwise Hence However,


Note that the rotation angle and the rotation axis can be recovered from the rotation matrix by the inverse Rodrigues formulas. They follow directly from the linearity of trace and transposition operations for matrices.


2.2.3 Affine equations for affine motion initialization

For orthographic viewing, the affine motion of 3D cloud of points without rotation can be considered as the affine motion of 2D model . The we scale by and translate by in order to nearest to our target cloud of points . In our case and are Candide model and FP68 landmarks, both restricted to still points only. Having the word nearest expressed by the least square error we get the following set of equations.

Given four vectors , we are looking for three parameters such that


Then the centering trick leads to the solution:


2.2.4 Candide model personalization and local motion normalization

Candide model personalization is performed using data from neutral expression. From a small number of image frames we get the second order statistics for each shape unit coefficients. Then Gaussian approximation can be used to define the coefficient trust or distrust measure. Sorting by a distrust

of our statistics is based on the cumulative probability distribution in favor of zero value, i.e. we compute the probability of those real values

which are closer to zero than to the mean value:


where is the cumulative probability distribution of the Gaussian

In Tab.2 the shape coefficients are sorted by the distrust measure. The data is acquired for the person with photo given in Fig.2. All Gaussians characterizing personal face local proportions are drawn in Fig.8.

Table 2: Statistics for deformation coefficient to fit Candide model to one of authors. There are seven highly trusted deformations with one additional found below the threshold of distrust. We fix threshold level to  (the single line horizontal rule).
Figure 8: Gaussian distributions for deformation coefficient to fit Candide model to one of author’s.
Normalization for the action units feature vectors

The normalization of action units feature vectors changes the statistics of animation coefficients to zero mean and unit variance.

Considering that the motion vectors which are defined for each action unit do not share the same length it is clear that some facial component’s coefficients get large values while others are less significant.


where is the normalized feature vectors and is the mean value of each action unit values and

is the standard deviation of each action unit values, respectively.

3 Datasets for training and testing of classifiers

We make experiment to compare classical personalized SVM emotion classifier as prior work. For personalization we choose datasets with neutral face state. Moreover to make experiment more reliable and see how different facial features can affect results we choose the test dataset with also non-frontal face poses (Fig.8(b)) that are not provided in training data (Fig.8(a)). This selection can show what generalization is obtained using different classification techniques. For these reason, we don’t use open datasets where we can’t find neutral faces for personalization as a step of structured-SVM classifier. We use three different databases for training and evaluation. Selected datasets consist of four different emotions for each person.

(a) Frontal image from training dataset
(b) 45 degrees pose face image from test dataset
Figure 9: Difference between training pose and testing pose image

The first database we use is Cohn-Kanade dataset established by Lucey et al. [31], a dataset specified for action units and emotion-specified expression. It includes approximately 2000 video sequences of around 200 people between the age of 18 to 50. It consists of 69% samples from females and 31% samples from males. 65% of them is taken from white people, 15% of which from black people and 20% of which from Asian or Latin American. Every video sequences of an individual expression starts from neutral face to the maximized deformation of the face of a certain expression, which provides us the accurate data for action units extraction.

Figure 10: Different face poses (respectively 180, 135, 90, 45, 0 degrees) – frontal pose is repeated for three different emotions.

The second dataset, The MUG facial expression database developed by Aifanti et al. [32], consists of image sequences of 86 subjects recorded performing facial expressions. 35 females and 51 males all of Caucasian origin between 20 and 35 years of age. The performing of a special expression are joined with different combination of AUs, e.g., fear can be imitated with mouth closed and only moving the upper part of the face or lips stretched.

Figure 11: Cropped training examples for 4 different classes - neutral, angry, smile and surprised

The third dataset RaFD which we used for the testing purpose is established by Oliver et al. [33], not just providing us 67 models (including Caucasian males and females, Caucasian children, both boys and girls, and Moroccan Dutch males), but also three samples of an expression of different angles from the same subject, thus the dataset is more challenging for our recognition algorithms to test their performance in practical conditions. The RaFD is high quality dataset and contains pictures of eight emotions. Each emotion is shown with three different gaze directions and pictures are taken from five camera angles. We use only three different camera angles (90, 45 and 135 degrees) due to limitations of face detector and facial landmarks extraction, which are not reliable at 180 and 0 degrees (Fig. 10).

In total, 6409 image samples were used as training subset, containing 2079 “neutral” samples, 2149 “smile” samples, 925 “angry” samples, and 1256 “surprised” samples, respectively from the MUG dataset and Cohn-Kanade dataset (Fig.11) while 2412 samples are equally divided into these four classes for testing purpose from RafD dataset [33].

4 Classification

We prepare set of classifiers to recognize emotions using both facial vector features like AU8 and FP68 as well as face grayscale images.

For facial expression geometric data, the comparison of SVM classifiers and CNN classifiers is performed directly. We train and test them on the same data.

4.1 AU8 and FP68 classifiers

The Support Vector Machine (SVM) finds in high dimensional space, the hyper plane separating the feature vectors with the maximum margin. SVM classifier has proven to be accurate in image processing area and machine learning area in case of limited number of feature samples.

For our experiments we use two different implementations of SVM: structured SVM (SSVM) and SVM using one-against-one decision function with polynomial kernel (SVMpoly).

To compare SVM and CNN classification techniques, sets of simple neural networks adapted to classify emotions were also created with input AU and FP68 facial features.

We present the neural architecture in symbolic notation for tensor neural networks with defined BNF grammars in the paper [34].

For AU8, where the architecture options are limited due to the low dimensionality of the input data, few kinds of two-tier neural networks have been checked. We observed that enlarging those networks and addition of nonlinear activations lead to model over-fitting.

AU8 au8 := 8_a; optima := [loss, AdaM, SoftMax]

For FP-68, an architecture was established for the best results using

-fold validating process. Deepening the neural network, modifications of the activation function, changing tensor dimensions do not improve results neither for training nor for testing data. However, the dropout regularization technique applied with probability

prevents over-fitting.

Multilayer perceptrons are trained with Adam optimizer with starting learning rate 0.001 being reduced by two when validation metric stops improving.

FP68 fp68 := 136_a; optima := [loss, AdaM, SoftMax]

4.2 Image based emotion classifiers

We train neural networks on grayscale images for 4-class emotion recognition problem. We think that trained features on small training dataset can be at least good as the features determined by analytic methods.

Testing and training data require common face detector to crop facial image. From dlib library [17] we choose CNN/MMOD neural face detector which is more robust to different face poses than HOG/MMOD face detector.

4.2.1 Image augmentation

The original image training dataset is augmented by performing affine transforms, scalings, cropping, changes of lighting, contrasting, and adding Gaussian noise. The augmentation using imgaug library[35], makes models more robust to changing the pose of the head – it can be seen in the test set (Fig. 13).

We define list of image processing operations on training images as augmentation procedure on image (Fig. 12). Augmentation consists of stochastic set of procedures. Some of them are applied with particular probability. Order of augmentation is also randomized to provide better mixture of training data.

(a) Input image
(b) Cropping
(c) Flipping
(d) Blurring
(e) Contrasting
(f) Additive noise
(g) Affine transform
(h) All augmentations
Figure 12: Image augmentation results, final augmentation consists of all operations applied in random order
  1. Vertical axis symmetry is applied with probability ;

  2. Cropping randomly rows and columns of the image;

  3. Gaussian blur is randomly applied with probability for

  4. Contrast normalization

  5. Additive Gaussian Noise

  6. Affine transform with random matrix in the uniform pixel coordinates representing the composition of the following basic transformations:

    • scaling by

    • translating ,

    • rotating by

    • shearing by

Figure 13: Cropped faces used for training phase after augmentation procedure

4.2.2 Neural networks

We also test convolutional neural networks on different cropped face image sizes: 150x150, 75x75 and 50x50. This experiment shows what is general impact of resolution on the performance. For each image size the best architecture is chosen (Tab.3

). Detailed architecture consists of several convolutional, max pooling, dropout and fully connected layers.


network is constructed of convolutional layers with batch normalization and non-linear ReLU activation unit. After the last convolution layer the global average pooling is used. The network doesn’t contain fully connected layer thanks to the preceding convolution layer which defines four feature maps.

yximg50 516pb 2_σ516pbr 532pb 2_σ532pbr 364pb 2_σ364pbr 164pb
                  2_σ3128pbr 1256pb 2_σ3128p 1256pb 2_σ34p ga
cnn-1 img50 := 50_yx; optima := [loss, AdaM, SoftMax]

CNN-2 network is inspired by xception architecture [36]. It contains cast adder blocks with depth-wise separable convolution layers. Global average pooling is also used in the same way as in the network CNN-1.

yximg75 38pbr 38pbr
                  2_σ116p r| 316ps_dbr 316ps_db 2_σ3m 2_σ132p r| 332ps_dbr 332ps_db 2_σ3m
                  2_σ164p r| 364ps_dbr 364ps_db 2_σ3m 2_σ1128p r| 3128ps_dbr 3128ps_db 2_σ3m
                  34p ga
cnn-2 img75 := 75_yx; optima := [loss, AdaM, SoftMax]

The above unstructured form can be simplified by exploiting user defined units111Note that the notation of the residual block is generalized now to multi residual block also known as the cast adder. (cf. [34]).

xcept 2_σ11_$p r| 31_$ps_dbr 31_$ps_db 2_σ3m xcept116 xcept232 xcept364 xcept4128
yximg75 38pbr 38pbr xcept1 xcept2 xcept3 xcept4 34p ga
cnn-2 img75 := 75_yx; optima := [loss, AdaM, SoftMax]

The above network CNN-2 exhibits comparable results to CNN-1 for testing data. However, its architecture is more complicated what in this case leads to better generalization measured by the difference between performance for training and testing data:
(cf. Tab.3).

The last network CNN-3 is built of convolutional, max pooling and dense layers with the first followed by dropout layer during the training stage.

yximg150 332br 2m 332br 2m 364br 2m 364br 2m 64r 50 4
cnn-3 img150 := 150_yx; optima := [loss, AdaM, SoftMax]

It is interesting that the architecture of CNN-3 is simpler than architecture of CNN-1. Apparently, it follows from density of image details for higher image resolution.

5 Experimental results

Statistical results compare SVM and DNN using different features. The statistics computations share the same training and testing samples between SVM and DNN for the discriminative expression features. To analyze the details of the performance and to weight the ups and downs in various aspects for different features and algorithms, we selected Accuracy, Cohen’s kappa value and as the performance measures.

5.1 Accuracy

The accuracy takes the simple average success rate as the final score, counted by

Statistics from Tab.3 indicate that when dealing with the same discriminative features both for AU and FP68, DNN’s solution are overwhelmingly more accurate than SVM’s solutions. We also prepare cross-validation methodology to test performance of SVM classifiers and select the best. In the Tab. 4 we put standard deviation and mean of accuracy for svm classifiers. Statisitfcs are collected from 30 different experiments. We observe that standard deviation is small for statistics, so each model performance is similar.
AU’s results are almost more accurate than the pure geometric FP68, with DNN’s classification algorithms, it reaches almost the same level as the simple CNN-1, 50x50 result. Having the robust capability of dealing with the RGB images as the input themselves, DNN’s results even peak at while using the classical features as input gives the lower accuracy at %.

Train Data Test Data
Vectorized Data AU FP-68 AU FP-68
SSVM 0.838 0.800 0.411 0.335
SVM (poly) 0.824 0.611 0.442 0.404
DNN* 0.830 0.642 0.754 0.611
CNN-1 50x50x1 0.838 0.763
CNN-1 75x75x1 0.927 0.847
CNN-2 75x75x1 0.865 0.836
CNN-3 150x150x1 0.932 0.877

[5pt] Note: DNN – for each input data type there is different architecture.

Table 3: Accuracy results for selected features
Train Data Test Data
Mean of SR
Vectorized Data AU FP68 AU FP68
SVM (poly) 0.799 0.605 0.426 0.388
SSVM 0.835 0.746 0.404 0.311
Standard deviation of SR
Vectorized Data AU FP68 AU FP68
SVM (poly) 0.011 0.003 0.008 0.008
SSVM 0.002 0.038 0.004 0.027


Table 4: Standard deviation and mean of accuracy for SVM classifiers

5.2 Cohen’s Kappa results

Let is the number of testing examples which belong to the class but they are recognized to be from the class

Beside the confusion matrix

, the probability of detection for each class of detector is estimated and Cohen’s coefficient is computed. We use the following formulas:


Cohen’s kappa value is a statistical way of measuring the inter-rater agreement (accuracy) for the classes. Instead of only counting the percentage of the correct prediction, Cohen’s kappa value takes the possibility of the prediction’s occurring by a chance. It means that is the probability of the random agreement and stands for the observed accuracy (agreement).

From the results in Tab.5 we observe that for AU input data, the CNN algorithm exceeds the SVM solutions in measure by at least while in for FP68 the Cohen’s kappa is higher more than three times.

For raw image data the advantage of CNN solutions over geometric data changes of with pixel resolution from to . The conclusion about higher generalization of xception architecture CNN-2 over simpler architecture CNN-1 is valid for both measures: the accuracy and the Cohen’s kappa.

Train Data Test Data
Vectorized Data AU FP-68 AU FP-68
SSVM 0.772 0.718 0.215 0.112
SVM (poly) 0.748 0.430 0.256 0.204
DNN 0.758 0.485 0.673 0.482
CNN-1 50x50x1 0.775 0.684
CNN-1 75x75x1 0.899 0.798
CNN-2 75x75x1 0.814 0.782
CNN-3 150x150x1 0.905 0.836

Note: DNN – for each input data type there is different architecture.

Table 5: Cohen’s kappa results for selected features (SVM, DNN).

5.3 Weighted score evaluation

Performance measure (index)

also known as F-measure, F-score shows the weighted average of precision and recall. Using the

parameter, this measure combines the precision and recall measures into one performance measure:


We compute score which reduces F-score to double harmonic average of precision and recall measures. As expected, weighted score again lead to the similar conclusions as we have seen for the accuracy and the Cohen’s kappa measures.

Train Data Test Data
Vectorized Data AU FP-68 AU FP-68
SSVM 0.810 0.786 0.354 0.228
SVM (poly) 0.786 0.568 0.406 0.266
DNN* 0.808 0.615 0.752 0.581
CNN-1 50x50x1 0.837 0.760
CNN-1 75x75x1 0.928 0.849
CNN-2 75x75x1 0.865 0.834
CNN-3 150x150x1 0.933 0.880

Note: DNN – for each input data type there is different architecture.

Table 6: Weighted score results for selected features

6 Conclusions

Input SR[%] Symbolic notation for neural architecture
FP68 57 a1fp6816br804
AU8 75 a1au884
76 yximg bbr1bbr2bbr3 164pb 2_σ3128pbr 1256pb 2_σ3128p 1256pb 2_σ34p ga
83 yximg 38pbr 38pbr xcept1 xcept2 xcept3 xcept4 34p ga
88 yximg 332br 2m 332br 2m 364br 2m 364br 2m 64r 50 4


where the unit bbr is defined for : bbr 1_$2_$pb 2_σ1_$2_$pbr bbr15,16 bbr25,32 bbr33,64

Table 7: Neural architectures for emotions from images.

The experiments regarding to the facial expression classification performance of different features(raw images, FP68 landmarks and action units) and algorithms(SVM and DNN) illustrate that when dealing with each type of those discriminative features, DNN as the classification algorithm shows the most promising results, even when just classifying the eight dimensional data, it holds approximately solid 30% advantage in accuracy than SVM when the testing samples are much more challenging than the training samples.

Namely, at the challenging conditions when the models are trained for frontal views of human faces while they are tested for arbitrary head poses, for geometric features, the success rate (accuracy) indicate nearly triple increase of performance of CNN with respect to SVM classifiers. For raw images, CNN outperforms in accuracy its best geometric counterpart (AU/CNN) by about 30 percent while the best SVM solutions are inferior nearly four times. For F-score the high advantage of raw/CNN over geometric/CNN and geometric/SVM is observed, as well.

We conclude also that contrary to CNN based emotion classifiers, the generalization capability wrt human head pose is for SVM based emotion classifiers poor.

To summarize and compare the neural architectures and their performance we assemble them in Tab.7 sorting by type of input with the success rate column SR[%].