Automatic analysis of facial expressions has been attractive in computer vision research since long time due to its wide spectrum of potential applications that go from human computer interaction to medical and psychological investigations, to cite a few. Similarly to other applications, for many years facial expression analysis has been addressed by designing hand-crafted low-level descriptors, either geometric (e.g., distances between landmarks) or appearance based (e.g.
, LBP, SIFT, HOG, etc.), with the aim of extracting suitable representations of the face. Higher order relations, like the covariance descriptor, have been also computed on raw data or low-level descriptors. Standard machine learning tools, like SVMs, have then been used to classify expressions. Now, the approach to address this problem has changed quite radically with Deep Convolutional Neural Networks (DCNNs). The idea here is to make the network learn the best features from large collections of data during a training phase.
However, one drawback of DCNNs is that they do not take into account the spatial relationships within the face during the classification phase by using fully connected layers and softmax. To overcome this problem, we propose to exploit globally and locally the network features extracted in different regions of the face. This yields a set of DCNN features per region. The question is how to encode them in a compact and discriminative representation for a more efficient classification than the one achieved globally by classical softmax.
In this paper, we propose to encode face DCNN features in local and global covariance matrices. These matrices have shown to outperform first-order features in many computer vision tasks [1, 2]. We demonstrate the benefits of this representation in facial expression recognition from static images or collections of static peak frames (i.e., frames where the expression reaches its maximum) and also from video sequences.
For static images, we represent each image with local and global covariance matrices that lie on the Symmetric Positive Definite (SPD) manifold, and we use a valid positive definite Gaussian kernel on this manifold to train a SVM classifier for static facial expression classification. Implementing our approach with different DCNN architectures, i.e., VGG-face  and ExpNet , and by a thorough set of experiments, we found that the classification of these matrices outperforms the classical softmax.
Furthermore, we extend our static approach to deal with the dynamic facial expressions. The challenges encountered here are: how to represent the dynamic evolution of the video sequences? and how to deal with the temporal misalignment of these videos to classify them in an efficient way? Regarding the first question, we exploit the space geometry of the covariance matrices as points on the SPD manifold, and model the temporal evolution of facial expressions as trajectories on this manifold. Following the static approach, we studied both global and local deep trajectories. Once constructing the deep trajectories, we need to align them in their manifold to remedy to the different executions rates of the facial expressions. A common method to do so is to use Dynamic Time Warping (DTW) as done in many previous works [5, 6, 7]. However, DTW does not define a proper metric and can not be used to derive a valid positive-definite kernel for the classification phase . Instead, in this work we propose global alignment of deep trajectories with the log-Euclidean Riemannian metric, which allows us to derive a valid positive-definite kernel used with SVM for the classification. By doing so, we propose a completely new approach to model and compare the spatial and the temporal evolution of facial expressions.
Overall, the proposed solution permits us to combine the geometric and appearance features enabling an effective description of facial expressions at different spatial levels, while taking into account the spatial relationships within the face. In addition, this solution is extended to deal with both the spatial and the temporal domains of facial expressions. An overview of the proposed solution is illustrated in Figure 1. In summary, the main contributions of our work consist of:
Encoding local/global DCNN features of the face by using local/global covariance matrices;
Using multiple late/early fusion schemes to combine multiple local and global information;
A temporal extension of the static covariance representations by modeling their temporal evolution as trajectories in the SPD manifold. To the best of our knowledge, this is the first work that uses DCNN features to model videos as trajectories on a Riemannian manifold;
A temporal alignment method based on Global Alignment (GA), which is the first time to be proposed for aligning trajectories on the SPD manifold;
Classifying static facial expressions using Gaussian kernel on the SPD manifold coupled with SVM classifier;
Classifying deep trajectories in SPD manifold using Global Alignment Kernel (GAK), which is a valid positive definite kernel, and SVM classifier;
An extensive experimental evaluation with two different CNN architectures that also compares our static and dynamic results with state-of-the-art methods on three public datasets.
We presented some preliminary ideas of this work in . With respect to , here we propose a completely new and original solution to model the temporal dynamic of facial expressions as trajectories on the SPD manifold. The experimental evaluation now comprises both the static and dynamic solutions, also including a larger number of datasets.
The rest of the paper is organized as follows: In Section II, we summarize the prior works that are most related to our solution, including methods for DCNN facial expression recognition, DCNN covariance descriptors, and trajectories modeling on Riemannian manifolds; In Section III, we present our solution for facial feature extraction and we introduce the idea of DCNN covariance descriptors; The way these descriptors can be used for expression classification from static images is reported in Section IV; In Section V, the approach is extended to the modeling of facial expressions as deep trajectories on the SPD manifold. A comprehensive experimentation using the proposed approach on three publicly available benchmarks, and comparison with state-of-the-art solutions is reported in Section VI; Finally, conclusions and directions for future work are sketched in Section VII.
Ii Related work
The approach we propose in this paper is mainly related to the works on static and dynamic facial expression recognition. Accordingly, we first summarize relevant works that use DCNN features for static facial expression analysis, including some works that propose covariance descriptors in conjunction with DCNN. Then, we present essential literature solutions proposed for modeling the temporal evolution of the face in the context of dynamic facial expressions.
DCNN for Static Facial Expression Recognition: Motivated by the success of DCNN models in facial analysis tasks, several papers proposed to use them for static facial expression recognition [10, 11]. However, the main reason behind the impressive performance of DCNNs is the availability of large-scale training datasets. As a matter of fact, in facial expression recognition, datasets are quite small, mainly for the difficulty of producing properly annotated images for training. To overcome such a problem, Ding et al.  proposed FaceNet2ExpNet, where a regularization function helps to use the face information to train the facial expression classification net of static images. Facial expression recognition from still images using DCNN was also proposed in [10, 11, 12]
. All these methods use a similar strategy in the network architecture: multiple convolutional and pooling layers are used for feature extraction; fully connected ones, and softmax layers are used for classification. Besides, several other works introduced a novel class of DCNNs that exploit second-order statistics (e.g., covariances). In the context of facial expression recognition from images, Acharya et al. 
explored convolutional networks in conjunction with manifold networks for covariance pooling in an end-to-end deep learning manner. Wang et al. constructed a feature learning network (e.g., a CNN) to project the face images into a target representation space. The network is trained with the goal of maximizing the discriminative ability of the set of covariance matrices computed in the target space.
Temporal Modeling of Facial Expressions: The difficulty here is to account for the dynamic evolution of the facial expression. One direction to deal with this difficulty is to explore deep architectures that can model appearance and motion information simultaneously. For example, LSTMs combined with CNN have been successfully employed for facial expression recognition with different names such as CNN-RNN , CNN-BRNN , etc. 3D convolutional neural networks have also been used for facial expression recognition in several works including [15, 17]. In the same direction, Jung et al. , used convolutional neural networks to extract temporal appearance features from face image sequences with an additional deep network that extracts temporal geometry features from temporal facial landmarks. The two networks are then combined using a joint fine-tuning method. Acharya et al.  have extended their static approach discussed before to dynamic facial expression recognition. They considered the temporal evolution of per-frame features by leveraging covariance pooling. Their networks achieve significant facial expression recognition performance for static data, while dynamic data are still more challenging.
Taking a different direction, several recent works choose to model the temporal evolution of the face as a trajectory. For example, Taheri et al.  used landmark configurations of the face to represent facial deformations on the Grassmann manifold . They modeled the dynamic of facial expressions by parameterized trajectories on this manifold before classifying them using LDA followed by a SVM. In the same direction, Kacem et al. , described the temporal evolution of facial landmarks as parameterized trajectories on the Riemannian manifold of positive semidefinite matrices of fixed-rank. Trajectories modeling in Riemannian manifolds was also used for human action recognition in several works [5, 21, 22]. However, all these works were based on geometric information to study the temporal evolution of some landmarks ignoring the texture information.
One outstanding issue encountered when modeling the temporal evolution of the face as a trajectory is the temporal misalignment resulting from the different execution rate of the facial expression. This issue necessitates the use of an algorithm based generally on dynamic programming to align different trajectories. Several works including [5, 6, 20] used DTW to align trajectories in a Riemannian manifold; however, this algorithm does not define a proper metric, which is indeed required in the classification phase to define a valid positive-definite kernel. As alternative solution, different works [6, 20, 23] proposed to ignore this constraint by using a variant of SVM with an arbitrary kernel without any restrictions on the kernel function.
Different from the above methods, in this work, we use both global and local covariance descriptors computed on DCNN features to explore appearance and geometric features simultaneously. Furthermore, we propose a new solution for trajectories alignment in a Riemannian manifold based on Global Alignment. This allows us to derive a valid positive definite kernel for trajectory classification in the SPD manifold, instead of using an arbitrary kernel.
Iii Face representation
Given a set of face images labeled with their corresponding expressions , our goal is to find a high discriminative face representation allowing an effective matching between faces and their expression labels.
Motivated by the success of DCNNs in automatic extraction of non-linear features that are relevant to the problem at hand, we opt for this technique in order to encode a given facial expression into a set of Feature Maps (FMs). A covariance descriptor is then computed over these FMs, which is considered for global face representation. We also extract four regions on the input face image around the eyes, mouth and cheeks (left and right). By mapping these regions on the extracted deep FMs, we are able to extract local regions from them that bring more accurate information about the facial expression. A local covariance descriptor is also computed for each local region.
The first step in our approach is the extraction of non-linear features that encode well the facial expression in the input face image. In this work, we use two DCNN models, namely, VGG-face  and ExpNet .
Iii-a Global DCNN Features
VGG-face is a DCNN model that is commonly used in facial analysis tasks. It consists of layers trained on M facial images of
K people for face recognition in the wild. This model has been also successfully used for expression recognition. However, the model was trained for face identification so it is expected to also encode information about the identity of the persons that should be filtered-out in order to capture person-independent facial expressions. This may deteriorate the discrimination of the expression model after fine-tuning, especially when it comes to dealing with small datasets, which is quite common in facial expression recognition. To tackle this problem, Ding et al.  proposed ExpNet, which is a much smaller network containing only five convolutional layers and one fully connected layer. The training of this model is regularized by the VGG-face model.
Following Ding et al. , we first fine-tune the VGG-face network on expression datasets by minimizing the cross-entropy loss. This fine-tuned model is then used to regularize the ExpNet model. Because we are interested in facial feature extraction, we only consider the FMs at the last convolutional layer of the ExpNet model. In what follows, we will denote the set of extracted FMs from an input face image as , where are the FMs at the last convolutional layer, and is the non-linear function induced by the DCNN architecture at this layer.
Iii-B Local DCNN Features
In addition to using the global feature map , we focus on specific regions extracted from this global feature map that are relevant for face expression analysis.
To do so, we start by detecting a set of landmarks on the input facial image using the method proposed in . Four regions are then identified around the eyes, mouth, and both cheeks using these points. By defining a pixel-wise mapping between the input face image and its corresponding FMs, we map the detected regions from the input face image to the global FMs. Indeed, a feature map is obtained by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function. Accordingly, units within a feature map will be connected to different regions on the input image. Based on this assumption, we can find a mapping between the coordinates of the input image and those of the output feature map. Specifically, for each point of coordinates in the input image , we associate a feature in the feature map such that,
where is the rounding operation and , are the map size ratio with respect to the input size, such that and , where and are the width and the height of the feature maps, respectively, and and are those of the input image. It is worth noting that for both the network models used in this work, the input image and output maps have the same spatial extent. This is important to map landmarks position in the input image to the coordinates of convolutional feature maps. Using this pixel-wise mapping, we map each region formed by pixels on the input image into the global FMs to obtain the corresponding local FMs .
Figure 2 shows the four local regions detected on the input facial image on the left; then, landmarks and regions are shown on four FMs, selected from a total of FMs.
Iii-C Deep Covariance Descriptors
Both our local and global non-linear features and can be directly used to classify the face images. However, motivated by the great success of covariance matrices in various recent works, we propose to compute covariance descriptors using these global and local features. In particular, a covariance descriptor is computed for each region across the corresponding local FMs yielding four covariance descriptors. A covariance descriptor is also computed on the global FMs extracted from the whole face . In this way, we encode the correlation between the extracted non-linear features within different spatial levels, which results in an efficient, compact and more discriminative representation. Furthermore, covariance descriptors allow us to select local features and focus on local facial regions, which is not possible with fully connected and softmax layers. We also note that the covariance descriptors are treated separately, then lately fused in the classifier. In what follows, we describe the processing for the global features ; the same steps hold for the covariance descriptors computed over the local features.
The extracted features are arranged in a tensor, where and denote the width and height of the FMs, respectively, and is their number. Each feature map is vectorized into a -dimensional vector with , and the input tensor is transformed to a set of observations stored in the matrix . Each observation encodes the values of the pixel across all the feature maps. Finally, we compute the corresponding covariance matrix,
where is the mean of the feature vectors.
Figure 3 shows six selected feature maps (chosen from the FMs extracted with the ExpNet model) for two subjects with happy and surprise expression. The figure also shows the global covariance descriptor relative to the FMs as a 2D image. Common patterns can be observed in the covariance descriptors computed for similar expressions, e.g., the dominant colors in the covariance descriptors of happy expression (left panel) are green, while being cyan in the covariance descriptors of surprise expression (right panel).
Covariance descriptors are mostly studied under a Riemannian structure of the space of symmetric positive definite matrices of size , [1, 14, 25]. Several metrics have been proposed to compare covariance matrices on , the most widely used being the Log-Euclidean Riemannian Metric (LERM)  due to its excellent theoretical properties with simple and fast computation. Formally, given two covariance descriptors and of two images and , their log-Euclidean distance is given by,
where is the Frobenius norm, and is the matrix logarithm.
Iv RBF Kernel for Deep Covariance Descriptors Classification of Static Expressions
As discussed above, each face is represented by its global and local covariance descriptors that lie on the non-linear manifold . The problem of recognizing expressions from facial images is then turned to classifying their covariance descriptors in . However, one should take into account the non-linearity of this space, where traditional machine learning techniques cannot be applied in a straightforward way. Accordingly, many works proposed adaptations of these traditional machine learning techniques to the SPD manifold. For example, Harandi et al.  proposed kernels derived from two Bregman matrix divergences, namely, the Stein and Jeffrey divergences to classify SPD matrices in their embedding manifold. In our work, we exploit the log-Euclidean distance mentioned in Eq. (3) between symmetric positive definite matrices to define the Gaussian RBF kernel ,
where is the log-Euclidean distance between and . Conveniently for us, this kernel has been already proved to be a positive definite kernel for all . This kernel is computed for the global covariance descriptor as well as for each local covariance descriptor yielding to five different kernels. Finally, we follow different fusion schemes discussed in Section IV-A to combine these kernels and classify the static covariance descriptors in their embedding manifold using SVM.
Iv-a Fusion of Global and Local Information
Each facial region provides relevant information for facial expression analysis and provides a different contribution to the final decision. Consequently, an efficient fusion method of the information provided by different regions is required.
In this section, we investigate different strategies to combine the local information extracted from different facial regions. We divide these strategies into late fusion and early fusion. For the late fusion strategy, each region is pre-classified independently, then the final decision is based on the fusion of the scores of the different regions. More formally, given a set of training samples for each of the four facial regions with their associated labels, we use Support Vector Machines (SVM) to learn a classifier for each region independently. Each of these classifiers provides for each sample a scores vector , where is the number of investigated classes, and
is the probability thatbelongs to the class . Using late fusion, the final scores vector of a sample is given by,
for the product rule, and by,
for the weighted sum rule, where represents the weight associated to the region .
Concerning the early fusion strategy, we do not need to train a classifier on each region independently; instead, it aims to combine information before any training. A simple way to do so is to concatenate features of all regions in one vector that will be used to train the classifier. This is different from using the global features since many other irrelevant regions are ignored in this case. We refer to this method in our experimental study as feature fusion. A more efficient way to conduct early fusion is Multiple Kernel Learning (MKL), where information fusion is performed at the kernel level. In our case, we use MKL to combine different local features using different kernels, such that each kernel is computed on the features of one region following two rules: for the product rule, the final kernel used for training is given by,
and for the weighted sum rule, the final kernel is,
where is the weight associated to the region . In what follows, we will refer to the kernel fusion with the weighted sum rule as kernel fusion.
In our experimental study, we have evaluated each of the fusion strategies discussed in this section.
V Modeling Dynamic Facial Expressions as Trajectories in
Facial expressions are much more described by a dynamic process than a static one, thus we need to extend our approach to take into account the temporal dimension. To this end, we propose to model a video sequence of a facial expression as a time varying trajectory on the manifold.
Following our static approach, we represent each frame of a sequence by a covariance matrix
computed on the top of deep features. Given that each covariance matrix is a point onas discussed before, a sequence of covariance matrices computed on DCNN features defines a trajectory on the manifold by . We define a trajectory to be a path that consists of a set of points on . In Figure 4, we visualize the temporal evolution of some feature maps extracted by our ExpNet model from a normalized video sequence of the CK+ dataset. This figure shows that each feature map focuses on some relevant features (related to the facial expression) that are more activated than others over time. For example, the first row (first feature map) shows the activation over time of the right mouth corner resulting from the smile movement, while the second feature map detects the same activation over time on the left corner. The last row of the same figure illustrates the temporal evolution of the corresponding trajectory. In particular, by encoding the feature maps of each frame in a compact covariance matrix, the problem of analyzing the temporal evolution of FMs is turned to studying a trajectory of covariance matrices in . Here, we can observe that the dominant color of the covariance matrices corresponding to neutral frames is green, and gradually changes to yellow along the facial expression (i.e., happiness).
Using the same strategy, we extend the local approach as well, by representing each video sequence with five trajectories , including a trajectory which encodes the temporal evolution of the global features, and four trajectories representing the temporal evolution of four facial regions. For simplicity, we will use to refer to the trajectory in the rest of this section.
The temporal variability is one of the difficulties encountered when comparing videos. It is due to the different execution rate of the facial expressions, their variable durations, and their arbitrary starting/ending intensities. These aspects yield to a distortion of the comparison measures of the corresponding trajectories. To tackle this problem, different algorithms based on dynamic programming have been introduced to find an optimal alignment between two videos. In this work, we propose to align trajectories in based on the LERM distance using two algorithms: Dynamic Time Warping (DTW) and Global Alignment (GA).
V-a Dynamic Time Warping
We use the notation of  to formulate the problem of aligning trajectories in . Given two trajectories and of length and , respectively, an alignment between these trajectories is a pair of increasing -tuples of length such that and , with unitary increments and no simultaneous repetitions.
Given , the set of all possible alignments between two trajectories and , the optimal alignment is given by,
where defined as
is the cost given by the mean of a local divergence on that measures dissimilarities between any two points of the trajectories and . Hence, the dissimilarity measure computed by DTW between and is given by,
To align trajectories in with DTW, we use the LERM distance defined in Eq. (3) to define the divergence .
The problem of DTW is that the cost function used for alignment is not a proper metric; it is not even symmetric. Indeed, the optimal alignment of a trajectory to a trajectory is often different from the alignment of to . Thus, we can not use it to define a valid positive definite kernel, while the positive definiteness of the kernel is a very important requirement of kernel machines during the classification phase.
V-B Global Alignment Kernel
To address the problem of non positive definiteness of the kernel defined by DTW, Cuturi et al.  proposed the Global Alignment Kernel (GAK). As shown earlier, DTW uses the minimum value of alignments to align time-series. Instead, the Global Alignment proposes to take advantage of all possible alignments, assuming that the minimum value used in DTW may be sensitive to peculiarities of the time series. GAK has been shown its effectiveness in aligning the temporal information in many works including [28, 29, 30]. Furthermore, it requires the same computational effort as that of DTW. GAK is defined as the sum of exponentiated and sign changed costs of the individual alignments:
For simplicity, Eq. (12) can be rewritten using the local similarity function induced from the divergence as , to get,
Let be a positive definite kernel such that is positive definite, then as defined in Eq. (13) is positive definite.
According to Theorem 1 proved by Cuturi et al. , the global alignment kernel is positive definite if is positive definite. It has been shown in the same paper  that, in practice, most kernels including the RBF kernel satisfy the property that provide positive semi-definite matrices. Consequently, in our numerical simulations, we have used the same RBF kernel given by Eq. (4) to define our local similarity function . By doing so, we have extended the classification pipeline of our static approach to the dynamic approach by using the same local RBF kernel defined on the manifold. Note that we checked experimentally the positive definiteness of all the kernels used in our experiments.
V-C Classification of Trajectories in
In this section, we aim to classify the aligned trajectories in . More formally, given a set of aligned trajectories , we select a training set of samples with their corresponding labels, and we seek for an approximation of the function that satisfies for each sample of the training set . In order to learn this approximation function, we use two types of Support Vector Machines (SVM), namely, the standard SVM and the pairwise proximity function SVM (ppfSVM) .
Assuming the linear separability of the data, SVM classify them by defining a separating hyperplane in the data space. However, most of the data do not satisfy this assumption and necessitate to use a kernel functionto transform them to a higher dimensional Hilbert space, where the data are linearly separable. The kernel function can be used with general data types like trajectories. However, according to Mercer’s theorem  the kernel function must define a symmetric positive semi-definite matrix to be a valid kernel; otherwise, we can not guarantee the convexity of the resulting optimization problem, which makes it difficult to solve.
Given that GAK provides a valid SPD kernel under a mild condition as demonstrated by Cuturi et al. , and given that our local kernel satisfies this condition as discussed before, we use the standard SVM with the kernel given in Eq. (12) to classify the aligned trajectories with global alignment on .
By contrast, DTW can not define a positive definite kernel. Hence, we adopt the algorithm ppfSVM, which assumes that instead of a valid kernel function, all that is available is a proximity function without any restriction. In our case, the proximity function between two trajectories and is defined by,
Using this proximity function, the main idea of ppfSVM is to represent each training example with a vector , which contains its proximities to all training examples in . This results in a matrix that contains all proximities between all training data in . Using the linear kernel on this data representation, the kernel matrix is used with SVM to classify trajectories on their manifold.
Vi Experimental results
The effectiveness of the proposed approach in recognizing basic facial expressions has been evaluated in constrained and unconstrained, i.e., in-the-wild, settings using three publicly available datasets with different challenges.
Oulu-CASIA dataset : Includes image sequences of subjects. For each subject, there are six sequences corresponding to six basic emotion labels. Each sequence begins with a neutral facial expression and ends with the apex of the expression. For training the DCNN models, and testing the static approach, we used the last three peak frames to represent the video resulting in images. Following the same setting of the state-of-the-art, we conducted a ten-fold cross validation experiment, with subject independent splitting.
Extended Cohn Kanade (CK+) dataset : It contains sequences of posed expressions, annotated with seven expression labels (the six basic emotions plus the contempt one). Each sequence starts with a neutral expression, and reaches the peak in the last frame. Following the protocol of , the three last frames of each sequence are used to represent the video in the static approach, and the subjects are divided into ten groups by ID in ascending order to conduct 10 cross validation.
Static Facial Expression in the Wild (SFEW) dataset : Consists of images labeled with seven facial expressions (the six basic emotions plus the neutral one). This dataset has been collected from real movies and targets spontaneous expression recognition in challenging, i.e., in-the-wild, environments. It is divided into training ( samples), validation ( samples), and test sets. Since the test labels are not available, here we report results on the validation data.
As initial step, we performed some preprocessing on the images of the datasets. For Oulu-CASIA and CK+, we first detected the face using the method proposed in . For SFEW, we used the aligned faces provided by the dataset. Then, we detected facial landmarks on each face using the Chehra Face Tracker . All frames were cropped and resized to , which is the input size of the DCNN models. For the dynamic approach, we firstly normalize videos using the method proposed by Zhou et al. .
VGG fine-tuning: Since the three datasets are quite different, we fine-tuned the VGG-face model on each dataset separately. To keep the experiments consistent with  and , we conducted ten-fold cross validation on Oulu-CASIA and CK+. This results in ten different deep models for each of the datasets, each of them is trained on nine splits and tested on the rest split. On the SFEW dataset, one model is trained using the provided training data. The training procedure for both datasets is executed for epochs, with a mini-batch size of and learning rate of decreased by after epochs. The momentum is fixed to be
, and Stochastic Gradient Descent is adopted as optimization algorithm. The fully connected layers of the VGG-face model are trained from scratch by initializing them with a Gaussian distribution. For data augmentation, we used horizontal flipping on the original data without any other supplementary datasets.
: Also in this case, a ten-fold cross validation is performed on Oulu-CASIA and CK+ requiring the training of ten different deep models. The ExpNet architecture consists of five convolutional layers, each one followed by ReLU activation and max pooling. As mentioned in Section III-A, these layers were trained first by regularization with the fine-tuned VGG model, then we appended one fully connected layer of size . The whole network is finally trained. All parameters used in the ExpNet training (learning rate, momentum, mini-batch size, number of epochs) are the same as in 
. We conducted all our training experiments using the Caffe deep learning framework.
Feature extraction: We used the last pooling layer of the DCNN models to extract features from each face image. This layer provides feature maps of size , which yields to covariance descriptors of size . For the local approach, to well map landmarks position in the input image to the coordinates of the feature maps, we resized all feature maps to , that allows us to correctly localize regions on the feature maps and minimize the overlapping between them. The detected regions in the input image were mapped to the feature maps using Eq. (1) with a ratio . Based on this mapping, we extracted features around eyes, mouth and both cheeks from each feature map. Finally, we used these local features to compute a covariance descriptor of size for each region in the input image. It is worth noting that the extracted regions have different sizes in different images. However, the size of the resulting covariance matrices depends only on the number of feature maps (as results from Eq. (2)). This yields covariance matrices of the same size lying in the same SPD manifold . In Figure 3, we show images of the extracted global and local FMs and their corresponding covariance matrices.
Image Classification: For the global approach, each static image is represented by a covariance descriptor of size . In order to compare covariance descriptors in , it is empirically necessary to ensure their positive definiteness by using their regularized version, , where is a small regularization parameter (set to in all our experiments), and is the identity matrix. To classify these descriptors, we used multi-class SVM with Gaussian kernel on the Riemannian manifold . For reproducibility, we choose parameters of the Gaussian kernel and SVM cost using cross validation with grid search in the following intervals: and . Concerning the local approach, each image was represented by four covariance descriptors, each regularized as stated for the global covariance descriptor. These local descriptors were combined and classified using late and early fusion strategies. For the fusion methods that require weights, we have reported the results with the best weights chosen by a grid search in the interval . Note that we reported the results of our local approach with the ExpNet model only, since it provides better results with the global approach than the VGG-face model. SVM classification was obtained using the LIBSVM  package. It is also relevant to note that for testing the Oulu-CASIA and CK+ datasets, we represented each video by its three peak frames as in Ding et al. . Hence, to calculate the distance between two videos, we considered the mean of the distances between their frames. For softmax, we considered the video as correctly classified if its three frames are correctly recognized by the model.
Video Classification: For the dynamic approach, each video was normalized to frames. Consequently, each video was represented as a trajectory of points in , where each point is a regularized covariance matrix of size . These trajectories were aligned and classified with a SVM using the kernel functions discussed earlier. The parameter used for the RBF kernel was set to . For the local approach, each video was represented by four local trajectories processed as described for the global trajectory. The fusion of local trajectories was performed with the weighted sum kernel, which has shown the best results in the static approach.
Vi-B Results and Discussion
Vi-B1 Static Facial Expressions
As first analysis, in Table I, we compare our proposed global (G-FMs) and local (R-FMs) solutions with the baselines provided by the VGG-face and ExpNet models, without the use of the covariance matrix (i.e., they used the fully connected and softmax layers). On Oulu-CASIA, the G-FMs solution improves by and , respectively, the VGG-face and ExpNet models. More improvement is observed on CK+ dataset by and for the VGG-face and ExpNet models, respectively. Though less marked, an increment of for the VGG-face and of for ExpNet has been also obtained on the SFEW dataset. These results prove that the covariance descriptors computed on the convolutional features provide more discriminative representations. Furthermore, the classification of these representations using a Gaussian kernel on the SPD manifold is more efficient than the standard classification with fully connected layers and softmax, even if these layers were trained in an end-to-end manner. Table I also shows that the fusion of the local (R-FMs) and global (G-FMs) approaches achieves a clear superiority on the Oulu-CASIA and CK+ datasets surpassing by, respectively, and
the global approach, while no improvement is observed on the SFEW dataset. This is due to the failure of landmark detection skewing the extraction of the local deep features. In Section 3 of the supplementary material, we show some failure cases of landmark detection on this dataset.
|Dataset||Model||FC-Softmax||G-FMs||G-FMs and R-FMs|
Table II compares different fusion modalities discussed in section IV-A. We found consistent results across the datasets, indicating the kernel fusion and weighted sum late fusion are the best methods to combine local and global covariance descriptors.
In Table III, we investigated the performance of the individual regions of the face for ExpNet. On all datasets, the right and left cheek provide almost the same score outperforming at a large extent the mouth score. Results for the eye region are not coherent across the datasets: the eyes region is the best performing for Oulu-CASIA and CK+, but this is not the case on SFEW. We motivate this result by the fact that, in the wild acquisitions as for the SFEW dataset, the region of the eyes can be affected by occlusions, and the landmarks detection can be less accurate. (see Section 3 of the supplementary material for failure cases of landmark detection in this dataset).
|Features fusion (R-FMs only)||84.38||96.70||45.70|
|G-FMs and R-FMs fusion||87.08||98.40||49.18|
Vi-B2 Dynamic Facial Expressions
In Table IV, we report the results of the dynamic approach using either GAK with SVM or DTW with ppfSVM to align and classify the deep trajectories. Unsurprisingly, on both the datasets, GAK achieved the highest accuracy compared with DTW. On CK+, GAK achieved an improvement of and , with global trajectories G-FMS and local trajectories R-FMS, respectively. On the other hand, this improvement reaches about and , with G-FMS and R-FMS, respectively, on Oulu-CASIA dataset. These results indicate the effectiveness of the proposed global alignment with RBF kernel on in classifying trajectories on their SPD manifold ; they also show the importance of using a symmetric positive definite kernel instead of the pairwise proximity function used with DTW. The same table shows also a consistent results with those of the static approach, when the fusion of the local trajectories surpasses the performance of the global trajectory by on CK+ and on Oulu-CASIA, using GAK. This improvement is also observed with DTW by on CK+ and on Oulu-CASIA, which confirms again the contribution of the local analysis of facial expressions.
|G-Traj + DTW + ppfSVM||78.13||89.71|
|G-Traj + GAK + SVM||82.25||94.33|
|R-Traj + DTW + ppfSVM||83.10||95.04|
|R-Traj + GAK + SVM||86.04||98.16|
|Jung et al. ||92.35||7||Dynamic|
|Kacem et al. ||96.87||7||Dynamic|
|Liu et al. ||92.22||8||Static|
|Liu et al. ||94.19||7||Dynamic|
|Cai et al. ||94.35||7||Static|
|Meng et al. ||95.37||7||static|
|li et al. ||95.78||6||static|
|chu et al. ||96.40||7||Dynamic|
|Ding et al. ||96.8||8||Static|
|Mollahosseini et al. ||97.80||7||Static|
|Zhao et al. ||97.30||6||Dynamic|
|Ding et al. ||98.60||6||Static|
|Jung et al. ||97.25||7||Dynamic|
|Ofodil et al. ||98.70||7||Dynamic|
|ours (ExpNet + G-FMs)||97.07||7||Static|
|ours (ExpNet + fusion)||98.40||7||Static|
|ours (ExpNet + G-FMs)||94.33||7||Dynamic|
|ours (ExpNet + fusion)||98.16||7||Dynamic|
|Jung et al. ||74.17||6||Dynamic|
|Kacem et al. ||83.13||6||Dynamic|
|Liu et al. ||74.59||6||Dynamic|
|Guo et al. ||75.52||6||Dynamic|
|Cai et al. ||77.29||6||Static|
|Ding et al. ||82.29||6||Static|
|Zhao et al. ||84.59||6||Dynamic|
|Jung et al. ||81.46||6||Dynamic|
|Ofodil et al. ||89.60||6||Dynamic|
|ours (ExpNet + G-FMs)||83.55||6||Static|
|ours (ExpNet + fusion)||87.08||6||Static|
|ours (ExpNet + G-FMs)||82.25||6||Dynamic|
|ours (ExpNet + fusion)||86.04||6||Dynamic|
Vi-B3 Comparison with the State-of-the-Art
As last analysis, in Tables V, VI, and VII we compare our solution with respect to state-of-the-art methods. In general, our approaches achieved competitive results with respect to the most recent solutions.
Comparing the static approaches on CK+ and Oulu-CASIA (Table V and VI, respectively), our method outperforms the state of the art with a significant gain. Ding at al.  outperform our results on CK+ with an accuracy of ; however, this result is reported on facial expressions only, ignoring the challenging contempt expression of this database. Concerning the dynamic approaches, we obtained the second highest accuracy on both CK+ and Oulu-CASIA datasets, outperforming several recent approaches. Furthermore, Ofodil et al. , who achieved the highest accuracy on these datasets, did not report the frames used to train their DCNN model, which is indeed an important information to compare the two approaches. Note that, for the static approach on Oulu-CASIA, to compare our results with those of Ding et al. , which was reported per frames, we reproduced the results for their approach on a per-video basis, considering that the video is correctly classified if the three frames of the video are correctly recognized.
On the SFEW dataset (Table VII), the global approach achieves the second highest accuracy, surpassing various state-of-the-art methods with significant gains. Moreover, the highest accuracy reported by  is obtained using a DCNN model trained on more than additional data provided by the FER-2013 database . As reported in , this data augmentation boosts results on SFEW from to .
|Liu et al. ||26.14|
|Levi et al. ||41.92|
|Mollahosseini et al. ||47.70|
|Ding et al. ||48.29|
|Ng et al. ||48.50|
|Yu et al. ||52.29|
|Cai et al. ||52.52|
|ours (ExpNet + G-FMs)||49.18|
|ours (ExpNet + fusion)||49.18|
In this paper, we have proposed the covariance matrix descriptor as a way to encode DCNN features in facial expression recognition. In the general approach, DCNNs are trained to automatically identify the patterns that characterize each class in the input images. For the case of facial expression recognition, these patterns correspond to high-level features that are related to Facial Action Units . Following a standard classification scheme in DCNN models, these features are firstly flattened using fully connected layers by performing a set of linear combinations of the input features followed by a softmax activation for predicting the expression. By contrast, in this work, we discard the fully connected layers and use covariance matrices to encode all the linear correlations between the activated non-linear features at the top convolutional layers. This is achieved both globally and locally by focusing on specific regions of the face. More particularly, the covariance matrix belongs to the set of symmetric positive-definite (SPD) matrices, thus laying on a special Riemannian manifold. We have shown that the classification of these representations using Gaussian kernel on the SPD manifold and SVM is more efficient than the standard classification with fully connected layers and softmax. In order to effectively combine local and global information, multiple fusion schemes are adopted. Furthermore, we have shown how our approach can deal with the temporal dynamics of the face. This is achieved by modeling a facial expression video sequence as a deep trajectory in the SPD manifold. To jointly align and classify deep trajectories in the SPD manifold while respecting the structure of the manifold, a global alignment kernel is derived from the Gaussian kernel which was used to classify static covariance descriptors. This results in a valid positive definite kernel that is fed to SVM for the final classification of the trajectories. By conducting extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we have shown that the proposed approach achieves state-of-the-art performance for facial expression recognition.
The authors would like to thank the high performance computing center of the Université de Lille for computational facilities.
-  O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” in European Conf. on Computer Vision (ECCV), 2006, pp. 589–600.
-  ——, “Pedestrian detection via classification on riemannian manifolds,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 10, pp. 1713–1727, 2008.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conf. (BMVC). BMVA Press, 2015, pp. 41.1–41.12.
-  H. Ding, S. K. Zhou, and R. Chellappa, “FaceNet2ExpNet: Regularizing a deep face recognition net for expression recognition,” in IEEE Int. Conf. on Automatic Face Gesture Recognition (FG), 2017, pp. 118–126.
-  B. B. Amor, J. Su, and A. Srivastava, “Action recognition using rate-invariant analysis of skeletal shape trajectories,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 1–13, 2016.
-  A. Kacem, M. Daoudi, B. B. Amor, S. Berretti, and J. C. Alvarez-Paiva, “A novel geometric framework on gram matrix trajectories for human behavior understanding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018.
-  A. Gritai, Y. Sheikh, C. Rao, and M. Shah, “Matching trajectories of anatomical landmarks under viewpoint, anthropometric and temporal transforms,” Int. Journal of Computer Vision, vol. 84, no. 3, pp. 325–343, 2009.
-  M. Cuturi, J.-P. Vert, O. Birkenes, and T. Matsui, “A kernel for time series based on global alignments,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, 2007, pp. II–413.
-  N. Otberdout, A. Kacem, M. Daoudi, L. Ballihi, and S. Berretti, “Deep covariance descriptors for facial expression recognition,” in British Machine Vision Conf. (BMVC), September 2018.
-  A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial expression recognition using deep neural networks,” in IEEE Winter Conf. on Applications of Computer Vision (WACV), 2016, pp. 1–10.
H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning for emotion recognition on small datasets using transfer learning,” inACM Int. Conf. on Multimodal Interaction, 2015, pp. 443–449.
-  Z. Yu and C. Zhang, “Image based static facial expression recognition with multiple deep network learning,” in ACM Int. Conf. on Multimodal Interaction, 2015, pp. 435–442.
D. Acharya, Z. Huang, D. Pani Paudel, and L. Van Gool, “Covariance pooling for
facial expression recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 367–374.
-  W. Wang, R. Wang, S. Shan, and X. Chen, “Discriminative covariance oriented representation learning for face recognition with image sets,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5599–5608.
-  Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 445–450.
-  J. Yan, W. Zheng, Z. Cui, C. Tang, T. Zhang, and Y. Zong, “Multi-cue fusion for emotion recognition in the wild,” Neurocomputing, 2018.
-  M. Liu, S. Li, S. Shan, R. Wang, and X. Chen, “Deeply learning deformable facial action parts model for dynamic expression analysis,” in Asian conference on computer vision. Springer, 2014, pp. 143–157.
-  H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, “Joint fine-tuning in deep neural networks for facial expression recognition,” in IEEE Int. Conf. on Computer Vision (ICCV), 2015, pp. 2983–2991.
-  S. Taheri, P. Turaga, and R. Chellappa, “Towards view-invariant expression analysis using analytic shape manifolds,” in IEEE Conf. on Face and Gesture (FG), March 2011, pp. 306–313.
-  A. Kacem, M. Daoudi, B. B. Amor, and J. C. Á. Paiva, “A novel space-time representation on the positive semidefinite cone for facial expression recognition,” in IEEE Int. Conf. on Computer Vision (ICCV), 2017, pp. 3199–3208.
-  M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, “3-d human action recognition by shape analysis of motion trajectories on riemannian manifold,” IEEE Trans. on Cybernetics, vol. 45, no. 7, pp. 1340–1352, July 2015.
-  R. Chakraborty, V. Singh, N. Adluru, and B. C. Vemuri, “A geometric framework for statistical analysis of trajectories with distinct temporal spans,” in IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 172–181.
-  S. Gudmundsson, T. P. Runarsson, and S. Sigurdsson, “Support vector machines and dynamic time warping for time series,” in IEEE Int. Joint Conf. on Neural Networks (IJCNN), 2008, pp. 2772–2776.
-  A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic, “Incremental face alignment in the wild,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1859–1866.
-  S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi, “Kernel methods on Riemannian manifolds with Gaussian RBF kernels,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp. 2464–2477, 2015.
-  V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Log-euclidean metrics for fast and simple calculus on diffusion tensors,” Magnetic resonance in medicine, vol. 56, no. 2, pp. 411–421, 2006.
-  M. T. Harandi, R. I. Hartley, B. C. Lovell, and C. Sanderson, “Sparse coding on symmetric positive definite manifolds using bregman divergences.” IEEE Trans. Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1294–1306, 2016.
-  A. Lorincz, L. Jeni, Z. Szabo, J. Cohn, and T. Kanade, “Emotional expression classification using time-series kernels,” in IEEE Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), 2013, pp. 889–895.
-  L. A. Jeni, A. Lőrincz, Z. Szabó, J. F. Cohn, and T. Kanade, “Spatio-temporal event classification using time-series kernel based structured sparsity,” in European Conf. on Computer Vision (ECCV). Springer, 2014, pp. 135–150.
-  M. Cuturi, “Fast global alignment kernels,” in Int. Conf. on Machine Learning (ICML), 2011, pp. 929–936.
-  J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge university press, 2004.
-  G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial expression recognition from near-infrared videos,” Image and Vision Computing, vol. 29, no. 9, pp. 607–619, 2011.
-  P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in IEEE Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010, pp. 94–101.
-  A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon, “Video and image based emotion recognition challenges in the wild: Emotiw 2015,” in ACM Int. Conf. on Multimodal Interaction, 2015, pp. 423–426.
-  P. Viola and M. J. Jones, “Robust real-time face detection,” Int. Journal on Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
-  Z. Zhou, G. Zhao, and M. Pietikäinen, “Towards a practical lipreading system,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 137–144.
-  I. Ofodile, K. Kulkarni, C. A. Corneanu, S. Escalera, X. Baro, S. Hyniewska, J. Allik, and G. Anbarjafari, “Automatic recognition of deceptive facial expressions of emotion,” arXiv preprint arXiv:1707.04061, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM Int. Conf. on Multimedia, 2014, pp. 675–678.
-  C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Trans. on Intelligent Systems and Technology, vol. 2, no. 3, p. 27, 2011.
-  M. Liu, S. Li, S. Shan, and X. Chen, “Au-aware deep networks for facial expression recognition,” in IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG), 2013, pp. 1–6.
-  M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1749–1756.
-  J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y. Tong, “Island loss for learning discriminative features in facial expression recognition,” in IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG), 2018, pp. 302–309.
-  Z. Meng, P. Liu, J. Cai, S. Han, and Y. Tong, “Identity-aware convolutional neural network for facial expression recognition,” in IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG), 2017, pp. 558–565.
-  S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2584–2593.
-  W.-S. Chu, F. De la Torre, and J. F. Cohn, “Selective transfer machine for personalized facial expression analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 39, no. 3, pp. 529–545, 2017.
-  X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan, “Peak-piloted deep network for facial expression recognition,” in European Conf. on Computer Vision (ECCV). Springer, 2016, pp. 425–442.
-  Y. Guo, G. Zhao, and M. Pietikäinen, “Dynamic facial expression recognition using longitudinal facial expression atlases,” in European Conf. on Computer Vision (ECCV), ser. Lecture Notes in Computer Science, vol. 7573. Springer, 2012, pp. 631–644.
-  I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee et al., “Challenges in representation learning: A report on three machine learning contests,” in Int. Conf. on Neural Information Processing (NIPS). Springer, 2013, pp. 117–124.
-  G. Levi and T. Hassner, “Emotion recognition in the wild via convolutional neural networks and mapped binary patterns,” in ACM Int. Conf. on Multimodal Interaction, 2015, pp. 503–510.
-  P. Khorrami, T. Paine, and T. Huang, “Do deep neural networks learn facial action units when doing expression recognition?” in IEEE Int. Conf. on Computer Vision Workshops (ICCVW), 2015, pp. 19–27.
For more clarity, we present in this section the algorithms of the proposed approaches. For each face , we compute the global and local deep covariance descriptors according to Eq. (2). Given these descriptors, Algorithm 1 summarizes the steps followed to classify the static facial expressions in . Concerning the dynamic approach, given a sequence of video frames, we use the same Eq. (2) to compute the local and global covariance descriptors of each frame, which yields to a global trajectory and four local trajectories for each video. For simplicity, Algorithm 2 provides a summary of the steps needed to classify the global deep trajectories in , while the same strategy can be extended to classify the local trajectories as in Algorithm 1. The equations cited in these algorithms refer to those in the main paper.
Ii Confusion matrices
In order to better evaluate our approach, we report in this section the confusion matrices obtained for each dataset used in our experiments. The confusion matrices reported here are obtained with the best DCNN model (ExpNet) and our best fusion strategy (Kernel fusion). Figures 5, 6 and 7 represent the confusion matrices for Oulu-CASIA, SFEW, and CK+, respectively.
For Oulu-CASIA, the happy and surprise expressions are better recognized over the rest, while anger and disgust expressions are more challenging. The happy expression is the best recognized one also for SFEW, followed by the neutral one, while surprise, disgust and fear expressions are harder to recognize. This is encountered in many other works, and it is related to the unbalanced number of expression examples for the different classes included in this database as explained in .
Concerning CK+, our approach is able to recognize the majority of the expressions with an accuracy of about , except contempt and sadness. As for SFEW, this can be explained by the relatively small number of samples for these expressions with respect to the other ones. Table VIII provides the number of samples representing each facial expression in each dataset.
Iii Failure Cases of Facial Landmark Detection on the SFEW Dataset
Figure 8 exhibits some failure and success cases of facial landmark and region detection on the input facial images. In the left panel of this figure, we show examples from the Oulu-CASIA and SFEW datasets, where the landmark and region detection succeeded. In the right panel, we show four failure examples for landmark and region detection in the SFEW dataset. We noticed that this step failed on of the facial images of SFEW. This explains why we do not obtain improvements by combining local and global covariance descriptors on this dataset.