Deep Covariance Descriptors for Facial Expression Recognition

05/10/2018 ∙ by Naima Otberdout, et al. ∙ 2

In this paper, covariance matrices are exploited to encode the deep convolutional neural networks (DCNN) features for facial expression recognition. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matrices. By performing the classification of the facial expressions using Gaussian kernel on SPD manifold, we show that the covariance descriptors computed on DCNN features are more efficient than the standard classification with fully connected layers and softmax. By implementing our approach using the VGG-face and ExpNet architectures with extensive experiments on the Oulu-CASIA and SFEW datasets, we show that the proposed approach achieves performance at the state of the art for facial expression recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic analysis of facial expressions has been attractive in computer vision research since long time due to its wide spectrum of potential applications that go from human computer interaction to medical and psychological investigations, to cite a few. Similarly to other applications, for many years facial expression analysis has been addressed by designing hand-crafted low-level descriptors, either geometric (

e.g., distances between landmarks) or appearance based (e.g.

, LBP, SIFT, HOG, etc.), with the aim of extracting suitable representations of the face. Higher order relations, like the covariance descriptor, have been also computed on raw data or low-level descriptors. Standard machine learning tools, like SVMs, have then been used to classify expressions. Now, the approach to address this problem has changed quite radically with Deep Convolutional Neural Networks (DCNNs). The idea here is to make the network learn the best features from large collections of data during a training phase. However, one drawback of DCNNs is that they do not take into account the spatial relationships within the face. To overcome this issue, we propose to exploit globally and locally the network features extracted in different regions of the face. This yields a set of DCNN features per region. The question is how to encode them in a compact and discriminative representation for a more efficient classification than the one achieved globally by classical softmax. In this paper, we propose to encode face DCNN features in a covariance matrix. These matrices have shown to outperform first-order features in many computer vision tasks 

[Tuzel et al.(2006)Tuzel, Porikli, and Meer, Tuzel et al.(2008)Tuzel, Porikli, and Meer]. We demonstrate the benefits of this representation in facial expression recognition from static images or collections of static peak frames (i.e., frames where the expression reaches its maximum). In doing this, we exploit the space geometry of the covariance matrices as points on the symmetric positive definite (SPD) manifold. Furthermore, we use a valid positive definite Gaussian RBF kernel on this manifold to train a SVM classifier for expression classification. Implementing our approach with different network architectures, i.e., VGG-face [Parkhi et al.(2015)Parkhi, Vedaldi, and Zisserman] and ExpNet [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa], and by a thorough set of experiments, we found that the classification of these matrices outperforms the classical softmax.

Overall, the proposed solution permits us to combine the geometric and appearance features enabling an effective description of facial expressions at different spatial levels, while taking into account the spatial relationships within the face. An overview of the proposed solution is illustrated in Figure 1. In summary, the main contributions of our work consist of: (i) encoding DCNN features of the face by using covariance matrices; (ii) encoding local DCNN features by local covariance descriptors; (iii) classifying facial expressions using Gaussian kernel on the SPD manifold; (iv) conducting an extensive experimental evaluation with two different architectures and comparing our results with state-of-the-art methods on two publicly available datasets.

Figure 1: Overview of the proposed method.

The rest of the paper is organized as follows: In Section 2, we summarize the works that are most related to our solution, including facial expression recognition, and covariance descriptors; In Section 3, we present our solution for facial feature extraction and, in Section 4, we introduce the idea of DCNN covariance descriptors for expression classification. A comprehensive experimentation using the proposed approach on two publicly available benchmarks, and comparison with state-of-the-art solutions is reported in Section 5; Finally, conclusions and directions for future work are sketched in Section 6.

2 Related work

The approach we propose in this paper is mainly related to the works on facial expression recognition and those on DCNNs combined with covariance descriptors. Accordingly, we first summarize relevant works using DCNN for facial expression, then we present some recent works that use covariance descriptors in conjunction with DCNN.

DCNN for Facial Expression Recognition: Motivated by the success of DCNN models in facial analysis tasks, several papers proposed to use them for both static and dynamic facial expression recognition [Jung et al.(2015)Jung, Lee, Yim, Park, and Kim, Tzirakis et al.(2017)Tzirakis, Trigeorgis, Nicolaou, Schuller, and Zafeiriou, Mollahosseini et al.(2016)Mollahosseini, Chan, and Mahoor, Ng et al.(2015)Ng, Nguyen, Vonikakis, and Winkler]. However, the main reason behind the impressive performance of DCNNs is the availability of large-scale training datasets. As a matter of fact, in facial expression recognition, datasets are quite small, mainly for the difficulty of producing properly annotated images for training. To overcome such a problem, Ding et al. [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa] proposed FaceNet2ExpNet, where a regularization function helps to use the face information to train the facial expression classification net of static images. Facial expression recognition from still images using DCNN was also proposed in [Mollahosseini et al.(2016)Mollahosseini, Chan, and Mahoor, Ng et al.(2015)Ng, Nguyen, Vonikakis, and Winkler, Yu and Zhang(2015)]

. All these methods use a similar strategy in the network architecture: multiple convolutional and pooling layers are used for feature extraction; fully connected ones, and softmax layers are used for classification. In 

[Ofodile et al.(2017)Ofodile, Kulkarni, Corneanu, Escalera, Baro, Hyniewska, Allik, and Anbarjafari]

, the authors proposed a method for dynamic facial expression recognition that exploits deep features extracted at the last convolutional layer of a trained DCNN. They used a Gaussian Mixture Model (GMM) and Fisher vector encoding on the set of extracted features from videos to get a single vector representation of the video, which is fed into a SVM classifier to predict expressions.

DCNN and Covariance Descriptors: Covariance features were first introduced by Tuzel et al. [Tuzel et al.(2006)Tuzel, Porikli, and Meer] for texture matching and classification. Bhattacharya et al. [Bhattacharya et al.(2016)Bhattacharya, Souly, and Shah] constructed covariance matrices, which capture joint statistics of both low-level motion and appearance features extracted from a video. Dong et al. [Dong et al.(2017)Dong, Jia, Zhang, Pei, and Wu]

constructed a deep neural network, which embeds high dimensional SPD matrices into a more discriminative low dimensional SPD manifold. In the context of face recognition from image sets, Wang et al. 

[Wang et al.(2017)Wang, Wang, Shan, and Chen] presented a Discriminative Covariance oriented Representation Learning (DCRL) framework to learn better image representations, which can closely match the subsequent image set modeling and classification. The framework constructs a feature learning network (e.g., a CNN) to project the face images into a target representation space. The network is trained with the goal of maximizing the discriminative ability of the set of covariance matrices computed in the target space. In the dynamic facial expression recognition method proposed by Liu et al. [Liu et al.(2014b)Liu, Wang, Li, Shan, Huang, and Chen], deep and hand-crafted features are extracted from each video clip to build three types of image set models, i.e.

, covariance matrix, linear subspace, and Gaussian distribution. Then, different Riemannian kernels are used, separately and combined, for classification.

To the best of our knowledge, compared to existing literature, our work is the first one that uses covariance descriptors in conjunction with DCNN for static expression recognition.

3 DCNN features

Given a set of face images labeled with their corresponding expressions , our goal is to find a high discriminative face representation allowing an efficient matching between faces and their expression labels. Motivated by the success of DCNNs in automatic extraction of non-linear features that are relevant to the problem at hand, we opt for this technique in order to encode the facial expression into Feature Maps (FMs). A covariance descriptor is then computed over these FMs and is considered for global face representation. We also extract four regions on the input face image around the eyes, mouth, and cheeks (left and right). By mapping these regions on the extracted deep FMs, we are able to extract local regions in these FMs that bring more accurate information about the facial expression. A local covariance descriptor is also computed for each local region.

The first step to our approach is the extraction of non-linear features that encode well the facial expression in the input face image. In this work, we use two DCNN models, namely, VGG-face [Parkhi et al.(2015)Parkhi, Vedaldi, and Zisserman] and ExpNet [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa].

3.1 Global DCNN features

VGG-Face is a DCNN model that is commonly used in facial analysis tasks. It consists of 16 layers trained on M facial images of K people for face recognition in the wild. This model has been also successfully used for expression recognition [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa]. However, the model was trained for face identification, so it is expected to also encode information about the identity of the persons that should be filtered-out in order to capture person-independent facial expressions. This may deteriorate the discrimination of the expression model after fine-tuning, especially when it comes to dealing with small datasets, which is quite common in facial expression recognition. To tackle this problem, Ding et al. [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa] proposed ExpNet, which is a much smaller network containing only five convolutional layers and one fully connected layer. The training of this model is regularized by the VGG-face model.

Following Ding et al. [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa], we first fine-tune the VGG-face network on expression datasets by minimizing the cross entropy loss. This fine-tuned model is then used to regularize the ExpNet model. Because we are interested in facial feature extraction, we only consider the FMs at the last convolutional layer of the ExpNet model. In what follows, we will denote the set of extracted FMs from an input face image as , where are the FMs at the last convolutional layer, and is the non-linear function induced by the employed DCNN architecture at this layer.

3.2 Local DCNN features

In addition to using the global feature map , we focus on specific regions extracted from this global feature map that are relevant for face expression analysis. To do so, we start by detecting a set of landmark points on the input facial image using the method proposed in [Asthana et al.(2014)Asthana, Zafeiriou, Cheng, and Pantic]. Four regions are then constructed around the eyes, mouth, and both cheeks using these points. By defining a pixel-wise mapping between the input face image and its corresponding FMs, we map the detected regions from the input face image to the global FMs. Indeed, a feature map is obtained by convolution of the input image with a linear filter, adding a bias term and then applying a non-linear function. Accordingly, units within a feature map will be connected to different regions on the input image. Based on this assumption, we can find a mapping between the coordinates of the input image and those of the output feature map. Specifically, for each point of coordinates in the input face image , we associate a feature in the feature map such that,

(1)

where is the map size ratio with respect to input size, and is the rounding operation. It is worth noting that for both models used in this work, the input image and output maps have the same spatial extent. This is important to map landmarks position in the input image to the coordinates of convolutional feature maps. Using this pixel-wise mapping, we map each region formed by pixels on the input image into the global FMs to obtain the corresponding local FMs .

4 DCNN based covariance descriptors

Both our local and global non-linear features and can be directly used to classify the face images. However, motivated by the great success of covariance matrices in various recent works, we propose to compute covariance descriptors using these global and local features. In particular, a covariance descriptor is computed for each region across the corresponding local FMs yielding four covariance descriptors. A covariance descriptor is also computed on the global FMs extracted from the whole face . In this way, we encode the correlation between the extracted non-linear features within different spatial levels, which results in an efficient, compact and more discriminative representation. Furthermore, covariance descriptors allow us to select local features and focus on local facial regions, which is not possible with fully connected and softmax layers. We can also note that the covariance descriptors are treated separately, then lately fused in the classifier. In what follows, we describe the processing for the global features ; the same steps hold for the covariance descriptors computed over the local features.

The extracted features are arranged in a tensor, where and denote the width and height of the feature maps, respectively, and is their number. Each feature map is vectorized into a -dimensional vector with to transform the input tensor to a set of observations stored in the matrix . Each observation encodes the values of the pixel across all the feature maps. Finally, we compute the corresponding covariance matrix,

(2)

where is the mean of the feature vectors such that . Covariance descriptors are mostly studied under a Riemannian structure of the space of symmetric positive definite matrices  [Tuzel et al.(2006)Tuzel, Porikli, and Meer, Jayasumana et al.(2015)Jayasumana, Hartley, Salzmann, Li, and Harandi, Wang et al.(2017)Wang, Wang, Shan, and Chen]. Several metrics have been proposed to compare covariance matrices on , the most widely used is the Log-Euclidean Riemannian Metric (LERM) [Arsigny et al.(2006)Arsigny, Fillard, Pennec, and Ayache] since it has excellent theoretical properties with simple and fast computations. Formally, given two covariance descriptors and of two images and , their log-Euclidean distance is given by,

(3)

where is the Frobenius norm, and is the matrix logarithm.

4.1 RBF Kernels for DCNN covariance descriptors classification

As discussed above, each face is represented by its global and local covariance descriptors that lie on the non-linear manifold . The problem of recognizing expressions from facial images is then turned to classifying their covariance descriptors in . However, one should take into account the non-linearity of this space, where traditional machine learning techniques cannot be applied in a straightforward way. Accordingly, we exploit the the log-Euclidean distance mentioned in Eq. (3) between symmetric positive definite matrices to define the Gaussian RBF kernel ,

(4)

where is the log-Euclidean distance between and . Conveniently for us, this kernel has been already proved to be a positive definite kernel for all [Jayasumana et al.(2015)Jayasumana, Hartley, Salzmann, Li, and Harandi]. This kernel is computed for the global covariance descriptor as well as for each local covariance descriptor yielding to five different kernels. Then, each kernel is fed, separately, to a SVM classifier that outputs a score per class. Finally, fusion is performed by multiplying or computing a weighted sum over the scores given by the different kernels.

5 Experimental results

The effectiveness of the proposed approach in recognizing basic facial expressions has been evaluated in constrained and unconstrained (i.e., in-the-wild) settings using two publicly available datasets with different challenges:

Oulu-CASIA dataset [Zhao et al.(2011)Zhao, Huang, Taini, Li, and PietikäInen]: Includes image sequences of subjects taken in a constrained environment with normal illumination conditions. For each subject, there are six sequences, one for each of the six basic emotion labels. Each sequence begins with a neutral facial expression and ends with the apex of the expression. For both training and testing, we use the last three peak frames to represent the video resulting in images. Following the same setting of the state-of-the-art, we conducted a ten-fold cross validation experiment, with subject independent splitting.

Static Facial Expression in the Wild (SFEW) dataset [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon]: Consists of static images labeled with seven facial expressions (the six basic plus the neutral one). This dataset has been collected from real movies and targets spontaneous expression recognition in challenging, i.e., in-the-wild, environments. It is divided into training (891 samples), validation ( samples), and test sets. Since the test labels are not available, here we report results on the validation data.

5.1 Settings

As initial step, we performed some preprocessing on the images of both datasets. For Oulu-CASIA, we first detected the face using the method proposed in [Viola and Jones(2004)]. For SFEW, we used the aligned faces provided by the dataset. Then, we detected facial landmarks on each face using the Chehra Face Tracker [Asthana et al.(2014)Asthana, Zafeiriou, Cheng, and Pantic]. All frames were cropped and resized to , which is the input size of the DCNN models.

VGG fine-tuning: Since the two datasets are quite different, we fine-tuned the VGG-face model on each dataset separately. To keep the experiments consistent with [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa] and [Ofodile et al.(2017)Ofodile, Kulkarni, Corneanu, Escalera, Baro, Hyniewska, Allik, and Anbarjafari], we conducted ten-fold cross validation on Oulu-CASIA. This results in ten different deep models, each of them is trained on nine splits with

images. On the SFEW dataset, one model is trained using the provided training data. The training procedure for both datasets is executed for 100 epochs, with a mini-batch size of 64 and learning rate of

decreased by after epochs. The momentum is fixed to be

, and Stochastic Gradient Descent is adopted as optimization algorithm. The fully connected layers of the VGG-face model are trained from scratch by initializing them with a Gaussian distribution. For data augmentation, we used horizontal flipping on the original data without any other supplementary datasets.

ExpNet training:

Also in this case, a ten-fold cross validation is performed on Oulu-CASIA requiring the training of ten different deep models. The ExpNet architecture consists of five convolutional layers, each one followed by Relu activation and max pooling 

[Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa]. As mentioned in Section 3.1, these layers were trained first by regularization with the fine-tuned VGG model, then we appended one fully connected layer of size . The whole network is finally trained. All parameters used in the ExpNet training (learning rate, momentum, mini-batch size, number of epochs) are the same as in [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa]

. We conducted all our training experiments using the Caffe deep learning framework 

[Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell].

Features extraction: We used the last pooling layer of DCNN models to extract features from each face image. This layer provides feature maps of size , which yields to covariance descriptors of size . For the local approach, to well map landmarks position in the input image to the coordinates of the feature maps, we resized all feature maps to , that allows us to correctly localize regions on the feature maps and minimize the overlapping between them. The detected regions in the input image were mapped to the feature maps using Eq. (1) with a ratio . Based on this mapping, we extracted features around eyes, mouth and both cheeks from each feature map. Finally, we used these local features to compute a covariance descriptor of size for each region in the input image. In Sections 1 and 2 of the supplementary material, we show images of the extracted global and local FMs and their corresponding covariance matrices.

Classification: For the global approach, each static image is represented by a covariance descriptor of size . In order to compare covariance descriptors in , it is empirically necessary to ensure their positive definiteness by using their regularized version, , where is a small regularization parameter (set to in all our experiments), and is the identity matrix. To classify these descriptors, we used multi-class SVM with Gaussian kernel on the Riemannian manifold . For reproducibility, we choose parameters of the Gaussian kernel and SVM cost using cross validation with grid search in the following intervals: and .

Concerning the local approach, each image was represented by four covariance descriptors, each regularized as stated for the global covariance descriptor. This resulted in four classification decisions that were combined using two late fusion methods: weighted sum and product. The best performance were achieved for weighted sum fusion with , , , equal to and , for the Oulu-CASIA dataset, and , and , , , equal to for the SFEW dataset. Note that we report the results of our local approach with only ExpNet model since it provides better results with the global approach than VGG-face model. SVM classification was obtained using the LIBSVM [Chang and Lin(2001)] package. Note that for testing the Oulu-CASIA dataset, we represented each video by its three peak frames as in Ding et al. [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa]. Hence, to calculate the distance between two videos, we considered the mean of the distances between their frames. For softmax, we considered the video as correctly classified if its three frames are correctly recognized by the model.

5.2 Results and discussion

As first analysis, in Table 1, we compare our proposed global (G-FMs) and local (R-FMs) solutions with the baselines provided by the VGG-face and ExpNet models, without the use of the covariance matrix (i.e., they used the fully connected and softmax layers). On Oulu-CASIA, the G-FMs solution improves by and , respectively, the VGG-face and ExpNet models. Though less marked, an increment of for the VGG-face and of for ExpNet has been also obtained on the SFEW dataset. These results prove that the covariance descriptors computed on the convolutional features provide more discriminative representations. Furthermore, the classification of these representations using Gaussian kernel on SPD manifold is more efficient than the standard classification with fully connected layers and softmax, even if these layers were trained in an end-to-end manner. Table 1 also shows that the fusion of the local (R-FMs) and global (G-FMs) approaches achieves a clear superiority on the Oulu-CASIA dataset surpassing by

the global approach, while no improvement is observed on the SFEW dataset. This is due to the failure of landmark detection skewing the extraction of the local deep features. In Section 3 of the supplementary material, we show some failure cases of landmark detection on this dataset.

Dataset Model FC-Softmax ours (G-FMs) ours (G-FMs and R-FMs)


Oulu-CASIA
VGG Face 77.8 81.5
ExpNet 82.29 83.55 84.80


SFEW
VGG Face 46.66 47.35
ExpNet 48.26 49.18 49.18
Table 1: Comparison of the proposed classification scheme with respect to the VGG-Face and ExpNet models with fully connected layer and Softmax.

In Table 2, we investigated the performance of the individual regions of the face for ExpNet. On both datasets, the right and left cheek provide almost the same score outperforming at a large extent the mouth score. Results for the eye region are not coherent across the two datasets: the eyes region is the best performing for Oulu-CASIA, but this is not the case on SFEW. We motivate this result by the fact that, in the wild acquisitions as for the SFEW dataset, the region of the eyes can be affected by occlusions, and the landmarks detection can be less accurate (see Section 3 of the supplementary material for failure cases of landmark detection in this dataset). Table 2 also compares different fusion modalities. We found consistent results across the two datasets, indicating the weighted sum fusion between G-FMs and R-FMs is the best approach.

Region Oulu-CASIA SFEW


Eyes
84.59 38.05
Mouth 70.00 38.98

Right Cheek
83.96 43.16
Left Cheek 83.12 42.93


R-FMs product fusion
83.66 42.92
G-FMs and R-FMs product fusion 84.05 45.24


R-FMs weighted-sum fusion
84.59 43.85
G-FMs and R-FMs weighted-sum fusion 84.80 49.18


Table 2: Overall accuracy (%) of different regions and fusion schemes on the Oulu-CASIA and SFEW datasets for the ExpNet model.

The confusion matrices for ExpNet with weighted-sum are reported in Figure 2 left and right plots, respectively, for Oulu-CASIA and SFEW. For Oulu-CASIA, the happy and surprise expressions are better recognized over the rest. The happy expression is the best recognized one also for SFEW, followed by the neutral one.

Figure 2: Confusion matrix on Oulu-CASIA (left) and SFEW (right) for ExpNet with weighted-sum fusion.
Method Oulu-CASIA SFEW
Kacem et al. [Kacem et al.(2017)Kacem, Daoudi, Amor, and Paiva]  83.13
Jung et al. [Jung et al.(2015)Jung, Lee, Yim, Park, and Kim]  74.17
Liu et al. [Liu et al.(2013)Liu, Li, Shan, and Chen] 26.14
Levi et al. [Levi and Hassner(2015)] 41.92
Mollahosseini et al. [Mollahosseini et al.(2016)Mollahosseini, Chan, and Mahoor] 47.70
Ng et al. [Ng et al.(2015)Ng, Nguyen, Vonikakis, and Winkler] 48.50
Yu et al. [Yu and Zhang(2015)] 52.29
Ding et al. [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa] 82.29 48.29
Liu et al. [Liu et al.(2014a)Liu, Shan, Wang, and Chen]  74.59
Guo et al.[Guo et al.(2012)Guo, Zhao, and Pietikäinen]  75.52
Zhao et al. [Zhao et al.(2016)Zhao, Liang, Liu, Li, Han, Vasconcelos, and Yan]  84.59
Jung et al. [Jung et al.(2015)Jung, Lee, Yim, Park, and Kim]  81.46
Ofodil et al. [Ofodile et al.(2017)Ofodile, Kulkarni, Corneanu, Escalera, Baro, Hyniewska, Allik, and Anbarjafari]  89.60

ours (ExpNet + G-FMs)
83.55 49.18
ours (ExpNet + G-FMs and R-FMs fusion) 84.80 49.18
Table 3: Comparison with state-of-the art solutions on Oulu-CASIA and SFEW. Geometric, appearance and hybrid solutions are reported in the first three groups of methods; Our solutions are given in the last row. () Dynamic approaches.

As last analysis, in Table 3 we compare our solution with respect to state-of-the-art methods. Overall, on Oulu-CASIA, we obtained the second highest accuracy, outperforming several recent approaches. Furthermore, Ofodil et al. [Ofodile et al.(2017)Ofodile, Kulkarni, Corneanu, Escalera, Baro, Hyniewska, Allik, and Anbarjafari], who achieved the highest accuracy on this dataset, also used temporal information of the video. In addition, they did not report the frames used to train their DCNN model, which is indeed an important information to compare the two approaches. Note that, to compare our results with those of Ding et al. [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa], which was reported per frames, we reproduced the results for their approach on a per video basis, considering that the video is correctly classified if the three frames of the video are correctly recognized. On the SFEW dataset, the global approach achieves the second highest accuracy, surpassing various state of the art methods with significant gains. Moreover, the highest accuracy reported by [Yu and Zhang(2015)] is obtained using a DCNN model trained on more than additional data provided by the FER-2013 database [Goodfellow et al.(2013)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, et al.]. As reported in [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa], this data augmentation can improve results on SFEW from to .

6 Conclusions

In this paper, we have proposed the covariance matrix descriptor as a way to encode DCNN features in facial expression recognition. The covariance matrix belongs to the set of symmetric positive-definite (SPD) matrices, thus laying on a special Riemannian manifold. We have shown the classification of these representations using Gaussian kernel on the SPD manifold is more efficient than the standard classification with fully connected layers and softmax. By implementing our approach using different architectures, i.e., VGG-face and ExpNet, in extensive experiments on the Oulu-CASIA and SFEW datasets, we have shown that the proposed approach achieves state-of-the-art performance for facial expression recognition. As future work, we aim to include the temporal dynamics of the face in the proposed model.

A Supplementary Material to the Paper: Deep Covariance Descriptors for Facial Expression Recognition

In this supplementary material, we present further details on the conducted experiments. In particular, we provide visualizations of:

Global DCNN features and their covariance descriptors: Figure 3 shows four selected feature maps (chosen from 512 FMs) extracted with the ExpNet model for two subjects of the Oulu-CASIA dataset (happy and surprise expressions). We also show the global covariance descriptor relative to the 512 feature maps as a 2D image. Common patterns can be observed in the covariance descriptors computed for similar expressions, e.g., the dominant colors in the covariance descriptors of happy expression (left panel) are pink/purple, while being blue in the covariance descriptors of surprise expression (right panel).

Figure 3: Visualization of some feature maps (ExpNet) and their corresponding covariance descriptors for two subjects from the Oulu-CASIA dataset conveying happy and surprise expressions. We show four feature maps (chosen from 512 feature maps) for each example image. The corresponding covariance descriptors are computed over the 512 FMs. Best seen in color.

Local DCNN features and their covariance descriptors: Figure 4 shows the four local regions detected on the input facial image on the left; then, landmarks and regions are shown on four selected feature maps, as mentioned in Section 3.2 of the paper. These FMs are selected from 512. The covariance descriptors relative to each detected region are shown in Figure 5. We can observe that each local covariance descriptor captures different patterns.

Figure 4: Visualization of the detected facial landmarks and regions on the input facial image and mapped on four selected feature maps (from 512). Best seen in color.
Figure 5: Visualization of the extracted regions on four feature maps and their corresponding covariance descriptors. Best seen in color.

Failure cases of facial landmark detection on SFEW dataset: Figure 6 exhibits some failure and success cases of facial landmark and region detection on the input facial images. In the left panel of this figure, we show examples from the Oulu-CASIA and SFEW datasets, where the landmark and region detection succeeded. In the right panel, we show four failure examples for landmark and region detection in the SFEW dataset. We noticed that this step failed on of the facial images of SFEW. This explains why we do not obtain improvements by combining local and global covariance descriptors on this dataset.

Figure 6: Examples of facial landmark and region detection on the SFEW and Oulu-CASIA datasets, with some failure cases for the SFEW dataset. For each example, the image on the left shows the aligned face with its landmark points, while the image on the right represents the aligned face with its detected regions.

References

  • [Arsigny et al.(2006)Arsigny, Fillard, Pennec, and Ayache] Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic resonance in medicine, 56(2):411–421, 2006.
  • [Asthana et al.(2014)Asthana, Zafeiriou, Cheng, and Pantic] Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Incremental face alignment in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1859–1866, 2014.
  • [Bhattacharya et al.(2016)Bhattacharya, Souly, and Shah] Subhabrata Bhattacharya, Nasim Souly, and Mubarak Shah. Covariance of motion and appearance featuresfor spatio temporal recognition tasks. CoRR, abs/1606.05355, 2016.
  • [Chang and Lin(2001)] Chih-chung Chang and Chih-jen Lin.

    Libsvm: A library for support vector machines. software available at http://www. csie. ntu. edu. tw/cjlin/libsvm.

    2001.
  • [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon] Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In ACM on International Conference on Multimodal Interaction, pages 423–426. ACM, 2015.
  • [Dong et al.(2017)Dong, Jia, Zhang, Pei, and Wu] Zhen Dong, Su Jia, Chi Zhang, Mingtao Pei, and Yuwei Wu. Deep manifold learning of symmetric positive definite matrices with application to face recognition. In

    AAAI Conf. on Artificial Intelligence

    , pages 4009–4015, 2017.
  • [Goodfellow et al.(2013)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, et al.] Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, pages 117–124. Springer, 2013.
  • [Guo et al.(2012)Guo, Zhao, and Pietikäinen] Yimo Guo, Guoying Zhao, and Matti Pietikäinen. Dynamic facial expression recognition using longitudinal facial expression atlases. In European Conference on Computer Vision,ECCV, volume 7573 of Lecture Notes in Computer Science, pages 631–644. Springer, 2012.
  • [Hui Ding et al.(2017)Hui Ding, Zhou, and Chellappa] Shaohua Hui Ding, Kevin Zhou, and Rama Chellappa. FaceNet2ExpNet: Regularizing a deep face recognition net for expression recognition. In IEEE Int. Conf. on Automatic Face Gesture Recognition (FG), pages 118–126, 2017.
  • [Jayasumana et al.(2015)Jayasumana, Hartley, Salzmann, Li, and Harandi] Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi. Kernel methods on Riemannian manifolds with Gaussian RBF kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(12):2464–2477, 2015.
  • [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM international conference on Multimedia, pages 675–678. ACM, 2014.
  • [Jung et al.(2015)Jung, Lee, Yim, Park, and Kim] Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In IEEE International Conference on Computer Vision, ICCV, pages 2983–2991, 2015.
  • [Kacem et al.(2017)Kacem, Daoudi, Amor, and Paiva] Anis Kacem, Mohamed Daoudi, Boulbaba Ben Amor, and Juan Carlos Álvarez Paiva. A novel space-time representation on the positive semidefinite cone for facial expression recognition. In IEEE International Conference on Computer Vision, ICCV, pages 3199–3208, 2017.
  • [Levi and Hassner(2015)] Gil Levi and Tal Hassner. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In ACM on international conference on multimodal interaction, pages 503–510. ACM, 2015.
  • [Liu et al.(2013)Liu, Li, Shan, and Chen] Mengyi Liu, Shaoxin Li, Shiguang Shan, and Xilin Chen. Au-aware deep networks for facial expression recognition. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pages 1–6. IEEE, 2013.
  • [Liu et al.(2014a)Liu, Shan, Wang, and Chen] Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In IEEE Conference on Computer Vision and Pattern Recognition CVPR, pages 1749–1756, 2014a.
  • [Liu et al.(2014b)Liu, Wang, Li, Shan, Huang, and Chen] Mengyi Liu, Ruiping Wang, Shaoxin Li, Shiguang Shan, Zhiwu Huang, and Xilin Chen. Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild. In Int. Conf. on Multimodal Interaction, pages 494–501. ACM, 2014b.
  • [Mollahosseini et al.(2016)Mollahosseini, Chan, and Mahoor] Ali Mollahosseini, David Chan, and Mohammad H. Mahoor. Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pages 1–10, 2016.
  • [Ng et al.(2015)Ng, Nguyen, Vonikakis, and Winkler] Hong-Wei Ng, Viet Dung Nguyen, Vassilios Vonikakis, and Stefan Winkler.

    Deep learning for emotion recognition on small datasets using transfer learning.

    In ACM on international conference on multimodal interaction, pages 443–449. ACM, 2015.
  • [Ofodile et al.(2017)Ofodile, Kulkarni, Corneanu, Escalera, Baro, Hyniewska, Allik, and Anbarjafari] Ikechukwu Ofodile, Kaustubh Kulkarni, Ciprian Adrian Corneanu, Sergio Escalera, Xavier Baro, Sylwia Hyniewska, Juri Allik, and Gholamreza Anbarjafari. Automatic recognition of deceptive facial expressions of emotion. arXiv preprint arXiv:1707.04061, 2017.
  • [Parkhi et al.(2015)Parkhi, Vedaldi, and Zisserman] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In BMVC, pages 41.1–41.12. BMVA Press, 2015.
  • [Tuzel et al.(2006)Tuzel, Porikli, and Meer] Oncel Tuzel, Fatih Porikli, and Peter Meer. Region covariance: A fast descriptor for detection and classification. In European Conf. on Computer Vision (ECCV), pages 589–600, 2006.
  • [Tuzel et al.(2008)Tuzel, Porikli, and Meer] Oncel Tuzel, Fatih Porikli, and Peter Meer. Pedestrian detection via classification on Riemannian manifolds. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(10):1713–1727, Oct. 2008.
  • [Tzirakis et al.(2017)Tzirakis, Trigeorgis, Nicolaou, Schuller, and Zafeiriou] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8):1301–1309, Dec 2017. ISSN 1932-4553. doi: 10.1109/JSTSP.2017.2764438.
  • [Viola and Jones(2004)] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004.
  • [Wang et al.(2017)Wang, Wang, Shan, and Chen] Wen Wang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Discriminative covariance oriented representation learning for face recognition with image sets. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5599–5608, 2017.
  • [Yu and Zhang(2015)] Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multiple deep network learning. In ACM on International Conference on Multimodal Interaction, pages 435–442, 2015.
  • [Zhao et al.(2011)Zhao, Huang, Taini, Li, and PietikäInen] Guoying Zhao, Xiaohua Huang, Matti Taini, Stan Z Li, and Matti PietikäInen. Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9):607–619, 2011.
  • [Zhao et al.(2016)Zhao, Liang, Liu, Li, Han, Vasconcelos, and Yan] Xiangyun Zhao, Xiaodan Liang, Luoqi Liu, Teng Li, Yugang Han, Nuno Vasconcelos, and Shuicheng Yan. Peak-piloted deep network for facial expression recognition. In European Conference on Computer Vision, pages 425–442. Springer, 2016.