Deep Metric Structured Learning For Facial Expression Recognition

We propose a deep metric learning model to create embedded sub-spaces with a well defined structure. A new loss function that imposes Gaussian structures on the output space is introduced to create these sub-spaces thus shaping the distribution of the data. Having a mixture of Gaussians solution space is advantageous given its simplified and well established structure. It allows fast discovering of classes within classes and the identification of mean representatives at the centroids of individual classes. We also propose a new semi-supervised method to create sub-classes. We illustrate our methods on the facial expression recognition problem and validate results on the FER+, AffectNet, Extended Cohn-Kanade (CK+), BU-3DFE, and JAFFE datasets. We experimentally demonstrate that the learned embedding can be successfully used for various applications including expression retrieval and emotion recognition.



There are no comments yet.


page 1

page 4

page 5

page 6

page 7

page 8


FERAtt: Facial Expression Recognition with Attention Net

We present a new end-to-end network architecture for facial expression r...

A Compact Embedding for Facial Expression Similarity

Most of the existing work on automatic facial expression analysis focuse...

Revisiting Few-Shot Learning for Facial Expression Recognition

Most of the existing deep neural nets on automatic facial expression rec...

Adults as Augmentations for Children in Facial Emotion Recognition with Contrastive Learning

Emotion recognition in children can help the early identification of, an...

THIN: THrowable Information Networks and Application for Facial Expression Recognition In The Wild

For a number of tasks solved using deep learning techniques, an exogenou...

Tensor-based Subspace Factorization for StyleGAN

In this paper, we propose τGAN a tensor-based method for modeling the la...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classical distance metrics like Euclidean distance and cosine similarity are limited and do not always perform well when computing distances between images or their parts. Recently, end–to–end methods

[18, 1, 21, 25] have shown much progress in learning an intrinsic distance metric. They train a network to discriminatively learn embeddings so that similar images are close to each other and images from different classes are far away in the feature space. These methods are shown to outperform others adopting manually crafted features such as SIFT and binary descriptors [7, 20]

. Feedforward networks trained by supervised learning can be seen as performing representation learning, where the last layer of the network is typically a linear classifier,

e.g. a softmax regression classifier.

Representation learning is of great interest as a tool to enable semi-supervised and unsupervised learning. It is often the case that datasets are comprised of vast training data but with relatively little labeled training data. Training with supervised learning techniques on a reduced labeled subset generally results in severe overfitting. Semi-supervised learning is an alternative to resolve the overfitting problem by learning from the vast unlabeled data. Specifically, it is possible to learn good representations for the unlabeled data and use them to solve the supervised learning task.

The adoption of a particular cost function in learning methods imposes constraints on the solution space, whose shape can take any form satisfying the underlying properties induced by the loss function. For example, in the case of triplet loss [18]

, the optimization of the cost function leads to the creation of a solution space where every object has the nearest neighbors within the same class. Unfortunately, it does not generate a much desired probability distribution function, which is achieved by our formulation.

In theory, we would like to have the solution manifold to be a continuous function representing the true original information, because, as in the case of the facial expression recognition problem, face expressions are points in the continuous facial action space resulting from the smooth activation of facial muscles [8]

. The transition from one expression to another is represented as the trajectory between the embedded vectors on the manifold surface.

The objective of this work is to offer a formulation for the creation of separable sub–spaces each with a defined structure and with a fixed data distribution. We propose a new loss function that imposes Gaussian structures in the creation of these sub-spaces. In addition, we also propose a new semi-supervised method to create sub–classes within each facial expression class, as exemplified in Figure 1.

2 Related Work

Siamese networks applied to signature verification showed the ability of neural networks to learn compact embedding

[4]. OASIS [5] and local distance learning [9]

learn fine-grained image similarity ranking models using hand-crafted features that are not based on deep-learning. Recent methods such as

[18, 1, 21, 25] approaches the problem of learning a distance metric by discriminatively training a neural network. Features generated by those approaches are shown to outperform manually crafted features [1], such as SIFT and various binary descriptors [7, 20].

Distance Metric Learning (DML) can be broadly divided into contrastive loss based methods, triplet networks, and approaches that go beyond triplets such as quadruplets, or even batch-wise loss. Contrastive embedding is trained on paired data, and it tries to minimize the distance between pairs of examples with the same class label while penalizing examples with different class labels that are closer than a margin [10]. Triplet embedding is trained on triplets of data with anchor points, a positive that belongs to the same class, and a negative that belongs to a different class [26, 11]. Triplet networks use a loss over triplets to push the anchor and positive closer, while penalizing triplets where the distance between the anchor and negative is less than the distance between the anchor and positive, plus a margin . Contrastive embedding has been used for learning visual similarity for products [3], while triplet networks have been used for face verification, person re-identification, patch matching, for learning similarity between images and for fine-grained visual categorization [18, 19, 25, 6, 1].

Several works are based on triplet-based loss functions for learning image representations. However, the majority of them use category label-based triplets [27, 24, 17]. Some existing works such as [5, 25] have focused on learning fine-grained representations. In addition, [27] used a similarity measure computing several existing feature representations to generate ground truth annotations for the triplets, while [25] used text image relevance, based on Google image search to annotate the triplets. Unlike those approaches, we use human raters to annotate the triplets. None of those works focus on facial expressions, only recently [23] proposed a system of facial expression recognition based on triplet loss.

3 Methodology

3.1 Structured Gaussian Manifold Loss

Let be a collection of i.i.d. samples to be classified into classes, and let represent the –th class, for . The computed class function returns the class of sample – maximum a posteriori

probability estimate – for the neural net function

drawn independently according to probability for input . Suppose we separate in an embedded space such that each set contains the samples belonging to class . Our goal is to find a Gaussian representation for each which would allow a clear separation of in a reduced space, .

We assume that has a known parametric form, and it is therefore determined uniquely by the value of a parameter vector . For example, we might have , where , for

the normal distribution with mean

and variance

. To show the dependence of on explicitly, we write as . Our problem is to use the information provided by the training samples to obtain a good transformation function that generates embedded spaces with a known distribution associated with each category. Then the a posteriori probability can be computed from by the Bayes’ formula:


We use the normal density function for . The objective is to generate embedded sub-spaces with defined structure. Thus, using the Gaussian structures:


where . For the case , where

is the identity matrix:


In a supervised problem, we know the a posteriori probability for the input set. From this, we can define our structured loss function as the mean square error between the a posteriori probability of the input set and the a posteriori probability estimated for the embedded space:


We applied the steps described in Algorithm 1 to train the system. The batch size is given by where is the number of classes, and is the sample size. In this work, we use , thus for eight classes the batch size is 240, which was used for the estimation of the parameters in Equation 4.

2:  while   do
3:     {Sample, get current batch.
4:      representation.
5:      where is a parameters ( in this work) and is the mean of the elements of the class :
where denotes set cardinality.
6:     Evaluation of the Loss function. For the explanation of the loss representation see equation 4:
7:     , backward and optimization steps.
8:  end while
Algorithm 1 Structured Gaussian Manifold Learning. : Neural Network; : dataset; are the subset of the elements of class ; : number of updates;

We define the accuracy of the model as the ability of the parameter vector to represent the test dataset in the embedded space. The prediction of a class can be calculated as:


3.2 Deep Gaussian Mixture Sub-space

The same facial expression may possess a different set of global features. For example, ethnicity can determine specific color and shape, while age provides physiological differences of facial characteristics; moreover, gender, weight, and other features can determine different facial characteristics, while having the same expression. Our proposal can group and extract these characteristics automatically. We propose to represent each facial expression class as a Gaussians Mixture. These Gaussian parameters are obtained in an unsupervised way as part of the learning processes. We start from a representation space given by Algorithm 1. Subsequently, a clustering algorithm is applied to separate each class into a new class subset. This process is repeated until reaching the desired granularity level. Algorithm 2 shows the set of steps to obtain the new sub-classes.

4:  while   do
5:     StructureGaussianManifold
9:     for all  class  do
15:     end for
16:  end while
Algorithm 2 Deep Gaussian Mixture Sub-space Learning. : Maximum level of subdivisions for the class; : Neural Network; : Structure Gaussian Manifold Algorithm 1;

: Expectation Maximization Algorithm;

: dataset; are the subset of the elements of class ; : number of updates;

4 Experiments

4.1 Protocol

For the evaluation of the clustering task, we use the F1-measure and Normalized Mutual Information (NMI) measures. The F1-measure computes the harmonic mean of the precision and recall,

. The NMI measure take as input a set of clusters and a set of ground truth classes , indicates the set of examples with cluster assignment and indicates the set of examples with the ground truth class label . Normalized mutual information is defined by the ratio of mutual information and the average entropy of the clusters and the entropy of the labels, , for complete details see [15]. For the retrieval task, we use the Recall@K [12]

measure. Each test image (query) first retrieves K Nearest Neighbour (KNN) from the test set and receives score

if an image of the same class is retrieved among the KNN, and otherwise. Recall@K averages those score over all the images. Moreover, we also evaluate accuracy, i.e. the fraction of results that are the same class as queried image, averaged over all queries. While the classification task is evaluated using KNN on the training set.

For the training process, we use the Adam method [13] with a learning rate of 0.0001 and batch size of 256 (samples of size 32 to estimate the parameters in each iteration). In the TripletLoss case, we used 128 triplets in each batch. The neural networks were initialized with the same weights in all cases.

4.2 Result

4.2.1 Representation and Recover

The groups used for the evaluation of the measures are obtained using K-means, whereas K equals the number of classes (8 in the case of the FER+, AffectNet, CK+ datasets, and 7 for JAFFE and BU-3DFE datasets).

The results obtained for the clustering task show that the proposed method presents good group quality (see table 1) in similar domains. As can be observed, the results are degraded for different domains. In general, we observe that the TripletLoss is most robust to the change of domains on all models. However, the best result is achieved using the proposed method for the RestNet18 model in FER+, CK+, and BU-3DFE.

Arch. FER+ AffectNet JAFFE CK+ BU-3DFE

FMPNet 55.257 10.627 19.528 71.129 34.901
CVGG13 67.384 9.103 28.295 68.303 27.275
AlexNet 67.035 12.945 30.241 68.800 27.039
ResNet18 64.457 15.588 31.046 74.028 36.708
PreActResNet18 57.904 8.452 20.699 70.079 27.580

FMPNet 57.880 10.469 26.196 77.839 36.559
CVGG13 65.139 10.355 24.293 66.062 27.233
AlexNet 62.091 10.582 24.560 65.230 28.115
ResNet18 68.840 12.333 30.382 77.902 37.545
PreActResNet18 51.425 6.886 23.216 61.413 26.104

Table 1: The NMI (%) of the clustering task for all datasets applying the TripletLoss and SGMLoss models trained on FER+. SGMLoss: Structured Gaussian Manifold Loss, Arch: Architecture, FER+: FER+ test dataset, AffectNet: validation dataset. JAFFE CK+ and BU3DFE datasets are used as test set.


Figure 2 shows a 2D t-SNE [22] visualization of the learned SGMLoss embedding space using the FER+ training set. The amount of overlap between the two categories in this figure roughly demonstrates the extent of the visual similarity between them. For example, happy and neutral have some objects overlap, indicating that these cases could be confused easily, and both of them have a very low overlap with fear indicating that they are visually very distinct from fear. Also, the spread of a category in this figure indicates the visual diversity within that category. For example, happiness category maps to some distinct regions indicating that there are some visually distinct modes within this category.

Figure 2: Barnes-Hut t-SNE visualization [22] of the SGMLoss for the FER+ database. Each color represents one of the eight emotions including neutral.


Figure 3 shows the results obtained in the recovery task (Recall@K and Acc@K measures) for . TripletLoss obtains better recovery results for all K but to the detriment of accuracy. Our method manages to increase its recovery value while preserving quality. It means that most neighbors are of the same class. Figure 4 shows the top-5 retrieved images for some of the queries on CelebA dataset [14]. The overall results of the proposed SGMLoss embedding are clearly better than the results of TripletLoss embedding.

Figure 3: Recall@K and Acc@K measures for the test split FER+ dataset. The applied model was the ResNet18 having .


Figure 4: Top-5 images retrieved using SGMLoss (left) and TripletLoss (right) embeddings. The overall results of the SGMLoss match the query set apparently better when compared to TripletLoss.


4.2.2 Classification

The proposed SGMLoss method can be used for FER by combining it with the KNN classifier. Figure 5 shows the average F1-score of the SGMLoss and TripletLoss on the FER+ validation set as a function of the number of neighbors used. F1-score is maximized for K=11.

Figure 5: Classification performance of the SGMLoss and TripletLoss on the FER+ validation set when combined with KNN classifier.


Table 2 compares the classification performance of the SGMLoss embedding (using 11 neighbors) with TripletLoss and CNN models. In general, our method obtains the best classification results for all architectures. ResNet18 CNN model does not obtains a significant higher accuracy. Moreover, our results surpass the accuracy 84.99 presented in [2].

Method Arch. Acc. Prec. Rec. F1

FMPNet 79.535 66.697 68.582 67.627
CVGG13 84.316 75.151 67.425 71.079
AlexNet 86.038 77.658 68.657 72.881
ResNet18 87.695 85.956 69.659 76.954
PreActResNet18 82.372 76.915 65.238 70.597

FMPNet 82.563 79.554 62.406 69.944
CVGG13 85.974 82.034 68.112 74.428
AlexNet 86.038 80.598 67.895 73.703
ResNet18 87.121 78.543 68.378 73.109
PreActResNet18 83.519 74.081 64.856 69.162

FMPNet 83.360 78.806 66.520 72.143
CVGG13 86.261 86.321 67.341 75.659
AlexNet 86.643 86.182 67.673 75.814
ResNet18 87.631 88.614 68.724 77.412
PreActResNet18 84.316 89.008 66.519 76.138

Table 2: Classification results of the CNN, TripletLoss and SGMLoss models trained on FER+. SGMLoss: Structured Gaussian Manifold Loss, Arch: Architecture, FER+: FER+ test dataset, AffectNet validation dataset.


The Facial Expression dataset constitute a great challenge due to the subjectivity of the emotions [16]. The labeling process requires the effort of a group of specialists to make the annotations. FER+ and AffectNet datasets contains many problems in the labels. In [2] an effort was made to improve the quality of the labels of the FER+ (dataset used in our experiments) by re-tagging the dataset using crowd sourcing. Figure 6 shows some mislabeled images retrieved by our method. The scale, position, and context could influence the decision of a non-expert tagger such as those in crowd sourcing.

Experimental results show the quality of the embedded representation obtained by SGMLoss in the classification problems. Our representation improves the representation obtained by TripletLoss, which is the method most used in the identification and representation problems.

Figure 6: Examples of mislabeled images on the FER+ dataset that were recovery using SGMLoss. The first row show the result of the query (1) and the second row the result (2). We can clearly observer that two very similar images have different labels in the dataset.


4.2.3 Clustering

For the training process, we use the Adam method [13]

with a learning rate of 0.0001, a batch size of 640 and 500 epoch. The maximum level of subdivision used is L=5 (this value guarantee that the batch for a subclass in this level to be 128). The ResNet18 architecture is selected to train the FER+ dataset. The objective of this experiment is to visually analyse the clustering obtained by this approach.

The results shown in Figure 7 present 64-dimensional embedded space using the Barnes-Hut t-SNE visualization scheme [22] using the Deep Gaussian Mixture Sub-space model for the FER+ dataset. The method created five Gaussian sub-spaces for the unsupervised case for each class.

Figure 7: Barnes-Hut t-SNE visualization [22] of the Deep Gaussian Mixture Sub-space for the FER+ database. Each color represents one of the eight emotions including neutral.


For the clustering task, all embedded vectors are calculated and EM method is applied creating 40 groups. For each group, the medoid is calculated. The medoid is the object in the group closest to the centroid (mean to the sample). The Top-k of a group contains the k-objects nearer to the medoid of the group.

Figure 8 shows the Top-16 images obtained for the happiness category. The first group (Figure 8 (a) ) shows an expression of happiness closer to surprise (raised eyebrows and open mouth) with the shape of the eyes similar to each other. The second group (Figure 8 (b)) represents an expression closer to contempt. The third group (Figure 8 (c)) shows an expression of more intense happiness (the teeth are shown in all cases) with the shape of the mouth very similar to each other. In the fourth case (Figure 8 (d) ) shows a subcategory that is present in all facial expressions. Babies are a typically expected subset due to the intensity of expression and the physiognomical formation. Generally babies and children from 1 to 4 years old present facial expressions of greater intensity. The last group (Figure 8 (f) ) represents people with glasses and large eyes.

Figure 8: Top-16 images of the clustering obtained from the class happiness on the FER+ dataset.


The presented method is a powerful tool for tasks such as photo album summarization. In this task, we are interested in summarizing the diverse expression content present in a given photo album using a fixed number of images. Figure 9 shows 5 of the 40 groups obtained on AffectNet dataset. The obtained groups show great similarity in terms of FER. These results demonstrate the generalization capacity of the proposed method and its applicability to problems of FER clustering.

Figure 9: Top-16 images of the 5 clustering obtained from the AffectNet dataset.


5 Conclusions

We introduced two new metric learning representation models in this work, namely Deep Gaussian Mixture Subspace Learning and Structured Gaussian Manifold Learning. In the first model, we build a Gaussian representation of expressions leading to a robust classification and grouping of facial expressions. We illustrate through many examples, the high quality of the vectors obtained in recovery tasks, thus demonstrating the effectiveness of the proposed representation. In the second case, we provide a semi-supervised method for grouping facial expressions. We were able to obtain embedded subgroups sharing the same facial expression group. These subgroups emerged due to shared specific characteristics other than the general appearance. For example, individuals with glasses expressing a happy appearance.


  • [1] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016)

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In BMVC, Vol. 1, pp. 3. Cited by: §1, §2, §2.
  • [2] E. Barsoum, C. Zhang, C. Canton Ferrer, and Z. Zhang (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), Cited by: §4.2.2, §4.2.2.
  • [3] S. Bell and K. Bala (2015) Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics (TOG) 34 (4), pp. 98. Cited by: §2.
  • [4] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §2.
  • [5] G. Chechik, V. Sharma, U. Shalit, and S. Bengio (2010) Large scale online learning of image similarity through ranking.

    Journal of Machine Learning Research

    11 (Mar), pp. 1109–1135.
    Cited by: §2, §2.
  • [6] Y. Cui, F. Zhou, Y. Lin, and S. Belongie (2016) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1153–1162. Cited by: §2.
  • [7] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1734–1747. Cited by: §1, §2.
  • [8] P. Ekman, W. V. Friesen, and J. C. Hager (2002) Facial action coding system. A Human Face. Cited by: §1.
  • [9] A. Frome, Y. Singer, and J. Malik (2007) Image retrieval and classification using local distance functions. In Advances in neural information processing systems, pp. 417–424. Cited by: §2.
  • [10] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, Vol. 2, pp. 1735–1742. Cited by: §2.
  • [11] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §2.
  • [12] H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.1.
  • [13] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link, 1412.6980 Cited by: §4.1, §4.2.3.
  • [14] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §4.2.1.
  • [15] C. D. Manning, P. Raghavan, H. Schütze, et al. (2008) Introduction to information retrieval. Vol. 1, Cambridge university press Cambridge. Cited by: §4.1.
  • [16] P. Marrero-Fernández, A. Montoya-Padrón, A. Jaume-I-Capó, and J. M. B. Rubio (2014) Evaluating the research in automatic emotion recognition. IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India) 31 (3), pp. 220–232. External Links: Document, ISSN 09745971 Cited by: §4.2.2.
  • [17] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §2.
  • [18] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §1, §2, §2.
  • [19] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li (2016) Embedding deep metric for person re-identification: a study against large variations. In European Conference on Computer Vision, pp. 732–748. Cited by: §2.
  • [20] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 118–126. Cited by: §1, §2.
  • [21] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 4004–4012. Cited by: §1, §2.
  • [22] L. Van Der Maaten (2014) Accelerating t-sne using tree-based algorithms.. Journal of machine learning research 15 (1), pp. 3221–3245. Cited by: Figure 2, Figure 7, §4.2.1, §4.2.3.
  • [23] R. Vemulapalli and A. Agarwala (2019) A compact embedding for facial expression similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5683–5692. Cited by: §2.
  • [24] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §2.
  • [25] J. Wang, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu, et al. (2014) Learning fine-grained image similarity with deep ranking. arXiv preprint arXiv:1404.4661. Cited by: §1, §2, §2, §2.
  • [26] K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §2.
  • [27] B. Zhuang, G. Lin, C. Shen, and I. Reid (2016) Fast training of triplet-based deep binary embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955–5964. Cited by: §2.