Classical distance metrics like Euclidean distance and cosine similarity are limited and do not always perform well when computing distances between images or their parts. Recently, end–to–end methods[18, 1, 21, 25] have shown much progress in learning an intrinsic distance metric. They train a network to discriminatively learn embeddings so that similar images are close to each other and images from different classes are far away in the feature space. These methods are shown to outperform others adopting manually crafted features such as SIFT and binary descriptors [7, 20]
. Feedforward networks trained by supervised learning can be seen as performing representation learning, where the last layer of the network is typically a linear classifier,e.g. a softmax regression classifier.
Representation learning is of great interest as a tool to enable semi-supervised and unsupervised learning. It is often the case that datasets are comprised of vast training data but with relatively little labeled training data. Training with supervised learning techniques on a reduced labeled subset generally results in severe overfitting. Semi-supervised learning is an alternative to resolve the overfitting problem by learning from the vast unlabeled data. Specifically, it is possible to learn good representations for the unlabeled data and use them to solve the supervised learning task.
The adoption of a particular cost function in learning methods imposes constraints on the solution space, whose shape can take any form satisfying the underlying properties induced by the loss function. For example, in the case of triplet loss 
, the optimization of the cost function leads to the creation of a solution space where every object has the nearest neighbors within the same class. Unfortunately, it does not generate a much desired probability distribution function, which is achieved by our formulation.
In theory, we would like to have the solution manifold to be a continuous function representing the true original information, because, as in the case of the facial expression recognition problem, face expressions are points in the continuous facial action space resulting from the smooth activation of facial muscles 
. The transition from one expression to another is represented as the trajectory between the embedded vectors on the manifold surface.
The objective of this work is to offer a formulation for the creation of separable sub–spaces each with a defined structure and with a fixed data distribution. We propose a new loss function that imposes Gaussian structures in the creation of these sub-spaces. In addition, we also propose a new semi-supervised method to create sub–classes within each facial expression class, as exemplified in Figure 1.
2 Related Work
Siamese networks applied to signature verification showed the ability of neural networks to learn compact embedding. OASIS  and local distance learning 18, 1, 21, 25] approaches the problem of learning a distance metric by discriminatively training a neural network. Features generated by those approaches are shown to outperform manually crafted features , such as SIFT and various binary descriptors [7, 20].
Distance Metric Learning (DML) can be broadly divided into contrastive loss based methods, triplet networks, and approaches that go beyond triplets such as quadruplets, or even batch-wise loss. Contrastive embedding is trained on paired data, and it tries to minimize the distance between pairs of examples with the same class label while penalizing examples with different class labels that are closer than a margin . Triplet embedding is trained on triplets of data with anchor points, a positive that belongs to the same class, and a negative that belongs to a different class [26, 11]. Triplet networks use a loss over triplets to push the anchor and positive closer, while penalizing triplets where the distance between the anchor and negative is less than the distance between the anchor and positive, plus a margin . Contrastive embedding has been used for learning visual similarity for products , while triplet networks have been used for face verification, person re-identification, patch matching, for learning similarity between images and for fine-grained visual categorization [18, 19, 25, 6, 1].
Several works are based on triplet-based loss functions for learning image representations. However, the majority of them use category label-based triplets [27, 24, 17]. Some existing works such as [5, 25] have focused on learning fine-grained representations. In addition,  used a similarity measure computing several existing feature representations to generate ground truth annotations for the triplets, while  used text image relevance, based on Google image search to annotate the triplets. Unlike those approaches, we use human raters to annotate the triplets. None of those works focus on facial expressions, only recently  proposed a system of facial expression recognition based on triplet loss.
3.1 Structured Gaussian Manifold Loss
Let be a collection of i.i.d. samples to be classified into classes, and let represent the –th class, for . The computed class function returns the class of sample – maximum a posteriori
probability estimate – for the neural net functiondrawn independently according to probability for input . Suppose we separate in an embedded space such that each set contains the samples belonging to class . Our goal is to find a Gaussian representation for each which would allow a clear separation of in a reduced space, .
We assume that has a known parametric form, and it is therefore determined uniquely by the value of a parameter vector . For example, we might have , where , for
the normal distribution with mean
and variance. To show the dependence of on explicitly, we write as . Our problem is to use the information provided by the training samples to obtain a good transformation function that generates embedded spaces with a known distribution associated with each category. Then the a posteriori probability can be computed from by the Bayes’ formula:
We use the normal density function for . The objective is to generate embedded sub-spaces with defined structure. Thus, using the Gaussian structures:
where . For the case , where
is the identity matrix:
In a supervised problem, we know the a posteriori probability for the input set. From this, we can define our structured loss function as the mean square error between the a posteriori probability of the input set and the a posteriori probability estimated for the embedded space:
We applied the steps described in Algorithm 1 to train the system. The batch size is given by where is the number of classes, and is the sample size. In this work, we use , thus for eight classes the batch size is 240, which was used for the estimation of the parameters in Equation 4.
We define the accuracy of the model as the ability of the parameter vector to represent the test dataset in the embedded space. The prediction of a class can be calculated as:
3.2 Deep Gaussian Mixture Sub-space
The same facial expression may possess a different set of global features. For example, ethnicity can determine specific color and shape, while age provides physiological differences of facial characteristics; moreover, gender, weight, and other features can determine different facial characteristics, while having the same expression. Our proposal can group and extract these characteristics automatically. We propose to represent each facial expression class as a Gaussians Mixture. These Gaussian parameters are obtained in an unsupervised way as part of the learning processes. We start from a representation space given by Algorithm 1. Subsequently, a clustering algorithm is applied to separate each class into a new class subset. This process is repeated until reaching the desired granularity level. Algorithm 2 shows the set of steps to obtain the new sub-classes.
measure. Each test image (query) first retrieves K Nearest Neighbour (KNN) from the test set and receives scoreif an image of the same class is retrieved among the KNN, and otherwise. Recall@K averages those score over all the images. Moreover, we also evaluate accuracy, i.e. the fraction of results that are the same class as queried image, averaged over all queries. While the classification task is evaluated using KNN on the training set.
For the training process, we use the Adam method  with a learning rate of 0.0001 and batch size of 256 (samples of size 32 to estimate the parameters in each iteration). In the TripletLoss case, we used 128 triplets in each batch. The neural networks were initialized with the same weights in all cases.
4.2.1 Representation and Recover
The groups used for the evaluation of the measures are obtained using K-means, whereas K equals the number of classes (8 in the case of the FER+, AffectNet, CK+ datasets, and 7 for JAFFE and BU-3DFE datasets).
The results obtained for the clustering task show that the proposed method presents good group quality (see table 1) in similar domains. As can be observed, the results are degraded for different domains. In general, we observe that the TripletLoss is most robust to the change of domains on all models. However, the best result is achieved using the proposed method for the RestNet18 model in FER+, CK+, and BU-3DFE.
Figure 2 shows a 2D t-SNE  visualization of the learned SGMLoss embedding space using the FER+ training set. The amount of overlap between the two categories in this figure roughly demonstrates the extent of the visual similarity between them. For example, happy and neutral have some objects overlap, indicating that these cases could be confused easily, and both of them have a very low overlap with fear indicating that they are visually very distinct from fear. Also, the spread of a category in this figure indicates the visual diversity within that category. For example, happiness category maps to some distinct regions indicating that there are some visually distinct modes within this category.
Figure 3 shows the results obtained in the recovery task (Recall@K and Acc@K measures) for . TripletLoss obtains better recovery results for all K but to the detriment of accuracy. Our method manages to increase its recovery value while preserving quality. It means that most neighbors are of the same class. Figure 4 shows the top-5 retrieved images for some of the queries on CelebA dataset . The overall results of the proposed SGMLoss embedding are clearly better than the results of TripletLoss embedding.
The proposed SGMLoss method can be used for FER by combining it with the KNN classifier. Figure 5 shows the average F1-score of the SGMLoss and TripletLoss on the FER+ validation set as a function of the number of neighbors used. F1-score is maximized for K=11.
Table 2 compares the classification performance of the SGMLoss embedding (using 11 neighbors) with TripletLoss and CNN models. In general, our method obtains the best classification results for all architectures. ResNet18 CNN model does not obtains a significant higher accuracy. Moreover, our results surpass the accuracy 84.99 presented in .
The Facial Expression dataset constitute a great challenge due to the subjectivity of the emotions . The labeling process requires the effort of a group of specialists to make the annotations. FER+ and AffectNet datasets contains many problems in the labels. In  an effort was made to improve the quality of the labels of the FER+ (dataset used in our experiments) by re-tagging the dataset using crowd sourcing. Figure 6 shows some mislabeled images retrieved by our method. The scale, position, and context could influence the decision of a non-expert tagger such as those in crowd sourcing.
Experimental results show the quality of the embedded representation obtained by SGMLoss in the classification problems. Our representation improves the representation obtained by TripletLoss, which is the method most used in the identification and representation problems.
For the training process, we use the Adam method 
with a learning rate of 0.0001, a batch size of 640 and 500 epoch. The maximum level of subdivision used is L=5 (this value guarantee that the batch for a subclass in this level to be 128). The ResNet18 architecture is selected to train the FER+ dataset. The objective of this experiment is to visually analyse the clustering obtained by this approach.
The results shown in Figure 7 present 64-dimensional embedded space using the Barnes-Hut t-SNE visualization scheme  using the Deep Gaussian Mixture Sub-space model for the FER+ dataset. The method created five Gaussian sub-spaces for the unsupervised case for each class.
For the clustering task, all embedded vectors are calculated and EM method is applied creating 40 groups. For each group, the medoid is calculated. The medoid is the object in the group closest to the centroid (mean to the sample). The Top-k of a group contains the k-objects nearer to the medoid of the group.
Figure 8 shows the Top-16 images obtained for the happiness category. The first group (Figure 8 (a) ) shows an expression of happiness closer to surprise (raised eyebrows and open mouth) with the shape of the eyes similar to each other. The second group (Figure 8 (b)) represents an expression closer to contempt. The third group (Figure 8 (c)) shows an expression of more intense happiness (the teeth are shown in all cases) with the shape of the mouth very similar to each other. In the fourth case (Figure 8 (d) ) shows a subcategory that is present in all facial expressions. Babies are a typically expected subset due to the intensity of expression and the physiognomical formation. Generally babies and children from 1 to 4 years old present facial expressions of greater intensity. The last group (Figure 8 (f) ) represents people with glasses and large eyes.
The presented method is a powerful tool for tasks such as photo album summarization. In this task, we are interested in summarizing the diverse expression content present in a given photo album using a fixed number of images. Figure 9 shows 5 of the 40 groups obtained on AffectNet dataset. The obtained groups show great similarity in terms of FER. These results demonstrate the generalization capacity of the proposed method and its applicability to problems of FER clustering.
We introduced two new metric learning representation models in this work, namely Deep Gaussian Mixture Subspace Learning and Structured Gaussian Manifold Learning. In the first model, we build a Gaussian representation of expressions leading to a robust classification and grouping of facial expressions. We illustrate through many examples, the high quality of the vectors obtained in recovery tasks, thus demonstrating the effectiveness of the proposed representation. In the second case, we provide a semi-supervised method for grouping facial expressions. We were able to obtain embedded subgroups sharing the same facial expression group. These subgroups emerged due to shared specific characteristics other than the general appearance. For example, individuals with glasses expressing a happy appearance.
Learning local feature descriptors with triplets and shallow convolutional neural networks.. In BMVC, Vol. 1, pp. 3. Cited by: §1, §2, §2.
-  (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In ACM International Conference on Multimodal Interaction (ICMI), Cited by: §4.2.2, §4.2.2.
-  (2015) Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics (TOG) 34 (4), pp. 98. Cited by: §2.
-  (1994) Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §2.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research11 (Mar), pp. 1109–1135. Cited by: §2, §2.
-  (2016) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In , pp. 1153–1162. Cited by: §2.
-  (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1734–1747. Cited by: §1, §2.
-  (2002) Facial action coding system. A Human Face. Cited by: §1.
-  (2007) Image retrieval and classification using local distance functions. In Advances in neural information processing systems, pp. 417–424. Cited by: §2.
-  (2006) Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, Vol. 2, pp. 1735–1742. Cited by: §2.
-  (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §2.
-  (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.1.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §4.1, §4.2.3.
-  (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §4.2.1.
-  (2008) Introduction to information retrieval. Vol. 1, Cambridge university press Cambridge. Cited by: §4.1.
-  (2014) Evaluating the research in automatic emotion recognition. IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India) 31 (3), pp. 220–232. External Links: Cited by: §4.2.2.
-  (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §2.
Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §1, §2, §2.
-  (2016) Embedding deep metric for person re-identification: a study against large variations. In European Conference on Computer Vision, pp. 732–748. Cited by: §2.
-  (2015) Discriminative learning of deep convolutional feature point descriptors. In Computer Vision (ICCV), 2015 IEEE International Conference on, pp. 118–126. Cited by: §1, §2.
-  (2016) Deep metric learning via lifted structured feature embedding. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 4004–4012. Cited by: §1, §2.
-  (2014) Accelerating t-sne using tree-based algorithms.. Journal of machine learning research 15 (1), pp. 3221–3245. Cited by: Figure 2, Figure 7, §4.2.1, §4.2.3.
-  (2019) A compact embedding for facial expression similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5683–5692. Cited by: §2.
-  (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §2.
-  (2014) Learning fine-grained image similarity with deep ranking. arXiv preprint arXiv:1404.4661. Cited by: §1, §2, §2, §2.
-  (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §2.
-  (2016) Fast training of triplet-based deep binary embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955–5964. Cited by: §2.