More Information Supervised Probabilistic Deep Face Embedding Learning

by   Ying Huang, et al.

Researches using margin based comparison loss demonstrate the effectiveness of penalizing the distance between face feature and their corresponding class centers. Despite their popularity and excellent performance, they do not explicitly encourage the generic embedding learning for an open set recognition problem. In this paper, we analyse margin based softmax loss in probability view. With this perspective, we propose two general principles: 1) monotonic decreasing and 2) margin probability penalty, for designing new margin loss functions. Unlike methods optimized with single comparison metric, we provide a new perspective to treat open set face recognition as a problem of information transmission. And the generalization capability for face embedding is gained with more clean information. An auto-encoder architecture called Linear-Auto-TS-Encoder(LATSE) is proposed to corroborate this finding. Extensive experiments on several benchmarks demonstrate that LATSE help face embedding to gain more generalization capability and it boosted the single model performance with open training dataset to more than 99% on MegaFace test.



There are no comments yet.


page 1

page 7

page 8


ElasticFace: Elastic Margin Loss for Deep Face Recognition

Learning discriminative face features plays a major role in building hig...

GB-CosFace: Rethinking Softmax-based Face Recognition from the Perspective of Open Set Classification

State-of-the-art face recognition methods typically take the multi-class...

Minimum Margin Loss for Deep Face Recognition

Face recognition has achieved great progress owing to the fast developme...

Loss Function Search for Face Recognition

In face recognition, designing margin-based (e.g., angular, additive, ad...

CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

The quality of face images significantly influences the performance of u...

Exponential Discriminative Metric Embedding in Deep Learning

With the remarkable success achieved by the Convolutional Neural Network...

Deeply Coupled Auto-encoder Networks for Cross-view Classification

The comparison of heterogeneous samples extensively exists in many appli...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face recognition performance has gained dramatic improvement in recent years. Margin based loss functions(Schroff et al., 2015; Liu et al., 2017b) play an important role in this process. The most common solutions treat face recognition as a classification problem. These works utilize deep convolutional network, such as VGGNet (Simonyan and Zisserman, 2014) or ResNet (He et al., 2016), to transfer the landmark aligned face images to their corresponding class label. And comparison loss function(Hadsell et al., 2006) is employed in the learning process, such as softmax cross entropy loss. Recent researches demonstrate that we can gain more discriminative face embedding by explicitly adding margin to loss. FaceNet (Schroff et al., 2015) adds margin by penalizing the distance between face feature and their corresponding class centers in euclidean space (triplet loss). While SphereFace (Liu et al., 2017b) penalizes this distance in hyperspherical angle space (angular margin softmax loss).

Figure 1: Examples of generated frontal face from different identity. In rows top to bottom: original input face image, the generated frontal face image from the corresponding feature, the absolute difference between input image and the generated one.

However, all these methods strengthen the embedding learning process from the perspective of comparison. So these recognition algorithms optimize single discriminative ability metric. At the same time, they do not give a general guidance for new margin loss function designing.

While robust face embedding learning is an open set problem(Geng et al., 2018; Boult et al., 2019). We can not involve images of all possible human identities in the training dataset. So we seek to solve some primary questions: How to gain more discriminative power with appropriate margin definition? Whether discriminative ability equals generative capability in open set situation?

Following these questions, we firstly explore the effect of margin in softmax cross entropy loss (Liu et al., 2016; Wang et al., 2018b; Deng et al., 2019a) from probabilistic view. Through detailed analysis, this work proposes two principles to guide the definition of new margin loss function. One is monotonically decreasing. Let be the angle between face feature and their class center. The definition of transformation function, which transforms to probability value, should monotonically decrease in the domain of . The other is margin probability penalty. New margin loss should guarantee a non-negative probability penalty following this principle.

And we find single comparison metric may not direct the model to gain optimally generalized face embedding. In contrast to these single-metric-based methods, we investigate a different way. In our work, we treat face recognition as an information transmission problem, which can be supervised by class label. Different from the classification task which only strengthens the discriminative power, information transmission can be seen as a multi-objective optimization problem. One optimization objective is comparison-metric-based discriminative embedding learning. While the other objective is that how much accurate information is passed to the face feature.

Base on this motivation, we built the learning framework as an auto-encoder (Hinton and Salakhutdinov, 2006) architecture, which can be trained with a teacher student (Hinton et al., 2015; Liu et al., 2017a) learning strategy. This Linear-Auto-TS-Encoder(LATSE) architecture can guarantee the proper information pass through the whole network. Different parts in framework promise distinct effects for the embedding learning. In the decode part is a generative network (Goodfellow et al., 2014; Zhao et al., 2016) which transfers face embedding to the corresponding individual frontal face image. This generative model is supervised by pixel level image label. The generated frontal faces are shown in Figure 1. At the encode part, we employ a deep ResNet to learn face embedding. A new definition of margin loss following the proposed principle plays the role as comparison loss. The goal of teacher network is filtering noise and ensuring cleaner information for face embedding learning. The effects of these parts are discussed in details below. We conduct extensive experiments on large scale face recognition benchmarks. The experimental results verify our findings and the effectiveness of the proposed architecture.

2 Related Work

Margin Based Comparison Loss. Learning face embedding with comparison loss can be divided into two main streams. One stream methods directly obtain face embedding from the raw image through comparing match/non-match pairs, such as triplet loss (Schroff et al., 2015). The other methods first train a multi-class CNN using margin-based softmax loss function, such as large-margin softmax loss (Liu et al., 2016) or arccos loss (Deng et al., 2019a). Then the feature layer before softmax is used as face embedding. Recent works mainly focus on how to adjust margin or other hyper parameters. Liu (Liu et al., 2019) thinks that the margin between unbalance face classes should be adaptively learned during the training process. RegularFace (Zhao et al., 2019) is proposed which explicitly penalize the angle between an identity and its nearest neighbor in order to increase the inter-class separability. AdaCos (Zhang et al., 2019) is proposed to tune the margin and scale parameter automatically to strengthen the discriminative power for face embedding. However, the single optimization metric, which enhances the intra-class compactness and inter-class dispersion, in all these methods may ignore the nature of open set recognition problem.

Figure 2: The proposed Linear-Auto-TS-Encoder(LATSE) Architecture. The red bottom part is the student encode net. While the yellow part is generative decode net. The top blue part is teacher network. A linear comparison loss and a pixel level generative loss are employed for final face embedding learning.

Generative Model. Generative network learning usually starts with a latent code from noise distribution. Then the model maps this code to a generative sample, such as probabilistic GAN (Goodfellow et al., 2014) and Energy Based GAN (Zhao et al., 2016). In Cycle GAN (Zhu et al., 2017), a cycle-consistent loss is proposed to learn from unpaired images. However, few work explores whether a generative model can enhance the embedding generalization ability in the open set recognition situation.

Teacher Student (TS) Learning Strategy. Noises are inevitable in large-scale datasets and heavily affect the performance of face recognition algorithms(Wang et al., 2018a). It is expensive to get extremely clean large-scale datasets. So how to learn with noises plays a significant role for model training. MentorNet (Jiang et al., 2017) utilizes a pre-trained teacher network to drop and update corrupted labels for the student network in learning process. De-Coupling (Malach and Shalev-Shwartz, 2017) alleviates influence of noises through updating the parameters only when the predictions from two classfiers are same. Co-Mining (Wang et al., 2019) simultaneously traines two peer networks to redistribute the labels of raw data. Despite these methods have made efforts to reduce the corruption of noisy labels. There are still drawbacks to be improved, such as careful design for face recognition with large number of classes, less resource consumption and easier training process.

Multi-Modal Supervision For CNN Models. CNN for Image recognition is generally trained with image level semantic labels (Krizhevsky et al., 2012). While several researches demonstrates that multi-modal supervision can strengthen the generative ability of the learned models in different types of tasks. MTCNN (Zhang et al., 2016) and RetinaFace (Deng et al., 2019b) find that face detection performance can be boosted by using face landmarks along with bounding boxes label. Mask-rcnn (He et al., 2017) boosts the performance for the task of instance segmentation, bounding-box object detection, and person keypoint detection by utilizing object level bounding boxes and pixel level semantic labels together. Depth and time related information is used in Depth-Patch-Net (Atoum et al., 2017) and AuxNet (Liu et al., 2018b) to improve the accuracy for the task of anti-face spoofing. However, there is few work to explore whether face recognition performance can be improved by extra supervision. In this work, we leverage pixel level face image as extra supervision to improve face recognition accuracy.

3 Learning Face Feature Embedding with Multi-Modal Supervision

The intuition behind this work is more orthogonal information lead to less uncertainty. And we treat face recognition as a problem of information transmission. Following this guide, we build our learning architecture as an auto-encoder framework which is trained with a teacher student learning strategy. This architecture is illustrated in Figure 2

. Although there are three key components in the proposed algorithm. The model can be trained end-to-end from scratch data. In the following paragraphs, we will first illustrate the probability theory behind margin comparison loss. Then details for multi-level supervision are gave. After that is teacher student learning strategy and whole network training algorithm.

3.1 Margin Softmax Loss in Probability View

There is a line of research to add margin in softmax cross entropy loss. Previous works illustrate their theory by giving us a geometric interpretation (Liu et al., 2016; Deng et al., 2019a). In this paper, we try to explain the effect of margin in probability view. At the same time, we summarize two principles which can be followed when new margin loss is needed.

Firstly, we starts from the most widely used softmax loss function. Let represent the extracted feature for a face image from identity . and

are the model weights in the fully-connected classifier layer for identity

and . , are the biases. is the total image number and is the total category number in the training set. Then the softmax loss function can be presented as:


This loss is consisted of two components and each part plays a distinct role in model optimization. One is the softmax function which transforms the predicted value from fully-connected layer to the probability of corresponding class. The other is a cross entropy loss to measure the difference between predicted probability and the given label distribution. Then we can decouple the softmax loss in Equation 1 to these parts. For an input image , the softmax function computing the probability for the target label can be presented as:


Let represent the real data distribution and

be the model predicted probability distribution. Then the cross entropy loss computes the mutual information

between them, which can be presented as:


Following the researches for margin softmax loss (Liu et al., 2017b; Deng et al., 2019a), we do not modify the format of cross entropy loss. And we adjust margin in softmax function. For simplicity, we fix the bias term as in (Liu et al., 2016), normalize the feature and classifier layer weight . Then the predicted value from fully-connected layer is simplified as . After that, the output is multiplied with a scale parameter . At this time, softmax function is formulated as:


where is the angle between the predicted face feature and its normalized class center. In previous works (Liu et al., 2018a; Chen et al., 2019), the angle is demonstrated to be correlated to visual semantic. Decreasing this angle can boost the discriminative ability for the learned model. Margin based softmax methods try to multiple () or add positive margin value ( or ) in the target label, which is expressed in format:


Rethinking the effect of softmax function, it normalizes the predicted value from the fully-connected layer and transfers this value to the probability of the corresponding class. Then we can define a general function to normalize the predicted value:


Where should always hold for any input pair . At the same time, the definition of this generalized function must satisfy probability law. While in softmax function, it employs as the definition of . By adding margin in this normalization function , we can get :


Where and are general functions to transform fully-connected layer output to a non-negative value.

The first principle for the definition of and is non-negative and to penalize probability when margin is added. This principle is formulated as:


When we have non-negative margin value and , should be monotonically decreasing when the margin increases.


Margin based loss decreases the probability of target label when the model predicts same angle between feature and class center, expressed as .

Another principle for function and is to make them monotonic decrease in the domain of ( when ). If the model tries to get the same probability under a margin based function, it must step forward to decrease the angle between feature and class center. This constraint makes the learned model gain more discriminative power by learning small angle between similar inputs.

Following these proposed principles, we can generalize the margin based probability function to any new formats. In this paper, we test our method with a linear definition under these principles, specialize and . We set and . This specialization case can be formulated as:


Finally, the is computed by cross entropy loss function between model prediction and label probability . This is defined as:


Visualize the effect of margin by target logit curve.

Different kinds of margin try to decrease the predicted probability for ground truth by penalizing the target predicted value, namely increase the cross entropy loss value under the same angle . Compared to the origin softmax function, we can view their relations in Figure 3.

Figure 3: Target logit curves for different loss function

Despite the similar goal for different margin losses, the proposed principles will lead us to define a better normalization function. They provide a guidance when problem occurs in model training. From target logit curve in the domain of , we can find SphereFace(Liu et al., 2017b) has a nonlinear penalty value which increases along with angle . This leads divergence at the beginning of training when the initial angle is large. Arcface(Deng et al., 2019a) is not monotonic decreasing in the domain of , so it will shrink margin instead of magnifying it when angle is too large. CosFace(Wang et al., 2018b) is too smooth in the domain of , meaning it will have little loss decay if the angle becomes smaller in this area. This makes it hard to shrink angle at the final stage of training. By contrast, the proposed linear format stably increases the target predicted value along with angle decreasing. This setting reduces the difficulty for the model to learn small angle between face embedding and its class center.

3.2 Pixel Level Supervision for Face Generative Net

Owing to single classification optimization metric (intra-class compactness and inter-class dispersion), margin based comparison loss encounters the bottleneck to gain more generalization ability for face embedding learning. We propose whether model can transfer more information to the face feature as another metric, which also should be optimized in the embedding learning process. So a generative network is added to realize this purpose with pixel level supervision. This generative part will try to restore the frontal face image from the embedding. Therefore the learned face embedding need to keep as much information as possible. Although we train this generative part with pixel level supervision. It is gained without cost of extra human labeling expect the identity class label for the whole image. The loss to supervise the generative model is consisted of two components. One is for frontal face regression. The other is correlated to structural similarity index (Wang et al., 2004), which compares the similarity between generative sample and the input image. Given an face image for identity . The generated face is . And the label for generative model is the momentum mean image from , which can be expressed as:


Then We gain the definition of :


Next the total generative loss is expressed as:


The suggest the model to transfer more information to embedding. While make the generated image to resemble the frontal face of the input identity.

3.3 Teacher Student Learning Strategy

Nowadays, most of large-scale datasets are obtained by utilizing search engine. Researchers apply automatic or semi-automatic approaches to clean the identity label for these datasets. This process leads to two types of noise : 1) label mess, which means images from same person may be marked with different labels. 2) distractor, which means some labeled examples are not part of any class within the dataset. Both types of noise will pollute the generalization ability of the learned face embedding. Despite current methods (Jiang et al., 2017; Malach and Shalev-Shwartz, 2017; Wang et al., 2019) have provided various strategies for model training in noisy dataset. Some of them are designed for binary classification and others may make training process too complicated. To solve these deficiencies and take into account the character of face embedding, we propose a teacher student learning strategy to filter label noise real-time in training process.

This strategy uses only the correct predictions gained by the pre-trained teacher model to update the student network, which guarantees training samples sufficiently clean with the prior knowledge of the teacher network. Recent researches (Yu et al., 2019; Han et al., 2018)

mention the memorization effects of deep neural networks, which argue that networks would first memorize training data of clean labels. So we train a teacher network in the large-scale dataset to make the teacher network memorize data of clean labels as much as possible and then fix its parameters in next process. Then, the teacher network will filter noise for the student network, which would further strengthen representational ability of the face embedding. The filtering strategy of the teacher network is optional, which provides more flexibility. Besides, there is no need to simultaneously train the teacher and student model. This setting simplifies the procedure of training and achieves better results in consumption of training time and resource than other methods.

Intuitively, the more correct knowledge gained from teacher, student can form more distinct cognition about the problem. Equally like that in the training process, the face embedding learned by the student network would possess better discriminability and generalization with the clean data under the help of the auxiliary teacher network. Besides, in the view of information transmission, the teacher net plays role as high reliability signal channel between face image data distribution and identity label.

3.4 Whole network training

For the whole network training. First we gain a teacher model on training data. Then the parameters from teacher network are fixed. Next we estimate the student network and generative model on same dataset with

formulated as:


Finally the parameters from student model are saved for testing. The whole training process can be illustated by Algorithm1.

  Input: face images , face class label
  First train by
  for  to  do
     Learn with and using
  end for
  Then fix parameters of
  Next Estimate , with and by
  for  to  do
     if  then
     end if
  end for
  Finally save parameters for testing
Algorithm 1 Linear-Auto-TS-Encoder Learning algorithm

4 Experiments

4.1 Implementation Details

Datasets. We employed CASIA (Yi et al., 2014) as small training set and MS1MV2 (Guo et al., 2016) or MS1M-RetinaV (Deng et al., 2019a) from ArcFace (Deng et al., 2019a) as large training dataset. We compared the performance for both face verification and identification tasks on several benchmark datasets, including Labelled Faces in the Wild (LFW) (Huang et al., 2008), Celebrities in Frontal Profile (CFP-FP) (Sengupta et al., 2016), Age Database (AgeDB-30) (Moschoglou et al., 2017), Cross-Age LFW (CALFW) (Zheng et al., 2017), Cross-Pose LFW (CPLFW) (Zheng and Deng, 2018) and Megaface (Nech and Kemelmacher-Shlizerman, 2017).

Experimental Settings. The proposed method were implemented with MXNet (Chen et al., 2015). We reduced memory cost with the help of memonger (Chen et al., 2016). The data preprocessing step followed paper of margin based softmax loss (Liu et al., 2017b; Wang et al., 2018b; Deng et al., 2019a). The detected face were aligned by five facial key points and resized to fix dimension () as the network input.

ResNet (He et al., 2016)

with depth of 34, 50, 100 and 124 were employed as the encode part in the auto-encoder, followed by a structure of Batch Normalization

(Ioffe and Szegedy, 2015), Dropout, Fully-Connected layer and Batch Normalization to get the final 512 dimension face embedding. A reversed ResNet-18 with deconvolution layer (Noh et al., 2015) to up-sample feature map was adopted as the generative decode part. A same setting with the encode part was employed for teacher network. The parameters in teacher network were fixed during the student network training process.

For hyperparameter setting, we adopted

as the normalized scale in spherical manifold by following (Liu et al., 2017b). Models were trained in eight NVIDIA Tesla V100 GPUs(16GB) with total batch size 768. The learning rate started from 0.1 and was divide by 10 at 10K, 16K, 20K, 22k iterations. We set weight decay to and momentum to 0.9. At test time, we only computed the 512 dimension feature for each normalized face from the student network and compared feature cosine angle value as the similarity score between different face images.

Loss Functions LFW CFP-FP AgedDB-30
ArcFace (Deng et al., 2019a) 99.53 95.56
SphereFace (Liu et al., 2017b) 99.42 - -
CosFace (Wang et al., 2018b) 99.51 95.44 94.56
LinearFace 94.66
LinearFace 99.48 96.81 95.05
LinearFace 99.48 96.80 94.40
Table 1: Face verification results () with different margin loss.(models trained on CASIA, ResNet50)

4.2 Ablation Study of proposed key components

The performance with different margin loss. Firstly, we explored the effect of different margin losses. For fair comparison, the proposed LinearFace method was trained following the setting in (Deng et al., 2019a) with ResNet50 as embedding feature backbone on CASIA dataset. The verification results on LFW, CFP-FP and AgeDB-30 were reported to compare their performance in Table 1. From the result, we observed better accuracy on LFW and CFP-FP by employing the linear margin function compared to other margin softmax losses, which demonstrated the effectiveness of the proposed principle in Section 3.1.

resnet34 99.65 95.85 92.12
resnet34(k=1) 96.03 97.23
resnet50 99.80 95.80 92.74
resnet50(k=1) 95.93 98.25
resnet100 99.77 98.27
resnet100(k=1) 98.64
Table 2: Face verification results () with Teacher Student learning strategy

Method Identification Verification
resnet34 96.09 96.72
resnet34(k=1) 97.61 98.03
resnet50 97.26 97.62
resnet50(k=1) 98.26 98.48
resnet100 98.35 98.6
resnet100(k=1) 98.56 98.8
Table 3: Megaface performance () with Teacher Student learning strategy
Figure 4: Samples selected by teacher network. Examples from same row were marked with same label. The leftmost column: data of true label. The other columns: data of noisy labels. The student network actually avoid pollution of noisy labels with the selection of teacher network.

The effectiveness of Teacher Student Strategy. To show the effectiveness of teacher student learning strategy, we conducted numerous validity experiments. All experiments employed MS1MV2 as training dataset and Arcface loss as training loss to guarantee fairness of comparison. We set the model performance from (Deng et al., 2019a) as baseline. The performance for student network was explored with setting different values of teacher network. We assume the image label was correct if it was included in teacher network predicted top probabilities. Only when image label fell in teacher’s predicted top classes, the gradient would pass to student network for updating its parameters. As shown in Table 2 , our teacher student learning strategy could be applied effectively with various network depth(34,50 and 100). The strategy not only obtained better results in several benchmarks, especially got more than promotion in CFP-FP when depth was 50. But also outperformed baseline for large-scale evaluation set. Results in Table 3 verified that our strategy strengthen the generalization ability of face embedding. Besides, we made visualization of clean training samples provided by teacher network during student training process in Figure 4. Obviously, the student network could get cleaner samples in large-scale training dataset with the help of the teacher network.

Component CALFW CPLFW AgedDB-30
LinearFace 93.3 89.08 94.66
LinearFace+TS 93.73 89.75 95.05
LinearFace+Gen 94.3 90.13
Table 4: The effect with different proposed component

The advantage of proposed architecture. Next we compared four architectures with different components equipped in them to show the advantages of the proposed architecture. The face verification performance on different benchmark datasets is shown in Table 4. LinearFace represents an architecture equipped with only the proposed margin loss. TS is short for the teacher student learning strategy and Gen is short for the generative decode part. The row LinearFace+TS+Gen showed the performance of the proposed architecture in this paper. All these model were trained on CASIA dataset with ResNet50 as backbone. The results showed that the proposed architecture can effectively boost the performance for face verification.

Figure 5: Examples of generated frontal face for same identity from different embedding models of various training stage. In rows top to bottom: early stage embedding model, middle stage embedding model and last stage embedding model. More details reserved in the generated frontal face images from latter stage embedding model.

Visualize samples from generative model. Resent researches illustrated why margin based algorithm works for face embedding learning by giving an geometric interpretation. But the perceptual intuition what the feature learned through the embedding network was missing. With the help of the auto-encoder architecture, we have a chance to visualize what different embeddings look like directly. In Figure 5, we showed the generative samples for same identity. We employed different embedding models from early, middle and last stages of training as the encode model to get the face image feature. Then a pre-trained generative decode net was employed to restore these samples from features. The figure showed that along with the training process stepping forward, more details were preserved in the restore frontal face image. This suggested that the embedding model has gained more discriminative power. And the model increased the capacity to retain information.

4.3 Evaluation Comparison

Results for face Verification on LFW, CALFW, CPLFW. LFW (Huang et al., 2008) dataset is the most widely used benchmark for unconstrained face verification on images. Recent algorithms(Deng et al., 2019a; Wang et al., 2018c) gain nearly saturation performance on it. Instead of only comparing performance on LFW, we also employed CALFW and CPLFW datasets, which involved higher pose and age variations for same identities from LFW, to evaluate the performance for face verification. The experiment results were reported in Table 5. From the results, we could find that several algorithms gained better performance than human-individual. By comparing results from different algorithms, the results shown that although both Arcface and our proposed LATSE gain similar performance on LFW test, our method improved 1% accuracy by average on CALFW and CPLFW benchmarks. This showed the proposed method generalized better when more human pose and age variations were involved.

Human-Individual 97.27 82.32 81.21
Human-Fusion 86.50 85.24
CenterLoss (Wen et al., 2016) 98.75 85.48 77.48
SphereFace (Liu et al., 2017b) 99.27 90.30 81.40
VGGFace2 (Cao et al., 2018) 99.43 90.37 84.00
ArcFace (Deng et al., 2019a) 95.45 92.08
Proposed LATSE
Table 5: Face verification results () for different algorithms

Results for MegaFace test. The MegaFace (Kemelmacher-Shlizerman et al., 2016) dataset includes 1,000,000 images of 690,000 different identities as the distractors in the gallery set and 100,000 photos of 530 unique celebrity from FaceScrub as the probe set. There are two testing scenarios, one for face identification and the other is face verification. The testing process is conducted under two different training protocols (large or small training set, while training set is defined as large if there is more than 500,000 images in it). Following the setting in ArcFace (Deng et al., 2019a), we employed CASIA as training dataset for protocol of small, and use MS1M-RetinaV dataset as the training set for protocol large. For MegaFace test data inference, we used the cleaned version from ArcFace to test the proposed method, which is noted as ‘r’ in the Table 6. Both the identification and verification accuracy results were reported to compare the performance for large scale face recognition. Our method improved both accuracy under small and large training protocols. The results showed that the proposed architecture gain more generalization capability for large scale face recognition in an open set situation. We gained more than accuracy on MegaFace test without private dataset.

Method Identification Verification
Softmax (Liu et al., 2017b) 54.85 65.92
TripletLoss (Schroff et al., 2015) 64.79 78.32
SphereFace (Liu et al., 2017b) 72.73 85.56
SphereFace+(Liu et al., 2018a) 73.03 -
CosFace (Wang et al., 2018b) 77.11 89.88
ArcFace(CASIA) (Deng et al., 2019a) 77.50 92.34
ArcFace,r(CASIA) (Deng et al., 2019a) 91.75 93.69
Proposed LATSE,r(CASIA)
TripletLoss (Schroff et al., 2015) 70.49 86.47
CosFace (Wang et al., 2018b) 82.72 96.65
ArcFace,r (Deng et al., 2019a) 98.35 98.48
SV-AM-Softmax,r (Wang et al., 2018c) 98.82 99.03
Proposed LATSE,r
Table 6: Megaface results () for different algorithm

5 Concluding Remarks

In this paper, we illustrate how the margin based loss methods work for face embedding learning in a perspective of probability. Two principles are proposed for new margin loss function designing. At the same time, in view of comparison metric encounters the bottleneck to gain more generalization ability, we propose to regard the open set face recognition as a problem of information transmission. Based on this intuition, we propose an auto-encoder architecture trained with a teacher student learning strategy, which increased the generalization ability for the face embedding. The extensive experimental results on several benchmarks show clear advantages of the proposed method.


  • Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu (2017) Face anti-spoofing using patch and depth-based cnns. In 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 319–328. Cited by: §2.
  • T. Boult, S. Cruz, A. Dhamija, M. Gunther, J. Henrydoss, and W. Scheirer (2019) Learning and the unknown: surveying steps toward open world recognition. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 9801–9807. Cited by: §1.
  • Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) Vggface2: a dataset for recognising faces across pose and age. In International Conference on Automatic Face & Gesture Recognition, pp. 67–74. Cited by: Table 5.
  • B. Chen, W. Liu, A. Garg, Z. Yu, A. Shrivastava, J. Kautz, and A. Anandkumar (2019) Angular visual hardness. External Links: 1912.02279 Cited by: §3.1.
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015)

    Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems

    arXiv preprint arXiv:1512.01274. Cited by: §4.1.
  • T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §4.1.
  • J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019a) Arcface: additive angular margin loss for deep face recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4690–4699. Cited by: §1, §2, §3.1, §3.1, §3.1, §4.1, §4.1, §4.2, §4.2, §4.3, §4.3, Table 1, Table 5, Table 6.
  • J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019b) RetinaFace: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641. Cited by: §2.
  • C. Geng, S. Huang, and S. Chen (2018) Recent advances in open set recognition: a survey. External Links: 1811.08581 Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §2.
  • Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §4.1.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §1.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8527–8537. Cited by: §3.3.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §4.1.
  • G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
  • G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller (2008) Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. Cited by: §4.1, §4.3.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
  • L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2017) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. External Links: 1712.05055 Cited by: §2, §3.3.
  • I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard (2016) The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882. Cited by: §4.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §2.
  • H. Liu, X. Zhu, Z. Lei, and S. Z. Li (2019) AdaptiveFace: adaptive margin and sampling for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11947–11956. Cited by: §2.
  • W. Liu, B. Dai, A. Humayun, C. Tay, C. Yu, L. B. Smith, J. M. Rehg, and L. Song (2017a) Iterative machine teaching. In Proceedings of the International Conference on Machine Learning, pp. 2149–2158. Cited by: §1.
  • W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song (2018a) Learning towards minimum hyperspherical energy. In Advances in Neural Information Processing Systems, pp. 6222–6233. Cited by: §3.1, Table 6.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017b) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220. Cited by: §1, §3.1, §3.1, §4.1, §4.1, Table 1, Table 5, Table 6.
  • W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks.. In Proceedings of the International Conference on Machine Learning, Vol. 2, pp. 7. Cited by: §1, §2, §3.1, §3.1.
  • Y. Liu, A. Jourabloo, and X. Liu (2018b) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 389–398. Cited by: §2.
  • E. Malach and S. Shalev-Shwartz (2017) Decoupling ”when to update” from ”how to update”. External Links: 1706.02613 Cited by: §2, §3.3.
  • S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou (2017) Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–59. Cited by: §4.1.
  • A. Nech and I. Kemelmacher-Shlizerman (2017) Level playing field for million scale face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7044–7053. Cited by: §4.1.
  • H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528. Cited by: §4.1.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §1, §2, Table 6.
  • S. Sengupta, J. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs (2016) Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: §4.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. External Links: 1409.1556 Cited by: §1.
  • F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. Change Loy (2018a) The devil of face recognition is in the noise. In Proceedings of the European Conference on Computer Vision, pp. 765–780. Cited by: §2.
  • H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018b) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §1, §3.1, §4.1, Table 1, Table 6.
  • X. Wang, S. Wang, J. Wang, H. Shi, and T. Mei (2019) Co-mining: deep face recognition with noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9358–9367. Cited by: §2, §3.3.
  • X. Wang, S. Wang, S. Zhang, T. Fu, H. Shi, and T. Mei (2018c)

    Support vector guided softmax loss for face recognition

    arXiv preprint arXiv:1812.11317. Cited by: §4.3, Table 6.
  • Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.2.
  • Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: Table 5.
  • D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Cited by: §4.1.
  • X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. External Links: 1901.04215 Cited by: §3.3.
  • K. Zhang, Z. Zhang, Z. Li, and Y. Qiao (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. Cited by: §2.
  • X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li (2019) AdaCos: adaptively scaling cosine logits for effectively learning deep face representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10823–10832. Cited by: §2.
  • J. Zhao, M. Mathieu, and Y. LeCun (2016) Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §1, §2.
  • K. Zhao, J. Xu, and M. Cheng (2019) RegularFace: deep face recognition via exclusive regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1136–1144. Cited by: §2.
  • T. Zheng, W. Deng, and J. Hu (2017) Cross-age lfw: a database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197. Cited by: §4.1.
  • T. Zheng and W. Deng (2018) Cross-pose lfw: a database for studying crosspose face recognition in unconstrained environments. Beijing University of Posts and Telecommunications, Tech. Rep, pp. 18–01. Cited by: §4.1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232. Cited by: §2.