1 Introduction
Face recognition is a fundamental and of great practice values task in the community of computer vision and patter recognition. The task of face recognition contains two categories, face identification to classify a given face to a specific identity, and face verification to determine whether a pair of face images are of the same identity. Though it has been extensively studied for decades
[38, 2, 35, 25, 27, 21], there still exist a great many challenges for accurate face recognition, especially on largescale test datasets, such as MegaFace Challenge [9] or Trillion Pairs Challenge^{1}^{1}1http://trillionpairs.deepglint.com/overview.In recent years, the advanced face recognition models are usually built upon deep convolutional neural networks [31, 7, 23] and the learned discriminative features play a significant role. To train deep models, the CNNs are generally equipped with classification loss functions [28, 32, 37, 10, 14, 36], metric learning loss functions [26, 20, 34] or both [18, 27, 37, 41]. Metric learning loss functions such as contrastive loss [26] or triplet loss [20] usually suffer from high computational cost. To avoid this problem, they require carefully designed sample mining strategies and the performance is very sensitive to these strategies. So increasingly more researchers shift their attentions to construct deep face recognition models by redesigning the classification loss functions.
Intuitively, face features are discriminative if their intraclass compactness and interclass separability are well maximized. However, as pointed out by many recent studies [37, 32, 14, 30, 36, 4], the current prevailing classification loss function (i.e., Softmax loss) usually lacks the power of feature discrimination for deep face recognition. To address this issue, one group proposes to explore the miningbased loss functions [22, 12, 24, 39]. Shrivastava et al. [22] develop a hard mining softmax (HMSoftmax) to improve the feature discrimination by constructing minibatches using highloss examples. Among which, the percentage of hard examples is empirically decided and the easy examples are completely discarded. In contrast, Lin et al. [12] design a relatively soft mining softmax, namely Focal loss (FSoftmax), to focus training on a sparse set of hard examples. It usually achieves more promising results than the simple hard mining softmax. Yuan et al. [39] select the hard examples based on model complexity and train an ensemble to model examples of different hard levels. The other group prefers to design marginbased loss functions [14, 30, 4]. This group does not focus on optimizing hard examples but directly increasing the feature margin between different classes. Wen et al. [37] develop a center loss to learn centers for each identity to enhance the intraclass compactness. Wang et al. [32] and Ranjan et al. [19]
propose to use a scale parameter to control the temperature of softmax loss, producing higher gradients to the wellseparated samples to shrink the intraclass variance. Liu
et al. [13, 14] introduce an angular margin (ASoftmax) between the ground truth class and other classes to encourage the larger interclass variance. However, it is usually unstable and the optimal parameters are hard to determinate. To enhance the stability of ASoftmax loss, several alternative approaches [30, 36, 15, 4] have been proposed. Wang et al. [30] design an additive margin (AMSoftmax) loss to stabilize the optimization and have achieved promising performance. Deng et al. [4] develop an additive angular margin (ArcSoftmax) loss, which has a more clear geometric interpretation.Although these two groups have been wellverified to learn discriminative features for face recognition. The motivation of miningbased losses is to focus on hard examples while marginbased losses are to enlarge the feature margin between different classes. Currently, they develop independently and both of them have their own intrinsic drawbacks. To the miningbased losses, the definition of hard examples is ambiguous and they are often empirically selected. How to semantically decide the hard examples is still an open problem. To the marginbased losses, most of them learn discriminative features by enlarging the feature margin, only from the perspective of ground truth class (selfmotivation). They usually ignore the discriminative power from the perspective of other nonground truth classes (othermotivation). Moreover, the relation between miningbased and marginbased losses remains unclear.
To overcome the above shortcomings, this paper tries to design a new loss function, which adaptively emphasizes on the informative support vectors to bridge the gap between miningbased and marginbased losses and semantically integrate them into one framework. To sum up, the main contributions of this paper can be summarized as follows:

We propose a novel SVSoftmax loss, which eliminates the ambiguity of hard examples as well as absorbs the discriminative power of other classes by focusing on support vectors. To the best of our knowledge, this is the first attempt to semantically fuse the miningbased and marginbased losses into one framework.

We deeply analyze the relations of our SVSoftmax loss to the current miningbased and marginbased losses, and further develop an improved version SVXSoftmax loss to enhance the feature discrimiantion. Our code will be available at https://github.com/xiaoboCASIA/SVXSoftmax.
2 Preliminary Knowledge
Softmax. Softmax loss is defined as the pipeline combination of the last fully connected layer, the softmax function and the crossentropy loss. In face recognition, the weights , (where and is the number of classes) and the feature of the last fully connected layer are usually normalized and the magnitude is replaced as a scale parameter [32, 30, 4]. In consequence, given an input feature vector with its corresponding ground truth label , the softmax loss can be formulated as follows:
(1) 
where
is the cosine similarity and
is the angle between and . As pointed out by a great many studies [13, 14, 30, 4], the learned features with softmax loss are prone to be separable, rather than to be discriminative for face recognition.Miningbased Softmax. Hard example mining is becoming a common practice to effectively train deep CNNs. Its idea is to focus training on the informative examples, thus it usually results in more discriminative features. There are recent works that select hard examples based on loss value [22, 12] or model complexity [39] to learn discriminative features. Generally, they can be summarized as:
(2) 
where
is the predicted ground truth probability and
is an indicator function. Basically, to the soft mining method Focal loss [12] (FSoftmax), , is a modulating factor. To the hard mining method HMSoftmax [22], when the sample is indicated as easy while when the sample is hard. However, the definition of hardness is ambiguous and they usually lead to sensitive performance.Marginbased Softmax. To directly enhance the feature discrimination, several marginbased softmax loss functions [14, 36, 30, 4] have been proposed in recent years. In summary, they can be defined as follows:
(3) 
where is a carefully designed margin function. Basically, is the motivation of ASoftmax loss [14], where and is an integer. with is the AMSoftmax loss [30]. with is the ArcSoftmax loss [4]. More generally, the margin function can be summarized into a combined version: . However, all these methods achieve the feature margin only from the perspective of ground truth class . They are not aware of the importance of other nonground truth classes.
3 Problem Formulation
3.1 Naive MiningMargin Softmax Loss
The miningbased loss functions aim to focus on the hard examples while the marginbased loss functions are to enlarge the feature margin between different classes. Therefore, these two branches can seamlessly incorporate into each other. The naive motivation to directly integrate them can be formulated as:
(4) 
However, this formulation Eq. (4) only absorbs their own merits. It can not solve their respective shortcomings. Detailedly, it only encourages the feature margin from the perspective of the ground truth class by (selfmotivation), ignoring the feature discriminative power of other nonground truth classes (othermotivation). Moreover, the hard examples are still empirically selected by the indicator function , without semantic guidance. In other words, the definition of hard examples is ambiguous.
3.2 Support Vector Guided Softmax Loss
Intuition says that considering the wellseparated feature vectors has little effect on the learning problem. That means the misclassified feature vectors are more crucial to enhance the feature discriminability. Motivated by this, the hard example mining [22] and the recent Focal loss [12] techniques are proposed to focus training on a sparse set of hard examples and ignore the vast number of easy ones during training. However, they either empirically sample hard examples according to loss values or empirically downweight the easy examples by a modulating factor. In other words, the definition of hard examples is ambiguous, and without intuitive interpretation.
To address it, we alternatively introduce a more elegant way to focus training on the informative features (i.e., support vectors). Specifically, we define a binary mask to adaptively indicate whether a sample is selected as the support vector by a specific classifier in the current stage. To the end, the binary mask is defined as follows:
(5) 
From the definition, we can see that if a sample is misclassified, i.e., , it will be emphasized temporarily. In this way, the concept of hard examples is clearly defined and we mainly focus on such a sparse set of support vectors. Consequently, our Support Vector Guided Softmax (SVSoftmax) loss is formulated:
(6) 
where
is a preset hyperparameter and the indicator function
is defined as:(7) 
Obviously, when , the designed SVSoftmax loss becomes identical to the original softmax loss. Figure 1 gives the geometrical interpretation of our SVSoftmax loss.
3.2.1 Relation to Miningbased Softmax Losses
To illustrate the advantages of our SVSoftmax loss over the traditional miningbased loss functions (e.g., Focal loss [12]), we use the binary classification case as an example. Assume that we have two samples and , both of them are from class 1. Figure 2 gives a diagram, where is relatively hard while is relatively easy. The traditional miningbased Focal loss is to differentially reweight the losses of hard and easy examples, such that:
(8) 
In that way, the importance of hard examples is emphasized. This strategy is directly from the loss perspective and the definition of hard examples is ambiguous. While our SVSoftmax loss is from a different way. Firstly, we semantically define the hard examples (support vectors) according to the decision boundary. Then, to the support vector , we reduce its probability, such that:
(9) 
In summary, the differences between SVSoftmax loss and miningbased Focal loss [12] are displayed in Figure 2.
3.2.2 Relation to Marginbased Softmax Losses
Similarly, assume that we have a sample from class 1, and it is a little far way from its ground truth class, (e.g., the red circle point in Figure 4). The original softmax loss aims to make . To make the objective more rigorous, marginbased losses usually introduce a margin function from the perspective of ground truth class [14, 30, 4]:
(10) 
In contrast, our SVSoftmax loss enlarge the feature margin from the perspective of other nonground truth classes. Specifically, we have introduced a margin function to these misclassified features:
(11) 
where . Our SVSoftmax loss semantically enlarges the feature margin from other nonground truth classes while marginbased losses make theirs efforts from the ground truth class. For multiclass case, Our SVSoftmax loss is classspecific margins. Figure 4 gives their geometrical comparison. To sum up, Figure 3 shows the pipeline of our SVSoftmax loss and its relations to the miningbased and marginbased losses.
3.2.3 SVXSoftmax
According to the above discussions, our SVSoftmax loss semantically fuses the motivation of miningbased and marginbased losses into one framework, but from different viewpoints. Therefore, we can also absorb their strengths into our SVSoftmax loss. Specifically, to increase the mining range, we adopt the marginbased decision boundaries to indicate the support vectors. Consequently, the improved SVXSoftmax loss can be formulated as:
(12) 
where X is the marginbased losses. It can be ASoftmax [14], AMSoftmax [30] and ArcSoftmax [4] etc. The indicator mask is recomputed according to marginbased decision boundaries^{2}^{2}2That why we uniformity call the hard examples as ”support vectors”, because it is similar to the definition in [3].. Specifically,
(13) 
Figure 5 gives the geometrical illustration of our SVXSoftmax loss. It is best because from the motivation of marginbased losses, SVXSoftmax loss enlarges the feature margin by integrating the selfmotivation of ground truth class and the othermotivation of other classes into one framework. While from the motivation of miningbased losses, it semantically enlarges the mining range.
4 Optimazation
In this section, we show that the proposed SVSoftmax loss (6
) is trainable and can be easily optimized by the typical stochastic gradient descent. The difference between the original softmax loss and the proposed SVSoftmax loss lies in the last fully connected layer
.To the forward, when , it is the same as the original softmax loss (i.e., ). When , it has two cases, if the feature vector is easy for an specific class, it is the same as the original softmax (i.e., ). Otherwise, it will be recomputed as
. To the backward propagation, we use the chain rule to compute the partial derivative. The derivative of
and the CNN feature of the last fully connected layer should be reemphasized:(14) 
(15) 
where the computation form of is the same as the original softmax loss. The whole scheme for a single image is summarized in Algorithm 1. It is trivial to perform derivation with minibatch input. Moreover, it is also straightforward to the SVXSoftmax loss case.
Method  LFW 6000  LFW BLUFR  LFW BLUFR  LFW BLUFR  
Pairs Accuracy  TPR@FAR=1e3  TPR@FAR=1e4  TPR@FAR=1e5  
Baseline 
Softmax  99.26  99.46  98.44  95.24 
Miningbased  FSoftmax [12]  99.46  99.62  98.76  95.97 
HMSoftmax [22]  99.26  99.48  98.48  95.11  
Marginbased  ASoftmax [14]  99.36  99.68  99.09  97.20 
ArcSoftmax[4]  99.63  99.86  99.68  98.18  
AMSoftmax [30]  99.61  99.86  99.75  98.18  
Naivefused  FArcSoftmax  99.66  99.87  99.73  98.32 
FAMSoftmax  99.66  99.87  99.76  98.39  
HMArcSoftmax  99.51  99.86  99.70  98.74  
HMAMSoftmax  99.63  99.87  99.75  98.90  
Ours  SVSoftmax  99.48  99.78  99.39  98.14 
SVArcSoftmax  99.78  99.85  99.77  98.52  
SVAMSoftmax  99.76  99.87  99.81  99.22 
5 Experiments
5.1 Datasets
Training Data. The MSCeleb1M dataset [6] contains about 100k identities with 10 million images. However, it consists of a great many noisy face images. Fortunately, the trillionpairs consortium has made their efforts to get a highquality version MSCeleb1Mv1c, which is wellcleaned with 86,876 identities and 3,923,399 aligned images.
Validation Data. We employ Labelled Faces in the Wild (LFW) [8] as the validation data. LFW contains 13,233 webcollected images from 5,749 different identities, with large variations in pose, expression and illuminations.
Test Data. We use two datasets, MegaFace [9] and Trillion Pairs^{3}^{3}3http://trillionpairs.deepglint.com/overview, as the test data. MegaFace datasets aim at evaluating the performance of face recognition algorithms at the million scale of distractors, which include gallery set and probe set. The gallery set, a subset of Flickr photos from Yahoo, consists of more than one million images from 690,000 different individuals. The probe set has two existing databases: Facescrub [17] and FGNET [1]. In this study, we use the Facescrub as the probe set, which contains 100,000 photos of 530 unique individuals, wherein 55,742 images are males, and 52,076 images are females. Trillion Pairs datasets are recently released as a public available testing benchmark, which are consisted of the following two parts, ELFW and DELFW. ELFW is the face images of celebrities in LFW name list. There are 274,000 images from 5,700 identities. DELFW is the distractors for ELFW. There are in total 1.58 million face images from Flickr.
5.2 Experimental Settings
Data Processing. We detect the faces by adopting the FaceBoxes detector [40] and localize five landmarks (two eyes, nose tip and two mouth corners) through a simple 6layer CNN [5]. The detected faces are cropped and resized to 120120, and each pixel (ranged between [0,255]) in RGB images is normalized by subtracting 127.5 and then being divided by 128. For all the training faces, they are horizontally flipped with probability 0.5 for data augmentation.
CNN Architecture. In face recognition, there are many kinds network architectures [14, 30, 29]. To be fair, the CNN architecture should be the same to test different loss functions. As suggested by the work [29], we use Attention56 [31] as our baseline architecture to achieve a good balance between computation and accuracy. The output of Attention56 has and finally gets a 512dimension feature by the operation of averaging pooling. The scale parameter has already been discussed sufficiently in previous works [30, 33]. In this paper, we directly fixed it to 30. For details, the adopted Attention56 architecture is provided in supplementary materials.
Training. All the CNN models are trained with stochastic gradient descent (SGD) algorithm and trained from scratch, with the batch size of 32 on 4 P40 GPUs parallelly, total batch size 128. The weight decay is set to 0.0005 and the momentum is 0.9. The learning rate is initially 0.1 and divided by 10 at the 100k, 160k, 220k iterations, and we finish the training process at 240k iterations.
Test. At the testing stage, only the features of original image are employed (512dimension) to compose the face representation. All the reported results in this paper are evaluated by a single model, without model ensemble or other fusion strategies.
To the evaluation metrics, the cosine distance of features is computed as the similarity score. Face identification and verification are conducted by ranking and thresholding the scores. Specifically, for face identification, the Cumulative Match Characteristics (CMC) curves are adopted to evaluate the Rank1 face identification accuracy. For face verification, the Receiver Operating Characteristic (ROC) curves are adopted. The true positive rate (TPR) at low false acceptance rate (FAR) is emphasized since in real applications false acceptance gives higher risks than false rejection. We test our models on several popular public face datasets, including LFW
[8], MegaFace Challenge [9, 16] and the recent Trillion Pairs Challenge. Specifically, for LFW, the unrestricted with labeled outside data on 6000 pairs accuracy [8] and the BLUFR [11] protocols are reported. For Megaface Challenge, the identification Rank1 accuracy and the verification rate TPR@FAR =1e6 are reported. For Trillion Pairs Challenge, every pair between ELFW and DELFW is used. There are in total 0.4 trillion pairs. To the face identification task, they provide a 1.58 millionsize gallery and a 270ksize query for top1 identification and the metric TPR@FAR=1e3 is reported. While to the face verification task, the verification rate TPR@FAR=1e9 is reported. For more details about the protocols, please refer to the works [8, 11, 9].To the compared methods, we compare our method with the baseline Softmax loss (Softmax) and the recently proposed stateofthearts, including 2 miningbased softmax losses (i.e., hard example mining (HMSoftmax [22]) and Focal loss (FSoftmax [12])), 3 marginbased softmax losses (the angular Softmax loss (ASoftmax[34]), the additive margin Softmax loss (AMSoftmax[30]), and the additive angular margin Softmax loss (ArcSoftmax[4])) and their 4 naive fusions (FAMSoftmax, FArcSoftmax, HMAMSoftmax and HMArcSoftmax). For all the compared methods, their source codes can be downloaded from the github or from authors’ webpages. The corresponding parameters are determined according to their suggestions (e.g., the feature margin parameter is 0.35 for AMSoftmax and is 0.5 for ArcSoftmax). For more details, please refer to the supplementary materials.
5.3 Effects of indicator parameter
Since the indicator parameter plays an important role in the developed SVSoftmax loss, we first conduct experiments to search its possible best value. By varying t from 1.0 to 1.3 (If t is larger than 1.4, the model may fail to converge), we use the Attention56 network and the SVSoftmax loss to train models on the MSCeleb1Mv1c dataset and evaluate its performance on the validation set LFW. As illustrated in the left subfigure of Figure 6, with being increased, the 6000 pairs accuracy and the BLUFR of LFW are improved consistently, and get saturated at . This demonstrates the effectiveness of our SVSoftmax loss (compared with ). To validate the sensitivity of our indicator parameter , we directly use the trained models to test them on MegaFace, the effects are reported in the right subfigure of Figure 6. From the curves, we can see that our SVSoftmax loss is insensitive to the indicator parameter in a certain range. According to this study, is set to fixed 1.2 in the subsequent experiments.
Method  Identification  Verification 
Rank1@1e6  TPR@FAR=1e6  
Softmax  86.29  87.63 
FSoftmax [12]  88.29  89.83 
HMSoftmax [22]  86.58  88.39 
ASoftmax [14]  88.54  89.40 
ArcSoftmax [4]  93.67  94.47 
AMSoftmax [30]  94.77  95.44 
FArcSoftmax  93.98  95.10 
FAMSoftmax  94.47  94.84 
HMArcSoftmax  94.05  95.26 
HMAMSoftmax  94.78  95.57 
SVSoftmax  92.11  93.54 
SVArcSoftmax  97.14  97.57 
SVAMSoftmax  97.20  97.38 
5.4 Experiments on LFW
Table 4 provides the quantitative results of all the competitors on LFW dataset. The bold number in each column represents the best performance. To the 6000 pairs accuracy protocol, it is wellknown that this protocol is typical and easy for deep face recognition, and all the competitors can achieve over 99% accuracy rate. So the improvement of our SVSoftmax loss is not quite large. From the numbers, we observe that the naive fusions of miningbased and marginbased losses, e.g., HMAMSoftmax and FAMSoftmax, outperform the simple miningbased or marginbased ones. Despite this, our imporved SVAMSoftmax still achieves about 0.3% improvements. To the BLUFR protocol, the similar trends as the 6000 pairs accuracy, our improved SVAMSoftmax loss achieves the best performance among all the competitors. Due to the evaluation protocols on LFW are nearly to be saturated, it would be better to test our models on MegaFace and Trillion Pair Challenges.
5.5 Experiments on MegaFace Challenge
Table 5 shows the identification and verification results on MegaFace dataset. In particular, compared with the baseline Softmax loss and the miningbased Softmax losses, our SVSoftmax loss achieves at least 3% improvements at both the Rank1 identification rate and the verification TPR@FAR=1e6 rate. The reason is that our SVSoftmax loss has clearly defined the hard examples (i.e., support vectors), thus it is better than existing miningbased losses. While compared with the marginbased Softmax losses, the performance of our SVSoftmax loss is slightly lower than them. This is reasonable because the support vectors decided by the Softmax decision boundary in SVSoftmax loss may not be enough for learning discriminative features. Our improved versions SVArcSoftmax and SVAMSoftmax losses, wherein the support vectors are determined by the marginbased decision boundaries, can further boost the performance because they absorb the complementary merits of marginbased losses. Specifically, to our SVAMSoftmax loss, it beats the best marginbased competitor AMSoftmax loss by a large margin (about 2.4% at Rank1 identification rate and 1.9% verification rate). Compared with the naive fusions of miningbased and marginbased losses, our improved SVAMSoftmax loss is also better than them. It is about 2.4% higher at Rank1 identification rate and 1.8% higher at verification rate than the second best competitor HMAMSoftmax loss. To sum up, our imporved SVXSoftmax losses, which eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes by focusing on support vectors, are inherently the best in the current stage. In Figure 7, we draw both of the CMC curves to evaluate the performance of face identification and the ROC curves to evaluate the performance of face verification on MegaFace Set 1. From the curves, we can see the similar trends at other measures. In this experiment, our SVSoftmax loss with its improved version SVAMSoftmax approach have shown their superiority for both the identification and verification tasks.
Method  Identification  Verification 
TPR@FAR=1e3  TPR@FAR=1e9  
Softmax  36.61  33.87 
FSoftmax [12]  39.80  37.14 
HMSoftmax [22]  36.75  34.46 
ASoftmax [14]  43.89  43.76 
ArcSoftmax [4]  57.48  57.45 
AMSoftmax [30]  61.80  61.61 
FArcSoftmax  56.80  56.87 
FAMSoftmax  61.85  61.79 
HMArcSoftmax  55.93  56.63 
HMAMSoftmax  61.42  61.33 
SVSoftmax  51.18  46.78 
SVArcSoftmax  71.19  70.33 
SVAMSoftmax  73.56  72.71 
5.6 Experiments on Trillion Pairs Challenge
Table 3 displays the performance comparison on the recent Trillion Pairs Challenge, from which, we can conclude that the results exhibit the same trends that emerged on LFW and MegaFace datasets. Besides, the trends are more obvious. Concretely, both of the current miningbased and marginbased losses are better than the simple softmax loss for face recognition. However, the marginbased losses usually achieve higher performance than the miningbased losses, because the motivation of marginbased losses is to enhance the feature discrimination while the motivation of miningbased losses is to focus training on hard examples. Their naive fusions can slightly improve the performance further. However, the naive fusions are still suffering from the ambiguity of hard examples and the lack of discriminative power of other classes. Therefore, they are limited for face recognition. Our SVXSoftmax (e.g., SVAMSoftmax) losses absorb the strengths and discard the drawbacks of the current mingbased and marginbased loss functions, thus they achieve the highest performance.
LFW  LFW BLUFR  LFW BLUFR  LFW BLUFR 

6000  TPR  TPR  TPR 
Accuracy  @FAR=1e3  @FAR=1e4  @FAR=1e5 
99.85 (our)  99.92  99.89  99.13 
99.87 (1st)          
MegaFace Identification  MegaFace Verification 
Rank1@1e6  TPR@FAR=1e6 
98.82 (our)  99.03 (our) 
99.93 (1st)  99.93 (1st) 
Trillion Pairs Identification  Trillion Pairs Verification 
TPR@FAR=1e3  TPR@FAR=1e9 
82.25 (our)  78.49 (our) 
85.67 (1st)  82.29 (1st) 
6 Improvement by Designing Architectures
To further boost the performance, we try to make the adopted Attention56 [31] architecture deeper. Specifically, we change the stages of [1,1,1] used in Attention56 into [3,6,2]. Moreover, inspired by [4], we incorporate the IRSE module into the architecture. The results are displayed in Tables 46. Note that all current results are training based on the simple MSCeleb1Mv1c dataset and only the single model performance is reported. From the numbers, we can see that our SVAMSoftmax loss has achieved the competitive absolute performance. In the future, it would be better to fuse the MS1MArcFace [4] and Asian datasets^{4}^{4}4http://trillionpairs.deepglint.com/data and design model ensemble methods (e.g., feature concatenation).
7 Conclusion
This paper has proposed a simple but very effective loss function, namely support vector guided softmax loss (i.e., SVSoftmax), for face recognition. In specific, SVSoftmax loss explicitly concentrates on optimizing the support vectors. Thus it semantically integrates the motivation of miningbased and marginbased loss functions into one framework. Consequently, it is intrinsically better than the current miningbased losses, marginbased losses and their naive fusions. Extensive experiments on several benchmark datasets have clearly demonstrated the advantages of our new approach over the stateoftheart alternatives.
References
 [1] Fgnet aging database. http://www.fgnet.rsunit.com/. 2010.
 [2] S. Cai, W. Zuo, L. Zhang, X. Feng, and P. Wang. Support vector guided dictionary learning. In ECCV, 2014.
 [3] C. Cortes and V. Vapnik. Supportvector networks. Machine learning, 20(3), 1995.
 [4] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
 [5] Z.H. Feng, J. Kittler, M. Awais, P. Huber, and X.J. Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. arXiv preprint arXiv:1711.06753, 2017.
 [6] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Msceleb1m: A dataset and benchmark for largescale face recognition. In ECCV, 2016.
 [7] K. He, X. Zhang, and S. Ren. Deep residual learning for image recognition. In CVPR, 2016.
 [8] G. Huang, M. Ramesh, T. Berg, and E. Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained enviroments. In Technical Report, 2007.
 [9] I. KemelmacherShlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In CVPR, 2016.
 [10] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Li. Softmargin softmax for deep classification. In ICONIP, 2017.
 [11] S. Liao, Z. Lei, D. Yi, and S. Z. Li. A benchmark study of largescale unconstrained face recognition. In ICB, 2014.
 [12] Y. Lin, P. Goyal, and R. Girshick. Focal loss for dense object detection. In ICCV, 2017.
 [13] W. Liu, Y. Wen, and Z. Yu. Largemargin softmax loss for convolutional neural networks. In ICML, 2016.
 [14] W. Liu, Y. Wen, Z. Yu, M. Li, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017.

[15]
Y. Liu, H. Li, and X. Wang.
Learning deep features via congenerous cosine loss for person recognition.
In ICCV, 2017.  [16] A. Nech and I. KemelmacherShlizerman. Level playing field for million scale face recognition. In CVPR, 2017.
 [17] H.W. Ng and S. Winkler. A datadriven approach to cleaning large face datasets. In ICIP, 2014.
 [18] O. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
 [19] R. Ranjan, C. Castillo, and R. Chellappa. L2constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507., 2017.
 [20] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
 [21] H. Shi, X. Wang, D. Yi, Z. Lei, X. Zhu, and S. Z. Li. Crossmodality face recognition via heterogeneous joint bayesian. IEEE Signal Processing Letters, 24(1):81–85, 2017.
 [22] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016.
 [23] K. Simonyan and Z. Andrew. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [24] H. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
 [25] Y. Sun, Y. Chen, and X. Wang. Deep learning face representation by joint identificationverification. In NIPS, 2014.
 [26] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In CVPR, 2014.
 [27] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In CVPR, 2015.
 [28] Y. Taigman, M. Yang, and M. Ranzato. Deepface: Closing the gap to humanlevel performance in face verification. In CVPR, 2014.
 [29] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. C. Loy. The devil of face recognition is in the noise. In ECCV, 2018.
 [30] F. Wang, J. Cheng, W. Liu, and H. Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018.
 [31] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. arXiv preprint arXiv:1704.06904, 2017.
 [32] F. Wang, X. Xiang, J. Chen, and A. Yuille. Normface: hypersphere embedding for face verification.. In ACM MM, 2017.
 [33] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
 [34] J. Wang, F. Zhou, and S. Wen. Deep metric learning with angular loss. In ICCV, 2017.
 [35] X. Wang, X. Guo, and S. Z. Li. Adaptively unified semisupervised dictionary learning with active points. In ICCV, 2015.
 [36] X. Wang, S. Zhang, Z. Lei, S. Liu, X. Guo, and S. Z. Li. Ensemble softmargin softmax loss for image classification. arXiv preprint arXiv:1805.03922, 2018.
 [37] Y. Wen, K. Zhang, and Z. Li. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
 [38] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. PAMI, 2009.
 [39] Y. Yuan, K. Yang, and C. Zhang. Hardaware deeply cascaded embedding. In ICCV, 2017.
 [40] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. Faceboxes: A cpu realtime face detector with high accuracy. In IJCB, 2017.
 [41] Y. Zheng, D. K. Pal, and M. Savvides. Ring loss: Convex feature normalization for face recognition. In CVPR, 2018.