Over the last few years, deep convolutional neural networks have significantly boosted the face recognition accuracy. State-of-the-art approaches are based on deep neural networks and adopt the following pipeline: training a classification model with different types of softmax losses and use the trained model as a feature extractor to encode unseen samples. Then the cosine similarities between testing faces’ features, are exploited to determine whether these features belong to the same identity. Unlike other vision tasks, such as object detection, where training and testing have the same objectives and evaluation procedures, face recognition systems were trained with softmax losses but tested with cosine similarities. In other words, there is a gap between the softmax probabilities in training and inner product similarities in testing.
This problem is not well addressed in exsiting face recognition models with softmax cross-entropy loss function (softmax loss for short in the remaining part), which mainly considers probability distributions of training classes and ignores the testing setup. In order to bridge this gap, cosine softmax losses[28, 13, 14] and their angular margin based variants [29, 27, 3]
directly use cosine distances instead of inner products as the input raw classification scores, namely logits. Specially, the angular margin based variants aim to learn the decision boundaries with a margin between different classes. These methods improve the face recognition performance in the challenging setup.
In spite of their successes, cosine-based softmax loss is only a trade-off: the supervision signals for training are still classification probabilities, which are never evaluated during testing. Considering the fact that the similarity between two testing face images is only related to themselves while the classification probabilities are related to all the identities, cosine softmax losses are not the ideal training measures in face recognition.
This paper aims to address these problems from a different perspective. Deep neural networks are generally trained with gradient-based optimization algorithms where gradients play an essential role in this process. In addition to the loss function, we focus on the gradients of cosine softmax loss functions. This new perspective not only allows us to analyze the relations and problems of previous methods, but also inspires us to develop a novel form of adaptive gradients, P2SGrad, which mitigates the problem of training-testing mismatch and improves the face recognition performance in practice.
To be more specific, P2SGrad optimizes deep models by directly designing new gradients instead of new loss functions. Compared with the conventional gradients in cosine-based softmax losses, P2SGrad uses cosine distances to replace the classification probabilities in the original gradients. P2SGrad also eliminates the effects of different from hyperparameters and the number of classes, and matches testing targets.
This paper mainly contributes in the following aspects:
We analyze the recent cosine softmax losses and their angular-margin based variants from the perspective of gradients, and propose a general formulation to unify different cosine softmax cross-entropy losses;
With this unified model, we propose an adaptive hyperparameter-free gradients - P2SGrad, instead of a new loss function for training deep face recognition networks. This method reserves the advantages of using cosine distances in training and replaces classification probabilities with cosine similarities in the backward propagation;
We conduct extensive experiments on large-scale face datasets. Experimental results show that P2SGrad outperforms state-of-the-art methods on the same setup and clearly improves the stability of the training process.
2 Related Works
The accuracy improvements of face recognition [9, 6, 18, 25] enjoy the large-scale training data, and the improvements of neural network structures. Modern face datasets contain a huge number of identities, such as LFW , PubFig , CASIA-WebFace , MS1M  and MegaFace [17, 8], which enable the effective training of very deep neural networks. A number of recent studies demonstrated that well-designed network architectures lead to better performance, such as DeepFace , DeepID2, 3 [22, 23] and FaceNet .
In face recognition, feature representation normalization, which restricts features to lie on a fixed-radius hyper-sphere, is a common operation to enhance models’ final performance. COCO loss [13, 14] and NormFace  studied the effect of normalization through mathematical analysis and proposed two strategies through reformulating softmax loss and metric learning. Coincidentally, L2-softmax  also proposed a similar method. These methods obtain the same formulation of cosine softmax loss from different views.
Optimizing auxiliary metric loss function is also a popular choice for boosting performance. In the early years, most face recognition approaches utilized metric loss functions, such as triplet loss  and contrastive loss , which use Euclidean margin to measure distance between features. Taking advantages of these works, center loss  and range loss  were proposed to reduce intra-class variations through minimizing distance within target classes .
Simply using Euclidean distance or Euclidean margin is insufficient to maximize the classification performance. To circumvent this difficulty, angular margin based softmax loss functions were proposed and became popular in face recognition. Angular constraints were added to traditional softmax loss function to improve feature discriminativeness in L-softmax  and A-softmax , where A-softmax applied weight normalization but L-softmax  did not. CosFace , AM-softmax  and ArcFace  also embraced the idea of angular margins and employed simpler as well as more intuitive loss functions compared with aforementioned methods. Normalization is applied to both features and weights in these methods.
3 Limitations of cosine softmax losses
In this section we discuss limitations caused by the mismatch between training and testing of face recognition models. We first provide a brief review of the workflow of cosine softmax losses. Then we will reveal the limitations of existing loss functions in face recognition from the perspective of forward and backward calculation respectively.
3.1 Gradients of cosine softmax losses
In face recognition tasks, the cosine softmax cross-entropy loss has an elegant two-part formulation, softmax function and cross-entropy loss.
We discuss softmax function at first. Assuming that the vectordenotes the feature representation of a face image, the input of the softmax function is the logit , i.e.,
where is a hyperparameter and is the classification score (logit) that is assigned to class , and is the weight vector of class . and are normalized vectors of and respectively. is the angle between feature and class weight . The logits are then input into the softmax function to obtain the probability , where is the number of classes and the output can be interpreted as the probability of being assigned to a certain class . If , then is the class probability of being assigned to its corresponding class .
Then we discuss the cross-entropy loss associated with the softmax function, which measures the divergence between the predicted probability and ground truth distributions as
where is the loss of input feature . The larger probability is, the smaller loss is.
In order to decrease the loss , the model needs to enlarge and thus enlarges . Then becomes smaller. In summary, cosine softmax loss function maps to the probability and calculates the cross-entropy loss to supervise the training.
In the backward propagation process, classification probabilities play key roles in optimization. The gradient of and in cosine softmax losses are calculated as
where the indicator function returns when and otherwise. and can be computed respectively as:
where and are unit vectors of and , respectively. are visualized as the red arrows in Fig. 2. This gradient vector is the updating directions of class weights . Intuitively, we expect the updating of makes close to , and makes for away from . Gradient is vertical to and points toward . Thus it is the fastest and optimal directions for updating .
Then we consider the gradient . In conventional cosine softmax losses [20, 28, 13], the classification score and thus . In angular margin-based cosine softmax losses [27, 29, 3], however, the gradient of for depends on where the margin parameter is. In CosFace  , thus and in ArcFace  , thus . In general, gradient is always a scalar related to parameters , and .
Based on the aforementioned discussions, we reconsider gradients of class weights in Eq. (3). In , the first part is a scalar, which decides the length of gradient, while the second part is a vector which decides the direction of gradients. Since the directions of gradients for various cosine softmax losses remain the same, the essential difference of these cosine softmax losses is the different lengths of gradients, which significantly affect the optimization of model. In the following sections, we will discuss the suboptimal gradient length caused by forward and backward process respectively.
3.2 Limitations in probability calculation
In this section we discuss the limitations of the forward calculation of cosine softmax losses in deep face networks and focus on the classification probability obtained in the forward calculation.
We first revisit the relation between and . The classification probability in Eq. (3) is a part of gradient length. Hence significantly affects the length of gradient. Probability and logit are positively correlated. For all cosine softmax losses, logits measure between feature and class weight . A larger produces lower classification probability while a smaller produces higher . It means that affects gradient length by its corresponding probability . The equation sets up a mapping relation between and and makes affects optimization. Above analysis is also the reason why cosine softmax losses are effective in face recognition performance.
Since is the direct measurement of the generalization but it can only indirectly affect gradients by corresponding , setting a reasonable mapping relation between and is crucial. However, there are two tricky problems in current cosine softmax losses: (1) classification probability is sensitive to hyperparameter settings; (2) the calculation of is dependent on class number, which is not related to face recognition tasks. We will discuss these problems below.
is sensitive to hyperparameters. The most common hyperparameters in conventional cosine softmax losses [20, 28, 13] and margin variants  are the scale parameter and the angular margin parameter . We will analyze the sensitivity of probability to hyperparameter and . For more accurate analysis, we first look at the actual range of . Fig. 3 exhibits how the average changes in training. Mathematically, could be any value in . In practice, however, the maximum is around . The blue curve reveals that for do not change significantly during training. The brown curve reveals that is gradually reduced. Therefore we can reasonably assume that for and the range of is . Then can be rewritten as
where is logit that is assigned to its corresponding class , and is the class number.
We can obtain the mapping between probability and angle under different hyperparameter settings. In state-of-the-art angular margin based losses , logit .
Fig. 4 reveals that different settings of and can significantly affect the relation between and . Apparently, both the green curve and the purple curve are examples of unreasonable relations. The former is so lenient that even a very larger can produce a large . The later is so strict that even a very small can just produce a low . In short, for a specific degree of , the probabilities under different settings are very different. This observation indicates that probability is sensitive to parameters and .
To further confirm this conclusion, we take an example of correspondences between and in real training. In Fig. 5, the red curve represents the change of and the blue curve represents the change of during the training process. As we discussed above, can produce very short gradients so that the sample has little affection in updating. This setting is not ideal because increases to rapidly but is still large. Therefore classification probability largely depends on the setting of the hyperparameter .
contains class number. In closed-set classification problems, probabilities become smaller as the growth of class number . This is reasonable in classification tasks. However, this is not suitable for face recognition, which is an open-set problem. Since is the direct measurement of generalization of while is the indirect measurement, we expect that they have a consistent semantic meaning. But is related to class nubmer while is not, which causes the mismatch between them.
As shown in Fig. 6, the class number is an important factor for .
From the above discussion, we reveal that limitations exist in the forward calculation of cosine softmax losses. Both hyperparameters and the class number, which are unrelated to face recognition tasks, can determine the probability , and thus affect the gradient length in Eq. (3).
3.3 Limitation in backward calculation of cosine softmax losses
In this section, we discuss the limitations in the backward calculation of the cosine softmax function, especially the angular-margin based softmax losses .
We revisit the gradient in Eq. (3). Besides , the part of also affects the length of gradient. Larger produces longer gradients while smaller ones produce shorter gradients. So we expect and values of to be positively correlated: small for small and large for larger .
The logit is different in various cosine softmax losses, and thus the specific form of is different. Generally, we focus on simple cosine softmax losses [20, 28, 13] and state-of-the-art angular margin based loss . Their are visualized in Fig. 7, which show that the lengths of gradients in conventional softmax cosine losses [20, 28, 13] are constant. However, in angular margin-based losses , the lengths of gradients and are negatively correlated, which is completely contrary to our expectations. Moreover, the correspondence between length of gradients in angular margin-based loss  and becomes tricky: when gradually reduced, tends to shorten length of gradients but
tends to elongate the length. Therefore, the geometric meaning of the gradient length becomes self-contradictory in angular margin-based cosine softmax loss.
In the above discussion, we first reveal that various cosine softmax losses have the same updating directions. Hence the main difference between the variants are their gradient lengths. For the length of gradient, there are two scalars that determine its value: the probability in the forward process and the gradient . For , we observe that it can be substantially affected by different hyperparameter settings and class numbers. For , its value depends on the definition of .
4 P2SGrad: Change Probability to Similarity in Gradient
In this section, we propose a new method, namely P2SGrad, that determines the gradient length only by in training face recognition models. Formally, the gradient length produced by P2SGrad is hyperparameter-free and not related to the number of class nor to a ad-hoc definition of logit . P2SGrad does not need a specified formulation of loss function because gradients is well-designed to optimize deep models.
Since the main differences between state-of-the-art cosine softmax losses are the gradient lengths, reforming a reasonable gradient length is an intuitive thought. In order to decouple the length factor and direction factor of the gradients, we rewrite Eq. (3) as
where the direction factors and are defined as
where and are unit vectors of and , respectively. is the cosine distances between feature and class weights . The direction factors will not be changed because they are the fastest changing directions, which are specified before. The length factor is defined as
The length factor depends on the probability and which are what we aim to reform.
Since we expect that the new length is hyperparameter-free, the cosine logit will not have hyperparameters like or . Thus a constant should be an ideal choice.
For the probability , because it is hard to set a reasonable mapping function between and , we can directly use as a good alternative of in the gradient length term. Firstly, they have the same theoretical range of where . Secondly, unlike which is adversely influenced by hyperparameter and the number of class, does not contain any of these. It means that we do not need to select specified parameters settings for ideal correspondence between and . Moreover, compared with , is a more natural supervision because cosine similarities are used in the testing phase of open-set face recognition systems while probabilities only apply for close-set classification tasks. Therefore, our reformed gradient length factor can be defined as:
where is a function of . The reformed gradients could then be defined as
where is the indicator function. The full formulation can be rewrite as
The formulation of P2SGrad is not only succinct but reasonable. When , the proposed gradient length and are positively correlated, when , they are negatively correlated. More importantly, gradient length in P2SGrad only depends on and thus is consistent the testing metric of face recognition systems.
In this section, we conduct a series of experiments to evaluate the proposed P2SGrad. We first verify advantages of P2SGrad in some exploratory experiments by testing the model’s performance on LFW . Then we evaluate P2SGrad on MegaFace  Challenge and IJBC 1:1 verification  with the same training configuration.
5.1 Exploratory Experiments
Preprocessing and training setting. We use CASIA-WebFace  as training data and ResNet-50 as the backbone network architecture. Here WebFace  dataset is cleaned and contains about k facial images. RSA  is adopted to images to extract facial areas and then aligns the faces using similarity transformation. All images are resized to . Also, we conduct pixel value normalization by subtracting and then dividing by . For all exploratory experiments, the size of a mini-batch is in every iteration.
The change of gradient length and w.r.t. iteration. Since P2SGrad aims to set up a reasonable mapping from to the length of gradients, it is necessary to visualize such mapping. In order to demonstrate the advancement of P2SGrad, we plot mapping curves of several cosine-based softmax losses in Fig. 8. This figure clearly shows that P2SGrad produces more optimal gradient length according to the change of .
Robustness of initial learning rates. An important problem of margin-based loss is that they are difficult to train with large learning rates. The implementation of L-softmax  and A-softmax  use extra hyperparameters to adjust the margin so that the models are trainable. Thus a small initial learning rate is important for properly training angular-margin-based softmax losses. In contrast, shown in Table. 1, our proposed P2SGrad is stable with large learning rates.
|Method||Num. of Iteration|
|Method||Size of MegaFace Distractor|
|Method||True Acceptance Rate @ False Acceptance Rate|
Crystal Loss 
Convergence rate. The convergence rate is important for evaluating optimization methods. We evaluated the trained model’s performance on Labeled Faces in the Wild (LFW) dataset of several cosine-based softmax losses and our P2SGrad method at different training periods. LFW dataset is an academic test set for unrestricted face verification. Its testing protocol contains about images of about identities. There are positive matches and the same number of negative matches. Table. 2 shows the results with the same training configuration while Fig. 9 shows the decrease of average with P2SGrad is more rapid than other losses. These results reveal that our proposed P2SGrad can optimize neural network much faster.
5.2 Evaluation on MegaFace
Preprocessing and training setting. Besides the mentioned WebFace  dataset, we add another public training dataset, MS1M , which contains about M cleaned and aligned images. Here we use Inception-ResNet [5, 24] with a batch size of for training.
Evaluation results. MegaFace 1 million Challenge  is a public identification benchmark to test the performance of facial identification algorithms. The distractor in MegaFace contains about images. Here we follow the cleaned testing protocol in . The results of P2SGrad on MegaFace dataset are shown in Table 3. P2SGrad exceeds other compared cosine-based losses on MegaFace 1 million challenge with every size of distractor.
5.3 Evaluation on IJBC 1:1 verification
Preprocessing and training setting. Same as 5.2.
Evaluation results. The IJB-C dataset  contains about identities with a total of still facial images and unconstrained video frames. The entire IJB-C testing protocols are designed to test detection, identification, verification and clustering of faces. In the 1:1 verification protocol, there are positive matches and negative matches. Therefore, we test Ture Acceptance Rates at very strict False Acceptance Rates. Table. 4 exhibits that P2SGrad surpasses all other cosine-based losses.
we comprehensively discussed the limitations of the forward and backward processes in training deep model for face recognition. To deal with the limitations, we proposed a simple but effective gradient method, P2SGrad, which is hyperparameter free and leads to better optimization results. Unlike previous methods which focused on loss functions, we improve the deep network training by using carefully designed gradients. Extensive experiments validate the robustness and fast convergence of the proposed method. Moreover, experimental results show that P2SGrad achieves superior performance over state-of-the-art methods on several challenging face recognition benchmarks.
Acknowledgements. This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, in part by CUHK Direct Grant, and in part by National Natural Science Foundation of China (61472410) and the Joint Lab of CAS-HK.
-  Peter N. Belhumeur, João P Hespanha, and David J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7):711–720, 1997.
-  S Chopra, R Hadsell, and Y Lecun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, pages 539–546 vol. 1, 2005.
-  Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
-  Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
-  Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
-  Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar.
Attribute and simile classifiers for face verification.In Computer Vision, 2009 IEEE 12th International Conference on, pages 365–372. IEEE, 2009.
-  Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
-  Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, pages 507–516, 2016.
-  Yu Liu, Hongyang Li, and Xiaogang Wang. Learning deep features via congenerous cosine loss for person recognition. arXiv preprint arXiv:1702.06890, 2017.
-  Yu Liu, Hongyang Li, and Xiaogang Wang. Rethinking feature discrimination and polymerization for large-scale recognition. arXiv preprint arXiv:1710.00870, 2017.
-  Yu Liu, Hongyang Li, Junjie Yan, Fangyin Wei, Xiaogang Wang, and Xiaoou Tang. Recurrent scale approximation for object detection in cnn. In IEEE International Conference on Computer Vision, 2017.
-  Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, et al. Iarpa janus benchmark–c: Face dataset and protocol. In 11th IAPR International Conference on Biometrics, 2018.
-  Aaron Nech and Ira Kemelmacher-Shlizerman. Level playing field for million scale face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3406–3415. IEEE, 2017.
-  Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  Rajeev Ranjan, Ankan Bansal, Hongyu Xu, Swami Sankaranarayanan, Jun-Cheng Chen, Carlos D Castillo, and Rama Chellappa. Crystal loss and quality pooling for unconstrained face verification and recognition. arXiv preprint arXiv:1804.01159, 2018.
-  Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
-  Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
-  Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
-  Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. In CVPR, 2015.
-  Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
-  Feng Wang, Weiyang Liu, Haijun Liu, and Jian Cheng. Additive margin softmax for face verification. arXiv preprint arXiv:1801.05599, 2018.
-  Feng Wang, Xiang Xiang, Jian Cheng, and Alan L Yuille. Normface: hypersphere embedding for face verification. arXiv preprint arXiv:1704.06369, 2017.
-  Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
Kilian Q Weinberger and Lawrence K Saul.
Distance metric learning for large margin nearest neighbor
Journal of Machine Learning Research, 10(Feb):207–244, 2009.
-  Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
-  Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
-  Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. Range loss for deep face recognition with long-tailed training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5409–5418, 2017.