P2SGrad: Refined Gradients for Optimizing Deep Face Models

05/07/2019 ∙ by Xiao Zhang, et al. ∙ The Chinese University of Hong Kong 14

Cosine-based softmax losses significantly improve the performance of deep face recognition networks. However, these losses always include sensitive hyper-parameters which can make training process unstable, and it is very tricky to set suitable hyper parameters for a specific dataset. This paper addresses this challenge by directly designing the gradients for adaptively training deep neural networks. We first investigate and unify previous cosine softmax losses by analyzing their gradients. This unified view inspires us to propose a novel gradient called P2SGrad (Probability-to-Similarity Gradient), which leverages a cosine similarity instead of classification probability to directly update the testing metrics for updating neural network parameters. P2SGrad is adaptive and hyper-parameter free, which makes the training process more efficient and faster. We evaluate our P2SGrad on three face recognition benchmarks, LFW, MegaFace, and IJB-C. The results show that P2SGrad is stable in training, robust to noise, and achieves state-of-the-art performance on all the three benchmarks.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Pipeline of current face recognition systems. In this general pipeline, deep face models trained on classification tasks are treated as feature extractors. Pairwise similarities between pairs of test images are calculated to determine whether they belong to the same persons. Best viewed in color.

Over the last few years, deep convolutional neural networks have significantly boosted the face recognition accuracy. State-of-the-art approaches are based on deep neural networks and adopt the following pipeline: training a classification model with different types of softmax losses and use the trained model as a feature extractor to encode unseen samples. Then the cosine similarities between testing faces’ features, are exploited to determine whether these features belong to the same identity. Unlike other vision tasks, such as object detection, where training and testing have the same objectives and evaluation procedures, face recognition systems were trained with softmax losses but tested with cosine similarities. In other words, there is a gap between the softmax probabilities in training and inner product similarities in testing.

This problem is not well addressed in exsiting face recognition models with softmax cross-entropy loss function (softmax loss for short in the remaining part), which mainly considers probability distributions of training classes and ignores the testing setup. In order to bridge this gap, cosine softmax losses 

[28, 13, 14] and their angular margin based variants [29, 27, 3]

directly use cosine distances instead of inner products as the input raw classification scores, namely logits. Specially, the angular margin based variants aim to learn the decision boundaries with a margin between different classes. These methods improve the face recognition performance in the challenging setup.

In spite of their successes, cosine-based softmax loss is only a trade-off: the supervision signals for training are still classification probabilities, which are never evaluated during testing. Considering the fact that the similarity between two testing face images is only related to themselves while the classification probabilities are related to all the identities, cosine softmax losses are not the ideal training measures in face recognition.

This paper aims to address these problems from a different perspective. Deep neural networks are generally trained with gradient-based optimization algorithms where gradients play an essential role in this process. In addition to the loss function, we focus on the gradients of cosine softmax loss functions. This new perspective not only allows us to analyze the relations and problems of previous methods, but also inspires us to develop a novel form of adaptive gradients, P2SGrad, which mitigates the problem of training-testing mismatch and improves the face recognition performance in practice.

To be more specific, P2SGrad optimizes deep models by directly designing new gradients instead of new loss functions. Compared with the conventional gradients in cosine-based softmax losses, P2SGrad uses cosine distances to replace the classification probabilities in the original gradients. P2SGrad also eliminates the effects of different from hyperparameters and the number of classes, and matches testing targets.

This paper mainly contributes in the following aspects:

  1. We analyze the recent cosine softmax losses and their angular-margin based variants from the perspective of gradients, and propose a general formulation to unify different cosine softmax cross-entropy losses;

  2. With this unified model, we propose an adaptive hyperparameter-free gradients - P2SGrad, instead of a new loss function for training deep face recognition networks. This method reserves the advantages of using cosine distances in training and replaces classification probabilities with cosine similarities in the backward propagation;

  3. We conduct extensive experiments on large-scale face datasets. Experimental results show that P2SGrad outperforms state-of-the-art methods on the same setup and clearly improves the stability of the training process.

2 Related Works

The accuracy improvements of face recognition [9, 6, 18, 25] enjoy the large-scale training data, and the improvements of neural network structures. Modern face datasets contain a huge number of identities, such as LFW [7], PubFig [10], CASIA-WebFace [32], MS1M [4] and MegaFace [17, 8], which enable the effective training of very deep neural networks. A number of recent studies demonstrated that well-designed network architectures lead to better performance, such as DeepFace [26], DeepID2, 3 [22, 23] and FaceNet [21].

In face recognition, feature representation normalization, which restricts features to lie on a fixed-radius hyper-sphere, is a common operation to enhance models’ final performance. COCO loss [13, 14] and NormFace [28] studied the effect of normalization through mathematical analysis and proposed two strategies through reformulating softmax loss and metric learning. Coincidentally, L2-softmax [20] also proposed a similar method. These methods obtain the same formulation of cosine softmax loss from different views.

Optimizing auxiliary metric loss function is also a popular choice for boosting performance. In the early years, most face recognition approaches utilized metric loss functions, such as triplet loss [30] and contrastive loss [2], which use Euclidean margin to measure distance between features. Taking advantages of these works, center loss [31] and range loss [33] were proposed to reduce intra-class variations through minimizing distance within target classes [1].

Simply using Euclidean distance or Euclidean margin is insufficient to maximize the classification performance. To circumvent this difficulty, angular margin based softmax loss functions were proposed and became popular in face recognition. Angular constraints were added to traditional softmax loss function to improve feature discriminativeness in L-softmax [12] and A-softmax [11], where A-softmax applied weight normalization but L-softmax [12] did not. CosFace [29], AM-softmax [27] and ArcFace [3] also embraced the idea of angular margins and employed simpler as well as more intuitive loss functions compared with aforementioned methods. Normalization is applied to both features and weights in these methods.

3 Limitations of cosine softmax losses

In this section we discuss limitations caused by the mismatch between training and testing of face recognition models. We first provide a brief review of the workflow of cosine softmax losses. Then we will reveal the limitations of existing loss functions in face recognition from the perspective of forward and backward calculation respectively.

3.1 Gradients of cosine softmax losses

In face recognition tasks, the cosine softmax cross-entropy loss has an elegant two-part formulation, softmax function and cross-entropy loss.

We discuss softmax function at first. Assuming that the vector

denotes the feature representation of a face image, the input of the softmax function is the logit , i.e.,


where is a hyperparameter and is the classification score (logit) that is assigned to class , and is the weight vector of class . and are normalized vectors of and respectively. is the angle between feature and class weight . The logits are then input into the softmax function to obtain the probability , where is the number of classes and the output can be interpreted as the probability of being assigned to a certain class . If , then is the class probability of being assigned to its corresponding class .

Then we discuss the cross-entropy loss associated with the softmax function, which measures the divergence between the predicted probability and ground truth distributions as


where is the loss of input feature . The larger probability is, the smaller loss is.

In order to decrease the loss , the model needs to enlarge and thus enlarges . Then becomes smaller. In summary, cosine softmax loss function maps to the probability and calculates the cross-entropy loss to supervise the training.

In the backward propagation process, classification probabilities play key roles in optimization. The gradient of and in cosine softmax losses are calculated as


where the indicator function returns when and otherwise. and can be computed respectively as:


where and are unit vectors of and , respectively. are visualized as the red arrows in Fig. 2. This gradient vector is the updating directions of class weights . Intuitively, we expect the updating of makes close to , and makes for away from . Gradient is vertical to and points toward . Thus it is the fastest and optimal directions for updating .

Figure 2: Gradient direction of . Note this gradient is the updating direction of . The red pointed line shows that the gradient of is vertical to itself and in the plane spanned by and . This can be seen as the fastest direction for updating to be close to and for updating to be far away from . Best viewed in color.

Then we consider the gradient . In conventional cosine softmax losses [20, 28, 13], the classification score and thus . In angular margin-based cosine softmax losses [27, 29, 3], however, the gradient of for depends on where the margin parameter is. In CosFace [29] , thus and in ArcFace [3] , thus . In general, gradient is always a scalar related to parameters , and .

Based on the aforementioned discussions, we reconsider gradients of class weights in Eq. (3). In , the first part is a scalar, which decides the length of gradient, while the second part is a vector which decides the direction of gradients. Since the directions of gradients for various cosine softmax losses remain the same, the essential difference of these cosine softmax losses is the different lengths of gradients, which significantly affect the optimization of model. In the following sections, we will discuss the suboptimal gradient length caused by forward and backward process respectively.

3.2 Limitations in probability calculation

In this section we discuss the limitations of the forward calculation of cosine softmax losses in deep face networks and focus on the classification probability obtained in the forward calculation.

We first revisit the relation between and . The classification probability in Eq. (3) is a part of gradient length. Hence significantly affects the length of gradient. Probability and logit are positively correlated. For all cosine softmax losses, logits measure between feature and class weight . A larger produces lower classification probability while a smaller produces higher . It means that affects gradient length by its corresponding probability . The equation sets up a mapping relation between and and makes affects optimization. Above analysis is also the reason why cosine softmax losses are effective in face recognition performance.

Since is the direct measurement of the generalization but it can only indirectly affect gradients by corresponding , setting a reasonable mapping relation between and is crucial. However, there are two tricky problems in current cosine softmax losses: (1) classification probability is sensitive to hyperparameter settings; (2) the calculation of is dependent on class number, which is not related to face recognition tasks. We will discuss these problems below.

Figure 3: The change of average of each mini-batch when training on WebFace dataset. (Red) average angles in each mini-batch for non-corresponding classes, for . (Brown) average angles in each mini-batch for corresponding classes, .

is sensitive to hyperparameters. The most common hyperparameters in conventional cosine softmax losses [20, 28, 13] and margin variants [3] are the scale parameter and the angular margin parameter . We will analyze the sensitivity of probability to hyperparameter and . For more accurate analysis, we first look at the actual range of . Fig. 3 exhibits how the average changes in training. Mathematically, could be any value in . In practice, however, the maximum is around . The blue curve reveals that for do not change significantly during training. The brown curve reveals that is gradually reduced. Therefore we can reasonably assume that for and the range of is . Then can be rewritten as


where is logit that is assigned to its corresponding class , and is the class number.

We can obtain the mapping between probability and angle under different hyperparameter settings. In state-of-the-art angular margin based losses [3], logit .

Figure 4: Probability curves w.r.t. the angle with different hyperparameter settings.

Fig. 4 reveals that different settings of and can significantly affect the relation between and . Apparently, both the green curve and the purple curve are examples of unreasonable relations. The former is so lenient that even a very larger can produce a large . The later is so strict that even a very small can just produce a low . In short, for a specific degree of , the probabilities under different settings are very different. This observation indicates that probability is sensitive to parameters and .

To further confirm this conclusion, we take an example of correspondences between and in real training. In Fig. 5, the red curve represents the change of and the blue curve represents the change of during the training process. As we discussed above, can produce very short gradients so that the sample has little affection in updating. This setting is not ideal because increases to rapidly but is still large. Therefore classification probability largely depends on the setting of the hyperparameter .

Figure 5: The change of probability and angle as the iteration number increases with the hyperparameter setting and . Best viewed in color.

contains class number. In closed-set classification problems, probabilities become smaller as the growth of class number . This is reasonable in classification tasks. However, this is not suitable for face recognition, which is an open-set problem. Since is the direct measurement of generalization of while is the indirect measurement, we expect that they have a consistent semantic meaning. But is related to class nubmer while is not, which causes the mismatch between them.

Figure 6: with different class numbers. The hyperparameter setting is fixed to and for fair comparison. Best viewed in color.

As shown in Fig. 6, the class number is an important factor for .

From the above discussion, we reveal that limitations exist in the forward calculation of cosine softmax losses. Both hyperparameters and the class number, which are unrelated to face recognition tasks, can determine the probability , and thus affect the gradient length in Eq. (3).

3.3 Limitation in backward calculation of cosine softmax losses

In this section, we discuss the limitations in the backward calculation of the cosine softmax function, especially the angular-margin based softmax losses [3].

We revisit the gradient in Eq. (3). Besides , the part of also affects the length of gradient. Larger produces longer gradients while smaller ones produce shorter gradients. So we expect and values of to be positively correlated: small for small and large for larger .

Figure 7: How affects the length of gradients. (Left) the correspondence between and . The red curve means is constant in conventional cosine softmax losses [20, 28, 13] while the blue curve means small can produce very large . (Right) each point refers to a feature and the vertical vector is weight . The is angle between each and . The color from light to dark corresponds to the value of from small to large. Hence for the factor of , the dark points produce longer gradients than the light points. Best viewed in color.

The logit is different in various cosine softmax losses, and thus the specific form of is different. Generally, we focus on simple cosine softmax losses [20, 28, 13] and state-of-the-art angular margin based loss [3]. Their are visualized in Fig. 7, which show that the lengths of gradients in conventional softmax cosine losses [20, 28, 13] are constant. However, in angular margin-based losses [3], the lengths of gradients and are negatively correlated, which is completely contrary to our expectations. Moreover, the correspondence between length of gradients in angular margin-based loss [3] and becomes tricky: when gradually reduced, tends to shorten length of gradients but

tends to elongate the length. Therefore, the geometric meaning of the gradient length becomes self-contradictory in angular margin-based cosine softmax loss.

3.4 Summary

In the above discussion, we first reveal that various cosine softmax losses have the same updating directions. Hence the main difference between the variants are their gradient lengths. For the length of gradient, there are two scalars that determine its value: the probability in the forward process and the gradient . For , we observe that it can be substantially affected by different hyperparameter settings and class numbers. For , its value depends on the definition of .

In summary, from the perspective of gradient, the widely used cosine softmax losses [20, 28, 13] and their angular margin variants [3] cannot produce optimal gradient lengths with well-explained geometric meanings.

4 P2SGrad: Change Probability to Similarity in Gradient

In this section, we propose a new method, namely P2SGrad, that determines the gradient length only by in training face recognition models. Formally, the gradient length produced by P2SGrad is hyperparameter-free and not related to the number of class nor to a ad-hoc definition of logit . P2SGrad does not need a specified formulation of loss function because gradients is well-designed to optimize deep models.

Since the main differences between state-of-the-art cosine softmax losses are the gradient lengths, reforming a reasonable gradient length is an intuitive thought. In order to decouple the length factor and direction factor of the gradients, we rewrite Eq. (3) as


where the direction factors and are defined as


where and are unit vectors of and , respectively. is the cosine distances between feature and class weights . The direction factors will not be changed because they are the fastest changing directions, which are specified before. The length factor is defined as


The length factor depends on the probability and which are what we aim to reform.

Since we expect that the new length is hyperparameter-free, the cosine logit will not have hyperparameters like or . Thus a constant should be an ideal choice.

For the probability , because it is hard to set a reasonable mapping function between and , we can directly use as a good alternative of in the gradient length term. Firstly, they have the same theoretical range of where . Secondly, unlike which is adversely influenced by hyperparameter and the number of class, does not contain any of these. It means that we do not need to select specified parameters settings for ideal correspondence between and . Moreover, compared with , is a more natural supervision because cosine similarities are used in the testing phase of open-set face recognition systems while probabilities only apply for close-set classification tasks. Therefore, our reformed gradient length factor can be defined as:


where is a function of . The reformed gradients could then be defined as


where is the indicator function. The full formulation can be rewrite as


The formulation of P2SGrad is not only succinct but reasonable. When , the proposed gradient length and are positively correlated, when , they are negatively correlated. More importantly, gradient length in P2SGrad only depends on and thus is consistent the testing metric of face recognition systems.

5 Experiments

In this section, we conduct a series of experiments to evaluate the proposed P2SGrad. We first verify advantages of P2SGrad in some exploratory experiments by testing the model’s performance on LFW [7]. Then we evaluate P2SGrad on MegaFace [8] Challenge and IJBC 1:1 verification [16] with the same training configuration.

5.1 Exploratory Experiments

Preprocessing and training setting. We use CASIA-WebFace [32] as training data and ResNet-50 as the backbone network architecture. Here WebFace [32] dataset is cleaned and contains about k facial images. RSA [15] is adopted to images to extract facial areas and then aligns the faces using similarity transformation. All images are resized to . Also, we conduct pixel value normalization by subtracting and then dividing by . For all exploratory experiments, the size of a mini-batch is in every iteration.

Figure 8: Curves of and gradient lengths w.r.t. iteration. Gradient lengths in existing cosine-based softmax losses (top-left, top-right, bottom-left) rapidly decrease to nearly while gradient length produced by P2SGrad (bottom-right) can match between and its ground truth class . Best viewed in color.

The change of gradient length and w.r.t. iteration. Since P2SGrad aims to set up a reasonable mapping from to the length of gradients, it is necessary to visualize such mapping. In order to demonstrate the advancement of P2SGrad, we plot mapping curves of several cosine-based softmax losses in Fig. 8. This figure clearly shows that P2SGrad produces more optimal gradient length according to the change of .

Init. LR Method
NormFace CosFace ArcFace P2SGrad
Table 1: The sensitiveness of initial learning rates. This table shows whether our P2SGrad and these cosine-based softmax loss are trainable under different initial learning rates.

Robustness of initial learning rates. An important problem of margin-based loss is that they are difficult to train with large learning rates. The implementation of L-softmax [12] and A-softmax [11] use extra hyperparameters to adjust the margin so that the models are trainable. Thus a small initial learning rate is important for properly training angular-margin-based softmax losses. In contrast, shown in Table. 1, our proposed P2SGrad is stable with large learning rates.

Figure 9: The change of average w.r.t. iteration number. represents the angle between and the weight vector of its ground truth class . Curves by the proposed P2SGrad, -softmax loss [20], CosFace [29] and ArcFace [3] are shown.
Method Num. of Iteration
k k k
-softmax [20]
CosFace [29]
ArcFace [3]
P2SGrad 91.25 97.38 99.82
Table 2: Convergence rates of P2SGrad and compared losses. With the same number of iterations, P2SGrad leads to the best performance.
Method Size of MegaFace Distractor
-softmax [20]
CosFace [29]
ArcFace [3]
P2SGrad 99.86% 99.70% 99.52% 98.92% 98.35% 97.25%
Table 3: Recognition accuracy on MegaFace. Inception-ResNet [24] models trained with different compared softmax loss and the same cleaned WebFace [32] and MS1M [4] training data.
Method True Acceptance Rate @ False Acceptance Rate
VggFace [18] -

Crystal Loss [19]

-softmax [20]

CosFace [29]

ArcFace [3]

97.79% 95.58% 92.25% 87.84% 82.44% 73.16%
Table 4: TARs by different compared softmax losses on the IJB-C 1:1 verification task. The same training data (WebFace [32] and MS1M [4]) and Inception-ResNet [24] networks are used. Results of VggFace [18] and Crystal Loss [19] are from [19].

Convergence rate. The convergence rate is important for evaluating optimization methods. We evaluated the trained model’s performance on Labeled Faces in the Wild (LFW) dataset of several cosine-based softmax losses and our P2SGrad method at different training periods. LFW dataset is an academic test set for unrestricted face verification. Its testing protocol contains about images of about identities. There are positive matches and the same number of negative matches. Table. 2 shows the results with the same training configuration while Fig. 9 shows the decrease of average with P2SGrad is more rapid than other losses. These results reveal that our proposed P2SGrad can optimize neural network much faster.

5.2 Evaluation on MegaFace

Preprocessing and training setting. Besides the mentioned WebFace [32] dataset, we add another public training dataset, MS1M [4], which contains about M cleaned and aligned images. Here we use Inception-ResNet [5, 24] with a batch size of for training.

Evaluation results. MegaFace 1 million Challenge [8] is a public identification benchmark to test the performance of facial identification algorithms. The distractor in MegaFace contains about images. Here we follow the cleaned testing protocol in [3]. The results of P2SGrad on MegaFace dataset are shown in Table 3. P2SGrad exceeds other compared cosine-based losses on MegaFace 1 million challenge with every size of distractor.

5.3 Evaluation on IJBC 1:1 verification

Preprocessing and training setting. Same as 5.2.

Evaluation results. The IJB-C dataset [16] contains about identities with a total of still facial images and unconstrained video frames. The entire IJB-C testing protocols are designed to test detection, identification, verification and clustering of faces. In the 1:1 verification protocol, there are positive matches and negative matches. Therefore, we test Ture Acceptance Rates at very strict False Acceptance Rates. Table. 4 exhibits that P2SGrad surpasses all other cosine-based losses.

6 Conclusion

we comprehensively discussed the limitations of the forward and backward processes in training deep model for face recognition. To deal with the limitations, we proposed a simple but effective gradient method, P2SGrad, which is hyperparameter free and leads to better optimization results. Unlike previous methods which focused on loss functions, we improve the deep network training by using carefully designed gradients. Extensive experiments validate the robustness and fast convergence of the proposed method. Moreover, experimental results show that P2SGrad achieves superior performance over state-of-the-art methods on several challenging face recognition benchmarks.

Acknowledgements. This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, in part by CUHK Direct Grant, and in part by National Natural Science Foundation of China (61472410) and the Joint Lab of CAS-HK.