The success of Convolutional Neural Networks (CNNs) on face recognition can be mainly credited to: enormous training data, network architectures, and loss functions. Recently, designing effective loss functions that enhance discriminative power is pivotal for training deep face CNNs.
Current state-of-the-art (SOTA) face recognition methods mainly adopt softmax-based classification loss. Since the learned features with the original softmax is not sufficiently discriminative for the practical face recognition problem [liu2017sphereface], which means that the testing identities are usually disjoint from the training set, several margin-based variants have been proposed to enhance features’ discriminative power. For example, explicit margin, i.e., CosFace [wang2018cosface], Sphereface [liu2017sphereface], ArcFace [deng2018arcface], and implicit margin, i.e., Adacos [zhang2019adacos], supplement the original softmax function to enforce greater intra-class compactness and inter-class discrepancy, which result in more discriminate features. However, these margin-based loss functions do not explicitly emphasize each sample according to its importance.
As demonstrated in [chen2019angular, huang2020distribution], hard sample mining is also a critical step to further improve the final accuracy. As a commonly-used hard sample mining method, OHEM [shrivastava2016training] focuses on the large-loss samples in one mini-batch, in which the percentage of hard samples is empirically decided and easy samples are completely discarded. Focal loss [lin2017focal] is a soft mining variant that rectifies the loss function to a elaborately designed form, in which two hyper-parameters should be tuned with a lot of efforts to decide the weights of each sample and hard samples are emphasized by reducing the weights of easy samples. Recently, Triplet loss [schroff2015facenet] and MV-Arc-Softmax [wang2018support] are motivated by integrating both margin and mining into one framework. Triplet loss adopts a semi-hard mining strategy to obtain semi-hard triplets and enlarges the margin between triplet samples. MV-Arc-Softmax [wang2018support] clearly defines hard samples as misclassified samples and emphasizes them by increasing the weights of their negative cosine similarities with a preset constant. In a nutshell, mining-based loss functions explicitly emphasize the effects of semi-hard or hard samples [schroff2015facenet].
However, there are drawbacks in training strategies of both margin- and mining-based loss functions. The general softmax-based loss function can be formulated as follows:
where and are the functions to define the positive and negative cosine similarities, respectively. denotes the modulation coefficients of negative cosine similarities and is a constant. For margin-based methods, mining strategy is ignored and thus the difficultness of each sample is not exploited, which may lead to convergence issues when using a large margin on small backbones, e.g., MobileFaceNet [chen2018mobilefacenets]. As shown in Fig. 1, the modulation coefficients for the negative cosine similarities are fixed as a constant of in ArcFace for all samples during the entire training process. For mining-based methods, over-emphasizing hard samples in early training stage may hinder the model to converge. MV-Arc-Softmax emphasizes hard samples by modulating the negative cosine similarity as , i.e., , where is a manually defined constant. As MV-Arc-Softmax claimed, plays a key role in the model convergence property and a slight larger value (e.g., ) may cause the model difficult to converge. Thus needs to be carefully tuned.
In this work, we propose a novel adaptive curriculum learning loss, termed CurricularFace, to achieve a novel training strategy for deep face recognition. Motivated by the nature of human learning that easy cases are learned first and then come the hard ones [bengio2009curriculum], our CurricularFace incorporates the idea of Curriculum Learning (CL) into face recognition in an adaptive manner, which differs from the traditional CL in two aspects. First, the curriculum construction is adaptive. In traditional CL, the samples are ordered by the corresponding difficultness, which are often defined by a prior and then fixed to establish the curriculum. In CurricularFace, the samples are randomly selected in each mini-batch, while the curriculum is established adaptively via mining the hard samples online, which shows the diversity in samples with different importance. Second, the importance of hard samples are adaptive. On one hand, the relative importance between easy and hard samples is dynamic and could be adjusted in different training stages. On the other hand, the importance of each hard sample in current mini-batch depends on its own difficultness.
Specifically, the mis-classified samples in mini-batch are chosen as hard samples and weighted by adjusting the modulation coefficients
of cosine similarities between the sample and the non-ground truth class center vectors,i.e., negative cosine similarity . To achieve the goal of adaptive curricular learning in the entire training, we design a novel coefficient function that is determined by two factors: ) the adaptively estimated parameter that utilizes moving average of positive cosine similarities between samples and the corresponding ground-truth class center to unleash the burden of manually tuning; and ) the angle that defines the difficultness of hard samples to achieve adaptive assignment. To sum up, the contributions of this work are:
We propose an adaptive curriculum learning loss for face recognition, which automatically emphasizes easy samples first and hard samples later. To the best of our knowledge, it is the first work to introduce the idea of adaptive curriculum learning for face recognition.
We design a novel modulation coefficient function to achieve adaptive curriculum learning during training, which connects positive and negative cosine similarity simultaneously without the need of manually tuning any additional hyper-parameter.
We conduct extensive experiments on popular facial benchmarks, which demonstrate the superiority of our CurricularFace over the SOTA competitors.
2 Related Work
Margin-based loss function.
Loss design is pivotal for large-scale face recognition. Current SOTA deep face recognition methods mostly adopt softmax-based classification loss [taigman2014deepface]. Since the learned features with the original softmax loss are not guaranteed to be discriminative enough for practical face recognition problem [liu2017sphereface], margin-based losses [liu2016large, liu2017sphereface, deng2018arcface] are proposed. Though the margin-based loss functions are verified to obtain good performance, they do not take the difficultness of each sample into consideration, while our CurricularFace emphasizes easy samples first and hard samples later, which is more reasonable and effective.
Mining-based loss function.
Though some mining-based loss function such as Focal loss [lin2017focal], Online Hard Sample Mining (OHEM) [shrivastava2016training] are prevalent in the field of object detection, they are rarely used in face recognition. OHEM focuses on the large-loss samples in one mini-batch, in which the percentage of the hard samples is empirically determined and easy samples are completely discarded. Focal loss emphasizes hard samples by reducing the weights of easy samples, in which two hyper-parameters should be manually tuned. The recent work, MV-Arc-Softmax [wang2018support] fuses the motivations of both margin and mining into one framework for deep face recognition. They define hard samples as misclassified samples and enlarge the weights of hard samples with a preset constant. Our method differs from MV-Arc-Softmax in three aspects: ) We do not always emphasize hard samples, especially in the early training stages. ) We assign different weights for hard samples according to their corresponding difficultness. ) We adaptively estimate the additional hyper-parameter t without manual tuning.
Learning from easier samples first and harder samples later is a common strategy in Curriculum Learning (CL) [bengio2009curriculum, zhou2018scheduled]. The key problem in CL is to define the difficultness of each sample. For example, [basu2013teaching] takes the negative distance to the boundary as the indicator for easiness in classification. However, the ad-hoc curriculum design in CL turns out to be difficult to implement in different problems. To alleviate this issue, [kumar2010selfpace] designs a new formulation, called Self-Paced Learning (SPL), where examples with lower losses are considered to be easier and emphasized during training. The key differences between our CurricularFace with SPL are: ) Our method focuses on easier samples in the early training stage and emphasizes hard samples in the later stage. ) Our method proposes a novel function for negative cosine similarities, which achieves not only adaptive assignment on modulation coefficients for different samples in the same training stage, but also adaptive curriculum learning strategy in different stages.
3 The Proposed CurricularFace
3.1 Preliminary Knowledge on Loss Function
The original softmax loss is formulated as follows:
denotes the deep feature of-th sample which belongs to the class, denotes the -th column of the weight and is the bias term. The class number and the embedding feature size are and , respectively. In practice, the bias is usually set to and the individual weight is set to by normalization. The deep feature is also normalized and re-scaled to . Thus, the original softmax can be modified as follows:
Since the learned features with original softmax loss may not be discriminative enough for practical face recognition problem, several variants are proposed and can be formulated in a general form:
is the predicted ground truth probability andis an indicator function. and are the functions to modulate the positive and negative cosine similarities, respectively, where is a constant, and denotes the modulation coefficients of negative cosine similarities. In margin-based loss function, e.g., ArcFace, , , and . It only modifies the positive cosine similarity of each sample to enhance the feature discrimination. As shown in Fig. 1, the modulation coefficients of each sample’s negative cosine similarities are fixed as . The recent work, MV-Arc-Softmax emphasizes hard samples by increasing for hard samples. That is, and is formulated as follows:
If a sample is defined to be easy, its negative cosine similarity is kept the same as the original one, ; if as a hard sample, its negative cosine similarity becomes . That is, as shown in Fig. 1, is a constant and determined by a preset hyper-parameter . Meanwhile, since is always larger than , always holds true, which means the model always focuses on hard samples, even in the early training stage. However, the parameter is sensitive that a large pre-defined value (e.g., ) may lead to convergence issue.
3.2 Adaptive Curricular Learning Loss
Next, we present the details of our proposed adaptive curriculum learning loss, which is the first attempt to introduce adaptive curriculum learning into deep face recognition. The formulation of our loss function is also contained in the general form, where , positive and negative cosine similarity functions are defined as follows:
It should be noted that the positive cosine similarity can adopt any margin-based loss functions and here we adopt ArcFace as an example. As shown in Fig. 1, the modulation coefficient of hard sample negative cosine similarity depends on both the value of and . In the early training stage, learning from easy samples is beneficial to model convergence. Thus, should be close to zero and is smaller than . Therefore, the weights of hard samples are reduced and easy samples are emphasized relatively. As training goes on, the model gradually focuses on the hard samples, i.e., the value of shall increase and is larger than . Thus, the hard samples are emphasized with larger weights. Moreover, within the same training stage, is monotonically decreasing with so that harder sample can be assigned with larger coefficient according to its difficultness. The value of the parameter is automatically estimated in our CurricularFace, otherwise it may require lots of efforts for manual tuning.
Next, we show our CurricularFace can be easily optimized by the conventional stochastic gradient descent. Assumingdenotes the deep feature of -th sample which belongs to the
class, the input of the proposed function is the logit, where denotes the -th class.
In the forwarding process, when , it is the same as the ArcFace, i.e., , . When , it has two cases, if is an easy sample, it is the the same as the original softmax, i.e., . Otherwise, it will be modulated as , where . In the backward propagation process, the gradients w.r.t. and can also be divided into three cases and computed as follows:
Based on the above formulations, we can find the gradient modulation coefficients of hard samples are determined by , which consists of two parts, the negative cosine similarity and the value of . As shown in Fig. 2, on the one hand, the coefficients increase with the adaptive estimation of (described in the next subsection) to emphasize hard samples. On the other hand, these coefficients are assigned with different importance according to their corresponding difficultness (. Therefore, the values of in Fig. 2 are plotted as a range at each training iteration. However, the coefficients are fixed to be and a constant in ArcFace and MV-Arc-Softmax, respectively.
Adaptive Estimation of .
It is critical to determine appropriate values of in different training stages. Ideally the value of can indicate the model training stages. We empirically find the average of positive cosine similarities is a good indicator. However, mini-batch statistic-based methods usually face an issue: when many extreme data are sampled in one mini-batch, the statistics can be vastly noisy and the estimation will be unstable. Exponential Moving Average (EMA) is a common solution to address this issue [li2019gradient]. Specifically, let be the average of the positive cosine similarities of the -th batch and be formulated as , we have:
where , is the momentum parameter and set to . With the EMA, we avoid the hyper-parameter tuning and make the modulation coefficients of hard sample negative cosine similarities adaptive to the current training stage. To sum up, the loss function of our CurricularFace is formulated as follows:
Fig. 3 illustrates how the loss changes from ArcFace to our CurricularFace during training. Here are some observations: ) As we excepted, hard samples (B and C) are suppressed in early training stage but emphasized later. ) The ratio is monotonically increasing with , since the larger is, the harder the sample is. ) The positive cosine similarity of a perceptual-well image is often large. However, during the early training stage, the negative cosine similarities of the perceptual-well image (A) may also be large so that it could be classified as the hard one.
3.3 Discussions with SOTA Loss Functions
Comparison with ArcFace and MV-Arc-Softmax.
We first discuss the difference between our CurricularFace and the two competitors, ArcFace and MV-Arc-Softmax, from the perspective of the decision boundary in Tab. 1. ArcFace introduces a margin function from the perspective of the positive cosine similarity. As shown in Fig. 4, its decision condition changes from (i.e., blue line) to (red line) for each sample. MV-Arc-Softmax introduces additional margin from the perspective of negative cosine similarity for hard samples, and the decision boundary becomes (green line). Conversely, we adaptively adjust the weights of hard samples in different training stages. The decision condition becomes (purple line). During training, the decision boundary for hard samples changes from one purple line (early stage) to another (later stage), which emphasizes easy samples first and hard samples later.
Comparison with Focal Loss.
Focal loss is formulated as: , where and are modulating factors to be tuned manually. The definition of hard samples in Focal loss is ambiguous, since it focuses on relatively hard samples by reducing the weight of easier samples during entire training process. In contrast, the definition of hard samples in our CurricularFace is more clear, i.e., mis-classified samples. Meanwhile, the weights of hard samples are adaptively determined in different training stages.
4.1 Implementation Details
We separately employ CASIA-WebFace [Yi2014learning] and refined MSMV [deng2018arcface] as our training data for fair comparisons with other methods. CASIA-WebFace contains about M of individuals, and MSMV contains about M images of K individuals. We extensively test our method on several popular benchmarks, including LFW [LFWTech], CFP-FP [sengupta2016frontal], CPLFW [CPLFWTech], AgeDB [moschoglou2017agedb], CALFW [zheng2017crossage], IJB-B [whitelam2017iarpa], IJB-C [maze2018iarpa], and MegaFace [kemelmacher2016megaface].
We follow [deng2018arcface] to crop the faces with five landmarks [zhang2016mtcnn, tai2019towards]. For the embedding network, we adopt ResNet and ResNet as in [deng2018arcface]
. Our framework is implemented in Pytorch[paszke2017automatic]. We train models on NVIDIA Tesla P GPU with batch size . The models are trained with SGD algorithm, with momentum and weight decay . On CASIA-WebFace, the learning rate starts from and is divided by at , , epochs. The training process is finished at epochs. On MSMV, we divide the learning rate at , , epochs and finish at epochs. We follow the common setting as [deng2018arcface] to set scale and margin .
4.2 Ablation study
Effects on Fixed vs. Adaptive Parameter .
We first investigate the effect of adaptive estimation of . We choose four fixed values between and for comparison. Specifically, means the modulation coefficient of each hard sample’s negative cosine similarity is always reduced based on its difficultness. In contrast, means the hard samples are always emphasized. and are between the two cases. Tab. 2 shows that it is more effective to learn from easier samples first and hard samples later based on our adaptively estimated parameter .
Effects on Different Statistics for Estimating .
We now investigate the effects of several other statistics, i.e., mode of positive cosine similarities in a mini-batch, or mean of the predicted ground truth probability for estimating in our loss. As Tab. 3 shows: ) The mean of positive cosine similarities is better than mode. ) The positive cosine similarity is more accurate than the predicted ground truth probability to indicate the training stages.
Robustness on Training Convergence.
As claimed in [li2019airface], ArcFace exhibits the divergence issue when using small backbones like MobileFaceNet. As a result, softmax loss must be incorporated for pre-training. To illustrate the robustness of our loss function on convergence issue with small backbones, we use the MobileFaceNet as the network architecture and train it on CASIA-WebFace. As shown in Fig. 5, when the margin is set to , the model trained with our loss achieves accuracy on LFW, while the model trained with ArcFace does not converge and the loss is NAN at about -th step. When the margin is set to , both losses can converge, but our loss achieves better performance ( vs. ). Comparing the yellow and red curves, since the losses of hard samples are reduced in early training stages, our loss converges much faster in the beginning, leading to lower loss than ArcFace. Later on, the value of our loss is slightly larger than ArcFace, because we emphasize the hard samples in later stages. The results illustrate that learning from easy samples first and hard samples later is beneficial to model convergence.
|Center Loss (ECCV’)|
|Peng et al. (ICCV’)|
|Deng et al. (CVPR’)|
4.3 Comparisons with SOTA Methods
Results on LFW, CFP-FP, CPLFW, AgeDB and CALFW.
Next, we train our CurricularFace on dataset MSMV with ResNet, and compare with the SOTA competitors on various benchmarks, including LFW for unconstrained face verification, CFP-FP and CPLFW for large pose variations, AgeDB and CALFW for age variations. As reported in Tab. 4, our CurricularFace achieves comparable result (i.e., ) with the competitors on LFW where the performance is near saturated. While for both CFP-FP and CPLFW, our method shows superiority over the baselines including general methods, e.g., [wen2016discriminative], [cao2018vggface2], and cross-pose methods, e.g., [tran2017disentangled], [peng2017rec], [cao2018pose] and [deng2018uvgan]. As a recent face recognition method, MV-Arc-Softmax achieves better performance than ArcFace, but still worse than Our CurricularFace. Finally, for AgeDB and CALFW, as Tab. 4 shows, our CurricularFace again achieves the best performance than all of the other SOTA methods.
Results on IJB-B and IJB-C.
The IJB-B dataset contains subjects with K still images and K frames from videos. In the : verification, there are positive matches and M negative matches. The IJB-C dataset is a further extension of IJB-B, which contains about identities with a total of images and unconstrained video frames. In the : verification, there are positive matches and negative matches. On IJB-B and IJB-C datasets, we employ MSMV and the ResNet for a fair comparison with recent methods. We follow the testing protocol in ArcFace and take the average of the image features as the corresponding template representation without bells and whistles. Note that our method is not proposed for set-based face recognition task, and DOES not adopt any specific strategies for set-based face recognition. The experiments on these two datasets are just to prove that our loss can obtain more discriminate features than the baselines like ArcFace, which are also generic methods for face recognition. Tab. 5 exhibits the performance of different methods, e.g., Multicolumn [xie2018multicolumn], DCN [xie2018comparator], Adacos [zhang2019adacos], P2SGrad [zhang2019p2sgrad], PFE [shi2019probabilistic] and MV-Arc-Softmax [wang2018support] on IJB-B and IJB-C : verification, our method again achieves the best performance. Fig. 6 shows the ROC curves of CurricularFace and ArcFace on IJB-B/C with the backbone ResNet100, our method achieves better performance.
|Center Loss (ECCV’)||Small|
Results on MegaFace.
Finally, we evaluate the performance on the MegaFace Challenge. The gallery set of MegaFace includes M images of K subjects, and the probe set includes K photos of unique individuals from FaceScrub. We report the two testing results under two protocols (large or small training set). Here, we use CASIA-WebFace and MSMV under the small protocol and large protocol, respectively. In Tab. 6, our method achieves the best single-model identification and verification performance under both protocols, surpassing the recent strong competitors, e.g., CosFace, ArcFace, Adacos, P2SGrad and PFE. We also report the results following the ArcFace testing protocol, which refines both the probe set and the gallery set. As shown in Fig. 8, our method still clearly outperforms the competitors and achieves the best performance on identification. Compared with ArcFace, our loss shows better performance under both identification and verification scenarios as shown in Fig. 9. AdapitveFace [liu2019adaptiveface] is another recent margin-based loss function for face recognition. We train our model with the same training data MS1MV2 and the same backbone ResNet50 [deng2018arcface] as AdaptiveFace for a fair comparison. The results in Tab. 6 demonstrate the superiority of our method.
The proposed method only brings small burden on training complexity, but has the same cost as the backbone model during inference. Specifically, compared with the conventional margin-based loss functions, our loss only additionally adjusts the negative cosine similarity of hard samples. Under the same environment and batchsize, ArcFace [deng2018arcface] costs s for each iteration on NVIDIA P40 GPUs, while ours costs s.
Discussion on Easy and Hard Samples During Training.
Finally, Fig. 7 shows the easy and hard samples classified by our method in different training stages. As we can see, the front and clear faces are usually considered as easy samples in early training stage, and our model mainly learns the identity information from these samples. With the model continues training, slightly harder samples (i.e., Blue box) are gradually focused and corrected as the easy ones.
In this paper, we propose a novel Adaptive Curriculum Learning Loss that embeds the idea of adaptive curriculum learning into deep face recognition. Our key idea is to address easy samples in the early training stage and hard ones in the later stage. Our method is easy to implement and robust to converge. Extensive experiments on popular facial benchmarks demonstrate the effectiveness of our method compared to the SOTA competitors. Following the main idea of this work, future research can be expanded in various aspects, including designing a better function for negative cosine similarity that shares similar adaptive characteristic during training, and investigating the effects of noise samples that might be optimized as hard samples.