NPCFace: A Negative-Positive Cooperation Supervision for Training Large-scale Face Recognition

by   Dan Zeng, et al.
Shanghai University, Inc.

Deep face recognition has made remarkable advances in the last few years, while the training scheme still remains challenging in the large-scale data situation where many hard cases occur. Especially in the range of low false accept rate (FAR), there are various hard cases in both positives (i.e. intra-class) and negatives (i.e. inter-class). In this paper, we study how to make better use of these hard samples for improving the training. The existing training methods deal with the challenge by adding margins in either the positive logit (such as SphereFace, CosFace, ArcFace) or the negative logit (such as MV-softmax, ArcNegFace, CurricularFace). However, the correlation between hard positive and hard negative is overlooked, as well as the relation between the margin in positive logit and the margin in negative logit. We find such correlation is significant, especially in the large-scale dataset, and one can take advantage from it to boost the training via relating the positive and negative margins for each training sample. To this end, we propose an explicit cooperation between positive and negative margins sample-wisely. Given a batch of hard samples, a novel Negative-Positive Cooperation loss, named NPCFace, is formulated, which emphasizes the training on both the negative and positive hard cases via a cooperative-margin mechanism in the softmax logits, and also brings better interpretation of negative-positive hardness correlation. Besides, the negative emphasis is implemented with an improved formulation to achieve stable convergence and flexible parameter setting.We validate the effectiveness of our approach on various benchmarks of large-scale face recognition and outperform the previous methods especially in the low FAR range.


page 1

page 2


KappaFace: Adaptive Additive Angular Margin Loss for Deep Face Recognition

Feature learning is a widely used method employed for large-scale face r...

Distribution Distillation Loss: Generic Approach for Improving Face Recognition from Hard Samples

Large facial variations are the main challenge in face recognition. To t...

Large-scale Bisample Learning on ID vs. Spot Face Recognition

In many face recognition applications, there is large amount of face dat...

A Novel ECOC Algorithm with Centroid Distance Based Soft Coding Scheme

In ECOC framework, the ternary coding strategy is widely deployed in cod...

Killing Two Birds with One Stone:Efficient and Robust Training of Face Recognition CNNs by Partial FC

Learning discriminative deep feature embeddings by using million-scale i...

MultiFace: A Generic Training Mechanism for Boosting Face Recognition Performance

Deep Convolutional Neural Networks (DCNNs) and their variants have been ...

Negative Samples are at Large: Leveraging Hard-distance Elastic Loss for Re-identification

We present a Momentum Re-identification (MoReID) framework that can leve...

Code Repositories



view repo

I Introduction

Fig. 1: The top and second row form hard positives (i.e. intra-class), the second and bottom row form hard negatives (i.e. inter-class). Many factors, such as large pose, expression, occlusion, age gap etc., result in hard cases in not only positives but also negatives.

Face recognition is a widely studied topic in computer vision and video analysis. With the advances of deep learning for face recognition 

[1, 2, 3], increasing research interest focuses on the large-scale face recognition whose major challenge falls in the recognition accuracy at the low false accept rate (FAR) range. There are many factors leading to hard cases at the low FAR, such as large pose, age gap, non-uniform lightening, occlusion, and so forth (Fig. 1). These cases are often regarded as hard samples, and form not only hard positives but also hard negatives. Here, positive and negative denote intra-class and inter-class, respectively. Many prior methods [3, 4, 5, 6]

aims to select training samples from the hard cases to gain performance improvement. Rather than study how to identify hard samples from the dataset, in this paper, given a definition of hard sample (such as the mis-classified samples), our objective is to study how to make better use of them to boost the training.

Recently, many methods are proposed to optimize the training supervision from the perspective of either positive or negative, and achieve great progress on the mainstream benchmarks. Some of them [7, 8, 9, 10, 11, 12] aim to enlarge the gap between different classes by adding an angular margin in the positive logit of softmax. Liu et al. [7, 8] introduces the idea of angular margin at the first time. CosFace [9] and AM-softmax [10] propose an additive margin for better optimization. ArcFace [11] improves the geometric interpretation of the margin and achieves better performance. AdaptiveFace [12] encourages to learn an adaptive margin for each class. The above methods can be regarded as the category of hard positive mining, because they aim to emphasize the training on those samples away from their ground-truth center by adding margin in the positive logit of softmax. In contrast, some other methods [13, 14, 15] consider to employ hard negative mining by adding margin from the negative (non-ground-truth) view. MV-softmax [13] proposes to identify the mis-classified samples and exploit them by adding margin in the negative logits. ArcNegFace [14] also studies on margin-involved negative logits in a similar way. Based on MV-softmax, CurricularFace [15] adaptively adjusts the relative importance of easy and hard samples during different training stage by modulating the negative logits.

The above-mentioned methods study to improve the training from either the positive view or negative view. To further make full use of the hard samples from positive perspective and negative perspective simultaneously, a straightforward idea is to add margins in both positive and negative logits, such as the manner in  [13, 14, 15]. However, such straight combination has a shortcoming: the margins are imposed independently in positive and negative logits, which is a sub-optimal choice of setting the margin for each hard sample in training. We argue that the margins should be related between negative and positive in a cooperative way, sample-wisely. The reason is that, given a face dataset, a sample, which acts as a hard positive, will generally act as a hard negative as well. Such case widely exists in the face dataset, especially when the dataset is of large-scale. For example, as shown in Fig. 2, when dataset is of large-scale, “Class 1” is surrounded by many neighboring classes, and the hard positive sample could easily find a neighboring class to form a hard negative. More formally, the hard positives and hard negatives have high correlation in large-scale face dataset. This phenomenon is verified on the datasets CASIA-WebFace [16] and MS-Celeb-1M [17] in Section IV-A (Fig. 3). This observation is consistent with the intuition, but has been overlooked in the prior methods for face recognition.

To address this issue, we propose the Negative-Positive Cooperation loss (NPCFace): applying the hard negative-positive correlation to the training loss formulation, and so taking the benefit of it via a cooperative margin scheme for better supervision. Specifically, we formulate an explicit cooperation of the margins in the positive and negative logits. The margin in the positive logit will be enlarged by the cooperation if the negative logit is enlarged. This cooperation scheme is implemented sample-wisely: it will be activated when the sample is identified as a hard sample by an off-the-shelf criterion; otherwise, the cooperation will be deactivated, and the positive logit and negative logit will be calculated independently. Through this cooperation scheme, NPCFace emphasizes the supervision of the hard samples from both the positive and negative perspective. A training sample, which acts as a hard case from the negative view, will also give extra contribution to the supervision from the positive view. Resorting to the cooperation scheme, our NPCFace achieves better exploitation of large-scale training data of face recognition, and pushes the frontier in low FAR range.

Furthermore, we improve the margin formulation in the negative logits in order to guarantee the stable convergence and flexible parameter setting. The experiments on different network architectures and training datasets show that one can easily adopt our method to conduct effective training and further study.

The contributions of this paper are summarized as follows:

  • We propose a novel supervision loss, named NPCFace, to improve the use of hard samples in the large-scale training. It performs a cooperative training emphasis on hard positives and hard negatives sample-wisely, and is implemented via an explicitly related margin formulation in the softmax logits. Benefiting from the correlation between hard positive and hard negative, NPCFace makes better use of the hard samples for training deep face model.

  • We improve the margin formulation in the negative logits to achieve stable convergence and flexible parameter setting. These benefits are validated on various network architectures and face datasets. One can easily train the deep networks with NPCFace on large-scale face datasets.

  • We evaluate our approach on extensive face recognition benchmarks, including LFW, BLUFR, CALFW, CPLFW, CFP-FP, AgeDB-30, RFW, IJB-B, IJB-C, MegaFace and Trillion-Pairs. Resorting to the above improvements, NPCFace achieves leading performance on them, especially in the low FAR range.

Ii Related Work

Ii-a Loss Function

Loss function is an essential research topic in deep supervision for face recognition. There are mainly two routines in the previous works. The first consists in feature embedding. Contrastive loss [18, 19, 2] calculates pairwise Euclidean distance and optimizes it in feature space, while Triplet loss [3] selects the triplet samples and measures the relative Euclidean distance of them. The second includes classification loss functions, such as Taigman et al. [1] which aims to make the different identities separate. Furthermore, face feature representation should be compact in intra-class and separate in inter-class simultaneously. Therefore, Center loss [20] develops a method to constrain the intra-class compactness. RegularFace [21] aims at enlarging inter-class separability between different class centers by an exclusive regularization. L-softmax [7] and SphereFace [8] introduce the angular margin to obtain significant improvement. NormFace [22] studies the effectiveness of the feature and weight normalization. Afterward, CosFace [9] and AM-softmax [10] propose an additive margin to the positive logit which can be optimized steadily. ArcFace [11] employs an additive angular margin, which has a more clear geometric interpretation and achieves further improvement. AdaCos [23]

introduces an adaptive scale parameter to reformulate the mapping between cosine similarity and classification probability. More recently, MV-softmax 

[13], ArcNegFace [14], CurricularFace [15] propose to add margins in the negative logits. However, seldom has yet accomplished thorough study on the cooperation between positive and negative.

Ii-B Hard Sample Usage

There are many prior works that study the mining approach for hard samples, such as OHEM [24], SmartMining [25], HDC [26] and some others [4, 5, 6] for face and general learning. However, there are fewer literatures of discussion about how to use the selected hard samples. FaceNet [3] selects the hard positive and negative samples to construct each mini-batch. EDM [27] proposes to use the moderate positive and hard negative in a related manner. MV-softmax [13] chooses the mis-classified sample as hard one and enlarges the corresponding loss value by adjusting the margins explicitly. Similarly, ArcNegFace [14] proposed a margin-involved negative logits to emphasis the hard samples. Based on MV-softmax, CurricularFace [15] adaptively adjusts the relative importance of easy and hard samples during different training stages by modulating the negative logits. Earlier, a series of methods propose to exploit the hard samples in a more implicit way. These methods, such as CosFace [9], AM-softmax [10], ArcFace [11], mainly adopt the margins in loss function, so the training supervision automatically focuses on the hard sample. AdaptiveFace [12] makes the model to learn the adaptive margin for each class and focus on hard classes with a small number of hard prototypes in each training iteration.

Fig. 2: (a) An example of hard positive in small-scale dataset. (b) In large-scale dataset, the hard positive has higher chance to be also a hard negative. This is verified by high correlation computed in Section IV-A (Fig. 3). Best viewed in color.

Iii Our Method

Iii-a Revisiting Softmax

The softmax loss is the most widely used training loss function, which includes a fully connected layer, the softmax function and the cross-entropy loss. At the fully connected layer, the output is obtained by the inner product of the -th feature and the -th class weight . After normalization on features and weights, the inner product equals to the cosine similarity . Thus, the softmax loss can be formulated as follows:


where is the batch size, is the class number, is a re-scaling parameter, and is the ground-truth label of the -th sample. We denote the positive and negative logits as and , which are computed as and , respectively. So, the softmax loss can be further formulated as:


Then, the gradient with respect to the positive logit and negative logits is calculate as:


where is the predicted probability on the -th class, which is defined by the softmax function:


Given for the total classes, the gradient summation of each sample with respect to each class always equals to the constant zero:


Considering is a probability that being less than 1, the gradient with respect to the positive logit () and that with respect to the negative logit () have the opposite sign. Therefore, for each training sample, given the loss function of softmax, the supervisions on the ground-truth class and non-ground-truth class have strong correlation in terms of magnitude, since their sum equals to zero. In other words, if a training sample leads to a strong supervision on the ground-truth class, then it will bring strong supervision on the non-ground-truth classes as well. This property is brought about by the normalization of softmax function.

Iii-B Revisiting Margin-based Variants

Many prior works attempt to impose a margin in the positive logit to emphasize the supervision on the ground-truth class. Without loss of generality, we take the ArcFace formulation as an example. The positive logit is equipped with a non-negative margin , so the positive logit is decreased than the original version , as well as the probability:


Then, according to Eqn. 3, the supervision on the ground-truth class is amplified. While the positive margin brings the benefit on ground-truth supervision, it will impair the non-ground-truth supervision. The reason is as follows. According to the property above-mentioned, the supervision on the non-ground-truth classes is emphasized by:


Unfortunately, such emphasis is activated to all the non-ground-truth classes indiscriminately. Thus, the hard non-ground-truth class, which deserved stronger supervision than the other classes, is, however, relatively weakened by the indiscriminate emphasis. More recently, MV-softmax, ArcNegFace and CurricularFace argue to perform an extra margin in the negative logits, and such scheme compensates the supervision on hard non-ground-truth class and alleviates the above issue. We take MV-softmax as an example, the logit of hard non-ground-truth class is reformulated as:


where can be regarded as the extra margin in the negative logit. We can see the logit is enlarged, and so the corresponding supervision is emphasised independently.

Iii-C Improved Hard Negative Emphasis

In the above formulation (Eqn. 8) of non-ground-truth logit, the margin is implemented via the parameter in the negative logit. By further developing the gradient from Eqn. 3 with respect to the class weight ,


we can see the supervision on the non-ground-truth class is determined by the predicted probability and the parameter , while is also determined by (Eqn. 8). So, a slight increase of will lead to large increase of gradient, and thus bad solution or even unstable convergence (Fig. 9). But if we decrease to gain better convergence, the emphasis on the hard non-ground-truth class will be weakened instead. In order to alleviate the conflict between the stability and hard emphasis, we propose to disentangle the multiplicative margin and additive margin. To this end, the logit of non-ground-truth class is reformulated as:


where the mask indicates whether the sample is hard to the -th class. The choice of hard sample can be any of the off-the-shelf definition, such as mis-classification [13], OSM [28], DE-DSP [4] etc. More importantly, we disentangle the multiplicative margin and additive margin, and define them by and , respectively. and represent the scale and the shift modulation. For hard samples, we emphasize the supervision on the hard non-ground-truth class by tuning and together, so we can obtain the stable convergence while keeping hard supervision. This is an improved formulation with more flexible parameter setting. One can refer to Section V-B and find our formulation leads to stable convergence.

Iii-D Cooperation in Hard Positive

As discussed in Section I, when we train face recognition model on large-scale dataset, we can observe high correlation between hard positive and hard negative. In this section, to further improve the training supervision, we explore to take advantage of the correlation between the hard positive and hard negative. We argue that the margin formulated in the positive logit should be related to the negative logits for each sample. The hard samples are generally far away from their ground-truth class, and closer to the non-ground-truth classes. In other words, a sample which acts as hard case in positive perspective, also generally acts as hard one in negative perspective. We will discuss more about their correlation in Section IV-A.

Therefore, we develop an explicit cooperation between positive and negative logits for each sample. Specifically, a cooperative margin is defined for the positive logit of the -th sample. Two factors are involved in the definition of cooperative margin: (1) the similarity to the non-ground-truth class is involved to implement the cooperation; (2) the mask is involved to enable the cooperation if it is a hard sample:


where is a constant which maintains a basic margin for each sample, and controls the range of the cooperative margin. We can see that the cooperative margin is related to the averaged hard negative logits. If the sample acts as a hard case from negative perspective, the cooperative margin will increase; if it is not a hard case, will reduce to the basic margin.

The cooperative margin can be applied in any positive-margin-based methods ( e.g,  [8, 9, 10, 11]). Here, we take ArcFace as an example, and the positive logit can be formulated as:


When the sample has more hard negatives, then the cooperative margin will increase, the positive logit will decrease and thus the loss value will increase. Notice that each sample has its own cooperative margin with respect to its hardness. In Section IV-C, we will provide more discussion about the role of in the cooperative margin.

Iii-E Negative-Positive Cooperation Loss

The Negative-Positive Cooperation (NPCFace) loss function incorporates the improved hard negative emphasis and the cooperation in hard positive emphasis, which is formulated as:


As mentioned above, the choice of hard sample can be any off-the-shelf definition, and we follow MV-softmax to employ the mis-classified samples. The cooperation comes from the important observation: a sample which is observed as a hard case in positive perspective, generally acts as hard one as well in negative perspective. So, NPCFace not only combines the emphasis from two views, but also benefits from the correlation between hard positive and hard negative for boosting the supervision. The following sections will give more discussion on NPCFace and show its superiority on face recognition.

Iv Analysis

Iv-a Correlation between Hard Positive and Negative

Fig. 3: Correlation of hard positive and hard negative of CASIA-WebFace and MS-Celeb-1M during the training.

As mentioned above, the important observation is that a sample which is observed as a hard case from positive view, most likely will act as a hard case from negative view as well. This is the motivation of NPCFace that make use of this correlation for better training supervision. To verify this argument, we calculate the correlation between the hard positives and hard negatives. Specifically, we calculate the distance from the mis-classified samples to their ground-truth class, and the distance to the nearest non-ground-truth class; we calculate the correlation of the two distances of the samples, each of which has such two distance values; we find the two distances are negatively correlated throughout the training (Fig. 3). Note that this correlation is not the same item of the correlation between positive gradient and negative gradient in Section III-A. Here, the correlation indicates the samples which have smaller distance to the non-ground-truth class (i.e. hard negative), will have larger distance to the ground-truth class (i.e. hard positive). Also, we can see the correlation is more significant when the dataset has larger scale (MS-Celeb-1M is larger than CASIA-WebFace), which verifies the phenomenon in Fig. 2.

Iv-B Sample-wise Margin

The most prior works of margin-based methods, such as CosFace, AM-softmax, ArcFace etc., setup the margin with fixed value for all the training samples. Afterward, AdaptiveFace proposes to learn a margin for each class of the softmax classification. More recently, MV-softmax, ArcNegFace and CurricularFace set the margin in a sample-wise way, which means each training sample computes the loss with a specific margin with respect to the sample itself. This is a more reasonable routine because: (1) each training sample has different extent of hardness; (2) the hardness of a sample varies as the network being updated. Therefore, the sample-wise definition of margin is a better way. NPCFace also designs the margin in such sample-wise way, and employs this sample-wise routine for both positive margin and negative margin; however, MV-softmax, ArcNegFace and CurricularFace adopt the sample-wise margin only in the negative logits.

Iv-C Selecting Hard Sample

There are many existing criteria for selecting hard samples for deep training. In this paper, we choose the mis-classified samples as hard case rather than the one with large distance to the ground-truth center. Fig. 4 demonstrates the case in which the mis-classified sample has smaller distance to the ground-truth than the well-classified one. As discussed in  [27], this is caused by the highly-curved manifold in the feature space. To verify this, we analyze the cosine similarity distribution between the training samples and their ground-truth centers throughout the training process (Fig. 5). The red distribution corresponds to the mis-classified samples, while the blue one corresponds to the well-classified samples. Their overlap rates are shown in Fig. 6. At the start of training, almost all the samples are mis-classified because the network is trained from scratch. Meanwhile, the overlap is the highest because the feature manifold is most distorted at this stage. As the network gradually converging, the training samples become closer to their positive centers. The red area decreases and the blue area increases because of less and less are mis-classified. Besides, we can observe that, as the network gradually converging, there is still an overlap between mis-classified samples and well-classified samples, which means it is improper if we directly use sample distance to identify hard samples.

Fig. 4: An illustration of mis-classified case in the feature space. The distance from Class 2 center to mis-classified sample () is smaller than to a well-classified one ().
Fig. 5: Blue: distribution of cosine similarity between well-classified samples and their ground-truth center throughout the training. Red: the counterpart between mis-classified samples and their ground-truth. Best viewed in color.
Fig. 6:

The overlap of the two distributions along with the training epochs.

Iv-D Robustness to Feature Dimension

When embedding face images to feature space, the feature dimension plays an important role in metric computation. Here, we explore the effect of different dimension settings in NPCFace scheme. The dimension is determined by the last layer of the network. We set the last layer to 128, 256, 512 and 1,024 in four networks (with the same backbone), respectively. Then, we train the networks and extract the features, and calculate the cosine similarity distributions between mis-classified samples and their nearest negative centers. Fig. 7 shows the distributions of the negative similarity under the four different feature dimension settings. We can find that the hard negative similarity distributions are almost unchanged when the dimensionality increases from 128 to 1,024. The stability could be attributed to the margin formulation in the negative logit of NPCFace, which performs effective scaling and shifting in the training process.

Fig. 7: Cosine similarity distribution between mis-classified samples and their nearest non-ground-truth centers. The four networks have different dimensionalities of the last layer, but result in similar distributions.

V Experiment

This section is structured as follows. Section V-A introduces the datasets and experimental settings. Section V-B studies the convergence of NPCFace and its flexibility on parameter setting. Section V-C includes the ablation study which validates the negative margin, the cooperative positive margin and the combination. Section V-D demonstrates the comprehensive evaluation on a wide range of datasets and comparison with the state-of-the-art methods.

V-a Datasets and Experimental Setting

Training Data. We use two public datasets to train the networks. Specifically, we use cleaned CASIA-WebFace [16] for training in stability analysis and ablation study, and we also utilize MS1M-v1c [29] (cleaned version of MS-Celeb-1M [17] ) for large scale comparison experiments. Note that we follow the lists of the [30] and [13] to remove the overlapped identities between the employed training datasets and the test datasets. As a result, the CASIA-WebFace remains 9,879 identifies with 0.38M images and the MS1M-v1c remains 72,690 Identities and 3.28M images.

Test Data. For a thorough evaluation, we use eleven test benchmarks, including LFW [31], BLUFR [32], AgeDB-30 [33], CFP-FP [34], CALFW [35], CPLFW [36], RFW [37], MegaFace [38], Trillion-Pairs [29], IJB-B [39], IJB-C [40]. Among these test data, AgeDB-30 and CALFW focus on the large age gap face verification. CFP-FP and CPLFW aim at the large pose face verification. RFW focuses on the face verification for different races. BLUFR fully exploits all the LFW face images for the large-scale face recognition evaluation with focus at low FARs. MegaFace and Trillion-Pairs evaluate the performance of face recognition at the million scale of distractors. IJB-B and IJB-C contain images and videos for set-based face recognition.

Preprocessing. All face images are detected by FaceBoxes [41]. Then, we align the faces by five facial landmarks [42] and crop them to 120120 RGB. During the training, we horizontally flip all the faces with probability 0.5 for data augmentation. Besides, each pixel in RGB images is normalized by subtracting 127.5 and then divided by 128.

Fig. 8: The loss value of NPCFace with different CNN architectures along training iterations. Best viewed in color.

CNN Architecture. In the stability analysis and ablation study, we use MobileFaceNet [43] as backbone to verify the effectiveness of each component of our method. Then, we adopt Attention-56 [44] as the backbone of NPCface and all of the counterparts in the comparison experiments, so we can make a fair comparison while keeping the performance contrast between methods. The output of network gives a 512-dimension feature. In addition, we also employ extra CNN architectures (Fig. 8), including VGG-19 [45], SE-ResNet-50 [46], ResNet-50 and -101 [47], Attention-92 [44] to prove the convergence of our approach with various architectures.

Training and Evaluation. We train the networks from scratch on four NVIDIA Tesla P40 GPUs. On CASIA-WebFace, the batch size is 128 and the learning rate begins with 0.1 and is divided by 10 at the 16, 24, 28 epochs and finished at 30 epochs. On MS1M-v1c, We set the batch size as 512, and the learning rate starts form 0.1 and is divided by 10 at the 8, 14, 18 epochs and finish at 20 epochs. We set momentum to 0.9 and weight decay to 0.0005. According to the validation on LFW, we set and in negative emphasis, and and in cooperative margin. In the evaluation stage, we extract features from the last layer, and compute the cosine similarity as the similarity metric. For a fair and precise evaluation, all the overlapping identities between training and test datasets are removed according to the overlapping list [30] and [13].

Compared Methods. The original softmax is employed as baseline. The classification loss counterparts include SphereFace [8], CosFace [9], ArcFace [11], AdaM-softmax [12], AdaCos [23]. In addition, we also compare with some recent softmax-based loss with hard mining improvement, such as MV-softmax [13], ArcNegFace [14] and CurricularFace [15]. OHEM (HM-softmax [24]) and Focal loss (F-softmax [48]) are involved as the hard mining counterparts. We re-implement them following every details in their original literature, and conduct fair comparison under the same experimental setting.

(a) LFW
Fig. 9: The results on LFW and BLUFR with different and . The red line is the performance of MV-softmax with different . The blue line is the performance of NPCFace with different and . Best viewed in color.

V-B Stable Convergence and Flexible Setting

To demonstrate the stable convergence in the training, we employ NPCFace to train on five prevailing CNN architectures, including MobileFaceNet [43], VGG-19 [45], ResNet-50 and -101 [47], SE-ResNet-50 [46] and Attention-56 and -92 [44]. As illustrated in Fig. 8, the loss values gradually drop along with training iterations. To demonstrate the flexible parameter setting of our improved formulation in the negative logits, we conduct a comparison experiment with NPCFace and MV-softmax. As shown in Fig. 9, the red line is the performance of MV-softmax and blue line is the NPCFace. We can find there is a large decrease both in LFW and BLUFR (VR@FAR=1e-5) when for MV-softmax, because MV-softmax’s shift parameter is entangled with . But for NPCFace, we can fix in an appropriate range (e.g. ) and enlarge to obtain further performance improvement. So, NPCFace is more flexible to determine favorable training parameters.

 Neg.  Pos. BLUFR MegaFace
1e-4 1e-5 Id. Veri.
- - 92.74 83.52 72.89 77.64
- 94.31 86.84 77.49 80.86
- 94.29 86.31 75.33 79.58
94.82 88.15 77.76 82.29
TABLE I: Ablation study: performance () on BLUFR and MegaFace. On BLUFR, we report the verification rate at FAR of 1e-4 and 1e-5. On MegaFace, “Id.” refers to face identification rank-1 accuracy with 1M distractors, and “Veri.” refers to face verification TAR at 1e-6 FAR.

V-C Ablation Study

In this subsection, we analyse the two improvements of NPCFace and validate their effectiveness. Table I shows the results on BLUFR and MegaFace. The baseline (i.e. the top row) is the original ArcFace [11]

. “Neg.” represents the employment of our improved negative logits. The improvement by negatives (second row) is significant by every evaluation metric. “Pos.” refers to the cooperative margin for positive logit. We also observe the obvious improvement (third row) compared with the baseline. By the joint advantage of the two components, NPCFace (bottom row) can obtain further performance improvement, especially at the low FARs.

Caucasian Indian Asian African
softmax 99.45 96.58 92.67 93.52 86.27 95.35 91.63 87.80 89.45
HM-softmax [24] 99.67 96.43 93.33 94.02 86.95 94.77 90.65 87.35 87.47
F-softmax [48] 99.65 96.60 94.11 93.87 87.17 94.95 90.72 86.82 88.00
SphereFace [8] 99.70 96.43 93.86 94.17 87.81 95.95 91.95 89.72 90.48
CosFace [9] 99.73 97.53 94.83 95.07 88.63 97.98 94.93 93.80 94.88
ArcFace [11] 99.75 97.68 94.27 95.12 88.53 98.22 95.68 93.97 94.95
AdaCos [23] 99.68 97.15 94.03 94.38 87.03 97.37 92.00 90.15 91.92
AdaM-softmax  [12] 99.74 97.68 94.96 95.05 88.80 98.22 95.13 93.77 94.58
MV-AM-softmax [13] 99.72 97.73 93.77 95.23 88.65 98.28 95.08 93.50 94.57
ArcNegFace [14] 99.73 97.37 93.64 95.15 87.87 98.07 95.73 93.35 95.05
CurricularFace [15] 99.72 97.43 93.73 94.98 87.62 98.23 95.37 93.60 94.73
NPCFace 99.77 97.77 95.09 95.60 89.42 98.58 95.98 94.78 95.52
TABLE II: Performance () comparison on the LFW, AgeDB-30, CFP-FP, CALFW, CPLFW and RFW.

V-D Comparison Experiments

The comparison experiments aims to evaluate NPCFace against various challenges, and show the results compared with state-of-the-art methods.

Recognition against large pose and age gap. Table II includes the performance on LFW, CPLFW, CFP-FP, CALFW and AgeDB-30. For LFW evaluation, NPCFace has a small improvement, since state of the art on LFW is almost saturated. The other four benchmarks (CPLFW, CFP-FP, CALFW and AgeDB-30) aim at the evaluation when encountering hard cases of large face pose and large age gap. From the results, we can see that NPCFace is better than the baseline softmax loss and other competitors in all evaluation, which prove the effectiveness of our negative-positive cooperation.

Recognition for various races. The RFW benchmark includes four testing subsets, i.e. Caucasian, Asian, Indian and African. Each contains about 3,000 individuals with 6,000 image pairs for face verification. As shown on right half of Table II, NPCFace achieves the highest accuracy in the four testing subsets, especially in the challenging subsets of Asian and African. It indicates the good generalization ability of NPCFace training for various races.

Recognition at low FAR. Table III includes the performance at low FARs. First, we conduct the evaluation on BLUFR protocol and compare the verification rate at FAR of 1e-4 and 1e-5. We can see that our method is obviously superior to all the competitors. Further, we compare the performance in the MegaFace Challenge, which is one of the most challenging benchmark for large scale face identification and verification. Following the official protocol, we use FaceScrub [49] as the probe set. Compared with the baseline softmax loss, our method achieves at least 4 percents improvements on both the Rank-1 identification rate and the verification TAR@FAR=1e-6. Compared with the recent state-of-the-art methods (CosFace, ArcFace, MV-AM-softmax, AdaCos, AdaM-softmax and ArcNegFace etc.), our method also keeps the superiority, which proves the effectiveness of the cooperative margins. The Trillion-Pairs Challenge [29] is also a large scale face recognition challenge, which is consisted of 5,700 identities for recognition and 1.58 million faces as distractors. Table III also displays the performance comparison in the Trillion-Pairs Challenge. From the results, we find that the hard mining methods [24, 48] do not work well in the extreme low FAR range (i.e. 1e-9), while the margin-improved methods (NPCFace, ArcFace, MV-AM-softmax etc.) shows the advantage on exploiting hard samples. Besides, we observe that our NPCFace is able to push the limit of deep face recognition in the extreme low FAR range and achieve the leading performance among all the competitors both in identification and in verification.

In addition, we also report the performance of face verification task on IJB-B [39] and IJB-C [40] datasets. The IJB-B dataset consists of 1,845 subjects with 21,798 images and 55K video frames. For face verification task, there are 10,270 positive matches and 8M negative matches. The IJB-C dataset composes of 3,531 subjects with 31,334 images and 117,542 video frames, which provides 19,557 genuine matches and 15,638,932 impostor matches. For the IJB-B and IJB-C face verification evaluation, we obtain the set-based representations by averaging the image features without any specific strategies for set-based face recognition. From the results in Table IV, we can find NPCFace also keep the leading performance on IJB-B and IJB-C datasets, which shows our methods can obtain more discriminate and generalized features than the counterparts.

Method BLUFR MegaFace Trillion-Paris
1e-4 1e-5 Id. Veri. Id. Veri.
softmax 99.43 97.62 92.27 93.77 51.21 47.91
HM-softmax [24] 99.50 97.77 91.45 93.51 49.78 46.66
F-softmax [48] 99.58 97.40 91.59 92.93 45.69 41.58
SphereFace [8] 99.51 98.03 92.54 94.23 55.09 54.42
CosFace  [9] 99.79 98.73 96.65 97.25 72.33 70.98
ArcFace [11] 99.80 98.53 97.04 97.38 75.68 74.80
AdaCos [23] 99.63 97.44 94.27 96.04 53.59 52.33
AdaM-softmax [12] 99.81 98.89 96.80 97.37 71.76 70.70
MV-AM-softmax [13] 99.81 99.25 97.13 97.50 75.34 74.34
ArcNegFace [14] 99.78 98.49 96.85 97.35 75.48 73.77
CurricularFace [15] 99.79 98.85 96.80 97.24 75.07 73.45
NPCFace 99.83 99.36 97.75 98.07 77.53 77.01
TABLE III: Performance () comparison on BLUFR, MegaFace and Trillion-Paris.
Method IJB-B IJB-C
1e-4 1e-5 1e-4 1e-5
softmax 85.66 73.63 86.62 76.48
HM-softmax [24] 85.81 73.79 87.26 77.76
F-softmax [48] 85.10 73.67 86.98 77.53
SphereFace [8] 86.67 74.75 87.92 78.77
CosFace  [9] 90.60 82.28 91.72 86.68
ArcFace [11] 90.83 82.68 91.82 85.75
AdaCos [23] 86.04 73.34 87.53 78.91
AdaM-softmax [12] 90.54 82.70 91.64 86.84
MV-AM-softmax [13] 90.67 83.17 92.03 87.52
ArcNegFace [14] 90.62 81.59 90.91 85.64
CurricularFace [15] 90.04 81.15 90.95 84.63
NPCFace 92.02 85.59 92.90 88.08
TABLE IV: Performance () comparison on IJB-B and IJB-C.
(a) IJB-B
(b) IJB-C
Fig. 10: The ROC curves of NPCFace and the counterparts on IJB-B and IJB-C. Best viewed in color.

Vi Conclusion

In this paper, we propose a novel training supervision, namely Negative-Positive Cooperation (NPCFace) loss, to address the challenges in large-scale face recognition. The contribution consists in two folds. First, a cooperative training emphasis on hard positives and hard negatives is developed to make full use of them for better training. Second, the improved margin formulation in the negative logits leads to stable convergence and flexible parameter setting. The two components can jointly bring advantages to the training of deep face recognition. Consequently, NPCFace achieves favorable performance in the low FAR range and various hard cases, and shows it superiority over the prior methods.


  • [1] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
  • [2] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Advances in neural information processing systems, 2014, pp. 1988–1996.
  • [3] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [4] Y. Duan, L. Chen, J. Lu, and J. Zhou, “Deep embedding learning with discriminative sampling policy,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4964–4973.
  • [5] J. Zhou, P. Yu, W. Tang, and Y. Wu, “Efficient online local metric adaptation via negative samples for person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2420–2428.
  • [6] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
  • [7]

    W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks.” in

    ICML, vol. 2, no. 3, 2016, p. 7.
  • [8] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
  • [9] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274.
  • [10] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
  • [11] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
  • [12] H. Liu, X. Zhu, Z. Lei, and S. Z. Li, “Adaptiveface: Adaptive margin and sampling for face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 947–11 956.
  • [13]

    X. Wang, S. Zhang, S. Wang, T. Fu, H. Shi, and T. Mei, “Mis-classified vector guided softmax loss for face recognition,” in

    Proceedings of the AAAI Conference on Artificial Intelligence

    , 2020.
  • [14] Y. Liu, G. Song, M. Zhang, J. Liu, Y. Zhou, and J. Yan, “Towards flops-constrained face recognition,” in Proceedings of the ICCV Workshop, 2019.
  • [15] Y. Huang, Y. Wang, Y. Tai, X. Liu, P. Shen, S. Li, J. Li, and F. Huang, “Curricularface: Adaptive curriculum learning loss for deep face recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [16] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
  • [17] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 87–102.
  • [18] S. Chopra, R. Hadsell, Y. LeCun et al., “Learning a similarity metric discriminatively, with application to face verification,” in CVPR (1), 2005, pp. 539–546.
  • [19] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2.   IEEE, 2006, pp. 1735–1742.
  • [20] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision.   Springer, 2016, pp. 499–515.
  • [21] K. Zhao, J. Xu, and M.-M. Cheng, “Regularface: Deep face recognition via exclusive regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1136–1144.
  • [22] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: l 2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia.   ACM, 2017, pp. 1041–1049.
  • [23] X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li, “Adacos: Adaptively scaling cosine logits for effectively learning deep face representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 823–10 832.
  • [24] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 761–769.
  • [25] B. Harwood, B. Kumar, G. Carneiro, I. Reid, T. Drummond et al., “Smart mining for deep metric learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2821–2829.
  • [26] Y. Yuan, K. Yang, and C. Zhang, “Hard-aware deeply cascaded embedding,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 814–823.
  • [27] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li, “Embedding deep metric for person re-identification: A study against large variations,” in European conference on computer vision.   Springer, 2016, pp. 732–748.
  • [28] X. Wang, Y. Hua, E. Kodirov, G. Hu, and N. M. Robertson, “Deep metric learning by online soft mining and class-aware attention,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5361–5368.
  • [29]
  • [30] X. Wang, S. Wang, J. Wang, H. Shi, and T. Mei, “Co-mining: Deep face recognition with noisy labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9358–9367.
  • [31] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,” 2008.
  • [32] S. Liao, Z. Lei, D. Yi, and S. Z. Li, “A benchmark study of large-scale unconstrained face recognition,” in IEEE international joint conference on biometrics.   IEEE, 2014, pp. 1–8.
  • [33] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou, “Agedb: the first manually collected, in-the-wild age database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 51–59.
  • [34] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs, “Frontal to profile face verification in the wild,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2016, pp. 1–9.
  • [35] T. Zheng, W. Deng, and J. Hu, “Cross-age lfw: A database for studying cross-age face recognition in unconstrained environments,” arXiv preprint arXiv:1708.08197, 2017.
  • [36] T. Zheng and W. Deng, “Cross-pose lfw: A database for studying crosspose face recognition in unconstrained environments,” Beijing University of Posts and Telecommunications, Tech. Rep, pp. 18–01, 2018.
  • [37] M. Wang, W. Deng, J. Hu, J. Peng, X. Tao, and Y. Huang, “Racial faces in-the-wild: Reducing racial bias by deep unsupervised domain adaptation,” arXiv preprint arXiv:1812.00194, 2018.
  • [38] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The megaface benchmark: 1 million faces for recognition at scale,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4873–4882.
  • [39] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C. Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, K. E. Allen, J. Cheney, and P. Grother, “Iarpa janus benchmark-b face dataset,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 592–600, 2017.
  • [40] B. Maze, J. C. Adams, J. A. Duncan, N. D. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, and P. Grother, “Iarpa janus benchmark - c: Face dataset and protocol,” 2018 International Conference on Biometrics (ICB), pp. 158–165, 2018.
  • [41] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “Faceboxes: A cpu real-time face detector with high accuracy,” in 2017 IEEE International Joint Conference on Biometrics (IJCB).   IEEE, 2017, pp. 1–9.
  • [42] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss for robust facial landmark localisation with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2235–2245.
  • [43] S. Chen, Y. Liu, X. Gao, and Z. Han, “Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices,” in Chinese Conference on Biometric Recognition.   Springer, 2018, pp. 428–438.
  • [44] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
  • [45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Computer Science, 2014.
  • [46] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [48] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [49] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning large face datasets,” in 2014 IEEE International Conference on Image Processing (ICIP).   IEEE, 2014, pp. 343–347.