Mis-classified Vector Guided Softmax Loss for Face Recognition

11/26/2019 ∙ by Xiaobo Wang, et al. ∙ JD.com, Inc. 0

Face recognition has witnessed significant progress due to the advances of deep convolutional neural networks (CNNs), the central task of which is how to improve the feature discrimination. To this end, several margin-based (e.g., angular, additive and additive angular margins) softmax loss functions have been proposed to increase the feature margin between different classes. However, despite great achievements have been made, they mainly suffer from three issues: 1) Obviously, they ignore the importance of informative features mining for discriminative learning; 2) They encourage the feature margin only from the ground truth class, without realizing the discriminability from other non-ground truth classes; 3) The feature margin between different classes is set to be same and fixed, which may not adapt the situations very well. To cope with these issues, this paper develops a novel loss function, which adaptively emphasizes the mis-classified feature vectors to guide the discriminative feature learning. Thus we can address all the above issues and achieve more discriminative face features. To the best of our knowledge, this is the first attempt to inherit the advantages of feature margin and feature mining into a unified loss function. Experimental results on several benchmarks have demonstrated the effectiveness of our method over state-of-the-art alternatives.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Face recognition is a fundamental and of great practice values task in the community of computer vision and pattern recognition. The task of face recognition contains two categories: face identification to classify a given face to a specific identity, and face verification to determine whether a pair of face images are of the same identity. Though it has been extensively studied for decades

[37, 7, 18, 8, 39, 1, 38, 14, 28, 24], there still exist a great many challenges for accurate face recognition, especially on large-scale test datasets at the very low false alarm rate (FAR), such as the MegaFace Challenge [10, 20] and the recent Trillion-Pairs Challenge [2].

In recent years, the advanced face recognition models are usually built upon deep convolutional neural networks [32, 6, 26] and the learned discriminative features play a significant role. To train deep models, the CNNs are generally equipped with classification loss functions [29, 41, 11, 15, 40], metric learning loss functions [27, 22, 35] or both [28, 41, 46]. Metric learning loss functions such as contrastive loss [27] or triplet loss [22] usually suffer from high computational cost. To avoid this problem, they require carefully designed sample mining strategies. But the performance is very sensitive to these strategies. So increasingly more researchers shift their attention to construct deep face recognition models by re-designing the classical classification loss functions.

Intuitively, face features are discriminative if their intra-class compactness and inter-class separability are well maximized. However, as pointed out by many recent studies [41, 30, 15, 33, 40, 3], the current prevailing classification loss function (i.e., Softmax loss) lacks the power of feature discrimination for deep face recognition. To address this issue, Wen et al. [41] develop a center loss to learn centers for each identity to enhance the intra-class compactness. Wang et al. [30] and Ranjan et al. [21]

propose to use a scale parameter to control the temperature of softmax loss, producing higher gradients to the well-separated samples to shrink the intra-class variance. Recently, several margin-based softmax loss functions

[16, 15, 34, 33, 3] to increase the feature margin between different classes have also been proposed. Liu et al. [16, 15] introduce an angular margin (A-Softmax) between the ground truth class and other classes to encourage larger inter-class variance. However, it is usually unstable and the optimal parameters need carefully adjust for different settings. To enhance the stability of A-Softmax loss, Liang et al. [11] and Wang et al. [33, 34] propose the additive margin (AM-Softmax) loss to stabilize the optimization. Deng et al. [3] develop an additive angular margin (Arc-Softmax) loss, which has a clear geometric interpretation.

Although the above approaches have achieved promising results, they mainly suffer from three shortcomings: 1) They obviously ignore the importance of informative features mining for discriminative learning. To address it, one may resort to the mining-based softmax loss functions. Shrivastava et al. [25] design the hard mining strategy (HM-Softmax) to improve the feature discrimination by constructing mini-batches using high-loss examples. But the percentage of hard examples is empirically decided and the easy examples are completely discarded. In contrast, Lin et al. [13] design a relatively soft mining strategy, namely Focal loss (F-Softmax), to focus training on a sparse set of hard examples. However, the indication of hard examples is unclear. As a result, these two mining-based candidates usually fail to improve the performance. How to semantically select the hard examples is still an open problem. 2) They enlarge the feature margin only from the perspective of the ground truth class, which is partial and without realizing the discriminability from other non-ground truth classes. 3) Last but not at least, they enlarge the feature margin by using a same and fixed margin for all classes, which may not be appropriate and may not work very well in practice.

To overcome the aforementioned shortcomings, this paper tries to design a new loss function, which explicitly indicates the hard examples as mis-classified vectors and adaptively emphasizes on them to guide the discriminative feature learning. To sum up, the main contributions of this paper can be summarized as follows:

  • We propose a novel MV-Softmax loss, which explicitly indicates the hard examples and focuses on them to guide the discriminative feature learning. As a consequence, our new loss also absorbs the discrimiantibility from other non-ground truth classes as well as is with adaptive margins for different classes.

  • To the best of our knowledge, this is the first attempt to effectively inherit the merits of feature margin and feature mining techniques into a unified loss function. Moreover, We deeply analyze the relations and differences between our new loss and the current margin-based and mining-based losses.

  • We conduct extensive experiments on the common benchmarks of LFW, CALFW, CPLFW, AgeDB, CFP, RFW, MegaFace and Trillion-Pairs, which have verified the superiority of our new approach over the baseline Softmax loss, the mining-based Softmax losses, the margin-based Softmax losses, and their naive fusions.

Preliminary Knowledge

Softmax. Softmax loss is defined as the pipeline combination of last fully connected layer, softmax function and cross-entropy loss. In face recognition, the weights , (where and is the number of classes) and the feature of the last fully connected layer are usually normalized and their magnitudes are replaced as a scale parameter [30, 33, 3]. In consequence, given an input feature vector with its corresponding ground truth label , the softmax loss can be re-formulated as follows:

(1)

where

is the cosine similarity and

is the angle between and . As pointed out by a great many studies [16, 15, 33, 3], the learned features with softmax loss are prone to be separable, rather than to be discriminative for face recognition.

Mining-based Softmax. Hard example mining is becoming a common practice to effectively train deep CNNs. Its idea is to concentrate on informative examples, thus it usually results in more discriminative features. There are recent works that select hard examples based on loss value [25, 13] to learn discriminative features. Generally, they can be summarized as:

(2)

where

is the predicted ground truth probability and

is an indicator function. Basically, for the soft mining method Focal loss [13] (F-Softmax), , is a modulating factor. For the hard mining method HM-Softmax [25], when the sample is indicated as easy and when the sample is hard.

Margin-based Softmax. To directly enhance the feature discrimination, several margin-based softmax loss functions [15, 40, 33, 3] have been proposed in recent years. In summary, they can be defined as follows:

(3)

where is a carefully designed margin function. Basically, is the motivation of A-Softmax loss [15], where and is an integer. with is the AM-Softmax loss [33]. with is the Arc-Softmax loss [3]. More generally, the margin function can be summarized into a combined version: .

Problem Formulation

To begin with, let us retrospect the formulation of margin-based softmax losses, i.e., Eq. (3), from which we can summarized that: 1) It ignores the importance of informative features mining for discriminative learning. 2) It only exploits the discriminability from the ground truth class , i.e, , without be aware of the potential discriminability from other non-ground truth classes , where , . 3) It simply uses a same and fixed margin , or to enlarge the feature margin between different classes.

Naive Mining-Margin Softmax Loss

To solve the first shortcoming, one may resort to hard examples mining strategies [25, 13]. The mining-based loss functions aim to focus training on the hard examples while the margin-based loss functions are to enlarge the feature margin between different classes. Therefore, these two branches are orthogonal and can seamlessly incorporate into each other, leading a naive motivation to directly integrate them as:

(4)

The formulation Eq. (4) do involve informative features by the indicator function , but its improvement is limited in practice. The reason behind this may be, for the HM-Softmax [25], it explicitly indicates the hard examples, but it discards the easy ones. For the F-Softmax [13], it uses all examples and empirically re-weights them by a modulating factor, but hard examples are unclear for training and without intuitive interpretation. This motivates us to design a more effective way to improve the performance.

Mis-classified Vector Guided Softmax Loss

Intuition says that considering the well-separated feature vectors has little effect on the learning problem. That means the mis-classified feature vectors are more crucial to enhance feature discriminability. To this end, we alternatively introduce a more elegant way to focus training on the truly informative features (i.e., mis-classified vectors). Specifically, based on the margin-based softmax loss functions, we define a binary indicator to adaptively indicate whether a sample (feature) is mis-classified by a specific classifier (where ) in the current stage:

(5)

From the definition Eq. (5), we can see that if a sample (feature) is mis-classified, i.e., (e.g., in the left sub-figure of Figure 1, the feature belongs to class 1, but it is mis-classified by the classifier , i.e., ), it will be emphasized temporarily. In this way, the hard examples are explicitly indicated and we mainly focus on them for discriminative training. Consequently, we formulate our Mis-classified Vector guided Softmax (MV-Softmax) loss as follows:

(6)

where is a re-weighted function to emphasize the indicated mis-classified vectors. Here we give two candidates, one is with fixed weights for all mis-classified classes:

(7)

and the other one is an adaptive formulation:

(8)

where

is a preset hyperparameter. Obviously, when

, the designed MV-Softmax loss Eq. (6) becomes identical to the original margin-based softmax losses Eq. (3).

Figure 1: A geometrical interpretation of MV-Softmax from feature perspective. Samples and are both from class 1. The mis-classified vectors (red dots) are those who are mis-classified by a specific classifier (e.g., ).

Comparision to Mining-based Softmax Losses.

To illustrate the advantages of our MV-Softmax loss over the traditional mining-based loss functions (e.g., HM-Softmax [25] and F-Softmax [13]), Figure 1 gives a toy example. Assume that we have two samples (features) and , both of them are from class 1, where is well-classified while is not. The HM-Softmax empirically indicates the hard samples and discards the easy sample to use the hard one for training. The F-Softmax does not explicitly indicate the hard samples, but it re-weights all the samples, making the harder one to have relatively larger loss value. These two strategies are directly from the loss viewpoint and the selection of hard examples is without semantic guidance. Our MV-Softmax loss Eq. (6) is from a different way. Firstly, we semantically indicates the hard examples (mis-classified vectors) according to the decision boundary. The hardness of previous methods is defined as a global relationship between feature (sample) and feature (sample). While our hardness is a local relationship between feature and classifier, which is more consistent with discriminative feature learning. Then, we emphasize these hard examples from probability viewpoint. Specifically, because the cross-entropy loss is a monotonically decreasing function, reducing the probability (the reason is that , see Eqs. (7) and (8)) of the mis-classified vector , will increase its importance for training. In summary, we can claim that our mis-classified vector guided mining strategy, is more superior for discriminative feature learning than previous ones.

Comparision to Margin-based Softmax Losses.

Similarly, assume that we have a sample from class 1, and it is not well-classified, (e.g., the red dot in Figure 1). The original softmax loss aims to make and . To make these objectives more rigorous, margin-based loss functions introduce a margin function from the perspective of ground truth class (i.e., ) [15, 33, 3]:

(9)

wherein is with a same and fixed margin for different classes and ignores the potential discriminability from other non-ground truth classes (e.g., and ). To solve these issues, our MV-Softmax loss tries to further enlarge the feature margin from the perspective of other non-ground truth classes. Specifically, we have introduced a margin function for the mis-classified feature :

(10)

where or . For the case , because is well-classified by the classifier , we do not need to give any additional enforcement to further enlarge its margin. Moreover, our MV-Softmax losses have also set adaptive margins for different classes. Taking MV-AM-Softmax (i.e., ) as an example, for the mis-classified classes, the margin is or . While for the well-classified classes, the margin is . On account of these, our MV-Softmax losses have addressed the second and third shortcomings.

According to the above discussions, we conclude that our new loss has inherited the merits of feature margin and feature mining into a unified loss function, thus it is expected to achieve more discriminative features for face recognition.

Optimization

In this section, we show that our MV-Softmax loss Eq. (6

) is trainable and can be easily optimized by the typical stochastic gradient descent (SGD). The difference between the previous margin-based softmax losses and the proposed MV-Softmax loss lies in the last fully connected layer

. For the forward propagation, when , it is the same as the original margin-based softmax loss (i.e., ). When , it has two cases, if the feature vector is well-classified for a specific classifier, it is the same as the original softmax (i.e., ). Otherwise, it will be re-computed with a fixed weight or an adaptive weight . The whole scheme of our method is summarized in Algorithm 1.

Input: Training set ; The hyper-parameter

; Training epochs

.
Initialization: ; Randomly initialize the parameter in convolution layers and in the last fully connected layer. while  do
        Shuffle the training set and fetch mini-batch ; Forward: According to the indication of hard examples Eq. (5), we compute the MV-Softmax loss by Eq. (6); Backward: Update the parameters and by Stochastic Gradient Descent (SGD);
end while
Output: Parameters and .
Algorithm 1 MV-Softmax

Experiments

Datasets Identities Images
Training MS-Celeb-1M-v1c-R 72,690 3.28M
Test LFW 5,749 13,233
CALFW 5,749 12,174
CPLFW 5,749 11,652
AgeDB 568 16,488
CFP 500 7,000
RFW 11,430 40,607
MegaFace 530(P) 1M(G)
Trillion-Pairs 5,749(P) 1.58M(G)
Table 1: Face datasets for training and test. ”(P)” and ”(G)” refer to the probe and gallery set, respectively.

Datasets

Training Data. The original MS-Celeb-1M dataset [5] contains about 100K identities with 10M images. However, it consists of a great many noisy faces. Fortunately, the trillion-pairs consortium [2] has made their efforts to get a high-quality version MS-Celeb-1M-v1c, which is well-cleaned for training.

Test Data. We use eight face recognition benchmarks, including LFW [9], CALFW [44], CPLFW [45], AgeDB [19], CFP [23], RFW [36], MegaFace [10, 20] and Trillion-Pairs [2], as the test data. For more details about the test datasets, please see their references.

Dataset Overlap Removal. In face recognition, it is very important to perform open-set evaluation, i.e., there should be no overlapping identities between training set and test set. To this end, we need to carefully remove the overlapped identities between the employed training dataset (i.e., MS-Celeb-1M-v1c) and the test datasets (including LFW, CALFW, CPLFW, AgeDB, CFP, RFW and MegaFace)111For the Trillion-Pairs test set, we can not remove the potential overlaps because its ground truth label (name) is unreleased.. For the overlap identities removal tool, we use the publicly available script provided by [33] to check whether if two names are of the same person. As a consequence, we remove 14,186 identities from the training set MS-Celeb-1M-v1c. For clarity, we donate the refined training dataset as MS-Celeb-1M-v1c-R. Important statistics of all the involved datasets are summarized in Table 1. To be rigorous, all the experiments in this paper are based on the refined training set MS-Celeb-1M-v1c-R. To encourage more researchers to abide by the open-set protocol, the overlapping lists and the refined dataset MS-Celeb-1M-v1c-R are publicly available.

Method BLUFR CALFW AgeDB
1e-5
MV-Arc-Softmax-f (0.15) 94.60 95.54 98.05
MV-Arc-Softmax-f (0.2) 95.18 95.46 98.11
MV-Arc-Softmax-f (0.25) 94.04 95.51 98.08
MV-Arc-Softmax-a (0.25) 94.15 95.33 97.86
MV-Arc-Softmax-a (0.3) 95.50 95.46 98.06
MV-Arc-Softmax-a (0.35) 95.08 95.50 97.90
MV-AM-Softmax-f (0.2) 94.81 95.29 98.01
MV-AM-Softmax-f (0.25) 95.74 95.45 98.05
MV-AM-Softmax-f (0.3) 95.07 95.41 98.00
MV-AM-Softmax-a (0.15) 94.09 95.41 98.13
MV-AM-Softmax-a (0.2) 96.27 95.63 98.00
MV-AM-Softmax-a (0.25) 94.29 95.51 97.96
Table 2: Verification performance (%) of our MV-Softmax loss functions with different hyper-parameter . ’f’ and ’a’ donate the fixed re-weight function Eq. (7) and the adaptive one Eq. (8), respectively.
Method LFW BLUFR CALFW CPLFW AgeDB CFP
1e-3 1e-4 1e-5
Baseline Softmax 99.59 99.29 99.11 91.74 94.66 87.76 97.01 94.04
Mining-based F-Softmax 99.65 99.24 98.72 91.19 93.83 86.35 96.51 93.20
HM-Softmax 99.65 99.30 99.11 92.03 94.69 87.56 97.05 94.12
Margin-based A-Softmax 99.65 99.30 99.12 92.77 94.55 87.85 97.16 94.22
Arc-Softmax 99.76 99.33 99.30 93.75 95.44 88.78 98.00 95.28
AM-Softmax 99.71 99.33 99.31 93.68 95.58 89.60 98.03 95.68
Naive-fused F-Arc-Softmax 99.71 99.33 99.29 94.51 95.48 88.85 98.10 95.62
F-AM-Softmax 99.73 99.33 99.30 92.81 95.58 89.60 98.20 95.47
HM-Arc-Softmax 99.75 99.33 99.29 93.53 95.36 89.16 97.86 95.22
HM-AM-Softmax 99.76 99.33 99.30 96.09 95.45 89.56 98.05 95.37
Ours MV-Arc-Softmax-f (0.2) 99.78 99.34 99.30 95.18 95.46 89.30 98.11 95.21
MV-Arc-Softmax-a (0.3) 99.76 99.33 99.30 95.50 95.46 89.41 98.06 95.45
MV-AM-Softmax-f (0.25) 99.79 99.33 99.31 95.74 95.45 89.69 98.05 95.70
MV-AM-Softmax-a (0.2) 99.79 99.33 99.30 96.27 95.63 89.19 98.00 95.30
Table 3: Verification performance (%) of different loss functions on the test sets LFW, CALFW, CPLFW, AgeDB and CFP.

Experimental Settings

Data Processing. We detect the faces by adopting the FaceBoxes detector [43, 42] and localize five landmarks (two eyes, nose tip and two mouth corners) through a simple 6-layer CNN [4, 17]. The detected faces are cropped and resized to 144144, and each pixel (ranged between [0,255]) in RGB images is normalized by subtracting 127.5 and divided by 128. For all the training faces, they are horizontally flipped with probability 0.5 for data augmentation.

CNN Architecture. In face recognition, there are many kinds of network architectures [15, 33, 31]. To be fair, the CNN architecture should be the same to test different loss functions. As suggested by the work [31], we use the AttentionNet [32] to achieve a good balance between computation and accuracy. Moreover, inspired by the work [3], we integrate the IRSE module into the AttentionNet and rename the developed architecture as AttentionNet-IRSE. For the depth stages of AttentionNet-IRSE, we set [1, 1, 1] as our baseline architecture. The output of AttentionNet-IRSE gets a 512-dimension feature.

Training

. All the CNN models are trained with stochastic gradient descent (SGD) algorithm and are trained from scratch, with the batch size of 32 on 4 P40 or 4 V100 GPUs parallelly, total batch size 128. The weight decay is set to 0.0005 and the momentum is 0.9. The learning rate is initially 0.1 and divided by 10 at 4, 8, 10 epochs, and we finish the training process at 12 epoch. All experiments in this paper are implemented by Pytorch library.

Test. At test stage, only the original image features are employed to compose the face representation. All the reported results in this paper are evaluated by a single model, without model ensemble or other fusion strategies.

For the evaluation metric, the cosine similarity is utilized. We follow the unrestricted with labelled outside data protocol

[9] to report the performance on LFW, CALFW, CPLFW, AgeDB, CFP and RFW. Moreover, we also report the BLUFR protocol [12] on the test set LFW. On Megaface and Trillion-Pairs Challenge, face identification and verification are conducted by ranking and thresholding the scores. Specifically, for face identification, the Cumulative Match Characteristics (CMC) curves are adopted to evaluate the Rank-1 accuracy. For face verification, the Receiver Operating Characteristic (ROC) curves are adopted. The true positive rate (TPR) at low false acceptance rate (FAR) is emphasized since in real applications false acceptance gives higher risks than false rejection.

For the compared methods, we compare our method with the baseline Softmax loss (Softmax) and the recently proposed state-of-the-arts, including 2 mining-based softmax losses (i.e., F-Softmax and HM-Softmax), 3 margin-based softmax losses (A-Softmax, Arc-Softmax and AM-Softmax) and their 4 naive fusions (F-Arc-Softmax, F-AM-Softmax, HM-Arc-Softmax and HM-AM-Softmax). For all the competitors, their source codes can be downloaded from the github or from authors’ webpages. The corresponding parameters of each competitors are mainly determined according to their paper’s suggestions. Specifically, for HM-Softmax [25], we save 90% high-loss samples in each mini-batch for training. For F-Softmax, it is with the parameter . For A-Softmax, the margin parameter is set as . While for AM-Softmax and Arc-Softmax, the margin parameters are set as and , respectively. The scale parameter has already been discussed sufficiently in previous works [33, 34]. In this paper, we empirically fixed it to 32 for all the methods.

Exploratory Experiments

Effect of parameter . Since the hyper-parameter in the re-weighted function Eqs. (7) and (8) plays an important role in the developed MV-Softmax loss, we mainly explore to search its possible best value in this part. In Table 2, we list the performance of our proposed MV-Softmax loss function with varies from different ranges. ’f’ and ’a’ donate the fixed re-weight function Eq. (7) and the adaptive one Eq. (8), respectively. From the numbers, we can summarize that our MV-Softmax loss is insensitive to the hyper-parameter in a certain range. Moreover, according to this study, we empirically set for MV-Arc-Softmax-f, for MV-Arc-Softmax-a, for MV-AM-Softmax-f and for MV-AM-Softmax-a in the subsequent experiments.

Convergence of MV-Softmax. Although the convergence of our method is not easy to be theoretically analyzed, it would be intuitive to see its empirical behavior. Here, we give the loss changes as the number of epochs increases. From the curves in Figure 2, it can be observed that our method has a good behavior of convergence.

Figure 2: Convergence of MV-Softmax. From the curves, we can see that our MV-Softmax loss functions have a good behavior of convergence.

Results on LFW, CALFW, CPLFW, AgeDB, CFP

Table 3 provides the quantitative results of all the competitors on LFW, CALFW, CPLFW, AgeDB and CFP. The bold number in each column represents the best result. For the LFW accuracy and its BLUFR protocol with different false alarm rates (e.g., 1e-3, 1e-4, 1e-5), it is well-known that these protocols are typical and easy for face recognition. For instance, at LFW accuracy and TPR@FAR=1e-3 and 1e-4, almost all the competitors can achieve 99% performance. So the improvement of our MV-Softmax losses is not quite large. For the BLUFR with TPR@FAR=1e-5, we can see that the naive fusion HM-AM-Softmax outperforms the baseline Softmax, the simple mining-based losses and the margin-based ones. Despite this, our MV-AM-Softmax still achieves about 0.2% improvement. On CALFW, CPLFW, AgeDB and CFP test sets, we also observe that our MV-Softmax losses are better than the state-of-the-art alternatives in most of cases. Nevertheless, we can see that the improvements of our method in these test sets are not by a large margin. The reason is that the test protocol is relatively easy and the performance of all the methods on these test sets are near saturation. So there is an urgent need to test the performance of all the competitors on new test sets or test with more complicated protocols.

Method RFW
Caucasian Indian Asian African
Softmax 98.33 93.33 93.16 91.33
F-Softmax 97.50 90.30 91.16 88.33
HM-Softmax 98.66 93.49 92.83 90.50
A-Softmax 98.83 94.33 93.33 91.33
Arc-Softmax 98.83 96.16 93.66 95.00
AM-Softmax 99.16 96.16 94.46 95.83
F-Arc-Softmax 98.99 95.83 94.16 95.50
F-AM-Softmax 99.16 96.66 93.66 95.00
HM-Arc-Softmax 98.66 94.33 94.16 96.66
HM-AM-Softmax 99.16 94.66 93.33 96.00
MV-Arc-Softmax-f 98.66 96.83 94.50 96.50
MV-Arc-Softmax-a 98.00 94.66 94.83 95.99
MV-AM-Softmax-f 99.00 94.99 94.83 96.66
MV-AM-Softmax-a 99.33 95.83 95.66 95.83
Table 4: Verification performance (%) of different loss functions on the test set RFW.

Results on RFW

Firstly, we evaluate all the competitors on the recent proposed new test set RFW [36]. RFW is a face recognition benchmark for measuring racial bias, which consists of four test subsets, namely Caucasian, Indian, Asian and African. Tables 4 displays the performance comparison of all the involved methods. From the values, we can conclude that the results on the four subsets exhibit the same trends, i.e., the margin-based losses are better than the baseline Softmax loss and the mining-based losses. The improvement by simply combining the margin-based and mining-based losses is limited. Our mis-classified guided ones, which explicitly emphasize on the mis-classified feature vectors for training, are more consistent with the discriminative feature learning. Therefore, they inherently absorb the merits of feature margin and feature mining into a unified loss function. They usually achieve more discriminative face features and can get higher performance than previous alternatives.

Method MegaFace Trillion-Pairs
Id. Veri. Id. Veri.
Softmax 93.94 94.76 60.06 59.00
F-Softmax 91.60 93.06 51.14 48.32
HM-Softmax 93.95 95.53 61.34 60.07
A-Softmax 94.18 95.26 60.34 59.01
Arc-Softmax 97.28 97.58 70.80 68.12
AM-Softmax 97.69 97.82 74.00 71.57
F-Arc-Softmax 97.51 97.81 70.65 69.06
F-AM-Softmax 95.75 97.75 73.82 72.18
HM-Arc-Softmax 97.43 97.56 70.08 68.16
HM-AM-Softmax 97.48 97.64 73.89 71.63
MV-Arc-Softmax-f 97.52 98.01 73.90 71.28
MV-Arc-Softmax-a 97.74 97.62 75.44 74.69
MV-AM-Softmax-f 97.95 97.85 75.92 74.45
MV-AM-Softmax-a 98.00 98.31 76.94 75.93
Table 5: Performance (%) of different loss functions on MegaFace and Trillion-Pairs Challenge.

Results on MegaFace and Trillion-Pairs

We then test all the competitors with more complicated protocols. Specifically, the identification (Id.) Rank-1 and the verification (Veri.) TPR@FAR=1e-6 on MegaFace, the identification (Id.) TPR@FAR=1e-3 and the verification (Veri.) TPR@FAR=1e-9 on Trillion-Pairs are reported in Table 5. From the numbers, we can observe that our MV-AM-Softmax-a achieves the best performance over the baseline Softmax loss, the mining-based Softmax losses, the margin-based softmax losses and the naive combinations of mining-based and margin-based losses, on both MegaFace and Trillion-Pairs Challenge. Specifically, on MegaFace, for our proposed MV-AM-Softmax-a, it obviously beats the best margin-based competitor AM-Softmax loss by a large margin (about 0.3% on identification and 0.5% on verification). Compared with the naive fusions of mining-based and margin-based losses, our improved MV-AM-Softmax-a loss is also better than them. Moreover, compared the MV-Softmax-a with MV-Softmax-f, we can say that the adaptive re-weighted function Eq. (8) is generally better than the fixed one Eq. (7). This is reasonable because for more difficult mis-classfied feature vectors, they should be more important for discriminative feature learning. In Figure 3, we also draw both of the CMC curves to evaluate the performance of face identification and the ROC curves to evaluate the performance of face verification on MegaFace Set 1. From the curves, we can see the similar trends at other measures. On Trillion-Pairs Challenge, we can observe that the results exhibit the same trends that emerged on MegaFace test set. Besides, the trends are more obvious. In particular, we achieve at least 3% improvements at both the identification and the verification on Trillion-Pairs Challenge. In this experiment, we have clearly demonstrated that our MV-AM-Softmax-a approach is superior for both the identification and verification tasks, especially when the false positive rate is very low. To sum up, by inheriting the advantages of both margin-based and mining-based Softmax losses, our new desined mis-classified guided one has shown its strong generalization ability for face recognition.

Figure 3: From Left to Right: CMC curves and ROC curves of different loss functions with 1M distractors on MegaFace Set 1.

Conclusion

This paper has proposed a simple yet very effective loss function, namely mis-classified vector guided softmax loss (i.e., MV-Softmax), for the task of face recognition. In specific, MV-Softmax loss explicitly concentrates on optimizing the mis-classified feature vectors. Thus it semantically inherits the motivations of feature margin and feature mining into a unified loss function. Consequently, it exhibits a higher performance than the baseline Softmax loss, the current mining-based losses, margin-based losses and their naive fusions. Extensive experiments on several face recognition benchmarks have validated the effectiveness of our new approach over the state-of-the-art alternatives.

References

  • [1] B. Chen, W. Deng, and H. Shen (2018) Virtual class enhanced discriminative embedding learning. In NeurIPS, Cited by: Introduction.
  • [2] Deepglint (2018) Note: http://trillionpairs.deepglint.com/overview Cited by: Introduction, Datasets, Datasets.
  • [3] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, Cited by: Introduction, Preliminary Knowledge, Preliminary Knowledge, Comparision to Margin-based Softmax Losses., Experimental Settings.
  • [4] Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu (2017) Wing loss for robust facial landmark localisation with convolutional neural networks. arXiv:1711.06753. Cited by: Experimental Settings.
  • [5] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In ECCV, Cited by: Datasets.
  • [6] K. He, X. Zhang, and S. Ren. (2016) Deep residual learning for image recognition.. In CVPR, Cited by: Introduction.
  • [7] G. Hu, X. Peng, Y. Yang, T. M. Hospedales, and J. Verbeek (2017) Frankenstein: learning deep face representations using small data. TIP. Cited by: Introduction.
  • [8] G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Z. Li, and T. Hospedales (2015)

    When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition

    .
    In CVPRW, Cited by: Introduction.
  • [9] G. Huang, M. Ramesh, and E. Miller. (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained enviroments.. Technical Report. Cited by: Datasets, Experimental Settings.
  • [10] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard (2016) The megaface benchmark: 1 million faces for recognition at scale. In CVPR, Cited by: Introduction, Datasets.
  • [11] X. Liang, X. Wang, Z. Lei, S. Liao, and Stan. Li. (2017) Soft-margin softmax for deep classification.. In ICONIP, Cited by: Introduction, Introduction.
  • [12] S. Liao, Z. Lei, D. Yi, and S. Z. Li (2014) A benchmark study of large-scale unconstrained face recognition. In ICB, Cited by: Experimental Settings.
  • [13] Y. Lin, P. Goyal, and R. Girshick. (2017) Focal loss for dense object detection.. In ICCV, Cited by: Introduction, Preliminary Knowledge, Naive Mining-Margin Softmax Loss, Comparision to Mining-based Softmax Losses..
  • [14] H. Liu, X. Zhu, Z. Lei, and S. Z. Li (2019) AdaptiveFace: adaptive margin and sampling for face recognition. In CVPR, Cited by: Introduction.
  • [15] W. Liu, Y. Wen, Z. Yu, M. Li, and L. Song. (2017) SphereFace: deep hypersphere embedding for face recognition.. In CVPR, Cited by: Introduction, Introduction, Preliminary Knowledge, Preliminary Knowledge, Comparision to Margin-based Softmax Losses., Experimental Settings.
  • [16] W. Liu, Y. Wen, and Z. Yu (2016) Large-margin softmax loss for convolutional neural networks.. In ICML, Cited by: Introduction, Preliminary Knowledge.
  • [17] Y. Liu, H. Shi, Y. Si, H. Shen, X. Wang, and T. Mei (2019) A high-efficiency framework for constructing large-scale face parsing benchmark. arXiv preprint arXiv:1905.04830. Cited by: Experimental Settings.
  • [18] Z. Liu, G. Hu, and J. Wang (2019) Learning discriminative and complementary patches for face recognition. In FG, Cited by: Introduction.
  • [19] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou (2017) Agedb: the first manually collected, in-the-wild age database. In CVPRW, Cited by: Datasets.
  • [20] A. Nech and I. Kemelmacher (2017) Level playing field for million scale face recognition. In CVPR, Cited by: Introduction, Datasets.
  • [21] R. Ranjan, C. Castillo, and R. Chellappa. (2017) L2-constrained softmax loss for discriminative face verification.. arXiv preprint arXiv:1703.09507.. Cited by: Introduction.
  • [22] F. Schroff, D. Kalenichenko, and J. Philbin. (2015) Facenet: a unified embedding for face recognition and clustering.. In CVPR, Cited by: Introduction.
  • [23] S. Sengupta, J. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs (2016) Frontal to profile face verification in the wild. In WACV, Cited by: Datasets.
  • [24] H. Shi, X. Wang, D. Yi, Z. Lei, X. Zhu, and S. Z. Li (2017) Cross-modality face recognition via heterogeneous joint bayesian. SPL. Cited by: Introduction.
  • [25] A. Shrivastava, A. Gupta, and R. Girshick. (2016) Training region-based object detectors with online hard example mining.. In CVPR, Cited by: Introduction, Preliminary Knowledge, Naive Mining-Margin Softmax Loss, Comparision to Mining-based Softmax Losses., Experimental Settings.
  • [26] K. Simonyan and Z. Andrew (2014) Very deep convolutional networks for large-scale image recognition.. arXiv preprint arXiv:1409.1556. Cited by: Introduction.
  • [27] Y. Sun, X. Wang, and X. Tang. (2014) Deep learning face representation from predicting 10,000 classes.. In CVPR, Cited by: Introduction.
  • [28] Y. Sun, X. Wang, and X. Tang. (2015) Deeply learned face representations are sparse, selective, and robust.. In CVPR, Cited by: Introduction, Introduction.
  • [29] Y. Taigman, M. Yang, and M. Ranzato. (2014) Deepface: closing the gap to human-level performance in face verification.. In CVPR, Cited by: Introduction.
  • [30] F. Wang, X. Xiang, J. Chen, and A. Yuille. (2017) NormFace: hypersphere embedding for face verification... In ACM MM, Cited by: Introduction, Preliminary Knowledge.
  • [31] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. C. Loy (2018) The devil of face recognition is in the noise. In ECCV, Cited by: Experimental Settings.
  • [32] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. arXiv:1704.06904. Cited by: Introduction, Experimental Settings.
  • [33] F. Wang, J. Cheng, W. Liu, and H. Liu (2018) Additive margin softmax for face verification. SPL. Cited by: Introduction, Preliminary Knowledge, Preliminary Knowledge, Comparision to Margin-based Softmax Losses., Datasets, Experimental Settings, Experimental Settings.
  • [34] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu (2018) CosFace: large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414. Cited by: Introduction, Experimental Settings.
  • [35] J. Wang, F. Zhou, and S. Wen. (2017) Deep metric learning with angular loss.. In ICCV, Cited by: Introduction.
  • [36] M. Wang, W. Deng, J. Hu, J. Peng, X. Tao, and Y. Huang (2018) Racial faces in-the-wild: reducing racial bias by deep unsupervised domain adaptation. arXiv:1812.00194. Cited by: Datasets, Results on RFW.
  • [37] X. Wang, X. Guo, and S. Z. Li (2015) Adaptively unified semi-supervised dictionary learning with active points. In ICCV, Cited by: Introduction.
  • [38] X. Wang, S. Wang, J. Wang, H. Shi, and T. Mei (2019) Co-mining: deep face recognition with noisy labels. In ICCV, Cited by: Introduction.
  • [39] X. Wang, S. Wang, S. Zhang, T. Fu, and T. Mei (2018) Support vector guided softmax loss for face recognition. arXiv preprint arXiv:1812.11317. Cited by: Introduction.
  • [40] X. Wang, S. Zhang, Z. Lei, S. Liu, X. Guo, and S. Z. Li (2018) Ensemble soft-margin softmax loss for image classification. arXiv preprint arXiv:1805.03922. Cited by: Introduction, Introduction, Preliminary Knowledge.
  • [41] Y. Wen, K. Zhang, and Z. Li (2016) A discriminative feature learning approach for deep face recognition.. In ECCV, Cited by: Introduction, Introduction.
  • [42] S. Zhang, X. Wang, Z. Lei, and S. Z. Li (2019) Faceboxes: a cpu real-time and accurate unconstrained face detector. Neurocomputing. Cited by: Experimental Settings.
  • [43] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li (2017) Faceboxes: a cpu real-time face detector with high accuracy. In IJCB, Cited by: Experimental Settings.
  • [44] T. Zheng, W. Deng, J. Hu, and J. Hu (2017) Cross-age lfw: a database for studying cross-age face recognition in unconstrained environments. arXiv:1708.08197. Cited by: Datasets.
  • [45] T. Zheng, W. Deng, T. Zheng, and W. Deng (2018) Cross-pose lfw: a database for studying crosspose face recognition in unconstrained environments. Tech. Rep. Cited by: Datasets.
  • [46] Y. Zheng, D. K. Pal, and M. Savvides (2018) Ring loss: convex feature normalization for face recognition. In CVPR, Cited by: Introduction.