Support Vector Guided Softmax Loss for Face Recognition

by   Xiaobo Wang, et al., Inc.

Face recognition has witnessed significant progresses due to the advances of deep convolutional neural networks (CNNs), the central challenge of which, is feature discrimination. To address it, one group tries to exploit mining-based strategies (e.g., hard example mining and focal loss) to focus on the informative examples. The other group devotes to designing margin-based loss functions (e.g., angular, additive and additive angular margins) to increase the feature margin from the perspective of ground truth class. Both of them have been well-verified to learn discriminative features. However, they suffer from either the ambiguity of hard examples or the lack of discriminative power of other classes. In this paper, we design a novel loss function, namely support vector guided softmax loss (SV-Softmax), which adaptively emphasizes the mis-classified points (support vectors) to guide the discriminative features learning. So the developed SV-Softmax loss is able to eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes, and thus results in more discrimiantive features. To the best of our knowledge, this is the first attempt to inherit the advantages of mining-based and margin-based losses into one framework. Experimental results on several benchmarks have demonstrated the effectiveness of our approach over state-of-the-arts.


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9


Mis-classified Vector Guided Softmax Loss for Face Recognition

Face recognition has witnessed significant progress due to the advances ...

NPT-Loss: A Metric Loss with Implicit Mining for Face Recognition

Face recognition (FR) using deep convolutional neural networks (DCNNs) h...

OTFace: Hard Samples Guided Optimal Transport Loss for Deep Face Representation

Face representation in the wild is extremely hard due to the large scale...

Learning Towards the Largest Margins

One of the main challenges for feature representation in deep learning-b...

Learning Deep Convolutional Embeddings for Face Representation Using Joint Sample- and Set-based Supervision

In this work, we investigate several methods and strategies to learn dee...

Angular Visual Hardness

Although convolutional neural networks (CNNs) are inspired by the mechan...

Additive Phoneme-aware Margin Softmax Loss for Language Recognition

This paper proposes an additive phoneme-aware margin softmax (APM-Softma...

1 Introduction

Face recognition is a fundamental and of great practice values task in the community of computer vision and patter recognition. The task of face recognition contains two categories, face identification to classify a given face to a specific identity, and face verification to determine whether a pair of face images are of the same identity. Though it has been extensively studied for decades

[38, 2, 35, 25, 27, 21], there still exist a great many challenges for accurate face recognition, especially on large-scale test datasets, such as MegaFace Challenge [9] or Trillion Pairs Challenge111

In recent years, the advanced face recognition models are usually built upon deep convolutional neural networks [31, 7, 23] and the learned discriminative features play a significant role. To train deep models, the CNNs are generally equipped with classification loss functions [28, 32, 37, 10, 14, 36], metric learning loss functions [26, 20, 34] or both [18, 27, 37, 41]. Metric learning loss functions such as contrastive loss [26] or triplet loss [20] usually suffer from high computational cost. To avoid this problem, they require carefully designed sample mining strategies and the performance is very sensitive to these strategies. So increasingly more researchers shift their attentions to construct deep face recognition models by re-designing the classification loss functions.

Intuitively, face features are discriminative if their intra-class compactness and inter-class separability are well maximized. However, as pointed out by many recent studies [37, 32, 14, 30, 36, 4], the current prevailing classification loss function (i.e., Softmax loss) usually lacks the power of feature discrimination for deep face recognition. To address this issue, one group proposes to explore the mining-based loss functions [22, 12, 24, 39]. Shrivastava et al. [22] develop a hard mining softmax (HM-Softmax) to improve the feature discrimination by constructing mini-batches using high-loss examples. Among which, the percentage of hard examples is empirically decided and the easy examples are completely discarded. In contrast, Lin et al. [12] design a relatively soft mining softmax, namely Focal loss (F-Softmax), to focus training on a sparse set of hard examples. It usually achieves more promising results than the simple hard mining softmax. Yuan et al. [39] select the hard examples based on model complexity and train an ensemble to model examples of different hard levels. The other group prefers to design margin-based loss functions [14, 30, 4]. This group does not focus on optimizing hard examples but directly increasing the feature margin between different classes. Wen et al. [37] develop a center loss to learn centers for each identity to enhance the intra-class compactness. Wang et al. [32] and Ranjan et al. [19]

propose to use a scale parameter to control the temperature of softmax loss, producing higher gradients to the well-separated samples to shrink the intra-class variance. Liu

et al. [13, 14] introduce an angular margin (A-Softmax) between the ground truth class and other classes to encourage the larger inter-class variance. However, it is usually unstable and the optimal parameters are hard to determinate. To enhance the stability of A-Softmax loss, several alternative approaches [30, 36, 15, 4] have been proposed. Wang et al. [30] design an additive margin (AM-Softmax) loss to stabilize the optimization and have achieved promising performance. Deng et al. [4] develop an additive angular margin (Arc-Softmax) loss, which has a more clear geometric interpretation.

Although these two groups have been well-verified to learn discriminative features for face recognition. The motivation of mining-based losses is to focus on hard examples while margin-based losses are to enlarge the feature margin between different classes. Currently, they develop independently and both of them have their own intrinsic drawbacks. To the mining-based losses, the definition of hard examples is ambiguous and they are often empirically selected. How to semantically decide the hard examples is still an open problem. To the margin-based losses, most of them learn discriminative features by enlarging the feature margin, only from the perspective of ground truth class (self-motivation). They usually ignore the discriminative power from the perspective of other non-ground truth classes (other-motivation). Moreover, the relation between mining-based and margin-based losses remains unclear.

To overcome the above shortcomings, this paper tries to design a new loss function, which adaptively emphasizes on the informative support vectors to bridge the gap between mining-based and margin-based losses and semantically integrate them into one framework. To sum up, the main contributions of this paper can be summarized as follows:

  • We propose a novel SV-Softmax loss, which eliminates the ambiguity of hard examples as well as absorbs the discriminative power of other classes by focusing on support vectors. To the best of our knowledge, this is the first attempt to semantically fuse the mining-based and margin-based losses into one framework.

  • We deeply analyze the relations of our SV-Softmax loss to the current mining-based and margin-based losses, and further develop an improved version SV-X-Softmax loss to enhance the feature discrimiantion. Our code will be available at

  • We conduct extensive experiments on the benchmarks of LFW [8], MegaFace Challenge [9, 16] and Trillion Pairs Challenge, which have verified the superiority of our new approach over the baseline Softmax loss, the mining-based Softmax losses, the margin-based Softmax losses, and their naive fusions.

2 Preliminary Knowledge

Softmax. Softmax loss is defined as the pipeline combination of the last fully connected layer, the softmax function and the cross-entropy loss. In face recognition, the weights , (where and is the number of classes) and the feature of the last fully connected layer are usually normalized and the magnitude is replaced as a scale parameter [32, 30, 4]. In consequence, given an input feature vector with its corresponding ground truth label , the softmax loss can be formulated as follows:



is the cosine similarity and

is the angle between and . As pointed out by a great many studies [13, 14, 30, 4], the learned features with softmax loss are prone to be separable, rather than to be discriminative for face recognition.

Mining-based Softmax. Hard example mining is becoming a common practice to effectively train deep CNNs. Its idea is to focus training on the informative examples, thus it usually results in more discriminative features. There are recent works that select hard examples based on loss value [22, 12] or model complexity [39] to learn discriminative features. Generally, they can be summarized as:



is the predicted ground truth probability and

is an indicator function. Basically, to the soft mining method Focal loss [12] (F-Softmax), , is a modulating factor. To the hard mining method HM-Softmax [22], when the sample is indicated as easy while when the sample is hard. However, the definition of hardness is ambiguous and they usually lead to sensitive performance.

Margin-based Softmax. To directly enhance the feature discrimination, several margin-based softmax loss functions [14, 36, 30, 4] have been proposed in recent years. In summary, they can be defined as follows:


where is a carefully designed margin function. Basically, is the motivation of A-Softmax loss [14], where and is an integer. with is the AM-Softmax loss [30]. with is the Arc-Softmax loss [4]. More generally, the margin function can be summarized into a combined version: . However, all these methods achieve the feature margin only from the perspective of ground truth class . They are not aware of the importance of other non-ground truth classes.

3 Problem Formulation

3.1 Naive Mining-Margin Softmax Loss

The mining-based loss functions aim to focus on the hard examples while the margin-based loss functions are to enlarge the feature margin between different classes. Therefore, these two branches can seamlessly incorporate into each other. The naive motivation to directly integrate them can be formulated as:


However, this formulation Eq. (4) only absorbs their own merits. It can not solve their respective shortcomings. Detailedly, it only encourages the feature margin from the perspective of the ground truth class by (self-motivation), ignoring the feature discriminative power of other non-ground truth classes (other-motivation). Moreover, the hard examples are still empirically selected by the indicator function , without semantic guidance. In other words, the definition of hard examples is ambiguous.

Figure 1: A geometrical interpretation of SV-Softmax loss from feature perspective. The support vectors (red circle points) are those who are mis-classified by the current classifiers. SV-Softmax loss semantically focuses on optimizing such support vectors.

3.2 Support Vector Guided Softmax Loss

Intuition says that considering the well-separated feature vectors has little effect on the learning problem. That means the mis-classified feature vectors are more crucial to enhance the feature discriminability. Motivated by this, the hard example mining [22] and the recent Focal loss [12] techniques are proposed to focus training on a sparse set of hard examples and ignore the vast number of easy ones during training. However, they either empirically sample hard examples according to loss values or empirically down-weight the easy examples by a modulating factor. In other words, the definition of hard examples is ambiguous, and without intuitive interpretation.

To address it, we alternatively introduce a more elegant way to focus training on the informative features (i.e., support vectors). Specifically, we define a binary mask to adaptively indicate whether a sample is selected as the support vector by a specific classifier in the current stage. To the end, the binary mask is defined as follows:


From the definition, we can see that if a sample is mis-classified, i.e., , it will be emphasized temporarily. In this way, the concept of hard examples is clearly defined and we mainly focus on such a sparse set of support vectors. Consequently, our Support Vector Guided Softmax (SV-Softmax) loss is formulated:



is a preset hyperparameter and the indicator function

is defined as:


Obviously, when , the designed SV-Softmax loss becomes identical to the original softmax loss. Figure 1 gives the geometrical interpretation of our SV-Softmax loss.

Figure 2: From left to right: SV-Softmax loss vs. Mining-based softmax loss (e.g., Focal loss [12]). SV-Softmax loss semantically defines the hard examples (support vectors) and emphasizes them from the probability view, while the hard examples of Focal loss are ambiguous and are concerned from the loss view.
Figure 3: Pipeline of our SV-Softmax loss and its relations to the existing mining-based and margin-based losses. Our SV-Softmax loss semantically integrates the motivation of mining-based and margin-based losses into one framework, but from different viewpoints.
Figure 4: From left to right: SV-Softmax loss vs. Margin-based softmax loss. SV-Softmax loss enlarges the feature margin from other classes (other-motivation) while current margin-based losses are directly from the ground truth class (self-motivation).

3.2.1 Relation to Mining-based Softmax Losses

To illustrate the advantages of our SV-Softmax loss over the traditional mining-based loss functions (e.g., Focal loss [12]), we use the binary classification case as an example. Assume that we have two samples and , both of them are from class 1. Figure 2 gives a diagram, where is relatively hard while is relatively easy. The traditional mining-based Focal loss is to differentially re-weight the losses of hard and easy examples, such that:


In that way, the importance of hard examples is emphasized. This strategy is directly from the loss perspective and the definition of hard examples is ambiguous. While our SV-Softmax loss is from a different way. Firstly, we semantically define the hard examples (support vectors) according to the decision boundary. Then, to the support vector , we reduce its probability, such that:


In summary, the differences between SV-Softmax loss and mining-based Focal loss [12] are displayed in Figure 2.

3.2.2 Relation to Margin-based Softmax Losses

Similarly, assume that we have a sample from class 1, and it is a little far way from its ground truth class, (e.g., the red circle point in Figure 4). The original softmax loss aims to make . To make the objective more rigorous, margin-based losses usually introduce a margin function from the perspective of ground truth class [14, 30, 4]:


In contrast, our SV-Softmax loss enlarge the feature margin from the perspective of other non-ground truth classes. Specifically, we have introduced a margin function to these mis-classified features:


where . Our SV-Softmax loss semantically enlarges the feature margin from other non-ground truth classes while margin-based losses make theirs efforts from the ground truth class. For multi-class case, Our SV-Softmax loss is class-specific margins. Figure 4 gives their geometrical comparison. To sum up, Figure 3 shows the pipeline of our SV-Softmax loss and its relations to the mining-based and margin-based losses.

Figure 5: From left to right: SV-Softmax loss vs. SV-X-Softmax loss. To increase the mining range, we adopt the margin-based decision boundaries to select support vectors. Thus the non-support vectors in SV-Softmax may be support vectors in SV-X-Softmax.

3.2.3 SV-X-Softmax

According to the above discussions, our SV-Softmax loss semantically fuses the motivation of mining-based and margin-based losses into one framework, but from different viewpoints. Therefore, we can also absorb their strengths into our SV-Softmax loss. Specifically, to increase the mining range, we adopt the margin-based decision boundaries to indicate the support vectors. Consequently, the improved SV-X-Softmax loss can be formulated as:


where X is the margin-based losses. It can be A-Softmax [14], AM-Softmax [30] and Arc-Softmax [4] etc. The indicator mask is re-computed according to margin-based decision boundaries222That why we uniformity call the hard examples as ”support vectors”, because it is similar to the definition in [3].. Specifically,


Figure 5 gives the geometrical illustration of our SV-X-Softmax loss. It is best because from the motivation of margin-based losses, SV-X-Softmax loss enlarges the feature margin by integrating the self-motivation of ground truth class and the other-motivation of other classes into one framework. While from the motivation of mining-based losses, it semantically enlarges the mining range.

4 Optimazation

In this section, we show that the proposed SV-Softmax loss (6

) is trainable and can be easily optimized by the typical stochastic gradient descent. The difference between the original softmax loss and the proposed SV-Softmax loss lies in the last fully connected layer


To the forward, when , it is the same as the original softmax loss (i.e., ). When , it has two cases, if the feature vector is easy for an specific class, it is the same as the original softmax (i.e., ). Otherwise, it will be recomputed as

. To the backward propagation, we use the chain rule to compute the partial derivative. The derivative of

and the CNN feature of the last fully connected layer should be re-emphasized:


where the computation form of is the same as the original softmax loss. The whole scheme for a single image is summarized in Algorithm 1. It is trivial to perform derivation with mini-batch input. Moreover, it is also straightforward to the SV-X-Softmax loss case.

Input: A CNN feature with its corresponding label . Initialized parameters in convolution layers. Parameter in the last fully connected layer. The learning rate and the indicator parameter . The number of iteration .
while not converged do
        1: ; 2: According to the definition of hard examples (5), we compute the SV-Softmax loss by (6); 3: Compute the back-propagation error of each CNN feature by (15) and the weight by (14); 4: Update the parameters and by ;
end while
Output: Parameters and .
Algorithm 1 SV-Softmax
Pairs Accuracy TPR@FAR=1e-3 TPR@FAR=1e-4 TPR@FAR=1e-5

Softmax 99.26 99.46 98.44 95.24
Mining-based F-Softmax [12] 99.46 99.62 98.76 95.97
HM-Softmax [22] 99.26 99.48 98.48 95.11
Margin-based A-Softmax [14] 99.36 99.68 99.09 97.20
Arc-Softmax[4] 99.63 99.86 99.68 98.18
AM-Softmax [30] 99.61 99.86 99.75 98.18
Naive-fused F-Arc-Softmax 99.66 99.87 99.73 98.32
F-AM-Softmax 99.66 99.87 99.76 98.39
HM-Arc-Softmax 99.51 99.86 99.70 98.74
HM-AM-Softmax 99.63 99.87 99.75 98.90
Ours SV-Softmax 99.48 99.78 99.39 98.14
SV-Arc-Softmax 99.78 99.85 99.77 98.52
SV-AM-Softmax 99.76 99.87 99.81 99.22
Table 1: Verification performance (%) of different loss functions on LFW test data.

5 Experiments

5.1 Datasets

Training Data. The MS-Celeb-1M dataset [6] contains about 100k identities with 10 million images. However, it consists of a great many noisy face images. Fortunately, the trillionpairs consortium has made their efforts to get a high-quality version MS-Celeb-1M-v1c, which is well-cleaned with 86,876 identities and 3,923,399 aligned images.

Validation Data. We employ Labelled Faces in the Wild (LFW) [8] as the validation data. LFW contains 13,233 web-collected images from 5,749 different identities, with large variations in pose, expression and illuminations.

Test Data. We use two datasets, MegaFace [9] and Trillion Pairs333, as the test data. MegaFace datasets aim at evaluating the performance of face recognition algorithms at the million scale of distractors, which include gallery set and probe set. The gallery set, a subset of Flickr photos from Yahoo, consists of more than one million images from 690,000 different individuals. The probe set has two existing databases: Facescrub [17] and FGNET [1]. In this study, we use the Facescrub as the probe set, which contains 100,000 photos of 530 unique individuals, wherein 55,742 images are males, and 52,076 images are females. Trillion Pairs datasets are recently released as a public available testing benchmark, which are consisted of the following two parts, ELFW and DELFW. ELFW is the face images of celebrities in LFW name list. There are 274,000 images from 5,700 identities. DELFW is the distractors for ELFW. There are in total 1.58 million face images from Flickr.

5.2 Experimental Settings

Data Processing. We detect the faces by adopting the FaceBoxes detector [40] and localize five landmarks (two eyes, nose tip and two mouth corners) through a simple 6-layer CNN [5]. The detected faces are cropped and resized to 120120, and each pixel (ranged between [0,255]) in RGB images is normalized by subtracting 127.5 and then being divided by 128. For all the training faces, they are horizontally flipped with probability 0.5 for data augmentation.

CNN Architecture. In face recognition, there are many kinds network architectures [14, 30, 29]. To be fair, the CNN architecture should be the same to test different loss functions. As suggested by the work [29], we use Attention-56 [31] as our baseline architecture to achieve a good balance between computation and accuracy. The output of Attention-56 has and finally gets a 512-dimension feature by the operation of averaging pooling. The scale parameter has already been discussed sufficiently in previous works [30, 33]. In this paper, we directly fixed it to 30. For details, the adopted Attention-56 architecture is provided in supplementary materials.

Training. All the CNN models are trained with stochastic gradient descent (SGD) algorithm and trained from scratch, with the batch size of 32 on 4 P40 GPUs parallelly, total batch size 128. The weight decay is set to 0.0005 and the momentum is 0.9. The learning rate is initially 0.1 and divided by 10 at the 100k, 160k, 220k iterations, and we finish the training process at 240k iterations.

Test. At the testing stage, only the features of original image are employed (512-dimension) to compose the face representation. All the reported results in this paper are evaluated by a single model, without model ensemble or other fusion strategies.

To the evaluation metrics, the cosine distance of features is computed as the similarity score. Face identification and verification are conducted by ranking and thresholding the scores. Specifically, for face identification, the Cumulative Match Characteristics (CMC) curves are adopted to evaluate the Rank-1 face identification accuracy. For face verification, the Receiver Operating Characteristic (ROC) curves are adopted. The true positive rate (TPR) at low false acceptance rate (FAR) is emphasized since in real applications false acceptance gives higher risks than false rejection. We test our models on several popular public face datasets, including LFW

[8], MegaFace Challenge [9, 16] and the recent Trillion Pairs Challenge. Specifically, for LFW, the unrestricted with labeled outside data on 6000 pairs accuracy [8] and the BLUFR [11] protocols are reported. For Megaface Challenge, the identification Rank-1 accuracy and the verification rate TPR@FAR =1e-6 are reported. For Trillion Pairs Challenge, every pair between ELFW and DELFW is used. There are in total 0.4 trillion pairs. To the face identification task, they provide a 1.58 million-size gallery and a 270k-size query for top-1 identification and the metric TPR@FAR=1e-3 is reported. While to the face verification task, the verification rate TPR@FAR=1e-9 is reported. For more details about the protocols, please refer to the works [8, 11, 9].

To the compared methods, we compare our method with the baseline Softmax loss (Softmax) and the recently proposed state-of-the-arts, including 2 mining-based softmax losses (i.e., hard example mining (HM-Softmax [22]) and Focal loss (F-Softmax [12])), 3 margin-based softmax losses (the angular Softmax loss (A-Softmax[34]), the additive margin Softmax loss (AM-Softmax[30]), and the additive angular margin Softmax loss (Arc-Softmax[4])) and their 4 naive fusions (F-AM-Softmax, F-Arc-Softmax, HM-AM-Softmax and HM-Arc-Softmax). For all the compared methods, their source codes can be downloaded from the github or from authors’ webpages. The corresponding parameters are determined according to their suggestions (e.g., the feature margin parameter is 0.35 for AM-Softmax and is 0.5 for Arc-Softmax). For more details, please refer to the supplementary materials.

Figure 6: From left to right: Identification and Verification performance (%) of SV-Softmax loss with different indicator parameter on LFW and MegaFace, respectively.

5.3 Effects of indicator parameter

Since the indicator parameter plays an important role in the developed SV-Softmax loss, we first conduct experiments to search its possible best value. By varying t from 1.0 to 1.3 (If t is larger than 1.4, the model may fail to converge), we use the Attention-56 network and the SV-Softmax loss to train models on the MS-Celeb-1M-v1c dataset and evaluate its performance on the validation set LFW. As illustrated in the left sub-figure of Figure 6, with being increased, the 6000 pairs accuracy and the BLUFR of LFW are improved consistently, and get saturated at . This demonstrates the effectiveness of our SV-Softmax loss (compared with ). To validate the sensitivity of our indicator parameter , we directly use the trained models to test them on MegaFace, the effects are reported in the right sub-figure of Figure 6. From the curves, we can see that our SV-Softmax loss is insensitive to the indicator parameter in a certain range. According to this study, is set to fixed 1.2 in the subsequent experiments.

Method Identification Verification
Rank1@1e6 TPR@FAR=1e-6
Softmax 86.29 87.63
F-Softmax [12] 88.29 89.83
HM-Softmax [22] 86.58 88.39
A-Softmax [14] 88.54 89.40
Arc-Softmax [4] 93.67 94.47
AM-Softmax [30] 94.77 95.44
F-Arc-Softmax 93.98 95.10
F-AM-Softmax 94.47 94.84
HM-Arc-Softmax 94.05 95.26
HM-AM-Softmax 94.78 95.57
SV-Softmax 92.11 93.54
SV-Arc-Softmax 97.14 97.57
SV-AM-Softmax 97.20 97.38
Table 2: Results (%) of different losses on MegaFace Challenge.

5.4 Experiments on LFW

Table 4 provides the quantitative results of all the competitors on LFW dataset. The bold number in each column represents the best performance. To the 6000 pairs accuracy protocol, it is well-known that this protocol is typical and easy for deep face recognition, and all the competitors can achieve over 99% accuracy rate. So the improvement of our SV-Softmax loss is not quite large. From the numbers, we observe that the naive fusions of mining-based and margin-based losses, e.g., HM-AM-Softmax and F-AM-Softmax, outperform the simple mining-based or margin-based ones. Despite this, our imporved SV-AM-Softmax still achieves about 0.3% improvements. To the BLUFR protocol, the similar trends as the 6000 pairs accuracy, our improved SV-AM-Softmax loss achieves the best performance among all the competitors. Due to the evaluation protocols on LFW are nearly to be saturated, it would be better to test our models on MegaFace and Trillion Pair Challenges.

Figure 7: Left: CMC curves of different loss functions with 1M distractors on MegaFace [9] Set 1. Right: ROC curves of different loss functions with 1M distractors on MegaFace [9] Set 1.

5.5 Experiments on MegaFace Challenge

Table 5 shows the identification and verification results on MegaFace dataset. In particular, compared with the baseline Softmax loss and the mining-based Softmax losses, our SV-Softmax loss achieves at least 3% improvements at both the Rank-1 identification rate and the verification TPR@FAR=1e-6 rate. The reason is that our SV-Softmax loss has clearly defined the hard examples (i.e., support vectors), thus it is better than existing mining-based losses. While compared with the margin-based Softmax losses, the performance of our SV-Softmax loss is slightly lower than them. This is reasonable because the support vectors decided by the Softmax decision boundary in SV-Softmax loss may not be enough for learning discriminative features. Our improved versions SV-Arc-Softmax and SV-AM-Softmax losses, wherein the support vectors are determined by the margin-based decision boundaries, can further boost the performance because they absorb the complementary merits of margin-based losses. Specifically, to our SV-AM-Softmax loss, it beats the best margin-based competitor AM-Softmax loss by a large margin (about 2.4% at Rank-1 identification rate and 1.9% verification rate). Compared with the naive fusions of mining-based and margin-based losses, our improved SV-AM-Softmax loss is also better than them. It is about 2.4% higher at Rank-1 identification rate and 1.8% higher at verification rate than the second best competitor HM-AM-Softmax loss. To sum up, our imporved SV-X-Softmax losses, which eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes by focusing on support vectors, are inherently the best in the current stage. In Figure 7, we draw both of the CMC curves to evaluate the performance of face identification and the ROC curves to evaluate the performance of face verification on MegaFace Set 1. From the curves, we can see the similar trends at other measures. In this experiment, our SV-Softmax loss with its improved version SV-AM-Softmax approach have shown their superiority for both the identification and verification tasks.

Method Identification Verification
TPR@FAR=1e-3 TPR@FAR=1e-9
Softmax 36.61 33.87
F-Softmax [12] 39.80 37.14
HM-Softmax [22] 36.75 34.46
A-Softmax [14] 43.89 43.76
Arc-Softmax [4] 57.48 57.45
AM-Softmax [30] 61.80 61.61
F-Arc-Softmax 56.80 56.87
F-AM-Softmax 61.85 61.79
HM-Arc-Softmax 55.93 56.63
HM-AM-Softmax 61.42 61.33
SV-Softmax 51.18 46.78
SV-Arc-Softmax 71.19 70.33
SV-AM-Softmax 73.56 72.71
Table 3: Performance (%) of different loss functions on Trillion Pairs Challenge.

5.6 Experiments on Trillion Pairs Challenge

Table 3 displays the performance comparison on the recent Trillion Pairs Challenge, from which, we can conclude that the results exhibit the same trends that emerged on LFW and MegaFace datasets. Besides, the trends are more obvious. Concretely, both of the current mining-based and margin-based losses are better than the simple softmax loss for face recognition. However, the margin-based losses usually achieve higher performance than the mining-based losses, because the motivation of margin-based losses is to enhance the feature discrimination while the motivation of mining-based losses is to focus training on hard examples. Their naive fusions can slightly improve the performance further. However, the naive fusions are still suffering from the ambiguity of hard examples and the lack of discriminative power of other classes. Therefore, they are limited for face recognition. Our SV-X-Softmax (e.g., SV-AM-Softmax) losses absorb the strengths and discard the drawbacks of the current ming-based and margin-based loss functions, thus they achieve the highest performance.

Accuracy @FAR=1e-3 @FAR=1e-4 @FAR=1e-5
99.85 (our) 99.92 99.89 99.13
99.87 (1st) - - - - - -
Table 4: Performance (%) of SV-AM-Softmax loss on LFW.
MegaFace Identification MegaFace Verification
Rank-1@1e6 TPR@FAR=1e-6
98.82 (our) 99.03 (our)
99.93 (1st) 99.93 (1st)
Table 5: Performance (%) of SV-AM-Softmax loss on MegaFace.
Trillion Pairs Identification Trillion Pairs Verification
TPR@FAR=1e-3 TPR@FAR=1e-9
82.25 (our) 78.49 (our)
85.67 (1st) 82.29 (1st)
Table 6: Performance (%) of SV-AM-Softmax loss on Trillion Pairs.

6 Improvement by Designing Architectures

To further boost the performance, we try to make the adopted Attention-56 [31] architecture deeper. Specifically, we change the stages of [1,1,1] used in Attention-56 into [3,6,2]. Moreover, inspired by [4], we incorporate the IRSE module into the architecture. The results are displayed in Tables 4-6. Note that all current results are training based on the simple MS-Celeb-1Mv1c dataset and only the single model performance is reported. From the numbers, we can see that our SV-AM-Softmax loss has achieved the competitive absolute performance. In the future, it would be better to fuse the MS1M-ArcFace [4] and Asian datasets444 and design model ensemble methods (e.g., feature concatenation).

7 Conclusion

This paper has proposed a simple but very effective loss function, namely support vector guided softmax loss (i.e., SV-Softmax), for face recognition. In specific, SV-Softmax loss explicitly concentrates on optimizing the support vectors. Thus it semantically integrates the motivation of mining-based and margin-based loss functions into one framework. Consequently, it is intrinsically better than the current mining-based losses, margin-based losses and their naive fusions. Extensive experiments on several benchmark datasets have clearly demonstrated the advantages of our new approach over the state-of-the-art alternatives.