ArcFace: Additive Angular Margin Loss for Deep Face Recognition

by   Jiankang Deng, et al.
Imperial College London

Convolutional neural networks have significantly boosted the performance of face recognition in recent years due to its high capacity in learning discriminative features. To enhance the discriminative power of the Softmax loss, multiplicative angular margin and additive cosine margin incorporate angular margin and cosine margin into the loss functions, respectively. In this paper, we propose a novel supervisor signal, additive angular margin (ArcFace), which has a better geometrical interpretation than supervision signals proposed so far. Specifically, the proposed ArcFace cos(θ + m) directly maximise decision boundary in angular (arc) space based on the L2 normalised weights and features. Compared to multiplicative angular margin cos(mθ) and additive cosine margin cosθ-m, ArcFace can obtain more discriminative deep features. We also emphasise the importance of network settings and data refinement in the problem of deep face recognition. Extensive experiments on several relevant face recognition benchmarks, LFW, CFP and AgeDB, prove the effectiveness of the proposed ArcFace. Most importantly, we get state-of-art performance in the MegaFace Challenge in a totally reproducible way. We make data, models and training/test code public available [].


CosFace: Large Margin Cosine Loss for Deep Face Recognition

Face recognition has achieved revolutionary advancement owing to the adv...

Additive Margin Softmax for Face Verification

In this paper, we propose a conceptually simple and geometrically interp...

SphereFace Revived: Unifying Hyperspherical Face Recognition

This paper addresses the deep face recognition problem under an open-set...

Improvising the Learning of Neural Networks on Hyperspherical Manifold

The impact of convolution neural networks (CNNs) in the supervised setti...

Noise-Tolerant Paradigm for Training Face Recognition CNNs

Benefit from large-scale training datasets, deep Convolutional Neural Ne...

DeepACC:Automate Chromosome Classification based on Metaphase Images using Deep Learning Framework Fused with Prior Knowledge

Chromosome classification is an important but difficult and tedious task...

Angular Learning: Toward Discriminative Embedded Features

The margin-based softmax loss functions greatly enhance intra-class comp...

Code Repositories


State-of-the-art 2D and 3D Face Analysis Project

view repo

1 Introduction

Face representation through the deep convolutional network embedding is considered the state-of-the-art method for face verification, face clustering, and face recognition [42, 35, 31]

. The deep convolutional network is responsible for mapping the face image, typically after a pose normalisation step, into an embedding feature vector such that features of the same person have a small distance while features of different individuals have a considerable distance.

(a) ArcFace
(b) Geodesic Correspondence
Figure 1: Geometrical interpretation of ArcFace. (a) Blue and green points represent embedding features from two different classes. ArcFace can directly impose angular (arc) margin between classes. (b) We show an intuitive correspondence between angle and arc margin. The angular margin of ArcFace corresponds to arc margin (geodesic distance) on the hypersphere surface.

The various face recognition approaches by deep convolutional network embedding differ along three primary attributes.

The first attribute is the training data employed to train the model. The identity number of public available training data, such as VGG-Face [31], VGG2-Face [7], CAISA-WebFace [48], UMDFaces [6], MS-Celeb-1M [11], and MegaFace [21], ranges from several thousand to half million. Although MS-Celeb-1M and MegaFace have a significant number of identities, they suffer from annotation noises [47] and long tail distributions [50]. By comparison, private training data of Google [35] even has several million identities. As we can check from the latest performance report of Face Recognition Vendor Test (FRVT) [4], Yitu, a start-up company from China, ranks first based on their private 1.8 billion face images [5]. Due to orders of magnitude difference on the training data scale, face recognition models from industry perform much better than models from academia. The difference of training data also makes some deep face recognition results [2] not fully reproducible.

The second attribute is the network architecture and settings. High capacity deep convolutional networks, such as ResNet [14, 15, 46, 50, 23] and Inception-ResNet [40, 3], can obtain better performance compared to VGG network [37, 31] and Google Inception V1 network [41, 35]. Different applications of deep face recognition prefer different trade-off between speed and accuracy [16, 51]. For face verification on mobile devices, real-time running speed and compact model size are essential for slick customer experience. For billion level security system, high accuracy is as important as efficiency.

The third attribute is the design of the loss functions.

(1) Euclidean margin based loss.

In [42] and [31], a Softmax classification layer is trained over a set of known identities. The feature vector is then taken from an intermediate layer of the network and used to generalise recognition beyond the set of identities used in training. Centre loss [46] Range loss [50] and Marginal loss [10]

add extra penalty to compress intra-variance or enlarge inter-distance to improve the recognition rate, but all of them still combine Softmax to train recognition models. However, the classification-based methods 

[42, 31] suffer from massive GPU memory consumption on the classification layer when the identity number increases to million level, and prefer balanced and sufficient training data for each identity.

The contrastive loss [39] and the Triplet loss [35] utilise pair training strategy. The contrastive loss function consists of positive pairs and negative pairs. The gradients of the loss function pull together positive pairs and push apart negative pairs. Triplet loss minimises the distance between an anchor and a positive sample and maximises the distance between the anchor and a negative sample from a different identity. However, the training procedure of the contrastive loss [39] and the Triplet loss [35] is tricky due to the selection of effective training samples.

(2) Angular and cosine margin based loss.

Liu et al[24] proposed a large margin Softmax (L-Softmax) by adding multiplicative angular constraints to each identity to improve feature discrimination. SphereFace  [23] applies L-Softmax to deep face recognition with weights normalisation. Due to the non-monotonicity of the cosine function, a piece-wise function is applied in SphereFace to guarantee the monotonicity. During training of SphereFace, Softmax loss is combined to facilitate and ensure the convergence. To overcome the optimisation difficulty of SphereFace, additive cosine margin [44, 43] moves the angular margin into cosine space. The implementation and optimisation of additive cosine margin are much easier than SphereFace. Additive cosine margin is easily reproducible and achieves state-of-the-art performance on MegaFace (TencentAILab_FaceCNN_v1) [2]. Compared to Euclidean margin based loss, angular and cosine margin based loss explicitly adds discriminative constraints on a hypershpere manifold, which intrinsically matches the prior that human face lies on a manifold.

As is well known that the above mentioned three attributes, data, network and loss, have a high-to-low influence on the performance of face recognition models. In this paper, we contribute to improving deep face recognition from all of these three attributes.

Data. We refined the largest public available training data, MS-Celeb-1M [11], in both automatic and manual way. We have checked the quality of the refined MS1M dataset with the Resnet-27 [14, 50, 10] network and the marginal loss [10] on the NIST Face Recognition Prize Challenge 222 We also find that there are hundreds of overlap face images between the MegaFace one million distractors and the FaceScrub dataset, which significantly affects the evaluation results. We manually find these overlap face images from the MegaFace distractors. Both the refinement of training data and test data will be public available.

Network. Taking VGG2 [7] as the training data, we conduct extensive contrast experiments regarding the convolutional network settings and report the verification accuracy on LFW, CFP and AgeDB. The proposed network settings have been confirmed robust under large pose and age variations. We also explore the trade-off between the speed and accuracy based on the most recent network structures.

Loss. We propose a new loss function, additive angular margin (ArcFace), to learn highly discriminative features for robust face recognition. As shown in Figure 1, the proposed loss function directly maximise decision boundary in angular (arc) space based on the L2 normalised weights and features. We show that ArcFace not only has a more clear geometrical interpretation but also outperforms the baseline methods, e.g. multiplicative angular margin [23] and additive cosine margin [44, 43]. We innovatively explain why ArcFace is better than Softmax, SphereFace [23] and CosineFace [44, 43] from the view of semi-hard sample distributions.

Performance. The proposed ArcFace achieves state-of-the-art results on the MegaFace Challenge [21], which is the largest public face benchmark with one million faces for recognition. We make these results totally reproducible with data, trained models and training/test code public available.

2 From Softmax to ArcFace

2.1 Softmax

The most widely used classification loss function, Softmax loss, is presented as follows:


where denotes the deep feature of the -th samples, belonging to the -th class. The feature dimension is set as in this paper following [46, 50, 23, 43]. denotes the -th column of the weights in the last fully connected layer and is the bias term. The batch size and the class number is and , respectively. Traditional Softmax loss is widely used in deep face recognition [31, 7]. However, the Softmax loss function does not explicitly optimise the features to have higher similarity score for positive pairs and lower similarity score for negative pairs, which leads to a performance gap.

2.2 Weights Normalisation

For simplicity, we fix the bias as  [23]

. Then, we transform the target logit 

[32] as follows:


Following [23, 43, 45], we fix by L2 normalisation, which makes the predictions only depend on the angle between the feature vector and the weight.


In the experiments of SphereFace, L2 weight normalisation only improves little on performance.

2.3 Multiplicative Angular Margin

In SphereFace [23, 24], angular margin is introduced by multiplication on the angle.


where . In order to remove this restriction, is substituted by a piece-wise monotonic function . The SphereFace is formulated as:


where ,,, is the integer that controls the size of angular margin. However, during the implementation of SphereFace, Softmax supervision is incorporated to guarantee the convergence of training, and the weight is controlled by a dynamic hyper-parameter . With the additional Softmax loss, in fact is:


where is a additional hyper-parameter to facilitate the training of SphereFace. is set to 1,000 at beginning and decreases to 5 to make the angular space of each class more compact [23]. This additional dynamic hyper-parameter makes the training of SphereFace relatively tricky.

2.4 Feature Normalisation

Feature normalisation is widely used for face verification, e.g. L2-normalised Euclidean distance and cosine distance [29]. Parde et al[30] observe that the L2-norm of features learned using Softmax loss is informative of the quality of the face. Features for good quality frontal faces have a high L2-norm while blurry faces with extreme pose have low L2-norm. Ranjan et al[33]

add the L2-constraint to the feature descriptors and restrict features to lie on a hypersphere of a fixed radius. L2 normalisation on features can be easily implemented using existing deep learning frameworks and significantly boost the performance of face verification. Wang 

et al[44] point out that gradient norm may be extremely large when the feature norm from low-quality face image is very small, which potentially increases the risk of gradient explosion. The advantages of feature normalisation are also revealed in [25, 26, 43, 45] and the feature normalisation is explained from analytic, geometric and experimental perspectives.

As we can see from above works, L2 normalisation on features and weights is an important step for hypersphere metric learning. The intuitive insight behind feature and weight normalisation is to remove the radial variation and push every feature to distribute on a hypersphere manifold.

Following [33, 43, 45, 44], we fix by L2 normalisation and re-scale to , which is the hypersphere radius and the lower bound is give in [33]. In this paper, we use for face recognition experiments [33, 43]. Based on feature and weight normalisation, we can get .

If the feature normalisation is applied to SphereFace, we can get the feature normalised SphareFace, denoted as SphereFace-FNorm


2.5 Additive Cosine Margin

In [44, 43], the angular margin is removed to the outside of , thus they propose the cosine margin loss function:


In this paper, we set the cosine margin as  [44, 43]. Compared to SphereFace, additive cosine margin (CosineFace) has three advantages: (1) extremely easy to implement without tricky hyper-parameters; (2) more clear and able to converge without the Softmax supervision; (3) obvious performance improvement.

2.6 Additive Angular Margin

Although the cosine margin in [44, 43] has a one-to-one mapping from the cosine space to the angular space, there is still a difference between these two margins. In fact, the angular margin has a more clear geometric interpretation compared to cosine margin, and the margin in angular space corresponds to the arc distance on the hypersphere manifold.

We add an angular margin within . Since is lower than when , the constraint is more stringent for classification. We define the proposed ArcFace as:


subject to


If we expand the proposed additive angular margin , we get . Compared to the additive cosine margin proposed in [44, 43], the proposed ArcFace is similar but the margin is dynamic due to .

In Figure 2, we illustrate the proposed ArcFace, and the angular margin corresponds to the arc margin. Compared to SphereFace and CosineFace, our method has the best geometric interpretation.

Figure 2: Geometrical interpretation of ArcFace. Different colour areas represent feature spaces from distinct classes. ArcFace can not only compress the feature regions but also correspond to the geodesic distance on the hypersphere surface.

2.7 Comparison under Binary Case

To better understand the process from Softmax to the proposed ArcFace, we give the decision boundaries under binary classification case in Table 1 and Figure 3. Based on the weights and features normalisation, the main difference among these methods is where we put the margin.

Loss Functions Decision Boundaries
W-Norm Softmax
SphereFace [23]
F-Norm SphereFace
CosineFace [44, 43]
Table 1: Decision boundaries for class 1 under binary classification case. Note that, is the angle between and , is the hypersphere radius, and is the margin.
Figure 3: Decision margins of different loss functions under binary classification case. The dashed line represents the decision boundary, and the grey areas are the decision margins.

2.8 Target Logit Analysis

To investigate why the face recognition performance can be improved by SphereFace, CosineFace and ArcFace, we analysis the target logit curves and the distributions during training. Here, we use the LResNet34E-IR (refer to Sec. 3.2) network and the refined MS1M dataset (refer to Sec. 3.1).

In Figure 4(a), we plot the target logit curves for Softmax, SphereFace, CosineFace and ArcFace. For SphereFace, the best setting is and , which is similar to the curve with and . However, the implementation of SphereFace requires the to be an integer. When we try the minimum multiplicative margin, and , the training can not converge. Therefore, decreasing the target logit curve slightly from Softmax is able to increase the training difficulty and improve the performance, but decreasing too much may cause the training divergence.

Both CosineFace and ArcFace follow this insight. As we can see from Figure 4(a), CosineFace moves the target logit curve along the negative direction of y-axis, while ArcFace moves the target logit curve along the negative direction of x-axis. Now, we can easily understand the performance improvement from Softmax to CosineFace and ArcFace.

For ArcFace with the margin , the target logit curve is not monotonic decreasing when . In fact, the target logit curve increases when . However, as shown in Figure 4(c), the

has a Gaussian distribution with the centre at

and the largest angle below when starting from the randomly initialised network. The increasing interval of ArcFace is almost never reached during training. Therefore, we do not need to deal with this explicitly.

In Figure 4(c), we show the distributions of CosineFace and ArcFace in three phases of training, e.g. start, middle and end. The distribution centres gradually move from to . In Figure 4(a), we find the target logit curve of ArcFace is lower than that of CosineFace between to . Therefore, the proposed ArcFace puts more strict margin penalty compared to CosineFace in this interval. In Figure 4(b)

, we show the target logit converge curves estimated on training batches for Softmax, CosineFace and ArcFace. We can also find that the margin penalty of ArcFace is heavier than that of CosineFace at the beginning, as the red dotted line is lower than the blue dotted line. At the end of training, ArcFace converges better than CosineFace, as the histogram of

is in the left (Figure 4(c)) and the target logit converge curve is higher (Figure 4(b)). From Figure 4(c), we can find that almost all of the s are smaller than at the end of training. The samples beyond this field are the hardest samples as well as the noise samples of the training dataset. Even though CosineFace puts more strict margin penalty when (Figure 4(a)), this field is seldom reached even at the end of training (Figure 4(c)). Therefore, we can also understand why SphereFace can obtain very good performance even with a relatively small margin in this section.

In conclusion, adding too much margin penalty when may cause training divergence, e.g. SphereFace ( and ). Adding margin when can potentially improve the performance, because this section corresponds to the most effective semi-hard negative samples [35]. Adding margin when can not obviously improve the performance, because this section corresponds to the easiest samples. When we go back to Figure 4(a) and rank the curves between , we can understand why the performance can improve from Softmax, SphereFace, CosineFace to ArcFace under their best parameter settings. Note that, and here are the roughly estimated thresholds for easy and hard training samples.

(a) Target Logit Curves
(b) Target Logit Converge Curves
(c) Distributions during Training
Figure 4: Target logit analysis. (a) Target logit curves for Softmax, SphereFace, CosineFace and ArcFace. (b) Target logit converge curves estimated on training batches for Softmax, CosineFace and ArcFace. (c) distributions move from large angles to small angles during training (start, middle and end). Better to view by zoom in.

3 Experiments

In this paper, we target to obtain state-of-the-art performance on MegaFace Challenge [21], the largest face identification and verification benchmark, in a totally reproducible way. We take Labelled Faces in the Wild (LFW) [19], Celebrities in Frontal Profile (CFP) [36], Age Database (AgeDB) [27] as the validation datasets, and conduct extensive experiments regarding network settings and loss function designs. The proposed ArcFace achieves state-of-the-art performance on all of these four datasets.

3.1 Data

3.1.1 Training data

We use two datasets, VGG2 [7] and MS-Celeb-1M [11], as our training data.

VGG2. VGG2 dataset contains a training set with 8,631 identities (3,141,890 images) and a test set with 500 identities (169,396 images). VGG2 has large variations in pose, age, illumination, ethnicity and profession. Since VGG2 is a high-quality dataset, we use it directly without data refinement.

MS-Celeb-1M. The original MS-Celeb-1M dataset contains about 100k identities with 10 million images. To decrease the noise of MS-Celeb-1M and get a high-quality training data, we rank all face images of each identity by their distances to the identity centre. For a particular identity, the face image whose feature vector is too far from the identity’s feature centre is automatically removed [10]. We further manually check the face images around the threshold of the first automatic step for each identity. Finally, we obtain a dataset which contains 3.8M images of 85k unique identities. To facilitate other researchers to reproduce all of the experiments in this paper, we make the refined MS1M dataset public available within a binary file, but please cite the original paper [11] and follow the original license [11] when using this dataset. Our contribution here is only training data refinement, not release.

3.1.2 Validation data

We employ Labelled Faces in the Wild (LFW) [19], Celebrities in Frontal Profile (CFP) [36] and Age Database (AgeDB) [27] as the validation datasets.

LFW. [19] LFW dataset contains web-collected images from different identities, with large variations in pose, expression and illuminations. Following the standard protocol of unrestricted with labelled outside data, we give the verification accuracy on face pairs.

CFP. [36]. CFP dataset consists of 500 subjects, each with 10 frontal and 4 profile images. The evaluation protocol includes frontal-frontal (FF) and frontal-profile (FP) face verification, each having 10 folders with 350 same-person pairs and 350 different-person pairs. In this paper, we only use the most challenging subset, CFP-FP, to report the performance.

AgeDB. [27, 10] AgeDB dataset is an in-the-wild dataset with large variations in pose, expression, illuminations, and age. AgeDB contains images of distinct subjects, such as actors, actresses, writers, scientists, and politicians. Each image is annotated with respect to the identity, age and gender attribute. The minimum and maximum ages are and , respectively. The average age range for each subject is years. There are four groups of test data with different year gaps ( years, years, years and years, respectively) [10]. Each group has ten split of face images, and each split contains positive examples and

negative examples. The face verification evaluation metric is the same as LFW. In this paper, we only use the most challenging subset, AgeDB-30, to report the performance.

3.1.3 Test data

MegaFace. MegaFace datasets [21] are released as the largest public available testing benchmark, which aims at evaluating the performance of face recognition algorithms at the million scale of distractors. MegaFace datasets include gallery set and probe set. The gallery set, a subset of Flickr photos from Yahoo, consists of more than one million images from 690k different individuals. The probe sets are two existing databases: FaceScrub [28] and FGNet [1]. FaceScrub is a publicly available dataset that containing 100k photos of unique individuals, in which images are males, and images are females. FGNet is a face ageing dataset, with images from identities. Each identity has multiple face images at different ages (ranging from to ).

It is quite understandable that data collection of MegaFace is very arduous and time-consuming thus data noise is inevitable. For FaceScrub dataset, all of the face images from one particular identity should have the same identity. For the one million distractors, there should not be any overlap with the FaceScrub identities. However, we find noisy face images not only exist in FaceScrub dataset but also exist in the one million distractors, which significantly affect the performance.

In Figure 5, we give the noisy face image examples from the Facesrub dataset. As shown in Figure 8(c), we rank all of the faces according to the cosine distance to the identity centre. In fact, face image 221 and 136 are not Aaron Eckhart. We manually clean the FaceScrub dataset and finally find noisy face images. During testing, we change the noisy face to another right face, which can increase the identification accuracy by about . In Figure 6(b), we give the noisy face image examples from the MegaFace distractors. All of the four face images from the MegaFace distractors are Alec Baldwin. We manually clean the MegaFace distractors and finally find noisy face images. During testing, we add one additional feature dimension to distinguish these noisy faces, which can increase the identification accuracy by about .

Even though the noisy face images are double checked by seven annotators who are very familiar with these celebrities, we still can not promise these images are noisy. We put the noise lists of the FaceScrub dataset and the MegaFace distractors online. We believe the masses have sharp eyes and we will update these lists based on other researchers’ feedback.

(a) Aaron Eckhart
(b) noise face 221
(c) noise face 136
Figure 5: Noisy face image examples from the FaceScrub dataset. In (a), the image id is put in top left and cosine distance to the identity centre is put in bottom left.
(a) Alec Baldwin
(b) Distractors Noise
Figure 6: Noisy face image examples from the MegaFace distractors. (a) is used for annotators to learn the identity from the FaceScrub dataset. (b) shows the selected overlap faces from the MegaFace distractors.

3.2 Network Settings

We first evaluate the face verification performance based on different network settings by using VGG2 as the training data and Softmax as the loss function. All experiments in this paper are implemented by MxNet [8]. We set the batch size as and train models on four or eight NVIDIA Tesla P40 (24GB) GPUs. The learning rate is started from and divided by at the 100k, 140k, 160k iterations. Total iteration step is set as 200k. We set momentum at and weight decay at (Table 5).

3.2.1 Input setting

Following [46, 23], we use five facial landmarks (eye centres, nose tip and mouth corners) [49] for similarity transformation to normalise the face images. The faces are cropped and resized to , and each pixel (ranged between ) in RGB images is normalised by subtracting then divided by .

As most of the convolutional networks are designed for the Image-Net [34] classification task, the input image size is usually set as or larger. However, the size of our face crops is only . To preserve higher feature map resolution, we use and in the first convolutional layer instead of using and . For these two settings, the output size of the convolutional networks is (denoted as “L” in front of the network names) and , respectively.

3.2.2 Output setting

In last several layers, some different options can be investigated to check how the embedding settings affect the model performance. All feature embedding dimension is set to expect for Option-A, as the embedding size in Option-A is determined by the channel size of last convolutional layer.

  • Option-A: Use global pooling layer(GP).

  • Option-B: Use one fully connected (FC) layer after GP.

  • Option-C: Use FC-Batch Normalisation (BN) [20] after GP.

  • Option-D: Use FC-BN-Parametric Rectified Linear Unit (PReLu) 

    [13] after GP.

  • Option-E: Use BN-Dropout [38]-FC-BN after the last convolutional layer.

During testing, the score is computed by the Cosine Distance of two feature vectors. Nearest neighbour and threshold comparison are used for face identification and verification tasks.

3.2.3 Block Setting

Besides the original ResNet [14] unit, we also investigate a more advanced residual unit setting [12] for the training of face recognition model. In Figure 7, we show the improved residual unit (denoted as “IR” in the end of model names), which has a BN-Conv-BN-PReLu-Conv-BN structure. Compared to the residual unit proposed by [12], we set for the second convolutional layer instead of the first one. In addition, PReLu [13]

is used to substitute the original ReLu.

Figure 7: Improved residual unit: BN-Conv-BN-PReLu-Conv-BN.

3.2.4 Backbones

Based on recent advances on the model structure designs, we also explore MobileNet [16], Inception-Resnet-V2 [40], Densely connected convolutional networks (DenseNet) [18], Squeeze and excitation networks (SE) [17] and Dual path Network (DPN) [9] for deep face recognition. In this paper, we compare the differences between these networks from the aspects of accuracy, speed and model size.

3.2.5 Network Setting Conclusions

Input selects L. In Table 2, we compare two networks with and without the setting of “L”. When using and as the first convolutional layer, the network output is . By contrast, if we use and as the first convolutional layer, the network output is only . It is obvious from Table 2 that choosing larger feature maps during training obtains higher verification accuracy.

Networks LFW CFP-FP AgeDB-30
SE-ResNet50D 99.38 94.58 91.00
SE-LResNet50D 99.6 96.04 92.68
SE-ResNet50E 99.26 94.11 90.85
SE-LResNet50E 99.71 96.38 92.98
Table 2: Verification accuracy () under different input sittings (Softmax@VGG2).

Output selects E. In Table 3, we give the detailed comparison between different output settings. The option E (BN-Dropout-FC-BN) obtains the best performance. In this paper, the dropout parameter is set as . Dropout can effectively act as the regularisation term to avoid over-fitting and obtain better generalisation for deep face recognition.

Networks LFW CFP-FP AgeDB-30
SE-LResNet50A 99.51 95.81 92.60
SE-LResNet50B 99.46 94.90 91.85
SE-LResNet50C 99.56 95.81 92.61
SE-LResNet50D 99.6 96.04 92.68
SE-LResNet50E 99.71 96.38 92.98
SE-LResNet50A-IR 99.58 95.90 92.63
SE-LResNet50D-IR 99.61 96.51 92.68
SE-LResNet50E-IR 99.78 96.82 93.83
Table 3: Verification accuracy () under different output settings (Softmax@VGG2).

Block selects IR. In Table 4

, we give the comparison between the original residual unit and the improved residual unit. As we can see from the results, the proposed BN-Conv(stride=1)-BN-PReLu-Conv(stride=2)-BN unit can obviously improve the verification performance.

Networks LFW CFP-FP AgeDB-30
SE-LResNet50E 99.71 96.38 92.98
SE-LResNet50E-IR 99.78 96.82 93.83
Table 4: Verification accuracy () comparison between the original residual unit and the improved residual unit (Softmax@VGG2).

Backbones Comparisons. In Table 8, we give the verification accuracy, test speed and model size of different backbones. The running time is estimated on the P40 GPU. As the performance on LFW is almost saturated, we focus on the more challenging test sets, CFP-FP and AgeDB-30, to compare these network backbones. The Inception-Resnet-V2 network obtains the best performance with long running time () and largest model size (). By contrast, MobileNet can finish face feature embedding within with a model of , and the performance only drops slightly. As we can see from Table 8, the performance gaps between these large networks, e.g. ResNet-100, Inception-Resnet-V2, DenseNet, DPN and SE-Resnet-100, are relatively small. Based on the trade-off between accuracy, speed and model size, we choose LResNet100E-IR to conduct experiments on the Megaface challenge.

Weight decay. Based on the SE-LResNet50E-IR network, we also explore how the weight decay (WD) value affects the verification performance. As we can see from Table 5, when the weight decay value is set as , the verification accuracy reaches the highest point. Therefore, we fix the weight decay at in all other experiments.

WD values LFW CFP-FP AgeDB-30
5e-6 99.11 94.52 90.43
5e-5 99.56 95.74 92.95
5e-4 99.78 96.82 93.83
1e-3 99.71 96.60 93.53
Table 5: Verification performance () of different weight decay (WD) values (SE-LResNet50E-IR,Softmax@VGG2).

3.3 Loss Setting

Since the margin parameter plays an important role in the proposed ArcFace, we first conduct experiments to search the best angular margin. By varying from to , we use the LMobileNetE network and the ArcFace loss to train models on the refined MS1M dataset. As illustrated in Table 6, the performance improves consistently from on all datasets and gets saturated at . Then, the verification accuracy turns to decrease from . In this paper, we fix the additive angular margin as .

0.2 99.23 87.23 95.25
0.3 99.40 88.15 96.00
0.4 99.48 87.85 96.00
0.5 99.50 88.50 96.06
0.6 99.46 87.23 95.68
0.7 99.46 87.48 95.80
0.8 99.40 86.74 95.68
Table 6: Verification performance () of ArcFace with different angular margins (LMobileNetE,ArcFace@MS1M).

Based on the LResNet100E-IR network and the refined MS1M dataset, we compare the performance of different loss functions, e.g. Softmax, SphereFace [23], CosineFace [44, 43] and ArcFace. In Table 7, we give the detailed verification accuracy on the LFW, CFP-FP, and AgeDB-30 datasets. As LFW is almost saturated, the performance improvement is not obvious. We find that (1) Compared to Softmax, SphereFace, CosineFace and ArcFace improve the performance obviously, especially under large pose and age variations. (2) CosineFace and ArcFace obviously outperform SphereFace with much easier implementation. Both CosineFace and ArcFace can converge easily without additional supervision from Softmax. By contrast, additional supervision from Softmax is indispensable for SphereFace to avoid divergence during training. (3) ArcFace is slightly better than CosineFace. However, ArcFace is more intuitive and has a more clear geometric interpretation on the hypersphere manifold as shown in Figure 1.

Loss LFW CFP-FP AgeDB-30
Softmax 99.7 91.4 95.56
SphereFace (m=4, ) 99.76 93.7 97.56
CosineFace (m=0.35) 99.80 94.4 97.91
ArcFace(m=0.4) 99.80 94.5 98.0
ArcFace(m=0.5) 99.83 94.04 98.08
Table 7: Verification performance () for different loss functions (LResNet100E-IR@MS1M).
Backbones LFW () CFP-FP () AgeDB-30 () Speed(ms) Model-Size(MB)
LResNet50E-IR 99.75 96.58 93.53 8.9 167
SE-LResNet50E-IR 99.78 96.82 93.83 13.0 169
LResNet100E-IR 99.75 96.95 94.4 15.4 250
SE-LResNet100E-IR 99.71 97.01 94.23 23.8 252
LResNet101(Bottle Neck)E-IR 99.76 96.72 93.68 49.2 294
LMobileNetE 99.63 95.81 91.85 4.2 112
LDenseNet161E 99.71 96.51 93.68 29.3 315
LDPN92E 99.71 96.82 94.18 38.1 393
LDPN107E 99.76 96.94 94.9 58.8 581
LInception-ResNet-v2 99.75 97.15 95.35 53.6 642
Table 8: Accuracy (), speed (ms) and model size (MB) comparison between different backbones (Softmax@VGG2)

3.4 MegaFace Challenge1 on FaceScrub

For the experiments on the MegaFace challenge, we use the LResNet100E-IR network and the refined MS1M dataset as the training data. In both Table 9 and 10, we give the identification and verification results on the original MegaFace dataset and the refined MegaFace dataset.

In Table 9, we use the whole refined MS1M dataset to train models. We compare the performance of the proposed ArcFace with related baseline methods, e.g. Softmax, Triplet, SphereFace, and CosineFace. The proposed ArcFace obtains the best performance before and after the distractors refinement. After the overlapped face images are removed from the one million distractors, the identification performance significantly improves. We believe that the results on the manually refined MegaFace dataset are more reliable, and the performance of face identification under million distractors is better than we think [2].

Methods Rank1@ VR@FAR Rank1@ (R) VR@FAR (R)
Softmax 78.89 94.95 91.43 94.95
Softmax-pretrain,Triplet-finetune 80.6 94.65 94.08 95.03
Softmax-pretrain@VGG2, Triplet-finetune 78.87 95.43 93.96 95.07
SphereFace(m=4, =5) 82.95 97.66 97.43 97.66
CosineFace(m=0.35) 82.75 98.41 98.33 98.41
ArcFace(m=0.4) 82.29 98.20 98.10 97.83
ArcFace(m=0.5) 83.27 98.48 98.36 98.48
Table 9: Identification and verification results of different methods on MegaFace Challenge1 (LResNet100E-IR@MS1M). “Rank 1” refers to the rank-1 face identification accuracy and “VR” refers to face verification TAR (True Accepted Rate) at FAR (False Accepted Rate). (R) denotes the refined version of MegaFace dataset.

To strictly follow the evaluation instructions on MegaFace, we need to remove all of the identities appearing in the FaceScrub dataset from our training data. We calculate the feature centre for each identity in the refined MS1M dataset and the FaceScrub dataset. We find that 578 identities from the refined MS1M dataset have a close distance (cosine similarity is higher than

) with the identities from the FaceScrub dataset. We remove these 578 identities from the refined MS1M dataset and compare the proposed ArcFace to other baseline methods in Table 10. ArcFace still outperforms CosineFace with a slight performance drop compared to Table 9. But for Softmax, the identification rate drops obviously from to after the suspectable overlap identities are removed from the training data. On the refined MegaFace testset, the verification result of CosineFace is slightly higher than that of ArcFace. This is because we read the verification results which are closest to FAR=1e-6 from the outputs of the devkit. As we can see from Figure 8, the proposed ArcFace always outperforms CosineFace under both identification and verification metric.

Methods Rank1@ VR@FAR Rank1@ (R) VR@FAR (R)
Softmax 73.66 91.5 86.37 91.5
CosineFace(m=0.35) 82.49 97.95 97.88 98.07
ArcFace(LMobileNetE,m=0.5) 79.58 93.0 92.65 94.0
ArcFace(LResNet50E-IR,m=0.5) 82.42 97.23 97.39 97.63
ArcFace(LResNet50E-IR,m=0.5) 82.55 98.33 98.06 97.94
Table 10: Identification and verification results of different methods on MegaFace Challenge1 (Methods@ MS1M - FaceScrub). “Rank 1” refers to the rank-1 face identification accuracy and “VR” refers to face verification TAR (True Accepted Rate) at FAR (False Accepted Rate). (R) denotes the refined version of MegaFace dataset.
(a) CMC@Original MegaFace
(b) ROC@Original MegaFace
(c) CMC@Refined MegaFace
(d) ROC@Refined MegaFace
Figure 8: (a) and (c) report CMC curves of different methods with 1M distractors on MegaFace Set 1. (b) and (d) give the ROC curves of different methods with 1M distractors on MegaFace Set 1. (a) and (b) are eveluated on the original MegaFace dataset, while (c) and (d) are evaluated on the refined MegaFace Dataset.

3.5 Further Improvement by Triplet Loss

Due to the limitation of GPU memory, it is hard to train Softmax-based methods,e.g. SphereFace, CosineFace and ArcFace, with millions of identities. One practical solution is to employ metric learning methods, and the most widely used method is the Triplet loss [35, 22]. However, the converging speed of Triplet loss is relatively slow. To this end, we explore Triplet loss to fine-turn exist face recognition models which are trained with Softmax based methods.

For Triplet loss fine-tuning, we use the LResNet100E-IR network and set learning rate at , momentum at and weight decay at . As shown in Table 11, we give the verification accuracy by Triplet loss fine-tuning on the AgeDB-30 dataset. We find that (1) The Softmax model trained on a dataset with fewer identity numbers (e.g. VGG2 with 8,631 identities) can be obviously improved by Triplet loss fine-tuning on a dataset with more identity numbers (e.g. MS1M with 85k identities). This improvement confirms the effectiveness of the two-step training strategy, and this strategy can significantly accelerate the whole model training compared to training Triplet loss from scratch. (2) The Softmax model can be further improved by Triplet loss fine-tuning on the same dataset, which proves that the local refinement can improve the global model. (3) The excellence of margin improved Softmax methods, e.g. SphereFace, CosineFace, and ArcFace, can be kept and further improved by Triplet loss fine-tuning, which also verifies that local metric learning method, e.g. Triplet loss, is complementary to global hypersphere metric learning based methods.

As the margin used in Triplet loss is the Euclidean distance, we will investigate Triplet loss with the angular margin recently.

Dataset@Loss AgeDB-30
VGG2@Softmax 94.4
VGG2@Softmax, MS1M@Triplet 97.5
MS1M@Softmax 95.56
MS1M@Softmax, MS1M@Triplet 97.16
MS1M@SphereFace 97.56
MS1M@SphereFace, MS1M@Triplet 97.85
MS1M@CosineFace 97.91
MS1M@CosineFace, MS1M@Triplet 97.98
MS1M@ArcFace 98.08
MS1M@ArcFace, MS1M@Triplet 98.15
Table 11: Improve verification accuracy by Triplet loss fine-tuning (LResNet100E-IR).

4 Conclusions

In this paper, we contribute to improving deep face recognition from data refinement, network settings and loss function designs. We have (1) refined the largest public available training dataset (MS1M) and test dataset (MegaFace); (2) explored different network settings and analysed the trade-off between accuracy and speed; (3) proposed a geometrically interpretable loss function called ArcFace and explained why the proposed ArcFace is better than Softmax, SphereFace and CosineFace from the view of semi-hard sample distributions; (4) obtained state-of-the-art performance on the MegaFace dataset in a totally reproducible way.