has significantly advanced the state-of-the-art performance on a wide variety of computer vision tasks, which makes deep CNN a dominant machine learning approach for computer vision. Face recognition, as one of the most common computer vision tasks, has been extensively studied for decades[37, 45, 22, 19, 20, 40, 2]
. Early studies build shallow models with low-level face features, while modern face recognition techniques are greatly advanced driven by deep CNNs. Face recognition usually includes two sub-tasks: face verification and face identification. Both of these two tasks involve three stages: face detection, feature extraction, and classification. A deep CNN is able to extract clean high-level features, making itself possible to achieve superior performance with a relatively simple classification architecture: usually, a multilayer perceptron networks followed by a softmax loss[35, 32]. However, recent studies [42, 24, 23] found that the traditional softmax loss is insufficient to acquire the discriminating power for classification.
To encourage better discriminating performance, many research studies have been carried out [42, 5, 7, 10, 39, 23]. All these studies share the same idea for maximum discrimination capability: maximizing inter-class variance and minimizing intra-class variance. For example, [42, 5, 7, 10, 39] propose to adopt multi-loss learning in order to increase the feature discriminating power. While these methods improve classification performance over the traditional softmax loss, they usually come with some extra limitations. For , it only explicitly minimizes the intra-class variance while ignoring the inter-class variances, which may result in suboptimal solutions. [5, 7, 10, 39] require thoroughly scheming the mining of pair or triplet samples, which is an extremely time-consuming procedure. Very recently,  proposed to address this problem from a different perspective. More specifically,  (A-softmax) projects the original Euclidean space of features to an angular space, and introduces an angular margin for larger inter-class variance.
Compared to the Euclidean margin suggested by [42, 5, 10], the angular margin is preferred because the cosine of the angle has intrinsic consistency with softmax. The formulation of cosine matches the similarity measurement that is frequently applied to face recognition. From this perspective, it is more reasonable to directly introduce cosine margin between different classes to improve the cosine-related discriminative information.
In this paper, we reformulate the softmax loss as a cosine loss by normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space. Specifically, we propose a novel algorithm, dubbed Large Margin Cosine Loss (LMCL), which takes the normalized features as input to learn highly discriminative features by maximizing the inter-class cosine margin. Formally, we define a hyper-parameter such that the decision boundary is given by , where is the angle between the feature and weight of class .
For comparison, the decision boundary of the A-Softmax is defined over the angular space by , which has a difficulty in optimization due to the non-monotonicity of the cosine function. To overcome such a difficulty, one has to employ an extra trick with an ad-hoc piecewise function for A-Softmax. More importantly, the decision margin of A-softmax depends on , which leads to different margins for different classes. As a result, in the decision space, some inter-class features have a larger margin while others have a smaller margin, which reduces the discriminating power. Unlike A-Softmax, our approach defines the decision margin in the cosine space, thus avoiding the aforementioned shortcomings.
Based on the LMCL, we build a sophisticated deep model called CosFace, as shown in Figure 1. In the training phase, LMCL guides the ConvNet to learn features with a large cosine margin. In the testing phase, the face features are extracted from the ConvNet to perform either face verification or face identification. We summarize the contributions of this work as follows:
(1) We embrace the idea of maximizing inter-class variance and minimizing intra-class variance and propose a novel loss function, called LMCL, to learn highly discriminative deep features for face recognition.
(2) We provide reasonable theoretical analysis based on the hyperspherical feature distribution encouraged by LMCL.
2 Related Work
Deep Face Recognition. Recently, face recognition has achieved significant progress thanks to the great success of deep CNN models [18, 15, 34, 9]. In DeepFace  and DeepID , face recognition is treated as a multi-class classification problem and deep CNN models are first introduced to learn features on large multi-identities datasets. DeepID2  employs identification and verification signals to achieve better feature embedding. Recent works DeepID2+  and DeepID3  further explore the advanced network structures to boost recognition performance. FaceNet  uses triplet loss to learn an Euclidean space embedding and a deep CNN is then trained on nearly 200 million face images, leading to the state-of-the-art performance. Other approaches [41, 11] also prove the effectiveness of deep CNNs on face recognition.
Loss Functions. Loss function plays an important role in deep feature learning. Contrastive loss [5, 7] and triplet loss [10, 39] are usually used to increase the Euclidean margin for better feature embedding. Wen et al.  proposed a center loss to learn centers for deep features of each identity and used the centers to reduce intra-class variance. Liu et al.  proposed a large margin softmax (L-Softmax) by adding angular constraints to each identity to improve feature discrimination. Angular softmax (A-Softmax)  improves L-Softmax by normalizing the weights, which achieves better performance on a series of open-set face recognition benchmarks [13, 43, 17]. Other loss functions [47, 6, 4, 3] based on contrastive loss or center loss also demonstrate the performance on enhancing discrimination.
Normalization Approaches. Normalization has been studied in recent deep face recognition studies.  normalizes the weights which replace the inner product with cosine similarity within the softmax loss.  applies the constraint on features to embed faces in the normalized space. Note that normalization on feature vectors or weight vectors achieves much lower intra-class angular variability by concentrating more on the angle during training. Hence the angles between identities can be well optimized. The von Mises-Fisher (vMF) based methods [48, 8] and A-Softmax  also adopt normalization in feature learning.
3 Proposed Approach
In this section, we firstly introduce the proposed LMCL in detail (Sec. 3.1). And a comparison with other loss functions is given to show the superiority of the LMCL (Sec. 3.2). The feature normalization technique adopted by the LMCL is further described to clarify its effectiveness (Sec. 3.3). Lastly, we present a theoretical analysis for the proposed LMCL (Sec. 3.4).
3.1 Large Margin Cosine Loss
We start by rethinking the softmax loss from a cosine perspective. The softmax loss separates features from different classes by maximizing the posterior probability of the ground-truth class. Given an input feature vectorwith its corresponding label , the softmax loss can be formulated as:
where denotes the posterior probability of
being correctly classified.is the number of training samples and is the number of classes. is usually denoted as activation of a fully-connected layer with weight vector and bias . We fix the bias for simplicity, and as a result is given by:
where is the angle between and . This formula suggests that both norm and angle of vectors contribute to the posterior probability.
To develop effective feature learning, the norm of should be necessarily invariable. To this end, We fix by normalization. In the testing stage, the face recognition score of a testing face pair is usually calculated according to cosine similarity between the two feature vectors. This suggests that the norm of feature vector is not contributing to the scoring function. Thus, in the training stage, we fix . Consequently, the posterior probability merely relies on cosine of angle. The modified loss can be formulated as
Because we remove variations in radial directions by fixing , the resulting model learns features that are separable in the angular space. We refer to this loss as the Normalized version of Softmax Loss (NSL) in this paper.
However, features learned by the NSL are not sufficiently discriminative because the NSL only emphasizes correct classification. To address this issue, we introduce the cosine margin to the classification boundary, which is naturally incorporated into the cosine formulation of Softmax.
Considering a scenario of binary-classes for example, let denote the angle between the learned feature vector and the weight vector of Class . The NSL forces for , and similarly for , so that features from different classes are correctly classified. To develop a large margin classifier, we further require and , where is a fixed parameter introduced to control the magnitude of the cosine margin. Since is lower than , the constraint is more stringent for classification. The above analysis can be well generalized to the scenario of multi-classes. Therefore, the altered loss reinforces the discrimination of learned features by encouraging an extra margin in the cosine space.
Formally, we define the Large Margin Cosine Loss (LMCL) as:
where is the numer of training samples, is the -th feature vector corresponding to the ground-truth class of , the is the weight vector of the -th class, and is the angle between and .
3.2 Comparison on Different Loss Functions
In this subsection, we compare the decision margin of our method (LMCL) to: Softmax, NSL, and A-Softmax, as illustrated in Figure 2. For simplicity of analysis, we consider the binary-classes scenarios with classes and . Let and denote weight vectors for and , respectively.
Softmax loss defines a decision boundary by:
Thus, its boundary depends on both magnitudes of weight vectors and cosine of angles, which results in an overlapping decision area (margin 0) in the cosine space. This is illustrated in the first subplot of Figure 2. As noted before, in the testing stage it is a common strategy to only consider cosine similarity between testing feature vectors of faces. Consequently, the trained classifier with the Softmax loss is unable to perfectly classify testing samples in the cosine space.
NSL normalizes weight vectors and such that they have constant magnitude 1, which results in a decision boundary given by:
The decision boundary of NSL is illustrated in the second subplot of Figure 2. We can see that by removing radial variations, the NSL is able to perfectly classify testing samples in the cosine space, with margin = 0. However, it is not quite robust to noise because there is no decision margin: any small perturbation around the decision boundary can change the decision.
A-Softmax improves the softmax loss by introducing an extra margin, such that its decision boundary is given by:
Thus, for it requires , and similarly for . The third subplot of Figure 2 depicts this decision area, where gray area denotes decision margin. However, the margin of A-Softmax is not consistent over all values: the margin becomes smaller as reduces, and vanishes completely when . This results in two potential issues. First, for difficult classes and which are visually similar and thus have a smaller angle between and , the margin is consequently smaller. Second, technically speaking one has to employ an extra trick with an ad-hoc piecewise function to overcome the nonmonotonicity difficulty of the cosine function.
LMCL (our proposed) defines a decision margin in cosine space rather than the angle space (like A-Softmax) by:
Therefore, is maximized while being minimized for (similarly for ) to perform the large-margin classification. The last subplot in Figure 2 illustrates the decision boundary of LMCL in the cosine space, where we can see a clear margin() in the produced distribution of the cosine of angle. This suggests that the LMCL is more robust than the NSL, because a small perturbation around the decision boundary (dashed line) less likely leads to an incorrect decision. The cosine margin is applied consistently to all samples, regardless of the angles of their weight vectors.
3.3 Normalization on Features
In the proposed LMCL, a normalization scheme is involved on purpose to derive the formulation of the cosine loss and remove variations in radial directions. Unlike  that only normalizes the weight vectors, our approach simultaneously normalizes both weight vectors and feature vectors. As a result, the feature vectors distribute on a hypersphere, where the scaling parameter controls the magnitude of radius. In this subsection, we discuss why feature normalization is necessary and how feature normalization encourages better feature learning in the proposed LMCL approach.
The necessity of feature normalization is presented in two respects: First, the original softmax loss without feature normalization implicitly learns both the Euclidean norm (-norm) of feature vectors and the cosine value of the angle. The -norm is adaptively learned for minimizing the overall loss, resulting in the relatively weak cosine constraint. Particularly, the adaptive -norm of easy samples becomes much larger than hard samples to remedy the inferior performance of cosine metric. On the contrary, our approach requires the entire set of feature vectors to have the same -norm such that the learning only depends on cosine values to develop the discriminative power. Feature vectors from the same classes are clustered together and those from different classes are pulled apart on the surface of the hypersphere. Additionally, we consider the situation when the model initially starts to minimize the LMCL. Given a feature vector , let and denote cosine scores of the two classes, respectively. Without normalization on features, the LMCL forces . Note that and can be initially comparable with each other. Thus, as long as is smaller than , is required to decrease for minimizing the loss, which degenerates the optimization. Therefore, feature normalization is critical under the supervision of LMCL, especially when the networks are trained from scratch. Likewise, it is more favorable to fix the scaling parameter instead of adaptively learning.
Furthermore, the scaling parameter should be set to a properly large value to yield better-performing features with lower training loss. For NSL, the loss continuously goes down with higher , while too small leads to an insufficient convergence even no convergence. For LMCL, we also need adequately large to ensure a sufficient hyperspace for feature learning with an expected large margin.
In the following, we show the parameter should have a lower bound to obtain expected classification performance. Given the normalized learned feature vector and unit weight vector , we denote the total number of classes as . Suppose that the learned feature vectors separately lie on the surface of the hypersphere and center around the corresponding weight vector. Let denote the expected minimum posterior probability of class center (i.e., ), the lower bound of is given by 111Proof is attached in the supplemental material.:
Based on this bound, we can infer that should be enlarged consistently if we expect an optimal for classification with a certain number of classes. Besides, by keeping a fixed , the desired should be larger to deal with more classes since the growing number of classes increase the difficulty for classification in the relatively compact space. A hypersphere with large radius is therefore required for embedding features with small intra-class distance and large inter-class distance.
3.4 Theoretical Analysis for LMCL
The preceding subsections essentially discuss the LMCL from the classification point of view. In terms of learning the discriminative features on the hypersphere, the cosine margin servers as momentous part to strengthen the discriminating power of features. Detailed analysis about the quantitative feasible choice of the cosine margin (i.e., the bound of hyper-parameter ) is necessary. The optimal choice of potentially leads to more promising learning of highly discriminative face features. In the following, we delve into the decision boundary and angular margin in the feature space to derive the theoretical bound for hyper-parameter .
First, considering the binary-classes case with classes and as before, suppose that the normalized feature vector is given. Let denote the normalized weight vector, and denote the angle between and . For NSL, the decision boundary defines as , which is equivalent to the angular bisector of and as shown in the left of Figure 3. This addresses that the model supervised by NSL partitions the underlying feature space to two close regions, where the features near the boundary are extremely ambiguous (i.e., belonging to either class is acceptable). In contrast, LMCL drives the decision boundary formulated by for , in which should be much smaller than (similarly for ). Consequently, the inter-class variance is enlarged while the intra-class variance shrinks.
Back to Figure 3, one can observe that the maximum angular margin is subject to the angle between and . Accordingly, the cosine margin should have the limited variable scope when and are given. Specifically, suppose a scenario that all the feature vectors belonging to class exactly overlap with the corresponding weight vector of class . In other words, every feature vector is identical to the weight vector for class , and apparently the feature space is in an extreme situation, where all the feature vectors lie at their class center. In that case, the margin of decision boundaries has been maximized (i.e., the strict upper bound of the cosine margin).
To extend in general, we suppose that all the features are well-separated and we have a total number of classes. The theoretical variable scope of is supposed to be: , where
. The softmax loss tries to maximize the angle between any of the two weight vectors from two different classes in order to perform perfect classification. Hence, it is clear that the optimal solution for the softmax loss should uniformly distribute the weight vectors on a unit hypersphere. Based on this assumption, the variable scope of the introduced cosine margincan be inferred as follows 222Proof is attached in the supplemental material.:
where is the number of training classes and is the dimension of learned features. The inequalities indicate that as the number of classes increases, the upper bound of the cosine margin between classes are decreased correspondingly. Especially, if the number of classes is much larger than the feature dimension, the upper bound of the cosine margin will get even smaller.
A reasonable choice of larger should effectively boost the learning of highly discriminative features. Nevertheless, parameter usually could not reach the theoretical upper bound in practice due to the vanishing of the feature space. That is, all the feature vectors are centered together according to the weight vector of the corresponding class. In fact, the model fails to converge when is too large, because the cosine constraint (i.e., or for two classes) becomes stricter and is hard to be satisfied. Besides, the cosine constraint with overlarge forces the training process to be more sensitive to noisy data. The ever-increasing starts to degrade the overall performance at some point because of failing to converge.
We perform a toy experiment for better visualizing on features and validating our approach. We select face images from 8 distinct identities containing enough samples to clearly show the feature points on the plot. Several models are trained using the original softmax loss and the proposed LMCL with different settings of . We extract 2-D features of face images for simplicity. As discussed above, should be no larger than (about 0.29), so we set up three choices of for comparison, which are , , and . As shown in Figure 4, the first row and second row present the feature distributions in Euclidean space and angular space, respectively. We can observe that the original softmax loss produces ambiguity in decision boundaries while the proposed LMCL performs much better. As increases, the angular margin between different classes has been amplified.
4.1 Implementation Details
Preprocessing. Firstly, face area and landmarks are detected by MTCNN  for the entire set of training and testing images. Then, the 5 facial points (two eyes, nose and two mouth corners) are adopted to perform similarity transformation. After that we obtain the cropped faces which are then resized to be . Following [42, 23], each pixel (in [0, 255]) in RGB images is normalized by subtracting 127.5 then dividing by 128.
Training. For a direct and fair comparison to the existing results that use small training datasets (less than 0.5M images and 20K subjects) , we train our models on a small training dataset, which is the publicly available CASIA-WebFace  dataset containing 0.49M face images from 10,575 subjects. We also use a large training dataset to evaluate the performance of our approach for benchmark comparison with the state-of-the-art results (using large training dataset) on the benchmark face dataset. The large training dataset that we use in this study is composed of several public datasets and a private face dataset, containing about 5M images from more than 90K identities. The training faces are horizontally flipped for data augmentation. In our experiments we remove face images belong to identities that appear in the testing datasets.
in Equation (4) is set to 64 empirically. We use Caffe to implement the modifications of the loss layer and run the models. The CNN models are trained with SGD algorithm, with the batch size of 64 on 8 GPUs. The weight decay is set to 0.0005. For the case of training on the small dataset, the learning rate is initially 0.1 and divided by 10 at the 16K, 24K, 28k iterations, and we finish the training process at 30k iterations. While the training on the large dataset terminates at 240k iterations, with the initial learning rate 0.05 dropped at 80K, 140K, 200K iterations.
Testing. At testing stage, features of original image and the flipped image are concatenated together to compose the final face representation. The cosine distance of features is computed as the similarity score. Finally, face verification and identification are conducted by thresholding and ranking the scores. We test our models on several popular public face datasets, including LFW, YTF, and MegaFace[17, 25].
4.2 Exploratory Experiments
Effect of . The margin parameter plays a key role in LMCL. In this part we conduct an experiment to investigate the effect of . By varying from 0 to 0.45 (If is larger than 0.45, the model will fail to converge), we use the small training data (CASIA-WebFace ) to train our CosFace model and evaluate its performance on the LFW and YTF datasets, as illustrated in Figure 5. We can see that the model without the margin (in this case m=0) leads to the worst performance. As being increased, the accuracies are improved consistently on both datasets, and get saturated at . This demonstrates the effectiveness of the margin . By increasing the margin , the discriminative power of the learned features can be significantly improved. In this study, is set to fixed 0.35 in the subsequent experiments.
Effect of Feature Normalization. To investigate the effect of the feature normalization scheme in our approach, we train our CosFace models on the CASIA-WebFace with and without the feature normalization scheme by fixing to 0.35, and compare their performance on LFW, YTF, and the Megaface Challenge 1(MF1). Note that the model trained without normalization is initialized by softmax loss and then supervised by the proposed LMCL. The comparative results are reported in Table 1. It is very clear that the model using the feature normalization scheme consistently outperforms the model without the feature normalization scheme across the three datasets. As discussed above, feature normalization removes radical variance, and the learned features can be more discriminative in angular space. This experiment verifies this point.
|Normalization||LFW||YTF||MF1 Rank 1||MF1 Veri.|
4.3 Comparison with state-of-the-art loss functions
In this part, we compare the performance of the proposed LMCL with the state-of-the-art loss functions. Following the experimental setting in , we train a model with the guidance of the proposed LMCL on the CAISA-WebFace using the same 64-layer CNN architecture described in . The experimental comparison on LFW, YTF and MF1 are reported in Table 2. For fair comparison, we are strictly following the model structure (a 64-layers ResNet-Like CNNs) and the detailed experimental settings of SphereFace . As can be seen in Table 2, LMCL consistently achieves competitive results compared to the other losses across the three datasets. Especially, our method not only surpasses the performance of A-Softmax with feature normalization (named as A-Softmax-NormFea in Table 2), but also significantly outperforms the other loss functions on YTF and MF1, which demonstrates the effectiveness of LMCL.
|Softmax Loss ||97.88||93.1||54.85||65.92|
|Triplet Loss ||98.70||93.4||64.79||78.32|
|L-Softmax Loss ||99.10||94.0||67.12||80.42|
|Softmax+Center Loss ||99.05||94.4||65.49||80.14|
|Method||Protocol||MF1 Rank1||MF1 Veri.|
|DeepSense - Small||Small||70.98||82.85|
|SphereFace - Small||Small||75.76||90.04|
|Beijing FaceAll V2||Small||76.66||77.60|
|Beijing FaceAll Norm 1600||Large||64.80||67.11|
|Google - FaceNet v8||Large||70.49||86.47|
|NTechLAB - facenx large||Large||73.30||85.08|
|Vocord - deepVo V3||Large||91.76||94.96|
|Method||Protocol||MF2 Rank1||MF2 Veri.|
4.4 Overall Benchmark Comparison
4.4.1 Evaluation on LFW and YTF
LFW  is a standard face verification testing dataset in unconstrained conditions. It includes 13,233 face images from 5749 identities collected from the website. We evaluate our model strictly following the standard protocol of unrestricted with labeled outside data , and report the result on the 6,000 pair testing images. YTF  contains 3,425 videos of 1,595 different people. The average length of a video clip is 181.3 frames. All the video sequences were downloaded from YouTube. We follow the unrestricted with labeled outside data protocol and report the result on 5,000 video pairs.
As shown in Table 3, the proposed CosFace achieves state-of-the-art results of 99.73% on LFW and 97.6% on YTF. FaceNet achieves the runner-up performance on LFW with the large scale of the image dataset, which has approximately 200 million face images. In terms of YTF, our model reaches the first place over all other methods.
4.4.2 Evaluation on MegaFace
MegaFace [17, 25] is a very challenging testing benchmark recently released for large-scale face identification and verification, which contains a gallery set and a probe set. The gallery set in Megaface is composed of more than 1 million face images. The probe set has two existing databases: Facescrub  and FGNET . In this study, we use the Facescrub dataset (containing 106,863 face images of 530 celebrities) as the probe set to evaluate the performance of our approach on both Megaface Challenge 1 and Challenge 2.
MegaFace Challenge 1 (MF1). On the MegaFace Challenge 1 , The gallery set incorporates more than 1 million images from 690K individuals collected from Flickr photos . Table 4 summarizes the results of our models trained on two protocols of MegaFace where the training dataset is regarded as small if it has less than 0.5 million images, large otherwise. The CosFace approach shows its superiority for both the identification and verification tasks on both the protocols.
MegaFace Challenge 2 (MF2). In terms of MegaFace Challenge 2 , all the algorithms need to use the training data provided by MegaFace. The training data for Megaface Challenge 2 contains 4.7 million faces and 672K identities, which corresponds to the large protocol. The gallery set has 1 million images that are different from the challenge 1 gallery set. Not surprisingly, Our method wins the first place of challenge 2 in table 5, setting a new state-of-the-art with a large margin (1.39% on rank-1 identification accuracy and 5.46% on verification performance).
In this paper, we proposed an innovative approach named LMCL to guide deep CNNs to learn highly discriminative face features. We provided a well-formed geometrical and theoretical interpretation to verify the effectiveness of the proposed LMCL. Our approach consistently achieves the state-of-the-art results on several face benchmarks. We wish that our substantial explorations on learning discriminative features via LMCL will benefit the face recognition community.
-  FG-NET Aging Database,http://www.fgnet.rsunit.com/.
-  P. Belhumeur, J. P. Hespanha, and D. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):711–720, July 1997.
-  J. Cai, Z. Meng, A. S. Khan, Z. Li, and Y. Tong. Island Loss for Learning Discriminative Features in Facial Expression Recognition. arXiv preprint arXiv:1710.03144, 2017.
-  W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. arXiv preprint arXiv:1704.01719, 2017.
S. Chopra, R. Hadsell, and Y. LeCun.
Learning a similarity metric discriminatively, with application to
Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
-  J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep face recognition. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
-  R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
-  M. A. Hasnat, J. Bohne, J. Milgram, S. Gentric, and L. Chen. von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification. arXiv preprint arXiv:1706.04264, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  E. Hoffer and N. Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, 2015.
G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Z. Li, and T. Hospedales.
When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition.In International Conference on Computer Vision Workshops (ICCVW), 2015.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation Networks. arXiv preprint arXiv:1709.01507, 2017.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Technical Report 07-49, University of Massachusetts, Amherst, 2007.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 2016 ACM on Multimedia Conference (ACM MM), 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
-  K. Zhang, Z. Zhang, Z. Li and Y. Qiao. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. Signal Processing Letters, 23(10):1499–1503, 2016.
-  I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.
-  Z. Li, D. Lin, and X. Tang. Nonparametric discriminant analysis for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:755–761, 2009.
-  Z. Li, W. Liu, D. Lin, and X. Tang. Nonparametric subspace analysis for face recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
-  J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ultimate accuracy: Face recognition via deep embedding. arXiv preprint arXiv:1506.07310, 2015.
-  W. Liu, Z. Li, and X. Tang. Spatio-temporal embedding for statistical face recognition from video. In European Conference on Computer Vision (ECCV), 2006.
-  W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-Margin Softmax Loss for Convolutional Neural Networks. In International Conference on Machine Learning (ICML), 2016.
-  A. Nech and I. Kemelmacher-Shlizerman. Level playing field for million scale face recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  H.-W. Ng and S. Winkler. A data-driven approach to cleaning large face datasets. In Image Processing (ICIP), 2014 IEEE International Conference on, pages 343–347. IEEE, 2014.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained Softmax Loss for Discriminative Face Verification. arXiv preprint arXiv:1703.09507, 2017.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems (NIPS), 2014.
-  Y. Sun, D. Liang, X. Wang, and X. Tang. DeepID3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
-  Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 2016.
-  M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In Conference on Computer Vision and Pattern Recognition (CVPR), 1991.
-  F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. NormFace: Hypersphere Embedding for Face Verification. In Proceedings of the 2017 ACM on Multimedia Conference (ACM MM), 2017.
J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and
Learning fine-grained image similarity with deep ranking.In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  X. Wang and X. Tang. A unified framework for subspace face recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(9):1222–1228, Sept. 2004.
-  Z. Wang, K. He, Y. Fu, R. Feng, Y.-G. Jiang, and X. Xue. Multi-task Deep Neural Network for Joint Face Recognition and Facial Attribute Prediction. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR), 2017.
-  Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision (ECCV), pages 499–515, 2016.
-  L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
-  Y. Xiong, W. Liu, D. Zhao, and X. Tang. Face recognition via archetype hull ranking. In IEEE International Conference on Computer Vision (ICCV), 2013.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
-  X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range Loss for Deep Face Recognition with Long-tail. In International Conference on Computer Vision (ICCV), 2017.
-  X. Zhe, S. Chen, and H. Yan. Directional Statistics-based Deep Metric Learning for Image Classification and Retrieval. arXiv preprint arXiv:1802.09662, 2018.
Appendix A Supplementary Material
This supplementary document provides mathematical details for the derivation of the lower bound of the scaling parameter (Equation 6 in the main paper), and the variable scope of the cosine margin (Equation 7 in the main paper).
Proposition of the Scaling Parameter
Given the normalized learned features and unit weight vectors , we denote the total number of classes as where . Suppose that the learned features separately lie on the surface of a hypersphere and center around the corresponding weight vector. Let denote the expected minimum posterior probability of the class center (i.e., ). The lower bound of is formulated as follows:
Let denote the -th unit weight vector. , we have:
Because is a convex function, according to Jensen’s inequality, we obtain:
Besides, it is known that
Thus, we have:
Further simplification yields:
The equality holds if and only if every is equal (), and .
Because at most unit vectors are able to satisfy this condition in the K-dimension hyper-space, the equality holds only when , where K is the dimension of the learned features.
Proposition of the Cosine Margin
Suppose that the weight vectors are uniformly distributed on a unit hypersphere. The variable scope of the introduced cosine margin is formulated as follows :
where is the total number of training classes and is the dimension of the learned features.
For , the weight vectors uniformly spread on a unit circle. Hence, . It follows .
For , the inequality below holds:
Therefore, , and we have .
Similarly, the equality holds if and only if every is equal (), and . As discussed above, this is satisfied only if . On this condition, the distance between the vertexes of two arbitrary should be the same. In other words, they form a regular simplex such as an equilateral triangle if , or a regular tetrahedron if .
For the case of , the equality cannot be satisfied. In fact, it is unable to formulate the strict upper bound. Hence, we obtain . Because the number of classes can be much larger than the feature dimension, the equality cannot hold in practice.