Recent years witnessed the breakthrough of deep Convolutional Neural Networks (CNNs)[17, 12, 25, 35] on significantly improving the performance of one-to-one face verification and one-to-many face identification tasks. The successes of deep face CNNs can be mainly credited to three factors: enormous training data , deep neural network architectures [10, 33]
and effective loss functions[28, 21, 7]. Modern face datasets, such as LFW , CASIA-WebFace , MS1M  and MegaFace [24, 16], contain huge number of identities which enable the training of deep networks. A number of recent studies, such as DeepFace , DeepID2 , DeepID3 , VGGFace  and FaceNet , demonstrated that properly designed network architectures also lead to improved performance.
Apart from the large-scale training data and deep structures, training losses also play key roles in learning accurate face recognition models [41, 6, 11]. Unlike image classification tasks, face recognition is essentially an open set recognition problem, where the testing categories (identities) are generally different from those used in training. To handle this challenge, most deep learning based face recognition approaches [31, 32, 36]
utilize CNNs to extract feature representations from facial images, and adopt a metric (usually the cosine distance) to estimate the similarities between pairs of faces during inference.
However, such inference evaluation metric is not well considered in the methods with softmax cross-entropy loss function111We denote it as “softmax loss” for short in the remaining sections.
, which train the networks with the softmax loss but perform inference using cosine-similarities. To mitigate the gap between training and testing, recent works[21, 28, 39, 8] directly optimized cosine-based softmax losses. Moreover, angular margin-based terms [19, 18, 40, 38, 7] are usually integrated into cosine-based losses to maximize the angular margins between different identities. These methods improve the face recognition performance in the open-set setup. In spite of their successes, the training processes of cosine-based losses (and their variants introducing margins) are usually tricky and unstable. The convergence and performance highly depend on the hyperparameter settings of loss, which are determined empirically through large amount of trials. In addition, subtle changes of these hyperparameters may fail the entire training process.
In this paper, we investigate state-of-the-art cosine-based softmax losses [28, 40, 7], especially those aiming at maximizing angular margins, to understand how they provide supervisions for training deep neural networks. Each of the functions generally includes several hyperprameters, which have substantial impact on the final performance and are usually difficult to tune. One has to repeat training with different settings for multiple times to achieve optimal performance. Our analysis shows that different hyperparameters in those cosine-based losses actually have similar effects on controlling the samples’ predicted class probabilities. Improper hyperparameter settings cause the loss functions to provide insufficient supervisions for optimizing networks.
Based on the above observation, we propose an adaptive cosine-based loss function, AdaCos, which automatically tunes hyperparameters and generates more effective supervisions during training. The proposed AdaCos dynamically scales the cosine similarities between training samples and corresponding class center vectors (the fully-connection vector before softmax), making their predicted class probability meets the semantic meaning of these cosine similarities. Furthermore, AdaCos can be easily implemented using built-in functions from prevailing deep learning libraries[26, 1, 5, 15]. The proposed AdaCos loss leads to faster and more stable convergence for training without introducing additional computational overhead.
2 Related Works
Cosine similarities for inference. For learning deep face representations, feature-normalized losses are commonly adopted to enhance the recognition accuracy. Coco loss [20, 21] and NormFace  studied the effect of normalization and proposed two strategies by reformulating softmax loss and metric learning. Similarly, Ranjan et al. in  also discussed this problem and applied normalization on learned feature vectors to restrict them lying on a hypersphere. Movrever, compared with these hard normalization, ring loss  came up with a soft feature normalization approach with convex formulations.
Margin-based softmax loss. Earlier, most face recognition approaches utilized metric-targeted loss functions, such as triplet  and contrastive loss , which utilize Euclidean distances to measure similarities between features. Taking advantages of these works, center loss  and range loss  were proposed to reduce intra-class variations via minimizing distances within each class . Following this, researchers found that constraining margin in Euclidean space is insufficient to achieve optimal generalization. Then angular-margin based loss functions were proposed to tackle the problem. Angular constraints were integrated into the softmax loss function to improve the learned face representation by L-softmax  and A-softmax . CosFace , AM-softmax  and ArcFace  directly maximized angular margins and employed simpler and more intuitive loss functions compared with aforementioned methods.
Automatic hyperparameter tuning. The performance of an algorithm highly depends on hyperparameter settings. Grid and random search  are the most widely used strategies. For more automatic tuning, sequential model-based global optimization  is the mainstream choice. Typically, it performs inference with several hyperparameters settings, and chooses setting for the next round of testing based on the inference results. Bayesian optimization  and tree-structured parzen estimator approach  are two famous sequential model-based methods. However, these algorithms essentially run multiple trials to predict the optimized hyperparameter settings.
3 Investigation of hyperparameters in cosine-based softmax losses
In recent years, state-of-the-art cosine-based softmax losses, including L2-softmax , CosFace , ArcFace , significantly improve the performance of deep face recognition. However, the final performances of those losses are substantially affected by their hyperparameters settings, which are generally difficult to tune and require multiple trials in practice. We analyze two most important hyperparameters, the scaling parameter and the margin parameter , in cosine-based losses. Specially, we deeply study their effects on the prediction probabilities after softmax, which serves as supervision signals for updating entire neural network.
Let denote the deep representation (feature) of the -th face image of the current mini-batch with size , and be the corresponding label. The predicted classification probability of all samples in the mini-batch can be estimated by the softmax function as
where is logit used as the input of softmax, represents its softmax-normalized probability of assigning to class , and is the number of classes. The cross-entropy loss associated with current mini-batch is
Conventional softmax loss and state-of-the-art cosine-based softmax losses [28, 40, 7] calculate the logits in different ways. In conventional softmax loss, logits are obtained as the inner product between feature and the -th class weights as . In the cosine-based softmax losses [28, 40, 7], cosine similarity is calculated by . The logits are calculated as , where is a scale hyperparameter. To enforce angular margin on the representations, ArcFace  modified the loss to the form
while CosFace  uses
where is the margin. The indicator function returns when and otherwise. All margin-based variants decrease associate with the correct class by subtracting margin . Compared with the losses without margin, margin-based variants require to be greater than other , by a specified .
Intuitively, on one hand, the parameter scales up the narrow range of cosine distances, making the logits more discriminative. On the other hand, the parameter enlarges the margin between different classes to enhance classification ability. These hyperparameters eventually affect . Empirically, an ideal hyperparameter setting should help to satisfy the following two properties: (1) Predicted probabilities of each class (identity) should span to the range : the lower boundary of should be near while the upper boundary near ; (2) Changing curve of should have large absolute gradients around to make training effective.
3.1 Effects of the scale parameter
The scale parameter can significantly affect . Intuitively, should gradually increase from to as the angle decreases from to 222Mathematically, can be any value in . We empirically found, however, the maximum is always around . See the red curve in Fig. 1 for examples., i.e., the smaller the angle between and its corresponding class weight is, the larger the probability should be. Both improper probability range and probability curves w.r.t. would negatively affect the training process and thus the recognition performance.
We first study the range of classification probability . Given scale parameter , the range of probabilities in all cosine-based softmax losses is
where the lower boundary is achieved when and for all in Eq. (1). Similarly, the upper bound is achieved when and for all . The range of approaches 1 when , i.e.,
which means that the requirement of the range spanning could be satisfied with a large . However it does not mean that the larger the scale parameter, the better the selection is. In fact the probability range can easily approach a high value, such as when class number and scale parameter
. But an oversized scale would lead to poor probability distribution, as will be discussed in the following paragraphs.
We investigate the influences of parameter by taking as a function of and angle where denotes the label of . Formally, we have
where are the logits summation of all non-corresponding classes for feature . We observe that the values of are almost unchanged during the training process. This is because the angles for non-corresponding classes always stay around during training (see red curve in Fig. 1).
Therefore, we can assume is constant, i.e., . We then plot curves of probabilities w.r.t. under different setting of parameter in Fig. 2(a). It is obvious that when is too small (e.g., for class/identity number and ), the maximal value of could not reach . This is undesirable because even when the network is very confident on a sample ’s corresponding class label , e.g. , the loss function would still penalize the classification results and update the network.
On the other hand, when is too large (e.g., ), the probability curve w.r.t. is also problematic. It would output a very high probability even when is close to , which means that the loss function with large
may fail to penalize mis-classified samples and cannot effectively update the networks to correct mistakes.
In summary, the scaling parameter has substantial influences to the range as well as the curves of the probabilities , which are crucial for effectively training the deep network.
3.2 Effects of the margin parameter
In this section, we investigate the effect of margin parameters in cosine-based softmax losses (Eqs. (3) & (4)), and their effects on feature ’s predicted class probability . For simplicity, we here study the margin parameter for ArcFace (Eq. 3); while the similar conclusions also apply to the parameter in CosFace (Eq. (4)).
We first re-write classification probability following Eq. (7) as
To study the influence of parameter on the probability , we assume both and are fixed. Following the discussion in Section 3.1, we set , and fix . The probability curves w.r.t. under different are shown in Fig. 2(b).
According to Fig. 2(b), increasing the margin parameter shifts probability curves to the left. Thus, with the same , larger margin parameters lead to lower probabilities and thus larger loss even with small angles . In other words, the angles between the feature and its corresponding class’s weights have to be very small for sample being correctly classified. This is the reason why margin-based losses provide stronger supervisions for the same than conventional cosine-based losses. Proper margin settings have shown to boost the final recognition performance in [40, 7].
Although larger margin provides stronger supervisions, it should not be too large either. When is oversized (e.g., ), the probabilities becomes unreliable. It would output probabilities around even is very small. This lead to large loss for almost all samples even with very small sample-to-class angles, which makes the training difficult to converge. In previous methods, the margin parameter selection is an ad-hoc procedure and has no theoretical guidance for most cases.
3.3 Summary of the hyparameter study
According to our analysis, we can draw the following conclusions:
(1) Hyperparameters scale and margin can substantially influence the prediction probability of feature with ground-truth identity/category . For the scale parameter , too small would limit the maximal value of . On the other hand, too large would make most predicted probabilities to be , which makes the training loss insensitive to the correctness of . For the margin parameter , a too small margin is not strong enough to regularize the final angular margin, while an oversized margin makes the training difficult to converge.
(2) The effect of scale and margin can be unified to modulate the mapping from cosine distances to the prediction probability . As shown in Fig. 2(a) and Fig. 2(b), both small scales and large margins have similar effect on for strengthening the supervisions, while both large scales and small margins weaken the supervisions. Therefore it is feasible and promising to control the probability using one single hyperparameter, either or . Considering the fact that is more related to the range of that required to span , we will focus on automatically tuning the scale parameter in the reminder of this paper.
4 The cosine-based softmax loss with adaptive scaling
Based on our previous studies on the hyperparameters of the cosine-based softmax loss functions, in this section, we propose a novel loss with a self-adaptive scaling scheme, namely AdaCos, which does not require the ad-hoc and time-consuming manual parameter tuning. Training with the proposed loss does not only facilitate convergence but also results in higher recognition accuracy.
Our previous studies on Fig. 1 show that during the training process, the angles for between the feature and its non-corresponding weights are almost always close to , In other words, we could safely assume that in Eq. (7). Obviously, it is the probability of feature belonging to its corresponding class that has the most influence on supervision for network training. Therefore, we focus on designing an adaptive scale parameter for controling the probabilities .
From the curves of w.r.t. (Fig. 2(a)), we observe that the scale parameter does not only simply affect ’s boundary of of determining correct/incorrect but also squeezes/stretches the curvature; In contrast to scale , margin parameter only shifts the curve in phase. We therefore propose to automatically tune the scale parameter and eliminate the margin parameter from our loss function, which makes our proposed AdaCos loss different from state-of-the-art softmax loss variants with angular margin. With softmax function, the predicted probability can be defined by
where is the automatically tuned scale parameter to be discussed below.
Let us first re-consider the (Eq. (7)) as a function of . Note that represents the angle between sample and the weight vector of its ground truth category . For network training, we hope to minimize with the supervision from the loss function . Our objective is choose a suitable scale which makes predicted probability change significantly with respect to . Mathematically, we find the point where the absolute gradient value reaches its maximum, when the second-order derivative of at equals , i.e.,
where can be well approximated as since the angles distribute around during training (see Eq. (7) and Fig. 1). Then the task of automatically determining would reduce to select an reasonable central angle in .
4.1 Automatically choosing a fixed scale parameter
Since is in the center of , it is natural to regard as the point, i.e. setting for figuring out an effective mapping from angle to the probability . Then the supervisions determined by would be back-propagated to update and further to update network parameters. According to Eq. (11), we can estimate the corresponding scale parameter as
where is approximated by .
For such an automatically-chosen fixed scale parameter (see Figs. 2(a) and 2(b)), it depends on the number of classes in the training set and also provides a good guideline for existing cosine distance based softmax losses to choose their scale parameters. In contrast, the scaling parameters in existing methods was manually set according to human experience. It acts as a good baseline method for our dynamically tuned scale parameter in the next section.
4.2 Dynamically adaptive scale parameter
As Fig. 1 shows, the angles between features and their ground-truth class weights gradually decrease as the training iterations increase; while the angles between features and non-corresponding classes become stabilize around , as shown in Fig. 1.
Although our previously fixed scale parameter behaves properly as changes over , it does not take into account the fact that gradually decrease during training. Since smaller gains higher probability and thus gradually receives weaker supervisions as the training proceeds, we therefore propose a dynamically adaptive scale parameter to gradually apply stricter requirement on the position of which can progressively enhance the supervisions throughout the training process.
Formally we introduce a modulating indicator variable , which is the median of all corresponding classes’ angles, , from the mini-batch of size at the -th iteration. roughly represents the current network’s degree of optimization on the mini-batch. When the median angle is large, it denotes that the network parameters are far from optimum and less strict supervisions should be applied to make the training converge more stably; when the median angle is small, it denotes that the network is close to optimum and stricter supervisions should be applied to make the intra-class angles become even smaller. Based on this observation, we set the central angle . We also introduce as the average of as
where denotes the face identity indices in the mini-batch at the -th iteration. Unlike approximating for the fixed adaptive scale parameter , here we estimate using the scale parameter of previous iteration, which provides us a more accurate approximation. Be reminded that also includes dynamic scale . We can obtain it by solving the nonlinear function given by the above equation. In practice, we notice that changes very little following iterations. So, we just use to calculate with Eq. (7). Then we can obtain dynamic scale directly with Eq. (11). So we have:
where is related to the dynamic scale parameter. We estimate it using the scale parameter of the previous iteration.
At the begin of the training process, the median angle of each mini-batch might be too large to impose enough supervisions for training. We therefore force the central angle to be less than . Our dynamic scale parameter for the -th iteration could then be formulated as
where is initialized as our fixed scale parameter when .
Substituting into , the corresponding gradients can be calculated as follows
where is the indicator function and
Eq. (17) shows that the dynamically adaptive scale parameter influences classification probabilities differently at each iteration and also effectively affects the gradients (Eq. (16)) for updating network parameters. The benefit of dynamic AdaCos is that it can produce reasonable scale parameter by sensing the training convergence of the model in the current iteration.
We examine the proposed AdaCos loss function on several public face recognition benchmarks and compare it with state-of-the-art cosine-based softmax losses. The compared losses include -softmax , CosFace , and ArcFace . We present evaluation results on LFW , MegaFace 1-million Challenge , and IJB-C  data. We also present results on some exploratory experiments to show the convergence speed and robustness against low-resolution images.
Preprocessing. We use two public training datasets, CASIA-WebFace  and MS1M , to train CNN models with our proposed loss functions. We carefully clean the noisy and low-quality images from the datasets. The cleaned WebFace  and MS1M  contain about M and M facial images, respectively. All models are trained based on these training data and directly tested on the test splits of the three datasets. RSA  is applied to the images to extract facial areas. Then, according to detected facial landmarks, the faces are aligned through similarity transformation and resized to the size . All image pixel values are subtracted with the mean and dividing by .
5.1 Results on LFW
The LFW  dataset collected thousands of identities from the inertnet. Its testing protocol contains about images for about identities with a total of ground-truth matches. Half of the matches are positive while the other half are negative ones. LFW’s primary difficulties lie in face pose variations, color jittering, illumination variations and aging of persons. Note portion of the pose variations can be eliminated by the RSA  facial landmark detection and alignment algorithm, but there still exist some non-frontal facial images which can not be aligned by RSA  and then aligned manually.
5.1.1 Comparison on LFW
For all experiments on LFW , we train ResNet-50 models  with batch size of on the cleaned WebFace  dataset. The input size of facial image is and the feature dimension input into the loss function is . Different loss functions are compared with our proposed AdaCos losses.
Results in Table 1 show the recognition accuracies of models trained with different softmax loss functions. Our proposed AdaCos losses with fixed and dynamic scale parameters (denoted as Fixed AdaCos and Dyna. AdaCos) surpass the state-of-the-art cosine-based softmax losses under the same training configuration. For the hyperparameter settings of the compared losses, the scaling parameter is set as for -softmax , CosFace  and ArcFace ; the margin parameters are set as and for CosFace , and ArcFace , respectively. Since LFW is a relatively easy evaluation set, we train and test all losses for three times. The average accuracy of our proposed dynamic AdaCos is higher than state-of-the-art ArcFace  and than -softmax .
5.1.2 Exploratory Experiments
The change of scale parameters and feature angles during training. In this part, we will show the change of scale parameter and feature angles during training with our proposed AdaCos loss. The scale parameter changes along with the current recognition performance of the model, which continuously strengthens the supervisions by gradually reducing and thus shrinking . Fig. 3 shows the change of the scale parameter with our proposed fixed AdaCos and dynamic AdaCos losses. For the dynamic AdaCos loss, the scale parameter adaptively decreases as the training iterations increase, which indicates that the loss function provides stricter supervisions to update network parameters. Fig. 4 illustrates the change of by our proposed dynamic AdaCos and -softmax. The average (orange curve) and median (green curve) of , which indicating the angle between a sample and its ground-truth category, gradually reduce while the average (maroon curve) of where remains nearly . Compared with -softmax loss, our proposed loss could achieve much smaller sample feature to category angles on the ground-truth classes and leads to higher recognition accuracies.
Convergence rates. Convergence rate is an important indicator of efficiency of loss functions. We examine the convergence rates of several cosine-based losses at different training iterations. The training configurations are same as Table 1. Results in Table 2 reveal that the convergence rates when training with the AdaCos losses are much higher.
5.2 Results on MegaFace
|Size of MegaFace Distractor|
We then evaluate the performance of proposed AdaCos on the MegaFace Challenge , which is a publicly available identification benchmark, widely used to test the performance of facial recognition algorithms. The gallery set of MegaFace incorporates over million images from K identities collected from Flickr photos . We follow ArcFace ’s testing protocol, which cleaned the dataset to make the results more reliable. We train the same Inception-ResNet  models with CASIA-WebFace  and MS1M  training data, where overlapped subjects are removed.
5.3 Results on IJB-C 1:1 verification protocol
|True Accept Rate @ False Accept Rate|
Crystal Loss 
The IJB-C dataset  contains about identities with a total of still facial images and unconstrained video frames. In the 1:1 verification, there are positive matches and negative matches, which allow us to evaluate TARs at various FARs (e.g., ).
We compare the softmax loss functoins, including the proposed AdaCos, -softmax , CosFace , and ArcFace  with the same training data (WebFace  and MS1M ) and network architecture (Inception-ResNet ). We also report the results of FaceNet , VGGFace  listed in Crystal loss . Table 4 and Fig. 6 exhibit their performances on the IJB-C 1:1 verification. Our proposed dynamic AdaCos achieves the best performance.
In this work, we argue that the bottleneck of existing cosine-based softmax losses may primarily comes from the mis-match between cosine distance and the classification probability , which limits the final recognition performance. To address this issue, we first deeply analyze the effects of hyperparameters in cosine-based softmax losses from the perspective of probability. Based on these analysis, we propose the AdaCos which automatically adjusts an adaptive parameter in order to reformulate the mapping between cosine distance and classification probability. Our proposed AdaCos loss is simple yet effective. We demonstrate its effectiveness and efficiency by exploratory experiments and report its state-of-the-art performances on several public benchmarks.
Acknowledgements. This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, in part by CUHK Direct Grant, and in part by National Natural Science Foundation of China (61472410) and the Joint Lab of CAS-HK.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  Peter N. Belhumeur, João P Hespanha, and David J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7):711–720, 1997.
-  James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
-  James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
-  Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
-  Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.
-  Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
-  Siddharth Gopal and Yiming Yang. Von mises-fisher clustering models. In International Conference on Machine Learning, pages 154–162, 2014.
-  Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.
-  Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
-  Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
-  Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, pages 507–516, 2016.
-  Yu Liu, Hongyang Li, and Xiaogang Wang. Learning deep features via congenerous cosine loss for person recognition. arXiv preprint arXiv:1702.06890, 2017.
-  Yu Liu, Hongyang Li, and Xiaogang Wang. Rethinking feature discrimination and polymerization for large-scale recognition. arXiv preprint arXiv:1710.00870, 2017.
-  Yu Liu, Hongyang Li, Junjie Yan, Fangyin Wei, Xiaogang Wang, and Xiaoou Tang. Recurrent scale approximation for object detection in cnn. In IEEE International Conference on Computer Vision, 2017.
-  Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, et al. Iarpa janus benchmark–c: Face dataset and protocol. In 11th IAPR International Conference on Biometrics, 2018.
-  Aaron Nech and Ira Kemelmacher-Shlizerman. Level playing field for million scale face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3406–3415. IEEE, 2017.
-  Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.In NIPS-W, 2017.
-  Rajeev Ranjan, Ankan Bansal, Hongyu Xu, Swami Sankaranarayanan, Jun-Cheng Chen, Carlos D Castillo, and Rama Chellappa. Crystal loss and quality pooling for unconstrained face verification and recognition. arXiv preprint arXiv:1804.01159, 2018.
-  Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
-  Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
-  Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
-  Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
-  Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
-  Feng Wang, Weiyang Liu, Haijun Liu, and Jian Cheng. Additive margin softmax for face verification. arXiv preprint arXiv:1801.05599, 2018.
-  Feng Wang, Xiang Xiang, Jian Cheng, and Alan L Yuille. Normface: hypersphere embedding for face verification. arXiv preprint arXiv:1704.06369, 2017.
-  Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
-  Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
-  Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
-  Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
-  Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. Range loss for deep face recognition with long-tailed training data. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  Yutong Zheng, Dipan K Pal, and Marios Savvides. Ring loss: Convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5089–5097, 2018.