1 Introduction
Recent years witnessed the breakthrough of deep Convolutional Neural Networks (CNNs)
[17, 12, 25, 35] on significantly improving the performance of onetoone face verification and onetomany face identification tasks. The successes of deep face CNNs can be mainly credited to three factors: enormous training data [9], deep neural network architectures [10, 33]and effective loss functions
[28, 21, 7]. Modern face datasets, such as LFW [13], CASIAWebFace [43], MS1M [9] and MegaFace [24, 16], contain huge number of identities which enable the training of deep networks. A number of recent studies, such as DeepFace [36], DeepID2 [31], DeepID3 [32], VGGFace [25] and FaceNet [29], demonstrated that properly designed network architectures also lead to improved performance.Apart from the largescale training data and deep structures, training losses also play key roles in learning accurate face recognition models [41, 6, 11]. Unlike image classification tasks, face recognition is essentially an open set recognition problem, where the testing categories (identities) are generally different from those used in training. To handle this challenge, most deep learning based face recognition approaches [31, 32, 36]
utilize CNNs to extract feature representations from facial images, and adopt a metric (usually the cosine distance) to estimate the similarities between pairs of faces during inference.
However, such inference evaluation metric is not well considered in the methods with softmax crossentropy loss function
^{1}^{1}1We denote it as “softmax loss” for short in the remaining sections., which train the networks with the softmax loss but perform inference using cosinesimilarities. To mitigate the gap between training and testing, recent works
[21, 28, 39, 8] directly optimized cosinebased softmax losses. Moreover, angular marginbased terms [19, 18, 40, 38, 7] are usually integrated into cosinebased losses to maximize the angular margins between different identities. These methods improve the face recognition performance in the openset setup. In spite of their successes, the training processes of cosinebased losses (and their variants introducing margins) are usually tricky and unstable. The convergence and performance highly depend on the hyperparameter settings of loss, which are determined empirically through large amount of trials. In addition, subtle changes of these hyperparameters may fail the entire training process.In this paper, we investigate stateoftheart cosinebased softmax losses [28, 40, 7], especially those aiming at maximizing angular margins, to understand how they provide supervisions for training deep neural networks. Each of the functions generally includes several hyperprameters, which have substantial impact on the final performance and are usually difficult to tune. One has to repeat training with different settings for multiple times to achieve optimal performance. Our analysis shows that different hyperparameters in those cosinebased losses actually have similar effects on controlling the samples’ predicted class probabilities. Improper hyperparameter settings cause the loss functions to provide insufficient supervisions for optimizing networks.
Based on the above observation, we propose an adaptive cosinebased loss function, AdaCos, which automatically tunes hyperparameters and generates more effective supervisions during training. The proposed AdaCos dynamically scales the cosine similarities between training samples and corresponding class center vectors (the fullyconnection vector before softmax), making their predicted class probability meets the semantic meaning of these cosine similarities. Furthermore, AdaCos can be easily implemented using builtin functions from prevailing deep learning libraries
[26, 1, 5, 15]. The proposed AdaCos loss leads to faster and more stable convergence for training without introducing additional computational overhead.2 Related Works
Cosine similarities for inference. For learning deep face representations, featurenormalized losses are commonly adopted to enhance the recognition accuracy. Coco loss [20, 21] and NormFace [39] studied the effect of normalization and proposed two strategies by reformulating softmax loss and metric learning. Similarly, Ranjan et al. in [28] also discussed this problem and applied normalization on learned feature vectors to restrict them lying on a hypersphere. Movrever, compared with these hard normalization, ring loss [45] came up with a soft feature normalization approach with convex formulations.
Marginbased softmax loss. Earlier, most face recognition approaches utilized metrictargeted loss functions, such as triplet [41] and contrastive loss [6], which utilize Euclidean distances to measure similarities between features. Taking advantages of these works, center loss [42] and range loss [44] were proposed to reduce intraclass variations via minimizing distances within each class [2]. Following this, researchers found that constraining margin in Euclidean space is insufficient to achieve optimal generalization. Then angularmargin based loss functions were proposed to tackle the problem. Angular constraints were integrated into the softmax loss function to improve the learned face representation by Lsoftmax [19] and Asoftmax [18]. CosFace [40], AMsoftmax [38] and ArcFace [7] directly maximized angular margins and employed simpler and more intuitive loss functions compared with aforementioned methods.
Automatic hyperparameter tuning. The performance of an algorithm highly depends on hyperparameter settings. Grid and random search [3] are the most widely used strategies. For more automatic tuning, sequential modelbased global optimization [14] is the mainstream choice. Typically, it performs inference with several hyperparameters settings, and chooses setting for the next round of testing based on the inference results. Bayesian optimization [30] and treestructured parzen estimator approach [4] are two famous sequential modelbased methods. However, these algorithms essentially run multiple trials to predict the optimized hyperparameter settings.
3 Investigation of hyperparameters in cosinebased softmax losses
In recent years, stateoftheart cosinebased softmax losses, including L2softmax [28], CosFace [40], ArcFace [7], significantly improve the performance of deep face recognition. However, the final performances of those losses are substantially affected by their hyperparameters settings, which are generally difficult to tune and require multiple trials in practice. We analyze two most important hyperparameters, the scaling parameter and the margin parameter , in cosinebased losses. Specially, we deeply study their effects on the prediction probabilities after softmax, which serves as supervision signals for updating entire neural network.
Let denote the deep representation (feature) of the th face image of the current minibatch with size , and be the corresponding label. The predicted classification probability of all samples in the minibatch can be estimated by the softmax function as
(1) 
where is logit used as the input of softmax, represents its softmaxnormalized probability of assigning to class , and is the number of classes. The crossentropy loss associated with current minibatch is
(2) 
Conventional softmax loss and stateoftheart cosinebased softmax losses [28, 40, 7] calculate the logits in different ways. In conventional softmax loss, logits are obtained as the inner product between feature and the th class weights as . In the cosinebased softmax losses [28, 40, 7], cosine similarity is calculated by . The logits are calculated as , where is a scale hyperparameter. To enforce angular margin on the representations, ArcFace [7] modified the loss to the form
(3) 
while CosFace [40] uses
(4) 
where is the margin. The indicator function returns when and otherwise. All marginbased variants decrease associate with the correct class by subtracting margin . Compared with the losses without margin, marginbased variants require to be greater than other , by a specified .
Intuitively, on one hand, the parameter scales up the narrow range of cosine distances, making the logits more discriminative. On the other hand, the parameter enlarges the margin between different classes to enhance classification ability. These hyperparameters eventually affect . Empirically, an ideal hyperparameter setting should help to satisfy the following two properties: (1) Predicted probabilities of each class (identity) should span to the range : the lower boundary of should be near while the upper boundary near ; (2) Changing curve of should have large absolute gradients around to make training effective.
3.1 Effects of the scale parameter
The scale parameter can significantly affect . Intuitively, should gradually increase from to as the angle decreases from to ^{2}^{2}2Mathematically, can be any value in . We empirically found, however, the maximum is always around . See the red curve in Fig. 1 for examples., i.e., the smaller the angle between and its corresponding class weight is, the larger the probability should be. Both improper probability range and probability curves w.r.t. would negatively affect the training process and thus the recognition performance.
We first study the range of classification probability . Given scale parameter , the range of probabilities in all cosinebased softmax losses is
(5) 
where the lower boundary is achieved when and for all in Eq. (1). Similarly, the upper bound is achieved when and for all . The range of approaches 1 when , i.e.,
(6) 
which means that the requirement of the range spanning could be satisfied with a large . However it does not mean that the larger the scale parameter, the better the selection is. In fact the probability range can easily approach a high value, such as when class number and scale parameter
. But an oversized scale would lead to poor probability distribution, as will be discussed in the following paragraphs.
We investigate the influences of parameter by taking as a function of and angle where denotes the label of . Formally, we have
(7) 
where are the logits summation of all noncorresponding classes for feature . We observe that the values of are almost unchanged during the training process. This is because the angles for noncorresponding classes always stay around during training (see red curve in Fig. 1).
Therefore, we can assume is constant, i.e., . We then plot curves of probabilities w.r.t. under different setting of parameter in Fig. 2(a). It is obvious that when is too small (e.g., for class/identity number and ), the maximal value of could not reach . This is undesirable because even when the network is very confident on a sample ’s corresponding class label , e.g. , the loss function would still penalize the classification results and update the network.
On the other hand, when is too large (e.g., ), the probability curve w.r.t. is also problematic. It would output a very high probability even when is close to , which means that the loss function with large
may fail to penalize misclassified samples and cannot effectively update the networks to correct mistakes.
In summary, the scaling parameter has substantial influences to the range as well as the curves of the probabilities , which are crucial for effectively training the deep network.
3.2 Effects of the margin parameter
In this section, we investigate the effect of margin parameters in cosinebased softmax losses (Eqs. (3) & (4)), and their effects on feature ’s predicted class probability . For simplicity, we here study the margin parameter for ArcFace (Eq. 3); while the similar conclusions also apply to the parameter in CosFace (Eq. (4)).
We first rewrite classification probability following Eq. (7) as
(8) 
To study the influence of parameter on the probability , we assume both and are fixed. Following the discussion in Section 3.1, we set , and fix . The probability curves w.r.t. under different are shown in Fig. 2(b).
According to Fig. 2(b), increasing the margin parameter shifts probability curves to the left. Thus, with the same , larger margin parameters lead to lower probabilities and thus larger loss even with small angles . In other words, the angles between the feature and its corresponding class’s weights have to be very small for sample being correctly classified. This is the reason why marginbased losses provide stronger supervisions for the same than conventional cosinebased losses. Proper margin settings have shown to boost the final recognition performance in [40, 7].
Although larger margin provides stronger supervisions, it should not be too large either. When is oversized (e.g., ), the probabilities becomes unreliable. It would output probabilities around even is very small. This lead to large loss for almost all samples even with very small sampletoclass angles, which makes the training difficult to converge. In previous methods, the margin parameter selection is an adhoc procedure and has no theoretical guidance for most cases.
3.3 Summary of the hyparameter study
According to our analysis, we can draw the following conclusions:
(1) Hyperparameters scale and margin can substantially influence the prediction probability of feature with groundtruth identity/category . For the scale parameter , too small would limit the maximal value of . On the other hand, too large would make most predicted probabilities to be , which makes the training loss insensitive to the correctness of . For the margin parameter , a too small margin is not strong enough to regularize the final angular margin, while an oversized margin makes the training difficult to converge.
(2) The effect of scale and margin can be unified to modulate the mapping from cosine distances to the prediction probability . As shown in Fig. 2(a) and Fig. 2(b), both small scales and large margins have similar effect on for strengthening the supervisions, while both large scales and small margins weaken the supervisions. Therefore it is feasible and promising to control the probability using one single hyperparameter, either or . Considering the fact that is more related to the range of that required to span , we will focus on automatically tuning the scale parameter in the reminder of this paper.
4 The cosinebased softmax loss with adaptive scaling
Based on our previous studies on the hyperparameters of the cosinebased softmax loss functions, in this section, we propose a novel loss with a selfadaptive scaling scheme, namely AdaCos, which does not require the adhoc and timeconsuming manual parameter tuning. Training with the proposed loss does not only facilitate convergence but also results in higher recognition accuracy.
Our previous studies on Fig. 1 show that during the training process, the angles for between the feature and its noncorresponding weights are almost always close to , In other words, we could safely assume that in Eq. (7). Obviously, it is the probability of feature belonging to its corresponding class that has the most influence on supervision for network training. Therefore, we focus on designing an adaptive scale parameter for controling the probabilities .
From the curves of w.r.t. (Fig. 2(a)), we observe that the scale parameter does not only simply affect ’s boundary of of determining correct/incorrect but also squeezes/stretches the curvature; In contrast to scale , margin parameter only shifts the curve in phase. We therefore propose to automatically tune the scale parameter and eliminate the margin parameter from our loss function, which makes our proposed AdaCos loss different from stateoftheart softmax loss variants with angular margin. With softmax function, the predicted probability can be defined by
(9) 
where is the automatically tuned scale parameter to be discussed below.
Let us first reconsider the (Eq. (7)) as a function of . Note that represents the angle between sample and the weight vector of its ground truth category . For network training, we hope to minimize with the supervision from the loss function . Our objective is choose a suitable scale which makes predicted probability change significantly with respect to . Mathematically, we find the point where the absolute gradient value reaches its maximum, when the secondorder derivative of at equals , i.e.,
(10) 
where . Combining Eqs. (7) and (10), we obtain an transcendental equation. Considering that is close to , the relation between the scale parameter and the point can be approximated as
(11) 
where can be well approximated as since the angles distribute around during training (see Eq. (7) and Fig. 1). Then the task of automatically determining would reduce to select an reasonable central angle in .
4.1 Automatically choosing a fixed scale parameter
Since is in the center of , it is natural to regard as the point, i.e. setting for figuring out an effective mapping from angle to the probability . Then the supervisions determined by would be backpropagated to update and further to update network parameters. According to Eq. (11), we can estimate the corresponding scale parameter as
(12)  
where is approximated by .
For such an automaticallychosen fixed scale parameter (see Figs. 2(a) and 2(b)), it depends on the number of classes in the training set and also provides a good guideline for existing cosine distance based softmax losses to choose their scale parameters. In contrast, the scaling parameters in existing methods was manually set according to human experience. It acts as a good baseline method for our dynamically tuned scale parameter in the next section.
4.2 Dynamically adaptive scale parameter
As Fig. 1 shows, the angles between features and their groundtruth class weights gradually decrease as the training iterations increase; while the angles between features and noncorresponding classes become stabilize around , as shown in Fig. 1.
Although our previously fixed scale parameter behaves properly as changes over , it does not take into account the fact that gradually decrease during training. Since smaller gains higher probability and thus gradually receives weaker supervisions as the training proceeds, we therefore propose a dynamically adaptive scale parameter to gradually apply stricter requirement on the position of which can progressively enhance the supervisions throughout the training process.
Formally we introduce a modulating indicator variable , which is the median of all corresponding classes’ angles, , from the minibatch of size at the th iteration. roughly represents the current network’s degree of optimization on the minibatch. When the median angle is large, it denotes that the network parameters are far from optimum and less strict supervisions should be applied to make the training converge more stably; when the median angle is small, it denotes that the network is close to optimum and stricter supervisions should be applied to make the intraclass angles become even smaller. Based on this observation, we set the central angle . We also introduce as the average of as
(13) 
where denotes the face identity indices in the minibatch at the th iteration. Unlike approximating for the fixed adaptive scale parameter , here we estimate using the scale parameter of previous iteration, which provides us a more accurate approximation. Be reminded that also includes dynamic scale . We can obtain it by solving the nonlinear function given by the above equation. In practice, we notice that changes very little following iterations. So, we just use to calculate with Eq. (7). Then we can obtain dynamic scale directly with Eq. (11). So we have:
(14) 
where is related to the dynamic scale parameter. We estimate it using the scale parameter of the previous iteration.
At the begin of the training process, the median angle of each minibatch might be too large to impose enough supervisions for training. We therefore force the central angle to be less than . Our dynamic scale parameter for the th iteration could then be formulated as
(15) 
where is initialized as our fixed scale parameter when .
Substituting into , the corresponding gradients can be calculated as follows
(16)  
where is the indicator function and
(17) 
Eq. (17) shows that the dynamically adaptive scale parameter influences classification probabilities differently at each iteration and also effectively affects the gradients (Eq. (16)) for updating network parameters. The benefit of dynamic AdaCos is that it can produce reasonable scale parameter by sensing the training convergence of the model in the current iteration.
5 Experiments
We examine the proposed AdaCos loss function on several public face recognition benchmarks and compare it with stateoftheart cosinebased softmax losses. The compared losses include softmax [28], CosFace [40], and ArcFace [7]. We present evaluation results on LFW [13], MegaFace 1million Challenge [24], and IJBC [23] data. We also present results on some exploratory experiments to show the convergence speed and robustness against lowresolution images.
Preprocessing. We use two public training datasets, CASIAWebFace [43] and MS1M [9], to train CNN models with our proposed loss functions. We carefully clean the noisy and lowquality images from the datasets. The cleaned WebFace [43] and MS1M [9] contain about M and M facial images, respectively. All models are trained based on these training data and directly tested on the test splits of the three datasets. RSA [22] is applied to the images to extract facial areas. Then, according to detected facial landmarks, the faces are aligned through similarity transformation and resized to the size . All image pixel values are subtracted with the mean and dividing by .
5.1 Results on LFW
The LFW [13] dataset collected thousands of identities from the inertnet. Its testing protocol contains about images for about identities with a total of groundtruth matches. Half of the matches are positive while the other half are negative ones. LFW’s primary difficulties lie in face pose variations, color jittering, illumination variations and aging of persons. Note portion of the pose variations can be eliminated by the RSA [22] facial landmark detection and alignment algorithm, but there still exist some nonfrontal facial images which can not be aligned by RSA [22] and then aligned manually.
5.1.1 Comparison on LFW
For all experiments on LFW [13], we train ResNet50 models [10] with batch size of on the cleaned WebFace [43] dataset. The input size of facial image is and the feature dimension input into the loss function is . Different loss functions are compared with our proposed AdaCos losses.
Method  st  nd  rd  Average Acc. 

Softmax  
softmax [28]  
CosFace [40]  
ArcFace [7]  
Fixed AdaCos  99.60  
Dyna. AdaCos  99.71 
Results in Table 1 show the recognition accuracies of models trained with different softmax loss functions. Our proposed AdaCos losses with fixed and dynamic scale parameters (denoted as Fixed AdaCos and Dyna. AdaCos) surpass the stateoftheart cosinebased softmax losses under the same training configuration. For the hyperparameter settings of the compared losses, the scaling parameter is set as for softmax [28], CosFace [40] and ArcFace [7]; the margin parameters are set as and for CosFace [40], and ArcFace [7], respectively. Since LFW is a relatively easy evaluation set, we train and test all losses for three times. The average accuracy of our proposed dynamic AdaCos is higher than stateoftheart ArcFace [7] and than softmax [28].
5.1.2 Exploratory Experiments
The change of scale parameters and feature angles during training. In this part, we will show the change of scale parameter and feature angles during training with our proposed AdaCos loss. The scale parameter changes along with the current recognition performance of the model, which continuously strengthens the supervisions by gradually reducing and thus shrinking . Fig. 3 shows the change of the scale parameter with our proposed fixed AdaCos and dynamic AdaCos losses. For the dynamic AdaCos loss, the scale parameter adaptively decreases as the training iterations increase, which indicates that the loss function provides stricter supervisions to update network parameters. Fig. 4 illustrates the change of by our proposed dynamic AdaCos and softmax. The average (orange curve) and median (green curve) of , which indicating the angle between a sample and its groundtruth category, gradually reduce while the average (maroon curve) of where remains nearly . Compared with softmax loss, our proposed loss could achieve much smaller sample feature to category angles on the groundtruth classes and leads to higher recognition accuracies.
Convergence rates. Convergence rate is an important indicator of efficiency of loss functions. We examine the convergence rates of several cosinebased losses at different training iterations. The training configurations are same as Table 1. Results in Table 2 reveal that the convergence rates when training with the AdaCos losses are much higher.
5.2 Results on MegaFace
Method 
Size of MegaFace Distractor  

softmax  
CosFace  
ArcFace  
Fixed AdaCos  
Dynamic AdaCos  99.88%  99.72%  99.51%  99.02%  98.54%  97.41% 

We then evaluate the performance of proposed AdaCos on the MegaFace Challenge [16], which is a publicly available identification benchmark, widely used to test the performance of facial recognition algorithms. The gallery set of MegaFace incorporates over million images from K identities collected from Flickr photos [37]. We follow ArcFace [7]’s testing protocol, which cleaned the dataset to make the results more reliable. We train the same InceptionResNet [33] models with CASIAWebFace [43] and MS1M [9] training data, where overlapped subjects are removed.
5.3 Results on IJBC 1:1 verification protocol
Method 
True Accept Rate @ False Accept Rate  

FaceNet [29]    
VGGFace [25] 
  
Crystal Loss [27] 

softmax 

CosFace [40] 

ArcFace [7] 
99.07%  97.75%  
Fixed AdaCos 

Dynamic AdaCos  95.65%  92.40%  88.03%  83.28%  74.07%  

The IJBC dataset [23] contains about identities with a total of still facial images and unconstrained video frames. In the 1:1 verification, there are positive matches and negative matches, which allow us to evaluate TARs at various FARs (e.g., ).
We compare the softmax loss functoins, including the proposed AdaCos, softmax [28], CosFace [40], and ArcFace [7] with the same training data (WebFace [43] and MS1M [9]) and network architecture (InceptionResNet [33]). We also report the results of FaceNet [29], VGGFace [36] listed in Crystal loss [27]. Table 4 and Fig. 6 exhibit their performances on the IJBC 1:1 verification. Our proposed dynamic AdaCos achieves the best performance.
6 Conclusions
In this work, we argue that the bottleneck of existing cosinebased softmax losses may primarily comes from the mismatch between cosine distance and the classification probability , which limits the final recognition performance. To address this issue, we first deeply analyze the effects of hyperparameters in cosinebased softmax losses from the perspective of probability. Based on these analysis, we propose the AdaCos which automatically adjusts an adaptive parameter in order to reformulate the mapping between cosine distance and classification probability. Our proposed AdaCos loss is simple yet effective. We demonstrate its effectiveness and efficiency by exploratory experiments and report its stateoftheart performances on several public benchmarks.
Acknowledgements. This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, in part by CUHK Direct Grant, and in part by National Natural Science Foundation of China (61472410) and the Joint Lab of CASHK.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [2] Peter N. Belhumeur, João P Hespanha, and David J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence, 19(7):711–720, 1997.
 [3] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 [4] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
 [5] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 [6] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.
 [7] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
 [8] Siddharth Gopal and Yiming Yang. Von misesfisher clustering models. In International Conference on Machine Learning, pages 154–162, 2014.
 [9] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Msceleb1m: A dataset and benchmark for largescale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016.
 [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [11] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pages 84–92. Springer, 2015.
 [12] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 2017.
 [13] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 0749, University of Massachusetts, Amherst, 2007.
 [14] Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
 [15] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
 [16] Ira KemelmacherShlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016.
 [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [18] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2017.
 [19] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Largemargin softmax loss for convolutional neural networks. In ICML, pages 507–516, 2016.
 [20] Yu Liu, Hongyang Li, and Xiaogang Wang. Learning deep features via congenerous cosine loss for person recognition. arXiv preprint arXiv:1702.06890, 2017.
 [21] Yu Liu, Hongyang Li, and Xiaogang Wang. Rethinking feature discrimination and polymerization for largescale recognition. arXiv preprint arXiv:1710.00870, 2017.
 [22] Yu Liu, Hongyang Li, Junjie Yan, Fangyin Wei, Xiaogang Wang, and Xiaoou Tang. Recurrent scale approximation for object detection in cnn. In IEEE International Conference on Computer Vision, 2017.
 [23] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, et al. Iarpa janus benchmark–c: Face dataset and protocol. In 11th IAPR International Conference on Biometrics, 2018.
 [24] Aaron Nech and Ira KemelmacherShlizerman. Level playing field for million scale face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3406–3415. IEEE, 2017.
 [25] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition.

[26]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
In NIPSW, 2017.  [27] Rajeev Ranjan, Ankan Bansal, Hongyu Xu, Swami Sankaranarayanan, JunCheng Chen, Carlos D Castillo, and Rama Chellappa. Crystal loss and quality pooling for unconstrained face verification and recognition. arXiv preprint arXiv:1804.01159, 2018.
 [28] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
 [29] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
 [30] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 [31] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identificationverification. In Advances in neural information processing systems, pages 1988–1996, 2014.
 [32] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.

[33]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, volume 4, page 12, 2017.  [34] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
 [35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [36] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
 [37] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and LiJia Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
 [38] Feng Wang, Weiyang Liu, Haijun Liu, and Jian Cheng. Additive margin softmax for face verification. arXiv preprint arXiv:1801.05599, 2018.
 [39] Feng Wang, Xiang Xiang, Jian Cheng, and Alan L Yuille. Normface: hypersphere embedding for face verification. arXiv preprint arXiv:1704.06369, 2017.
 [40] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414, 2018.
 [41] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009.
 [42] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
 [43] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
 [44] Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. Range loss for deep face recognition with longtailed training data. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 [45] Yutong Zheng, Dipan K Pal, and Marios Savvides. Ring loss: Convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5089–5097, 2018.
Comments
There are no comments yet.