# AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations

The cosine-based softmax losses and their variants achieve great success in deep learning based face recognition. However, hyperparameter settings in these losses have significant influences on the optimization path as well as the final recognition performance. Manually tuning those hyperparameters heavily relies on user experience and requires many training tricks. In this paper, we investigate in depth the effects of two important hyperparameters of cosine-based softmax losses, the scale parameter and angular margin parameter, by analyzing how they modulate the predicted classification probability. Based on these analysis, we propose a novel cosine-based softmax loss, AdaCos, which is hyperparameter-free and leverages an adaptive scale parameter to automatically strengthen the training supervisions during the training process. We apply the proposed AdaCos loss to large-scale face verification and identification datasets, including LFW, MegaFace, and IJB-C 1:1 Verification. Our results show that training deep neural networks with the AdaCos loss is stable and able to achieve high face recognition accuracy. Our method outperforms state-of-the-art softmax losses on all the three datasets.

## Authors

• 51 publications
• 43 publications
• 67 publications
• 130 publications
• 72 publications

Cosine-based softmax losses significantly improve the performance of dee...
05/07/2019 ∙ by Xiao Zhang, et al. ∙ 14

• ### Hyperparameter-Free Out-of-Distribution Detection Using Softmax of Scaled Cosine Similarity

The ability of detecting out-of-distribution (OOD) samples is important ...
05/25/2019 ∙ by Engkarat Techapanurak, et al. ∙ 0

• ### Contraction Mapping of Feature Norms for Classifier Learning on the Data with Different Quality

The popular softmax loss and its recent extensions have achieved great s...
07/27/2020 ∙ by Weihua Liu, et al. ∙ 0

• ### L2-constrained Softmax Loss for Discriminative Face Verification

In recent years, the performance of face verification systems has signif...
03/28/2017 ∙ by Rajeev Ranjan, et al. ∙ 0

• ### Learning Deep Features via Congenerous Cosine Loss for Person Recognition

Person recognition aims at recognizing the same identity across time and...
02/22/2017 ∙ by Yu Liu, et al. ∙ 0

• ### Neural Networks Out-of-Distribution Detection: Hyperparameter-Free Isotropic Maximization Loss, The Principle of Maximum Entropy, Cold Training, and Branched Inferences

Current out-of-distribution detection (ODD) approaches present severe dr...
06/07/2020 ∙ by David Macêdo, et al. ∙ 0

• ### It Is Likely That Your Loss Should be a Likelihood

We recall that certain common losses are simplified likelihoods and inst...
07/12/2020 ∙ by Mark Hamilton, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent years witnessed the breakthrough of deep Convolutional Neural Networks (CNNs)

[17, 12, 25, 35] on significantly improving the performance of one-to-one face verification and one-to-many face identification tasks. The successes of deep face CNNs can be mainly credited to three factors: enormous training data [9], deep neural network architectures [10, 33]

and effective loss functions

[28, 21, 7]. Modern face datasets, such as LFW [13], CASIA-WebFace [43], MS1M [9] and MegaFace [24, 16], contain huge number of identities which enable the training of deep networks. A number of recent studies, such as DeepFace [36], DeepID2 [31], DeepID3 [32], VGGFace [25] and FaceNet [29], demonstrated that properly designed network architectures also lead to improved performance.

Apart from the large-scale training data and deep structures, training losses also play key roles in learning accurate face recognition models [41, 6, 11]. Unlike image classification tasks, face recognition is essentially an open set recognition problem, where the testing categories (identities) are generally different from those used in training. To handle this challenge, most deep learning based face recognition approaches [31, 32, 36]

utilize CNNs to extract feature representations from facial images, and adopt a metric (usually the cosine distance) to estimate the similarities between pairs of faces during inference.

However, such inference evaluation metric is not well considered in the methods with softmax cross-entropy loss function

111We denote it as “softmax loss” for short in the remaining sections.

, which train the networks with the softmax loss but perform inference using cosine-similarities. To mitigate the gap between training and testing, recent works

[21, 28, 39, 8] directly optimized cosine-based softmax losses. Moreover, angular margin-based terms [19, 18, 40, 38, 7] are usually integrated into cosine-based losses to maximize the angular margins between different identities. These methods improve the face recognition performance in the open-set setup. In spite of their successes, the training processes of cosine-based losses (and their variants introducing margins) are usually tricky and unstable. The convergence and performance highly depend on the hyperparameter settings of loss, which are determined empirically through large amount of trials. In addition, subtle changes of these hyperparameters may fail the entire training process.

In this paper, we investigate state-of-the-art cosine-based softmax losses [28, 40, 7], especially those aiming at maximizing angular margins, to understand how they provide supervisions for training deep neural networks. Each of the functions generally includes several hyperprameters, which have substantial impact on the final performance and are usually difficult to tune. One has to repeat training with different settings for multiple times to achieve optimal performance. Our analysis shows that different hyperparameters in those cosine-based losses actually have similar effects on controlling the samples’ predicted class probabilities. Improper hyperparameter settings cause the loss functions to provide insufficient supervisions for optimizing networks.

Based on the above observation, we propose an adaptive cosine-based loss function, AdaCos, which automatically tunes hyperparameters and generates more effective supervisions during training. The proposed AdaCos dynamically scales the cosine similarities between training samples and corresponding class center vectors (the fully-connection vector before softmax), making their predicted class probability meets the semantic meaning of these cosine similarities. Furthermore, AdaCos can be easily implemented using built-in functions from prevailing deep learning libraries

To demonstrate the effectiveness of the proposed AdaCos loss function, we evaluated it on several face benchmarks, including LFW face verification [13], MegaFace one-million identification [24] and IJB-C [23]. Our method outperforms state-of-the-art cosine-based losses on all these benchmarks.

## 2 Related Works

Cosine similarities for inference. For learning deep face representations, feature-normalized losses are commonly adopted to enhance the recognition accuracy. Coco loss [20, 21] and NormFace [39] studied the effect of normalization and proposed two strategies by reformulating softmax loss and metric learning. Similarly, Ranjan et al. in [28] also discussed this problem and applied normalization on learned feature vectors to restrict them lying on a hypersphere. Movrever, compared with these hard normalization, ring loss [45] came up with a soft feature normalization approach with convex formulations.

Margin-based softmax loss. Earlier, most face recognition approaches utilized metric-targeted loss functions, such as triplet [41] and contrastive loss [6], which utilize Euclidean distances to measure similarities between features. Taking advantages of these works, center loss [42] and range loss [44] were proposed to reduce intra-class variations via minimizing distances within each class [2]. Following this, researchers found that constraining margin in Euclidean space is insufficient to achieve optimal generalization. Then angular-margin based loss functions were proposed to tackle the problem. Angular constraints were integrated into the softmax loss function to improve the learned face representation by L-softmax [19] and A-softmax [18]. CosFace [40], AM-softmax [38] and ArcFace [7] directly maximized angular margins and employed simpler and more intuitive loss functions compared with aforementioned methods.

Automatic hyperparameter tuning. The performance of an algorithm highly depends on hyperparameter settings. Grid and random search [3] are the most widely used strategies. For more automatic tuning, sequential model-based global optimization [14] is the mainstream choice. Typically, it performs inference with several hyperparameters settings, and chooses setting for the next round of testing based on the inference results. Bayesian optimization [30] and tree-structured parzen estimator approach [4] are two famous sequential model-based methods. However, these algorithms essentially run multiple trials to predict the optimized hyperparameter settings.

## 3 Investigation of hyperparameters in cosine-based softmax losses

In recent years, state-of-the-art cosine-based softmax losses, including L2-softmax [28], CosFace [40], ArcFace [7], significantly improve the performance of deep face recognition. However, the final performances of those losses are substantially affected by their hyperparameters settings, which are generally difficult to tune and require multiple trials in practice. We analyze two most important hyperparameters, the scaling parameter and the margin parameter , in cosine-based losses. Specially, we deeply study their effects on the prediction probabilities after softmax, which serves as supervision signals for updating entire neural network.

Let denote the deep representation (feature) of the -th face image of the current mini-batch with size , and be the corresponding label. The predicted classification probability of all samples in the mini-batch can be estimated by the softmax function as

 Pi,j=efi,j∑Ck=1efi,k, (1)

where is logit used as the input of softmax, represents its softmax-normalized probability of assigning to class , and is the number of classes. The cross-entropy loss associated with current mini-batch is

 LCE=−1NN∑i=1logPi,yi=−1NN∑i=1logefi,yi∑Ck=1efi,k. (2)

Conventional softmax loss and state-of-the-art cosine-based softmax losses [28, 40, 7] calculate the logits in different ways. In conventional softmax loss, logits are obtained as the inner product between feature and the -th class weights as . In the cosine-based softmax losses [28, 40, 7], cosine similarity is calculated by . The logits are calculated as , where is a scale hyperparameter. To enforce angular margin on the representations, ArcFace [7] modified the loss to the form

 fi,j=s⋅cos(θi,j+1{j=yi}⋅m), (3)

while CosFace [40] uses

 fi,j=s⋅(cosθi,j−1{j=yi}⋅m), (4)

where is the margin. The indicator function returns when and otherwise. All margin-based variants decrease associate with the correct class by subtracting margin . Compared with the losses without margin, margin-based variants require to be greater than other , by a specified .

Intuitively, on one hand, the parameter scales up the narrow range of cosine distances, making the logits more discriminative. On the other hand, the parameter enlarges the margin between different classes to enhance classification ability. These hyperparameters eventually affect . Empirically, an ideal hyperparameter setting should help to satisfy the following two properties: (1) Predicted probabilities of each class (identity) should span to the range : the lower boundary of should be near while the upper boundary near ; (2) Changing curve of should have large absolute gradients around to make training effective.

### 3.1 Effects of the scale parameter s

The scale parameter can significantly affect . Intuitively, should gradually increase from to as the angle decreases from to 222Mathematically, can be any value in . We empirically found, however, the maximum is always around . See the red curve in Fig. 1 for examples., i.e., the smaller the angle between and its corresponding class weight is, the larger the probability should be. Both improper probability range and probability curves w.r.t. would negatively affect the training process and thus the recognition performance.

We first study the range of classification probability . Given scale parameter , the range of probabilities in all cosine-based softmax losses is

 11+(C−1)⋅es≤Pi,j≤eses+(C−1), (5)

where the lower boundary is achieved when and for all in Eq. (1). Similarly, the upper bound is achieved when and for all . The range of approaches 1 when , i.e.,

 lims→+∞(eses+(C−1)−11+(C−1)⋅es)=1, (6)

which means that the requirement of the range spanning could be satisfied with a large . However it does not mean that the larger the scale parameter, the better the selection is. In fact the probability range can easily approach a high value, such as when class number and scale parameter

. But an oversized scale would lead to poor probability distribution, as will be discussed in the following paragraphs.

We investigate the influences of parameter by taking as a function of and angle where denotes the label of . Formally, we have

 Pi,yi=efi,yiefi,yi+Bi=es⋅cosθi,yies⋅cosθi,yi+Bi, (7)

where are the logits summation of all non-corresponding classes for feature . We observe that the values of are almost unchanged during the training process. This is because the angles for non-corresponding classes always stay around during training (see red curve in Fig. 1).

Therefore, we can assume is constant, i.e., . We then plot curves of probabilities w.r.t. under different setting of parameter in Fig. 2(a). It is obvious that when is too small (e.g., for class/identity number and ), the maximal value of could not reach . This is undesirable because even when the network is very confident on a sample ’s corresponding class label , e.g. , the loss function would still penalize the classification results and update the network.

On the other hand, when is too large (e.g., ), the probability curve w.r.t. is also problematic. It would output a very high probability even when is close to , which means that the loss function with large

may fail to penalize mis-classified samples and cannot effectively update the networks to correct mistakes.

In summary, the scaling parameter has substantial influences to the range as well as the curves of the probabilities , which are crucial for effectively training the deep network.

### 3.2 Effects of the margin parameter m

In this section, we investigate the effect of margin parameters in cosine-based softmax losses (Eqs. (3) & (4)), and their effects on feature ’s predicted class probability . For simplicity, we here study the margin parameter for ArcFace (Eq. 3); while the similar conclusions also apply to the parameter in CosFace (Eq. (4)).

We first re-write classification probability following Eq. (7) as

 Pi,yi=efi,yiefi,yi+Bi=es⋅cos(θi,yi+m)es⋅cos(θi,yi+m)+Bi. (8)

To study the influence of parameter on the probability , we assume both and are fixed. Following the discussion in Section 3.1, we set , and fix . The probability curves w.r.t. under different are shown in Fig. 2(b).

According to Fig. 2(b), increasing the margin parameter shifts probability curves to the left. Thus, with the same , larger margin parameters lead to lower probabilities and thus larger loss even with small angles . In other words, the angles between the feature and its corresponding class’s weights have to be very small for sample being correctly classified. This is the reason why margin-based losses provide stronger supervisions for the same than conventional cosine-based losses. Proper margin settings have shown to boost the final recognition performance in [40, 7].

Although larger margin provides stronger supervisions, it should not be too large either. When is oversized (e.g., ), the probabilities becomes unreliable. It would output probabilities around even is very small. This lead to large loss for almost all samples even with very small sample-to-class angles, which makes the training difficult to converge. In previous methods, the margin parameter selection is an ad-hoc procedure and has no theoretical guidance for most cases.

### 3.3 Summary of the hyparameter study

According to our analysis, we can draw the following conclusions:

(1) Hyperparameters scale and margin can substantially influence the prediction probability of feature with ground-truth identity/category . For the scale parameter , too small would limit the maximal value of . On the other hand, too large would make most predicted probabilities to be , which makes the training loss insensitive to the correctness of . For the margin parameter , a too small margin is not strong enough to regularize the final angular margin, while an oversized margin makes the training difficult to converge.

(2) The effect of scale and margin can be unified to modulate the mapping from cosine distances to the prediction probability . As shown in Fig. 2(a) and Fig. 2(b), both small scales and large margins have similar effect on for strengthening the supervisions, while both large scales and small margins weaken the supervisions. Therefore it is feasible and promising to control the probability using one single hyperparameter, either or . Considering the fact that is more related to the range of that required to span , we will focus on automatically tuning the scale parameter in the reminder of this paper.

## 4 The cosine-based softmax loss with adaptive scaling

Based on our previous studies on the hyperparameters of the cosine-based softmax loss functions, in this section, we propose a novel loss with a self-adaptive scaling scheme, namely AdaCos, which does not require the ad-hoc and time-consuming manual parameter tuning. Training with the proposed loss does not only facilitate convergence but also results in higher recognition accuracy.

Our previous studies on Fig. 1 show that during the training process, the angles for between the feature and its non-corresponding weights are almost always close to , In other words, we could safely assume that in Eq. (7). Obviously, it is the probability of feature belonging to its corresponding class that has the most influence on supervision for network training. Therefore, we focus on designing an adaptive scale parameter for controling the probabilities .

From the curves of w.r.t. (Fig. 2(a)), we observe that the scale parameter does not only simply affect ’s boundary of of determining correct/incorrect but also squeezes/stretches the curvature; In contrast to scale , margin parameter only shifts the curve in phase. We therefore propose to automatically tune the scale parameter and eliminate the margin parameter from our loss function, which makes our proposed AdaCos loss different from state-of-the-art softmax loss variants with angular margin. With softmax function, the predicted probability can be defined by

 Pi,j=e~s⋅cosθi,j∑Ck=1e~s⋅cosθi,k, (9)

where is the automatically tuned scale parameter to be discussed below.

Let us first re-consider the (Eq. (7)) as a function of . Note that represents the angle between sample and the weight vector of its ground truth category . For network training, we hope to minimize with the supervision from the loss function . Our objective is choose a suitable scale which makes predicted probability change significantly with respect to . Mathematically, we find the point where the absolute gradient value reaches its maximum, when the second-order derivative of at equals , i.e.,

 ∂2Pi,yi(θ0)∂θ02=0, (10)

where . Combining Eqs. (7) and (10), we obtain an transcendental equation. Considering that is close to , the relation between the scale parameter and the point can be approximated as

 s0=logBicosθ0, (11)

where can be well approximated as since the angles distribute around during training (see Eq. (7) and Fig. 1). Then the task of automatically determining would reduce to select an reasonable central angle in .

### 4.1 Automatically choosing a fixed scale parameter

Since is in the center of , it is natural to regard as the point, i.e. setting for figuring out an effective mapping from angle to the probability . Then the supervisions determined by would be back-propagated to update and further to update network parameters. According to Eq. (11), we can estimate the corresponding scale parameter as

 ~sf=logBicosπ4 =log∑k≠yies⋅cosθi,kcosπ4 (12) ≈√2⋅log(C−1)

where is approximated by .

For such an automatically-chosen fixed scale parameter (see Figs. 2(a) and 2(b)), it depends on the number of classes in the training set and also provides a good guideline for existing cosine distance based softmax losses to choose their scale parameters. In contrast, the scaling parameters in existing methods was manually set according to human experience. It acts as a good baseline method for our dynamically tuned scale parameter in the next section.

### 4.2 Dynamically adaptive scale parameter

As Fig. 1 shows, the angles between features and their ground-truth class weights gradually decrease as the training iterations increase; while the angles between features and non-corresponding classes become stabilize around , as shown in Fig. 1.

Although our previously fixed scale parameter behaves properly as changes over , it does not take into account the fact that gradually decrease during training. Since smaller gains higher probability and thus gradually receives weaker supervisions as the training proceeds, we therefore propose a dynamically adaptive scale parameter to gradually apply stricter requirement on the position of which can progressively enhance the supervisions throughout the training process.

Formally we introduce a modulating indicator variable , which is the median of all corresponding classes’ angles, , from the mini-batch of size at the -th iteration. roughly represents the current network’s degree of optimization on the mini-batch. When the median angle is large, it denotes that the network parameters are far from optimum and less strict supervisions should be applied to make the training converge more stably; when the median angle is small, it denotes that the network is close to optimum and stricter supervisions should be applied to make the intra-class angles become even smaller. Based on this observation, we set the central angle . We also introduce as the average of as

 B(t)avg=1N∑i∈N(t)B(t)i=1N∑i∈N(t)∑k≠yie~s(t−1)d⋅cosθi,k, (13)

where denotes the face identity indices in the mini-batch at the -th iteration. Unlike approximating for the fixed adaptive scale parameter , here we estimate using the scale parameter of previous iteration, which provides us a more accurate approximation. Be reminded that also includes dynamic scale . We can obtain it by solving the nonlinear function given by the above equation. In practice, we notice that changes very little following iterations. So, we just use to calculate with Eq. (7). Then we can obtain dynamic scale directly with Eq. (11). So we have:

 ~s(t)d=logB(t)avgcosθ(t)med, (14)

where is related to the dynamic scale parameter. We estimate it using the scale parameter of the previous iteration.

At the begin of the training process, the median angle of each mini-batch might be too large to impose enough supervisions for training. We therefore force the central angle to be less than . Our dynamic scale parameter for the -th iteration could then be formulated as

 ~s(t)d=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩√2⋅log(C−1)t=0,logB(t)avgcos(min(π4,θ(t)med))t≥1, (15)

where is initialized as our fixed scale parameter when .

Substituting into , the corresponding gradients can be calculated as follows

 ∂L(→xi)∂→xi=C∑j=1(P(t)i,j−1(yi=j))⋅~s(t)d∂cosθi,j∂→xi, (16) ∂L(→Wj)∂→Wj=(P(t)i,j−1(yi=j))⋅~s(t)d∂cosθi,j∂→Wj,

where is the indicator function and

 P(t)i,j=e~s(t)d⋅cosθi,j∑Ck=1e~s(t)d⋅cosθi,k. (17)

Eq. (17) shows that the dynamically adaptive scale parameter influences classification probabilities differently at each iteration and also effectively affects the gradients (Eq. (16)) for updating network parameters. The benefit of dynamic AdaCos is that it can produce reasonable scale parameter by sensing the training convergence of the model in the current iteration.

## 5 Experiments

We examine the proposed AdaCos loss function on several public face recognition benchmarks and compare it with state-of-the-art cosine-based softmax losses. The compared losses include -softmax [28], CosFace [40], and ArcFace [7]. We present evaluation results on LFW [13], MegaFace 1-million Challenge [24], and IJB-C [23] data. We also present results on some exploratory experiments to show the convergence speed and robustness against low-resolution images.

Preprocessing. We use two public training datasets, CASIA-WebFace [43] and MS1M [9], to train CNN models with our proposed loss functions. We carefully clean the noisy and low-quality images from the datasets. The cleaned WebFace [43] and MS1M [9] contain about M and M facial images, respectively. All models are trained based on these training data and directly tested on the test splits of the three datasets. RSA [22] is applied to the images to extract facial areas. Then, according to detected facial landmarks, the faces are aligned through similarity transformation and resized to the size . All image pixel values are subtracted with the mean and dividing by .

### 5.1 Results on LFW

The LFW [13] dataset collected thousands of identities from the inertnet. Its testing protocol contains about images for about identities with a total of ground-truth matches. Half of the matches are positive while the other half are negative ones. LFW’s primary difficulties lie in face pose variations, color jittering, illumination variations and aging of persons. Note portion of the pose variations can be eliminated by the RSA [22] facial landmark detection and alignment algorithm, but there still exist some non-frontal facial images which can not be aligned by RSA [22] and then aligned manually.

#### 5.1.1 Comparison on LFW

For all experiments on LFW [13], we train ResNet-50 models [10] with batch size of on the cleaned WebFace [43] dataset. The input size of facial image is and the feature dimension input into the loss function is . Different loss functions are compared with our proposed AdaCos losses.

Results in Table 1 show the recognition accuracies of models trained with different softmax loss functions. Our proposed AdaCos losses with fixed and dynamic scale parameters (denoted as Fixed AdaCos and Dyna. AdaCos) surpass the state-of-the-art cosine-based softmax losses under the same training configuration. For the hyperparameter settings of the compared losses, the scaling parameter is set as for -softmax [28], CosFace [40] and ArcFace [7]; the margin parameters are set as and for CosFace [40], and ArcFace [7], respectively. Since LFW is a relatively easy evaluation set, we train and test all losses for three times. The average accuracy of our proposed dynamic AdaCos is higher than state-of-the-art ArcFace [7] and than -softmax [28].

#### 5.1.2 Exploratory Experiments

The change of scale parameters and feature angles during training. In this part, we will show the change of scale parameter and feature angles during training with our proposed AdaCos loss. The scale parameter changes along with the current recognition performance of the model, which continuously strengthens the supervisions by gradually reducing and thus shrinking . Fig. 3 shows the change of the scale parameter with our proposed fixed AdaCos and dynamic AdaCos losses. For the dynamic AdaCos loss, the scale parameter adaptively decreases as the training iterations increase, which indicates that the loss function provides stricter supervisions to update network parameters. Fig. 4 illustrates the change of by our proposed dynamic AdaCos and -softmax. The average (orange curve) and median (green curve) of , which indicating the angle between a sample and its ground-truth category, gradually reduce while the average (maroon curve) of where remains nearly . Compared with -softmax loss, our proposed loss could achieve much smaller sample feature to category angles on the ground-truth classes and leads to higher recognition accuracies.

Convergence rates. Convergence rate is an important indicator of efficiency of loss functions. We examine the convergence rates of several cosine-based losses at different training iterations. The training configurations are same as Table 1. Results in Table 2 reveal that the convergence rates when training with the AdaCos losses are much higher.

### 5.2 Results on MegaFace

We then evaluate the performance of proposed AdaCos on the MegaFace Challenge [16], which is a publicly available identification benchmark, widely used to test the performance of facial recognition algorithms. The gallery set of MegaFace incorporates over million images from K identities collected from Flickr photos [37]. We follow ArcFace [7]’s testing protocol, which cleaned the dataset to make the results more reliable. We train the same Inception-ResNet [33] models with CASIA-WebFace [43] and MS1M [9] training data, where overlapped subjects are removed.

Table 3 and Fig. 5 summarize the results of models trained on both WebFace and MS1M datasets and tested on the cleaned MegaFace dataset. The proposed AdaCos and state-of-the-art softmax losses are compared, where the dynamic AdaCos loss outperforms all compared losses on the MegaFace.

### 5.3 Results on IJB-C 1:1 verification protocol

The IJB-C dataset [23] contains about identities with a total of still facial images and unconstrained video frames. In the 1:1 verification, there are positive matches and negative matches, which allow us to evaluate TARs at various FARs (e.g., ).

We compare the softmax loss functoins, including the proposed AdaCos, -softmax [28], CosFace [40], and ArcFace [7] with the same training data (WebFace [43] and MS1M [9]) and network architecture (Inception-ResNet [33]). We also report the results of FaceNet [29], VGGFace [36] listed in Crystal loss [27]. Table 4 and Fig. 6 exhibit their performances on the IJB-C 1:1 verification. Our proposed dynamic AdaCos achieves the best performance.

## 6 Conclusions

In this work, we argue that the bottleneck of existing cosine-based softmax losses may primarily comes from the mis-match between cosine distance and the classification probability , which limits the final recognition performance. To address this issue, we first deeply analyze the effects of hyperparameters in cosine-based softmax losses from the perspective of probability. Based on these analysis, we propose the AdaCos which automatically adjusts an adaptive parameter in order to reformulate the mapping between cosine distance and classification probability. Our proposed AdaCos loss is simple yet effective. We demonstrate its effectiveness and efficiency by exploratory experiments and report its state-of-the-art performances on several public benchmarks.

Acknowledgements. This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616, CUHK14208417, CUHK14239816, in part by CUHK Direct Grant, and in part by National Natural Science Foundation of China (61472410) and the Joint Lab of CAS-HK.