Recent years have witnessed the prosperous development of deep learning and its applications. Among them, embedding learning[sphereface, cosface, arcface, triplet] (or deep metric learning [lifted, npair, angular]) is one of the most challenging problems that attracts wide attention, and corresponding research findings are supporting many applications like face recognition [sphereface, cosface, arcface, triplet] and person re-identification [spherereid, hypershperereid].
The objective of embedding learning is to learn a mapping function for so that in the embedding space the distance between similar data is close while the distance between dissimilar data is far. To accomplish this, a most straightforward choice is to formulate the embedding learning problem as a classification problem by employing the softmax loss as the objective. For instance, in face recognition, faces of different persons are considered as different classes and a large softmax is used for learning the face embedding.
However, there exist two major drawbacks in the softmax loss. The first is that, the intra- and inter-class objectives are entangled. For an intuitive understanding, we visualize such entanglement in Figure. 1 (a) and (b). We select 5 and 10 identities respectively in the MS-Celeb-1M [ms1m] dataset with the most samples, set the embedding dimension to and plot the features. One can observe that with large inter-class distance (the 5-class case) the intra-class distance is also large. As we will show in Section 3, the reason behind this is that the softmax loss will gradually relax the intra-class objective along with the increase of inter-class distance, and vice versa. To our knowledge, we are the first to discuss such entanglement, while existing works mostly address the insufficient discrimination issue by introducing additional supervision [triplet, contrastive] or adding angular margin to the softmax loss [sphereface, cosface, arcface].
Another shortage should be mentioned is time and memory cost. The softmax loss, as well as its margin-based variants, needs to compute class activations over all the classes. This leads to linear time and memory complexity the number of classes, and in practice, the number of classes may be excessively large, say to or even beyond. The excessive memory demand makes it difficult to load all the class weights into the limited GPU memory, and the dramatically increased time cost is also not acceptable. Contrastive loss and triplet loss are possible alternatives that do not require much memory, but they also need plenty of time for training, and in terms of accuracy they significantly underperform the softmax family.
In this paper, we propose to dissect the softmax loss into an intra-class objective and an inter-class objective. The intra-class objective pulls the feature and the positive class-weight close until a pre-defined criterion is satisfied, and the inter-class objective maintains the class weights to be widely separated in the embedding space. With the dissected softmax (D-Softmax) loss as the optimization objective, the intra- and inter-class objectives are disentangled, so that even the inter-class objective is well-optimized, the regularization on the intra-class objective is still rigorous (Fig 1 (c) and (d)).
Moreover, D-Softmax also dissects the computation complexity of the softmax into two independent parts. The computation of intra-class objective only involves the sample and one positive class-weight, in contrast, the inter-class objective needs to compute activations over all negative classes. We find in practice such massive computation for the inter-class objective is somehow redundant. To facilitate the computation, we proposed to sample a subset of negative classes in one training pass. According to the difference in sampling strategies, we term the lightened D-Softmax as D-Softmax-B and D-Softmax-K respectively. Experiments show both strategies significant accelerate the training process with only a minor sacrifice in performance.
Our major contribution can be summarized as follows:
(1) We propose D-Softmax that dissects the intra- and inter-class objective of the softmax loss. The dissected intra-class objective is always rigorous, independent of how well the inter-class objective is optimized, and vice versa. Experiments show D-Softmax outperforms existing methods such as SphereFace and ArcFace on standard face verification benchmarks.
(2) We make an important conclusion that the computation of inter-class objective is redundant and propose two sampling-based variants to facilitate the computation of D-Softmax. Training with massive classes (757K), our methods significantly accelerate the training process with only a minor sacrifice in performance.
2 Related Work
Softmax and its variants for face recognition. It is a widely adopted approach to formulate the face recognition problem as a multi-class classification problem. DeepFace [deepface] and DeepID series [deepid, contrastive, contrastive2]
employ the conventional softmax loss in which the class activation is modeled as the inner product between vectors. Such loss is not discriminative enough, and some recent works address this problem by normalizing the embedding[l2norm], the class-weights [weightnorm], or both [normface]. If the embedding and weights are simultaneously normalized, it is equivalent to optimizing the cosine distance rather than the inner product. This inspired a series of works on softmax variants that directly optimize the angular distances between classes. This is achieved by introducing the angular margin, which can be in the form of either multiplication [sphereface] or addition [cosface, arcface]. However, all aforementioned losses focus on strengthening the regularization but overlook a fact, that the insufficient discrimination of the softmax loss is essentially caused by the entanglement of the intra- and inter-class objective.
Acceleration for Softmax.
The acceleration of the softmax loss is an extensively studied problem typically in natural language processing, where one needs to deal with large vocabularies. Existing methods mainly re-organize the structure of the softmax loss by the hierarchy of words[hierarchy], or the imbalanced frequency of classes [langfreq1, langfreq2, langfreq3, langfreq4]. However, all above methods do not apply to real-world applications like face recognition, because the data are not hierarchical-structured nor substantially imbalanced on importance. HF-Softmax [hfsoftmax]
is a relatively related work to ours, which aims at practical applications, especially face recognition. For reducing the computation cost, HF-Softmax dynamically selects a subset of the training classes for a mini-batch. The subset is selected by constructing a random forest in the embedding space and retrieving the approximate nearest neighbors. Training with HF-Softmax, the time cost is indeed reduced, but the construction of the random forest still cost too much time on average. In this work, the light version of D-Softmax do not require any extra computation besides the loss itself, so the computation is much faster. Moreover, the dissected intra-class objective is always rigorous, thus the performance is also superior.
3 Softmax Dissection
3.1 Preliminary Knowledge
The softmax cross-Entropy loss is fomulated as,
where is a scale parameter, indicates the activation of the -th class, and is the number of classes. Specifically, we denote the activation of the ground-truth class as . In conventional Softmax loss, , where is the class weight and is the feature of the last fully connected layer. In recent arts [normface, sphereface, cosface, arcface], the activation is usually modified as . We adopt this cosine formulation for its good performance and intuitive geometric interpretation.
Here we also list several important variants of Softmax loss, i.e., SphereFace [sphereface], ArcFace [arcface] and CosFace [cosface],
3.2 The Intra-Class Component
In this section, we first introduce how the intra-class objective is entangled with the inter-class objective in Softmax. Then we compare the intra-class objective between the Softmax loss and its margin-based variants (Eq.2-4). And finally, we present the intra-class objective of our Dissected Softmax loss.
Let represent the numerator in the fraction in the loss. reflects the inter-class similarity, large
means that the input has large cosine-similarity with all negative classes, therefore all these classes must be similar to each other. With fixed inter-class similarity, we plot the loss against the ground-truth class activation in Figure. 2 (a). Two observations can be made from the figure.
First, this family of curves can be approximated by a piecewise linear function: when is large, , and when is small, . This observation implies that when the intra-class similarity is small, the supervision signal will back-propagate a near-constant gradient, while in contrast, the gradient is almost zero when is large.
Second, the inflection point where the gradient vanishes (thus the optimization is almost terminated) is positively correlated to the value of . Actually, we can define the intersection point of the piecewise linear function as an approximation of the termination point of optimization:
This observation supports an important conclusion:
Unfortunately, the condition that the class weights are widely separated always holds in the training process, because the sparsity of class-weight distribution in high-dimensional space makes the optimization on inter-class objective quite easy. Therefore, the termination of intra-class objective optimization is always so early that the training is not sufficient. We speculate the early termination of the intra-class similarity optimization is the reason why Softmax underperforms its margin-based variants. In order to validate this hypothesis, we also plot the loss curves against for SphereFace, CosFace and ArcFace in Figure. 2 (b-d) respectively. Despite the loss curves present different shapes for different losses, all their termination points have significant positive shifts compared to the vanilla Softmax under the same . This means these losses do not stop optimize the intra-class similarity until is pretty large (say ), while Softmax may stop optimize the intra-class similarity when is not large enough (say ). The value of intra-class termination point plays an important role in learning discriminative embedding, and we will show how to select a proper in Sec. 4.2.
Above analysis indicates the termination point of intra-class similarity is entangled with the inter-class similarity , while is always not large enough, the optimization of the intra-class objective is usually insufficient. To address this problem, we propose to disentangle the intra-class objective from the inter-class objective, by replacing with a constant value . In this manner, we can manually adjust the optimization termination point of the intra-class similarity to a sufficiently large value according to Eq. 5. To summarize, the intra-class component of the Dissected Softmax is:
3.3 The Inter-Class Component
In Section 3.2 we modified softmax loss and obtain a disentangled intra-class objective. However, we still need inter-class objective as a regularization to avoid collapsing to a trivial solution where all the data is mapped to a single point. Similarly, we first analyze the inter-class objective of softmax and its variants, then give the formulation of the inter-class objective of D-Softmax.
Consider a sample of class and its activation on the -th () class . Softmax loss can be written as,
where we replace the summation with for convenience.
Firstly we fix and study how the loss varies with different and . A family of curves are presented in Figure. 3 (a). Similar characteristic emerges like in the intra-class analysis: The gradient remains almost constant with large negative-class similarity and diminishes rapidly to zero at some point. Once again we define the optimization termination point for as the intersection point of the approximate piecewise linear function,
and a conclusion can be drawn,
This may lead to non-sufficient discrepancy among different class weights thus hamper the embedding learning. As an evidence, we plot in Figure. 3 (b) the termination point against the intra-class similarity for Softmax, SphereFace, CosFace and ArcFace. Wider plateau in the curve means the objective regularizes the inter-class similarity more rigorously. All the large-margin Softmax variants present much wider plateau than the vanilla Softmax, which is one of the reasons why they produce more discriminative embedding than Softmax.
In light of above analysis, we propose to disentangle the inter-class objective from the intra-class objective, by replacing the intra-class similarity with a constant. We simply set this constant to 1, therefore the inter-class component of the Dissected Softmax is,
In such manner, the curve is a flat line, which means the regularization on inter-class similarity is always strict.
3.4 D-Softmax and Its Light Variants
D-Softmax only introduces one extra hyperparameter , and unlike other variants of Softmax [sphereface, cosface, arcface], this hyperparameter has a more clear interpretation, i.e., the optimization termination point (Eq. 5).
The merits of Dissected Softmax are mainly two-folds. First, we learn from Conclusion and that, in vanilla Softmax, the optimization of intra- and inter-class objective is entangled, minimizing the intra-class objective will relax the regularization on inter-class objective, and vice versa. By dissecting the intra- and inter-class objectives, the optimization is disentangled, thus the constraints are always strict, and the learned embedding is more discriminative.
Second, such disentangled formulation allows us to further reduce the computational complexity of the loss function, significantly boosting the training efficiency when the number of classes is tremendous.
When the number of classes is larger than , the computation of softmax becomes the bottleneck of the training process, since all the class-wise activations need to be computed. This problem emerges in many applications like learning language model [hierarchy, langfreq1, langfreq3, langfreq2, langfreq4] and face embedding [hfsoftmax] with large-scale data. Let us denote the batch size as , the number of classes as , then the time complexity for computing Softmax loss is . In D-Softmax, this complexity is dissected into for plus for . When , the computation of becomes the major time overhead. In order to accelerate the computation of the loss, let us first consider a question: For the inter-class objective, do we need to compute all the negative-class activations in a mini-batch?
In this work, our answer is No. The main reason lies in the sparsity of class-weight distribution in high-dimensional space. For illustrating how sparse the class-weight distribution is, we randomly initialize class weights with dimension and plot how the pairwise cosine similarities distribute in Figure. 4
(a). The pairwise cosine similarities present a narrow Gaussian distribution with zero mean and around, which means the class weights are far apart from each other. For comparison, we also plot how this distribution changes after training with softmax in Figure. 4
(b). Interestingly, the mean of the Gaussian distribution does not shift, and the variance just increases a little. This meansis not pushing the class weights further from each other. Considering above two points, we may reach the following conclusion,
Based on this conclusion, we speculate the computation of may be redundant. In order to validate this speculation, we again train an identical model using and a sampled . In each mini-batch we randomly sample of the classes as the negative classes, thus the computation of is faster. After training , we plot the distribution of pairwise cosine similarities between class weights in Figure. 4 (d). As expected, the distribution is almost the same as training with the full softmax. In Section 4 we will present the performance degradation of the sampled loss compared to the full D-Softmax is also minor. We name this light variant of D-Softmax as D-Softmax-K for the negative classes are sampled from the classes. Formally, The mini-batch version of D-Softmax-K is
where means a subset of the class-weight set . The sampling rate remains a hyperparameter for performance-speed trade-off.
An alternative sampling strategy is sampling from mini-batch samples rather than from negative classes, and we name such strategy as D-Softmax-B,
where is a subset of batch samples. We will demonstrate the strengths and weaknesses of each strategy later in Section 4.3.
|Loss||Verification accuracy ()||IJBC [ijbc]:TAR@FAR ()|
|LFW [lfw]||CFP [cfp]||AgeDB [agedb]|
4 Experimental Results
4.1 Datasets and Evaluation Metrics
Evaluation.We validate the effectiveness of the proposed D-Softmax in the face verification task. The testing datasets include LFW [lfw], CFP-FP [cfp], AgeDB-30 [agedb] and IJB-C [ijbc]
. LFW is a standard face verification benchmark that includes 6,000 pairs of faces, and the evaluation metric is the verification accuracy via 10-fold cross validation. CFP-FP and AgeDB-30 are similar to LFW but emphasis on frontal-profile and cross-age face verification respectively. IJB-C is a novel large-scale benchmark for template-based face recognition. A face template is composed of multiple still face images and/or video face tracks. The IJB-C dataset consists of 15,658,489 template pairs, and the evaluation metric is the true accept rate (TAR) at different false alarm rate (FAR). We simply average pool all the features in a template to obtain the template feature.
Training. We adopt the MS-Celeb-1M [ms1m] dataset for training the face embedding. Since the original MS-Celeb-1M dataset contains wrong annotations, we adopt a cleaned version that is also used in ArcFace***https://github.com/deepinsight/insightface. The cleaned MS-Celeb-1M consists of around 5.8M images of 85K identities. Moreover, to validate the effectiveness and efficiency of the proposed D-Softmax-B and D-Softmax-K on training with massive data, we combine MS-Celeb-1M with the MegaFace2 [megaface2] dataset to obtain a large training set. The MegaFace2 dataset consists of 4.7M images of 672K identities, so the joint training set has 9.5M images of 757K identities in total.
4.2 Experiments on D-Softmax
In this section, we explore how to set the intra-class termination point for best performance, and how different formulations of inter-class objective affect the discrimination of the learned embedding. Finally we compare D-Softmax with other state-of-the-art loss functions. All the models are trained with MS-Celeb-1M dataset, and we employ the same ResNet-50 [resnet] model and same hyperparameters. The only difference is the loss function.
Selection of . By tuning the hyperparameter in , we are able to set the optimization termination point for intra-class similarity (Eq. 5). Table 1 shows performance of with different settings of . With increasing from to , the performance increases steadily. However, when we further insrease to (which means, as the optimization terminates, all the samples in a class are supposed to be mapped to a single point), the performance drops slightly. This shows a more rigorous intra-class objective does not always results in better performance. A moderately large , , leads to the best results, so we set in all the following experiments.
|Loss||Sampling||Verification accuracy ()||IJBC [ijbc]:TAR@FAR ()|
|Rate||LFW [lfw]||CFP [cfp]||AgeDB [agedb]|
Different forms of inter-class objective. Apart from the simple form of inter-calss objective proposed in Section 3.3, we also compare several different forms of inter-class objective. The first is the inter-class objective of Softmax loss, which is entangled with the intra-class objective. To accomplish such inter-class objective, in the forward pass we compute the full Softmax loss, while in the backward pass we only back-propagate the inter-class part of the gradients, by setting , i.e., the gradient the ground-truth class to zero. We denote such inter-class objective as . Then we combine with the intra-class part of D-Softmax to train a model. Table 1 compares the performance between and . With the same intra-class objective, it is shown that outperforms by a large margin, and therefore it demonstrates the merit of the dissected form of the inter-class objective .
The second form of inter-class objective is the inter-class objective of ArcFace [arcface]. The inter-class objective of ArcFace is also entangled with the intra-class objective, so we apply the gradient-blocking trick as before, and denote the resulting loss as . With the same , and lead to almost the same good performance. This result is as expected: Let us recall Section 3.3 and Figure. 3, though entangled with the intra-class objective, the regularization on inter-class similarity in ArcFace is rigorous enough until the intra-class similarity is pretty large (say ), and such rigorous regularization guarantees the sparsity of the class weights. Similarly, the proposed dissected form of inter-class objective is always rigorous regardless of the intra-class similarity, so it achieves as good performance as in a more concise way.
Comparison with state-of-the-art losses. For fair comparison, we re-implement NormFace [normface], SphereFace [sphereface] and ArcFace [arcface] and compare the proposed D-Softmax with them using the same training data and model. As shown in Table 1, the proposed D-Softmax outperforms the Softmax (NormFace) baseline even with a small , and with D-Softmax outperforms Softmax by a siginificant margin. SphereFace and ArcFace also outperform the softmax baseline because of the introduced angular margin. In contrast, D-Softmax does not explicitly require such margin between classes. Instead, we introduce the optimization termination point and to reach the same goal of adding margin in a more concise way. Therefore, the proposed D-Softmax presents comparable performance to ArcFace.
4.3 Experiments on Light D-Softmax
In Section 3.4 we proposed two sampling-based variants of D-Softmax, i.e., D-Softmax-B and D-Softmax-K, for reducing the computational complexity of training with massive classes. In this section, we explore the strength and weakness of each sampling strategy respectively.
D-Softmax-B. D-Softmax-B is a most easy-to-implement sampling method for reducing the complexity of the inter-class objective. In practice, one only needs to sample from the batch samples and computes all the negative-class activations the sampled batch samples. This sampling strategy is simple, but it shows good performance. To illustrate the effectiveness of D-Softmax-B, we train several ResNet-50 [resnet] with batch size of , and employ D-Softmax-B as the objective, with sampling rates varying from to . As shown in Table 2, the performance drops with smaller sampling rates. To begin with, the performance drops slowly, the accuracy drop of sampling rate is nearly neglectable compared to the non-sampled version. After the sampling rate is lower than , the performance seems to drop faster. However, even with the extreme sampling rate , i.e., only one batch sample is used for computing the inter-class objective, the performance of D-Softmax-B is still acceptable, only slightly lower than the full-computed version ( v.s. LFW accuracy). These results in turn strongly support Conclusion #3 we made in Section 3.4, that the function of the inter-class objective is only maintaining the sparsity of class weights as a regularization, thus the full-computation with is redundant.
In a summary, the advantage of D-Softmax-B is the simplicity for implementation and minimal sacrifice of performance. However, such sampling strategy faces a dilemma in practice, i.e., besides the time complexity in computing the loss, the memory limit of GPU is also a matter in large-scale training. The computation of D-Softmax-B requires the whole class-weight matrix to be copied to the GPU memory thus adds difficulties on parallelism.
D-Softmax-K. For each mini-batch, D-Softmax-K first samples candidate negative classes from the intersection of negative-classes sets every batch sample, then the batch inter-class objective is computed with simple data parallel. To tackle the problem of GPU memory limit, inspired by [hfsoftmax], we adopt a parameter server to store all the class weights on a large-capacity memory (e.g. CPU Ram). When some classes are sampled in a mini-batch, the weights of these classes are retrieved on the parameter server and then cached in the client’s GPU. In such manner the dilemma of GPU memory limit is mitigated, and also the implementation is not so complicated.
|Loss||Loss Avg.||Total Avg.||Verification accuracy ()||IJBC [ijbc]:TAR@FAR ()|
|Time (s)||Time (s)||LFW [lfw]||CFP [cfp]||AgeDB [agedb]|
However, compared with D-Softmax-B at the same sampling rate (see the gray rows in Table 2), performance of D-Softmax-K is slightly inferior. A possible interpretation is that in D-Softmax-B all the class weights are updated in every mini-batch thus the class weights are more up-to-date in each iteration. This suggests sampling from the batch samples can achieve better performance. Nevertheless, considering the difference in performance is minor while D-Softmax-K is much easier for parallelism, we suggest to use D-Softmax-K in large-scale training.
Compared with other sample-based methods. In order to demonstrate the benefits of D-Softmax, we also compare with some exsiting sample-based methods. The first is random Softmax (Rand-Softmax), which means for one mini-batch the to-be-computed class weights are randomly sampled. At the same sampling rate, both D-Softmax-B and D-Softmax-K outperform Rand-Softmax by a significant margin ( v.s. LFW accuracy). The second is random ArcFace (Rand-ArcFace), which is similar to Rand-Softmax but the loss function is ArcFace. The performance of Rand-ArcFace is superior to Rand-Softmax as expected, but it still underperforms D-Softmax-B and D-Softmax-K. These results strongly prove the merit of the dissected form of D-Softmax.
Another sample-based method that needs to be comapred with is HF-Softmax proposed in [hfsoftmax]. We adopt the code released by the authors and train HF-Softmax on the same dataset for fair comparison. Like Rand-Softmax and Rand-ArcFace, in HF-Softmax the intra- and inter-class objectives are entangled, and it also samples from negative classes to reduce the computational cost. The difference is that the sampling is not random, they build a hash forest to partition the weight space and find approximate-nearest-neighbor (ANN) class weights for batch samples. Table 2 shows the performance of HF-Softmax. As expected it outperforms Rand-Softmax ( v.s. LFW accuracy), since the negative class weights are sampled from the ’hard-negatives’ which are more valuable for optimization. But compared with D-Softmax, the performance of HF-Softmax is inferior. The reason is the intra-class objective of HF-Softmax is still entangled with the inter-class objective. Though hard negative class weights are mined, only the inter-class regularization is improved. The intra-class constraint is still not strict enough.
Large-scale experiments. In order to validate the effectiveness of acceleration on training with the propose D-Softmax-K, we perform a large-scale experiment on the joint set of MS-Celeb-1M [ms1m] dataset and MegaFace2 [megaface2] dataset. Performance and average time cost of some baseline methods are listed in Table 3. The sampling rate is set to in all losses. HF-Softmax [hfsoftmax] and D-Softmax outperform Rand-Softmax at the same sampling rate in terms of accracy, yet only D-Softmax outperforms the full Softmax loss. Sampling based on the entangled form of Softmax loss, the performance upper bound of HF-Softmax is supposed to be comaprable to the full Softmax. In contrast, the Dissected Softmax has the ability to exceed Softmax because the objective is more rigorous.
In terms of the time cost, it is obvious that the full Softmax is the slowest one, with s average time cost on the loss layer for one forward-backward pass, while Rand-Softmax is the fastest with s. HF-Softmax is supposed to be efficient because only a small fraction of the weights need to be computed, but the update of the random forest cost too much time (s per iteration on average, while the computation of loss is only s per iteration.). This time cost can be decreased by changing to fast ANN algorithm or enlarging the updating time duration of the random forest, but as a result the performance will also decrease. In contrast, the proposed D-Softmax-K provides a pretty good performance-speed trade-off. The training with D-Softmax-K is as fast as Rand-Softmax since we do not need to build and update a random forest.
Note that the results of large-scale experiments seem to be inferior to that of training with MS-Celeb-1M alone. This is because the MegaFace2 dataset is rather noisy. If trained with a cleaned large-scale dataset, the performance is supposed to be better.
In this paper, we propose to dissect the softmax loss into independent intra- and inter-class objectives. By doing so, the optimization of the two objectives is no longer entangled with each other, and as a consequence it is more straightforward to tune the objectives to be consistently rigorous during the training time. The propsed D-Softmax shows good performance in the face recognition task. By sampling the inter-class similarity, it is easy to be extended to fast variants (D-Softmax-B and D-Softmax-K) that can handle massive-scale training. We show that the fast variants of D-Softmax significantly accelerate the training process, while the performance drop is quite small.