Softmax-based losses have achieved state-of-the-art performances on various tasks such as face recognition and re-identification. These methods can be classified into Euclidean margin based loss[Liu2016, Ranjan2017, Wang2017] and cosine margin (angular margin) based loss [Liu2017, Wang2018, Deng2018, meng2021magface]. Compared to pairwised metric learning methods such as triplet loss [schroff2015facenet] and N-pair [sohn2016improved] loss, softmax-based losses have the following advantages: (a). With all negative classes in sight, models trained with softmax-based losses converge more quickly in most cases. (b). Not affected by sampling strategies and therefore have more stable training processes. (c). More discriminative features can be learned, and state-of-art performances have been witnessed on various recognition benchmarks. These characteristics are of great value in many real-world applications where training databases are of large-scale and high efficiencies are needed.
Despite the great success, softmax-based losses suffer from large GPU memory consumption as well as demanding clean datasets with global labels. Thanks to the development of parallel acceleration, memory assumptions can be highly decreased and that enables training on million level classes on a single machine [Deng2018]. However, a clean dataset with global labels is not always available. Instead, in many real-world applications, we may have multiple datasets captured from various temporal/spatial scenarios. Directly merging these datasets, especially for large-scale datasets, is not realistic because noisy labels can be introduced, and exponential-increasing resources are required. In this case, we call a training dataset as a basket and define the problem as Multiple Baskets with Class Overlaps (MBCO), where one class is unique in a basket but can appear multiple times across baskets, as illustrated in Fig. 1. A practically significant but usually ignored question arises: how to simultaneously train softmax losses on multiple baskets with overlapped classes? We note that the raised problem is different from a well-studied problem [wu2018light, zhong2019unequal, hu2019noise, wang2019co, deng2020sub] called noisy label problem, which focuses on mislabeled samples inside a class but ignore the noise of one class being assigned with different labels. To emphasize the importance of this problem, we list a few real-world examples below:
In surveillance scenarios , merging faces in videos captured within a small-time window is relatively easy as cues such as tracking or re-id results can be relied on. However, merging faces across days is almost impossible because of the vast volumes of the data as well as the lack of other cues.
Merging datasets is extremely difficult when large variations exhibit. For example, face images are highly affected by image acquisition conditions (, illumination, background, blurriness, and low resolution) and factors of the face (, pose, occlusion and expression), and therefore degrades the qualities of human-assessed and similarity-based labels.
In some situations, there lack of a clear labelling rules. For instance, whether grouping images from a same identity but with different clothes is still indistinct for person re-identification task.
For scenarios with privacy concerns (, collecting faces without appropriate consents can breach the law), pipelines which can automatically collect data, train and deploy models are preferred. Automatically generating small and clean datasets is relatively easy. However, noise will be accumulated in absence of manual interventions with the increasing of data volumes. Consequently, models trained on such noisy datasets will encounter performance degradations. In this case, models are better trained on multiple clean baskets, instead of a large but noisy dataset.
To enable training on multiple baskets with softmax-based losses, we propose a framework called Basket-based Softmax (BBS) which works in an end-to-end fashion and can be applied to most softmax-based losses. Specifically, we simultaneously use the similarity scores defined in softmax losses as the clue to mining negative classes from other baskets, and dynamically add them to guide the learning of more discriminative features. To better fit the large-scale issue in practical applications, we further modified the BBS to the parallel version. Extensive experiments are conducted to verify the superiority and efficiency of our proposed method. We summarize our contributions as follows:
We raise the importance of the Multiple Baskets with Class Overlaps (MBCO) problem which is usually ignored by the academic community but can occur frequently in many real-world applications. Datasets collected from various temporal and spatial scenarios can be of tremendous size and labels are locally assigned in each basket. Directly assigning global classes can introduce noisy labels and requires exponential-increasing resources.
An efficient and end-to-end mining-during-training framework called BBS is proposed, where similarity scores are adopted as the cue to dynamically mining negative classes across baskets. Our proposed BBS can be applied to the majority softmax-based losses and trained on multiple baskets simultaneously. Besides that, we also introduce the parallel version to enables the training of million level classes.
Experimentally, we modified the common-used Softmax, CosFace [Wang2018] and ArcFace [Deng2018] to our BBS framework. Extensive experiments are conducted on face recognition and person re-identification and verify the superiority of BBS, with both simulated and real-world training datasets.
2 Related Works
|Method||Softmax||L-Softmax [Liu2016]||-softmax [Ranjan2017]||NormFace [Wang2017]||SphereFace [Liu2017]||CosFace [Wang2018]||ArcFace [Deng2018]|
2.1 Deep Face Recognition
Recent years have witnessed the breakthrough of deep convolutional face recognition techniques [Liu2017, Wang2018, Deng2018, xu2021searching, meng2021poseface, meng2021magface, meng2021lce]. Most of early works rely on metric-learning based loss, including contrastive loss [chopra2005learning], triplet loss [schroff2015facenet], n-pair loss [sohn2016improved], angular loss [wang2017deep], . Suffering from the combinatorial explosion in the number of face triplets, embedding-based method is usually inefficient in training on large-scale dataset. Therefore, the main body of research in deep face recognition has focused on devising more efficient and effective softmax-based loss. Wen [wen2016discriminative] develop a center loss to learn centers for each identity to enhance the intra-class compactness. -softmax [Ranjan2017] and NormFace [Wang2017] study the necessity of the normalization operation and applied normalization constraint on both features and weights.
From then on, several angular margin-based losses and progressively improve the performance on various benchmarks to the newer level. SphereFace [Liu2017] introduced angular margin to softmax loss and achieved discriminative features. To overcome the optimization difficulty of SphereFace, CosFace [Wang2018] moves the angular margin into cosine space. In ArcFace [Deng2018], decision boundary is directly maximized in angular (arc) space based on the normalized weights and features, and they achieve state-of-the-art performances on current benchmarks.
2.2 Person Re-identification
Person re-identification (re-id) aims to retrieval the images of the target person from the gallery set across different cameras. The widely used losses for CNN-based re-id methods include two types: classification loss [Sun2018Beyond] and metric learning loss [Hermans2017In]. Classification loss views re-id in the training stage as a classification task and uses softmax with cross entropy loss to optimize the model. Sun [Sun_2020_CVPR] demonstrate that other softmax-based losses, which are widely used in face recognition , AM-Softmax [Wang_2018_amsoftmax], CircleLoss [Sun_2020_CVPR], are also suitable for re-id. The most well-known metric learning loss for re-id is triplet loss with hard sample mining [Hermans2017In]. For each anchor in a mini-batch, triplet loss with hard sample mining only samples its hardest positive sample and its hardest negative sample to form a triplet to optimize. To take all samples to participate in optimization, Ristani [Ristani_2018_CVPR] improve it by proposing an adaptive weighted triplet loss and a new technique for hard-identity mining.
2.3 Noisy Labels
Learning with noisy labels has recently drawn much attention as ambiguous and inaccurate labels can exist in most datasets. Wu [wu2018light] proposes a semantic bootstrapping method by re-labelling noisy samples by predictions. Zhong [zhong2019unequal] learns discriminative face representation supervised by a noise-resistant loss and copes the long-tail issue by hard identities mining strategy. Hu [hu2019noise]
discovers the distribution of training samples implicitly reflects the probability to be clean and proposes a noise-tolerant end-to-end paradigm by employing the idea of weighting training samples. Co-Mining[wang2019co] trains twin networks simultaneously, detects noisy labels based on loss values, exchanges high confidence clean faces and re-weight the predicted clean faces. To improve the robustness to label noise of ArcFace, Sub-center ArcFace [deng2020sub] relaxes the intra-class constraint by designing K sub-centers for each class and one training sample only needs to be close to any of the K positive sub-centers instead of the only one positive center. These methods mainly deal with the purity issue which means one class may contain multiple identities. In contrast, our BBS focuses on training the multiple baskets with class overlaps.
Our goal is to learn discriminative features from multiple baskets with softmax-based losses, wherein labels inside a basket are clean while class overlaps can exist between baskets. To achieve this goal, we propose a novel mining-during-training strategy called Basket-based Softmax (BBS) to effectively train models on multiple baskets in an end-to-end fashion. We first provide a unified perspective for softmax-based losses in Sec. 3.1 and present the details of our BBS in Sec. 3.2.
In real-world applications, dealing with large-scale datasets is another non-negligible problem. To address this, we modified the original BBS to a parallel version in Sec.3.3, which enables BBS to support million level classes on a single machine.
3.1 A Unified Perspective of Softmax-based loss
Assuming there are classes in the training dataset and the feature embedding is of dimension . The softmax-based losses build a fully connected layer with a weight matrix and biases , where each corresponds to the class center . For the sample , denote its class as and the embedding as . The softmax-based losses for sample can be unified into one formula as follows:
In different methods, are defined in various formats as shown in Tab. 1. The function is used to measure similarity between feature and class center . Softmax and L-Softmax [Liu2016] uses the Euclidean distance while NormFace [Wang2017], CosFace [Wang2018] and ArcFace [Deng2018]
uses the cosine similarity.-softmax [Ranjan2017] and SphereFace [Liu2017] modify the similarities to eliminate the effects of magnitudes of features or class centers.
is either the same as , or revised to further increase the intra-class distance as well as decrease inter-class distance (, an additive angular margin is introduced in ArcFace [Deng2018]).
3.2 Basket-based Softmax
Assuming there are training baskets and basket contains classes, . The local class in basket starts from to and we sequentially concatenate labels from all baskets to get the network labels, as shown in Fig. 2.
Denote . Then the total number of network ids is . For the sample with basket id and local id , its network id is in our setting. For the sake of expression, we define the following functions:
Then the loss for basket-based softmax is formulated as follows:
Here is an indicator whether adding class in basket as a negative class for current sample . When for all , each basket is assigned with an individual loss and BBS loss degrades to the multi-task approach (, each basket has an individual loss). In contrast, setting all to 1 has the same effect of training on concatenated data, without considering the overlapped classes between baskets.
Function is used as a metric for measuring the similarity between a embedding and a class center . For each sample from basket , we select negative classes from basket () based on the similarities. To be more specific, similarities of current embedding and class centers of basket are calculated and least similar class centers are picked as negative classes. We summarize the training scheme in Algorithm 1.
3.2.1 Dynamic Negative Class Mining
A hyperparameter of our BBS is the number of negative class centersfrom basket . In the ideal situation, can be set to be as current sample
belongs to at most one class in that basket. However, this hyperparameter highly relies on the qualities of estimated similarities. If model is not discriminative enough, a smallershould be used. To this end, we design a dynamic mining strategy as shown in Algorithm. 2.
We define as the number of ignored classes in basket , which indicates that we cannot treat top- similar classes to be negative classes with a high confidence. We set a minimum ignored number and an ignored ratio . The is dynamically adjusted based on the discriminative ability of the model. Its value is monotonically decreased during our training process. In the end, we have .
3.3 Parallel Basket-based Softmax
For the parallel version of BBS, we first distribute class centers into GPUs with assigned to the -th GPU, where and . For a sample from basket and with the network id , we have the following parallel BBS:
Here . For classes belongs to basket , then we set to be 0 if and 1 otherwise. The remaining question is how to decide values for classes not in basket .
Because of being distributed, class centers belong to one basket may distribute to multiple GPUs. Consequently, one GPU has no access to all similarities without gathering the scores together. To avoid extra GPU memories consumption, we propose the algorithm 3 to approximate procedure in lines 6-13 in algorithm 1. For example, with basket ID , the key idea of the parallel version is to consider each truncated basket as a new one and calculate the number of negative classes based on the new baskets.
We evaluate our BBS on two important tasks in the computer vision community: face recognition (section4.1) and person re-identification (section 4.2). For face recognition, we simulate the MBCO problem for experiments on a common-used database called MS1MV2 [Deng2018]. For person re-identification, images from same camera and in same day are gathered to be a basket. As lack of previous works on this problem, we choose two baselines: the first baseline is trained on the concatenated baskets without merging classes, while the second baseline is the multi-task approach with each basket equipped with a softmax loss. To further verify the capability of handling large-scale data, resource consumptions are studied when modifying BBS to the parallel version (in section 4.3).
4.1 Face Recognition
To simulate the MBCO problem, we adopt the algorithm 4 to split a dataset into baskets, where one class appears in baskets with the probability . We modify the state-of-the-art ArcFace [Deng2018] in face recognition to our BBS framework and conduct experiments by our simulated baskets. We mainly consider two important factors of the MBCO problem: ratio of class overlaps between baskets (in section 4.1.2) and number of baskets involved (in section 4.1.3).
Datasets. The MS-Celeb-1M dataset [guo2016ms] contains about 100k identities with 10 million images. However, it consists of a great many noisy face images. We employ MS1MV2 [Deng2018] (3.8M images, 85k unique identities) as our training dataset. For evaluation, we adopt LFW [huang2008labeled], CFP-FP [sengupta2016frontal], AgeDB-30 [moschoglou2017agedb], IJB-B [whitelam2017iarpa] and IJB-C [maze2018iarpa]. All the images are aligned to based on 5 facial landmarks, following ArcFace [Deng2018].
Simulated Datasets. In section 4.1.2, we split the full MS1MV2 dataset into 2 parts with different ratios of class overlaps, from 10% to 100%. In section 4.1.3, MS1MV2 is split into 10 parts with (e.g., the probability of a class appears baskets is 3 times of it appears in baskets). In the end, we have and the average ratio of overlaps between two baskets is around 10%. The average number of classes in one basket is and experiments are conducted from 2 to 10 baskets to examine the impact of basket numbers.
We train the models with 8 1080Ti GPUs by stochastic gradient descent (SGD) algorithm. The learning rate is initialized as 0.1 and divided by 10 at 5, 10, 15 epochs, and we finish the training at the 20th epoch. The weight decay is set to 0.0005 and the momentum is 0.9. We only augment the training samples by random horizontal flip. Because of vast number of identities, the parallel BBS is adopted in the face recognition experiments.
4.1.2 Ratio of Class Overlaps between Baskets
|Split MS1MV2||Method||LFW||CFP-FP||AgeDB||IJB-B (TAR@FAR)||IJB-C (TAR@FAR)|
|2 baskets, overlap 10%||baseline1||99.63||93.48||97.10||78.96||90.04||94.03||86.10||92.20||95.45|
|2 baskets, overlap 20%||baseline1||99.58||93.14||96.75||83.33||90.29||94.01||88.77||92.65||97.41|
|2 baskets, overlap 30%||baseline1||99.60||92.64||96.81||82.24||90.19||93.69||87.86||92.48||95.14|
|2 baskets, overlap 40%||baseline1||99.58||93.20||96.80||81.89||89.66||93.66||87.77||92.17||95.13|
|2 baskets, overlap 50%||baseline1||99.63||92.35||96.36||80.98||88.94||93.12||87.07||91.44||94.73|
|2 baskets, overlap 60%||baseline1||99.53||92.18||96.08||81.11||88.49||92.67||86.42||90.78||94.32|
|2 baskets, overlap 70%||baseline1||99.57||92.45||95.60||79.15||87.40||92.44||84.54||90.01||94.16|
|2 baskets, overlap 80%||baseline1||99.50||91.48||94.46||73.68||85.29||91.24||81.99||88.41||93.22|
|2 baskets, overlap 90%||baseline1||99.50||90.40||93.81||73.38||84.32||90.72||80.09||87.62||92.92|
|2 baskets, overlap 100%||baseline1||99.03||89.95||90.63||17.18||37.57||75.00||23.85||46.31||83.21|
Results with different ratios of class overlaps are listed in Tab. 2. When the ratio equals to , BBS surpasses the best results of two baselines by on LFW, CFP-FP and AgeDB. The numbers are on IJB-B at TAR@FAR=1e-5, 1e-4, 1e-3 and on IJB-C at TAR@FAR=1e-5, 1e-4, 1e-3. When the ratio of overlaps increases, BBS has stable high performances while the performances of two baselines drop rapidly. Taking the ratio of as an example, BBS surpasses the best results of two baselines by on LFW, CFP-FP and AgeDB. The numbers are on IJB-B at TAR@FAR=1e-5, 1e-4, 1e-3 and on IJB-C at TAR@FAR=1e-5, 1e-4, 1e-3.
We further visualize the trend of performances on IJB-C at TAR@FAR=1e-4 in Fig. 3. With increasing ratio of class overlaps, performance of the baseline1 drops rapidly as the error of labels increases. Labels used by baseline2 are clean as each basket are processed individually. However, with more images separated into different baskets, each class contains less images and that leads to the performance degrade. In contrast, our BBS achieves the stable and best results on TAR@FAR=1e-4 at IJB-C, which shows the superiority of the proposed method compared to the baselines.
4.1.3 Number of Baskets
Tab. 3 shows the results with different number of baskets. The overall trend is that improvements become more significant with more baskets used. When all baskets used, BBS gets , , performance boosts compared to the best results from baselines on LFW, CFP-FP and AgeDB. BBS also surpasses the baselines on all TAR criteria by on FAR=1e-5, on FAR=1e-4 and on FAR=1e-3 than the best results from baselines on IJB-B and by on FAR=1e-5, on FAR=1e-4 and on FAR=1e-3 than best results from baselines on IJB-C. We also train the model with the original MS1MV2 and report the results in the last row of table 3. Even trained with the split datasets, our BBS can still obtain comparable results to the full clean data, which demonstrates the great practical value and efficiency of the proposed method.
|Datasets||Method||LFW||CFP-FP||AgeDB||IJB-B (TAR@FAR)||IJB-C (TAR@FAR)|
4.2 Person Re-identification
In this section, we verify the generalization of our proposed BBS on person re-identification. Specifically, three losses, including Softmax with cross entropy loss, CosFace [Wang2018] loss, and ArcFace [Deng2018] loss, are used as the training loss respectively to verify the generalization of the proposed BBS.
DatasetMarket-1501 dataset [Zheng2015Scalable] contains 1501 identities and 32217 images captured by 6 cameras. Following [Zheng2015Scalable], 751 identities are reserved for training and the remaining 750 identities are used for testing. All these data are captured from six time periods. To verify the effectiveness of the proposed BBS, we divide the training set of Market-1501 into six baskets according to the capture time and name it Market-Basket dataset. In the testing stage, we use the original testing set of Market-1501 as the testing set of Market-Basket for evaluation.
Implementation Details. Following [Hou2019Interaction], We use ResNet-50 [He2016Deep] as the backbone. The model is trained for 60 epochs in total by Adam [Kingma2014Adam] optimizer. The learning rate is initialized as and multiplied by 0.1 after every 20 epochs. The batch size is set to 64, and each batch consists of 16 persons and 4 images for each person. All these images are resized to pixels. We also use horizontal flip, random crop, and random erase [zhong2020random] for data augmentation. As for CosFace loss and ArcFace loss, the scaling factor is set to 16 and the margin is set to 0.1 by grid search. For BBS, are set to (2,2), respectively.
Cumulative Matching Characteristics (CMC) and mean Average Precision (mAP) are used as the evaluation metrics.
4.2.2 Generalization for Different Losses
As shown in Table 4, BBS surpasses baseline1 and baseline2 significantly and consistently for all these three losses. Specifically, BBS increases about 2% top-1 and 3% mAP over baseline1 for CosFace, which does not mine any cross-basket negative class. This comparison demonstrates the effectiveness of BBS for dynamic negative class mining and the generalization of BBS for different softmax-based losses.
4.3 Parallel Acceleration
We simulate the trainings on different number of classes on 8 GPUs (1080Ti with 12GB memory) to test the training speed of the BBS and parallel BBS and visualize the results in Fig. 4. The original BBS can train less than 200k classes and the throughputs are smaller than 500 images per second. In contrast, the parallel BBS is able to handle datasets with up to 1M classes and the throughputs can be 3 times those of the original BBS. The throughputs remain more than 1000 images per second even trained with 1M classes.
In this paper, we raise the importance of the Multiple Baskets with Class Overlaps (MBCO) problem which is usually ignored by the academic community but can occur frequently in real-world applications, and propose an end-to-end mining-during-training framework called Basket-based Softmax (BBS) to enable the training on multiple baskets. Extensive experiments are conducted on face recognition and person re-identification, and have verified the superiority, efficiency as well as the generalization of our proposed method.