"Semi-Siamese Training for Shallow Face Learning"
Most existing public face datasets, such as MS-Celeb-1M and VGGFace2, provide abundant information in both breadth (large number of IDs) and depth (sufficient number of samples) for training. However, in many real-world scenarios of face recognition, the training dataset is limited in depth, i.e. only two face images are available for each ID. We define this situation as Shallow Face Learning, and find it problematic with existing training methods. Unlike deep face data, the shallow face data lacks intra-class diversity. As such, it can lead to collapse of feature dimension and consequently the learned network can easily suffer from degeneration and over-fitting in the collapsed dimension. In this paper, we aim to address the problem by introducing a novel training method named Semi-Siamese Training (SST). A pair of Semi-Siamese networks constitute the forward propagation structure, and the training loss is computed with an updating gallery queue, conducting effective optimization on shallow training data. Our method is developed without extra-dependency, thus can be flexibly integrated with the existing loss functions and network architectures. Extensive experiments on various benchmarks of face recognition show the proposed method significantly improves the training, not only in shallow face learning, but also for conventional deep face data.READ FULL TEXT VIEW PDF
"Semi-Siamese Training for Shallow Face Learning"
Face Recognition (FR) has made remarkable advance and has been widely applied in the last few years. It can be attributed to three aspects, including convolution neural networks (CNNs)[26, 15, 31, 16], loss functions [29, 28, 37, 23, 44, 36] and large-scale training datasets [40, 12, 18, 1]. In recent years, the commonly used public training datasets, such as CASIA-WebFace , MS-Celeb-1M  and VGGFace2  etc., provide abundant information in not only breadth (large number of IDs), but also depth (dozens of face images for each ID). In this paper, we call this type of dataset as deep face data. Unfortunately, such deep face data is not available in many real-world scenarios. Usually, the training encounters the problem of “shallow face data” in which only two face images are available for each ID (generally a registration photo and a spot photo, so-called “gallery” and “probe”). As a result, it lacks intra-class diversity, which prevents the network from effective optimization and leads to the collapse of feature dimension. In such situation, we find the existing training methods suffer from either the model degeneration or the over-fitting issue.
In this paper, we regard the training on shallow face data as a particular task, named Shallow Face Learning (SFL). SFL is similar to the existing problem of Low-shot Learning (LSL)  in face recognition, but they have two significant differences. First, LSL performs close-set recognition [11, 38, 3, 34], while SFL includes open-set recognition task in which test IDs are excluded from training IDs. Second, LSL requires pretraining in the source domain (with deep data) before finetuning to the target domain [47, 3, 41], however, the pretraining is not always a good choice for practical development of face recognition w.r.t the following reasons: (1) the network architecture is fixed once the pretraining is done, thus it is inconvenient to change the architecture in the finetuning; (2) deploying new architectures needs restarting from the pretraining, while the pretraining is often time-consuming; (3) there exists domain gap between pretraining data and finetuning data, so the finetuning still suffers from the shallow data problem. Therefore, SFL argues to directly train from scratch on shallow face data.
In brief, the objective of Shallow Face Learning is the effective training from scratch on shallow face data for open-set face recognition. We retrospect the current methods and study how they suffer from the shallow data problem. In recent years, most of the prevailing deep face recognition methods [21, 33, 32, 7, 20]
are developed from the classification learning by softmax or its variants. They are built on a fully connected (FC) layer, the softmax function and the cross-entropy loss. The weights of the FC layer can be regarded as the prototypes which represent the center of each class. The learning objective is to maximize the prediction probability on the ground-truth class. This routine shows great capability and efficiency to learn discrimination on deep data. However, since the shallow data leads to the extreme lack of intra-class information, as shown in Section3.1, we find this kind of training methods suffer from either model degeneration or over-fitting.
Another major routine in face recognition is the embedding learning methods [6, 13, 28, 24, 27], which can learn face representation without the classification layer. For example, Contrastive loss  and Triplet loss 
calculate pair-wise Euclidean distance and optimize the model over the sample relation. Generally, the embedding learning performs better than the classification learning when data becomes shallow. The potential reason is that the embedding learning employs feature comparison between samples, instead of classifying them to the specific classes whose prototypes include large amount of parameters.
However, the performance and efficiency of the embedding learning routine depends on the number of sample pairs matched batch-wisely, which is limited by the GPU memory and hard sampling strategy. In this paper, we desire to draw the advantage of embedding learning for achieving successful classification learning on shallow data. If we address the issues of model degeneration and over-fitting, the training can greatly benefit from the capability and efficiency of the classification learning. A straightforward solution comes up from the plain combination of the two routines, which employs sample features as the prototypes to initialize the FC weights, and runs classification learning with them. The similar modification on softmax has been suggested by the previous methods . Specifically, for each ID of the shallow data, one photo is employed as the initial prototype, and the other photo is employed as training sample. However, such prototype initialization brings still limited improvement when training on shallow data (e.g. DP-softmax in Fig. 4.3). To explain this result, we assume that the prototype becomes too similar to its intra-class training sample, which leads to the extreme small gradient and impedes the optimization.
To overcome this issue, we propose to improve the training method from the perspective of enlarging intra-class diversity. Taking Contrastive or Triplet loss as an example, the features are extracted by the backbone. The backbone can be regarded as a pair (or a triplet) of Siamese networks, since the parameters are fully shared between the networks. We find the crucial technique for the solution is to enforce the backbone being Semi-Siamese
, which means the two networks have close (but not identical) parameters. One of the networks extracts the feature from gallery as the prototype, and the other network extracts the feature from probe as the training sample, for each ID in the training. The intra-class diversity between the features is guaranteed by the difference between the networks. There are many ways to constrain the two networks to have slight difference. For example, one can add a network constraint between their parameters during SGD (stochastic gradient descent) updating; or SGD updating for one, and moving-average updating for the other (like momentum proposed by). We conduct extensive experiments and find all of them contribute to the shallow face learning effectively. Furthermore, we incorporate the Semi-Siamese backbone with an updating feature-based prototype queue (i.e. the gallery queue), and achieve significant improvement on shallow face learning. We name this training scheme as Semi-Siamese Training, which can be integrated with any existing loss functions and network architectures. As shown in Section 4.3, whatever loss function, a large improvement can be obtained by using the proposed method for shallow face learning.
Moreover, we conduct two extra experiments to demonstrate more advantage of SST in a wide range. (1) Although SST is proposed for the shallow data problem, an experiment on conventional deep data shows that leading performance can still be obtained by using SST. (2) Another experiment for verifying the effectiveness of SST for real-world scenario, with pretrain-finetune setting, also shows that SST outperforms the conventional training.
In summary, the paper includes the following contributions:
We formally depict a critical problem of face recognition, i.e. Shallow Face Learning, from which the training of face recognition suffers severely. This problem exists in many real-world scenarios but has been overlooked before.
We study the Shallow Face Learning problem with thorough experiments, and find the lack of intra-class diversity impedes the optimization and leads to the collapse of the feature space. In such situation, the model suffers from degeneration and over-fitting in the training.
We propose Semi-Siamese Training (SST) method to address the issues in Shallow Face Learning. SST is able to perform with flexible combination with the existing loss functions and network architectures.
We conduct comprehensive experiments to show the significant improvement by SST on Shallow Face Learning. Besides, the extra experiments show SST also prevails in both conventional deep data and pretrain-finetune task.
There are two major schemes in the deep face recognition. On one hand, the classification based methods is developed from softmax loss and its variants. SphereFace  introduces the angular margin to enlarge gaps between classes. CosFace  and AM-softmax 
propose an additive margin to the positive logit. ArcFace employs an additive angular margin inside the cosine and gives a more clear geometric interpretation. On the other hand, the feature embedding methods, such as Contrastive loss [6, 13, 28] and Triplet loss  calculate pair-wise Euclidean distance and optimize the network over the relation between samples pairs or triplets. N-pairs loss  optimizes positive and negative pairs following a local softmax formulation each mini-batch. Beyond the two schemes, Zhu et al.  proposes a classification-verification-classification training strategy and DP-softmax loss to progressively enhance the performance on ID versus spot face recognition task.
Low-shot Learning (LSL) in face recognition aims at close-set ID recognition by few face samples. Choe et al.  use data augmentation and generation methods to enlarge the training dataset. Cheng et al.  propose an enforced softmax that contains optimal dropout, selective attenuation, normalization and model-level optimization. Wu et al.  develop the hybrid classifiers by using a CNN and a nearest neighbor model. Guo et al. 
propose to align the norms of the weight vectors of the one-shot classes and the normal classes. Yinet al.  augment feature space of low-shot classes by transferring the principal components from normal to low-shot classes. The above methods focus on the MS-Celeb-1M Low-shot Learning benchmark 
, which has relatively sufficient samples for each ID in a base set and only one sample for each ID in a novel set, and the target is to recognize faces from both the base and novel set. However, as discussed in the previous section, the differences between Shallow Face Learning and LSL have two aspects. First, the LSL methods aim at close-set classification, for example, in the MS-Celeb-1M Low-shot Learning benchmark, the test IDs are included in the training set; but Shallow Face Learning includes open-set recognition where the test samples belong to unseen classes. Second, unlike the LSL generally employing transfer learning from source dataset (pretraining) to target low-shot dataset (finetuning), Shallow Face Learning argues to train from scratch on target shallow dataset.
The recent self-supervised methods [8, 39, 48, 14] have achieved exciting progress in visual representation learning. Exemplar CNN  introduces the surrogate class concept for the first time, which adopts a parametric paradigm during training and test. Memory Bank  formulates the instance-level discrimination as a metric learning problem, where the similarity between instances are calculated from the features in a non-parametric way. MoCo 
proposes a dynamic dictionary with a queue and a momentum-updating encoder, which can build a large and consistent dictionary on-the-fly that facilitates the contrastive unsupervised learning. These methods regard each training sample as an instance-level class. Although they employ the data augmentation for each sample, the instance-level classes still lack the intra-class diversity, which is similar to the Shallow Face Learning problem. Inspired by the effectiveness of the self-supervised learning methods, we tackle the issues in Shallow Face Learning with similar techniques, such as the moving-average updating for the Semi-Siamese backbone, and the prototype queue for the supervised loss. Nonetheless, SST is quite different with the self-supervised methods. For example, the gallery queue of SST is built based on the gallery samples rather than the sample augmentation technique; SST aims to deal with Shallow Face Learning which is a specific task in supervised learning. From the perspective of learning against the lack of intra-class diversity, our method generalize the advantages of the self-supervised scheme to the supervised scheme on shallow data.
Shallow face learning is a practical problem in real-world face recognition scenario. For example, in the authentication application, the face data usually contains a registration photo (gallery) and a spot photo (probe) for each ID. The ID number could be large, but the shallow depth leads to extreme lack of intra-class information. Here, we study how the current classification-based methods suffer from this problem, and what the consequence is brought by the shallow data.
Most of the current prevailing methods are developed from softmax or its variants, which includes a FC layer, the softmax function, and the cross-entropy loss. The output of the FC layer is the inner product of the -th sample feature and -th class weight . When the feature and weight are normalized by their
norm, the inner product equals to the cosine similarity. Without loss of generality, we take the conventional softmax as an example, and the loss function (omitting the bias term) can be formulated by
where is the batch size, is the class number, is the scaling parameter, and is the ground truth label of the -th sample. The learning objective is maximizing the intra-class pair similarity and minimizing the inter-class pairs to achieve compact features for intra-class and separate for inter-class. The term inside the logarithm is the prediction probability on the ground truth class , which can be written as . This equation implies that the optimal solution of the prototype satisfies two conditions,
where is the sample number in this class. The Condition (i) means, ideally, the optimal prototype will be the class center which equals to the average of the features in this classes. Meanwhile, the Condition (ii) pushes the prototype to the risk of collapse to zeros in many dimensions. When is large enough (deep data), ’s have large diversity, so keeping the prototype away from collapse. While in shallow data (), the prototype is determined by only two samples in a class, i.e. the gallery and probe . As a result, the three vectors , and will rapidly become very close (), and this class will achieve very small loss value. Considering the network is trained batch-wisely by SGD, in every iteration the network is well-fitted on a small number of classes and badly-fitted on the other classes, thus the total loss value will be oscillating and the training will be harmed (as shown in Fig. 5 dot curves). Moreover, since all the classes gradually lose the intra-class diversity in features space , the prototype is pushed to zeros in most dimensions by Condition (ii), and unable to span a discriminative feature space.
To explore the consequence brought by the shallow data problem, we conduct an experiment on both deep data and shallow data with the loss functions of softmax, A-softmax , AM-softmax  and Arc-softmax . The deep data is MS1M-v1c  (cleaned version of MS-Celeb-1M ). Shallow data is a subset of MS1M-v1c, with two face images selected randomly per ID from the deep data. Table 1 shows not only the test accuracy on LFW  but also the accuracy on the training data. We can find that the softmax and A-softmax get lower performance both in training and test when training data becomes from deep to shallow, while the AM-softmax and Arc-softmax get higher in training but lower in test. Therefore, we argue that the softmax and A-softmax suffer from the model degeneration issue, while the AM-softmax and Arc-softmax suffer from the over-fitting issue. To further support this argument, we inspect the value of each entry in the prototype , and compute the distribution with Parzen window. The distribution is displayed in Fig 2, with the horizontal axis represents the entry values, and the vertical axis represents the density. We can find that most entries of the prototypes degrade to zeros, which means the feature space collapses in most dimensions. In such reduced-dimension space, the models could be easily degenerated or over-fitted.
From the above analysis, we can see, when the data becomes shallow, the current methods are damaged by the model degeneration and over-fitting issues, and the essential reason consists in feature space collapse. To cope with this problem, there are two directions for us to proceed: (1) to make and updating correctly, and (2) to keep the entries of away from zeros.
In the first direction, the major issue is the network is prevented from effective optimization. We retrospect the Condition (i) in Eqn. 2 for Shallow Face Learning in which only two face images are available for each ID. We denote them by (gallery) and (probe) and their features and , where is the Siamese backbone. According to Condition (i), . Due to the lack of intra-class diversity, the gallery and probe often have close features, and thus . As studied in the previous subsection, this situation will lead to loss value oscillation, preventing the network from effective optimization. The basic idea to deal with the problem is to keep some distance from , i.e. . To maintain the distance between and , we propose to make the Siamese backbone being Semi-Siamese. Specifically, a gallery-set network gets input of gallery, and a probe-set network gets input of probe. and have the same architecture but non-identical parameters, , so the features prevent being attracted to each other . There are certain choices to implement the Semi-Siamese networks. For example, one can add a network constraint in the training loss, such as , and the non-negative parameter is used to balance the network constraint in the training loss. Another choice, as suggested by MoCo , aims to update the gallery-set network in the momentum way,
where is the weight of moving-average, and the probe-set network updates with SGD w.r.t. the training loss. Both and are the instantiation of which keeps and similar. We compare different implementations for the Semi-Siamese networks, and find the moving-average style gives significant improvement in the experiments. Owing to the intra-class diversity maintaining, the training loss decreases steadily without oscillation (solid curves in Fig. 5).
In the second direction, a straightforward idea is to add a prototype constraint in the training loss to enlarge the entries of prototype, such like with parameters and . However, we find this technique enlarges the entries in most dimension indiscriminately (Fig. 2 the green distribution), and results in decrease (Table 2). Instead of manipulating , we argue to replace by the gallery feature as the prototype. Thus, the prototype totally depends on the output of the backbone, avoiding the zero issue of the parameters (entries) of . The red distribution in Fig. 2 shows the feature-based prototype avoids the issue of collapse while keeping more discriminative components compared with the prototype constraint. Removing also alleviates the over-fitting risk of heavy parameters. The entire prototype set updates by maintaining a gallery queue. Certain self-learning methods [39, 14] have studied this technique and its further advantages, such as better generalization when encountering unseen test IDs.
In summary, our Semi-Siamese Training method is developed to address the Shallow Face Learning problem along the two directions. The forward propagation backbone is constituted by a pair of Semi-Siamese networks, each of which is in charge of feature encoding for gallery and probe, respectively; the training loss is computed with an updating gallery queue, so the networks are optimized effectively on the shallow data. This training scheme can be integrated with any form of existing loss function (no matter classification loss or embedding loss) and network architectures (Fig. 3).
This section is structured as follows. Section 4.1 introduces the datasets and experimental settings. Section 4.2 includes the ablation study on SST. Section 4.3 demonstrates the significant improvement by SST on Shallow Face Learning with various loss functions. Section 4.4 shows the convergence of SST with various backbones. Section 4.5 shows SST can also achieve leading performance on deep face data. Section 4.6 studies SST also outperforms conventional training for the pretrain-finetune task.
Training Data. To prove the reproducibility111The source code of SST is available at https://github.com/dituu/Semi-Siamese-Training., we employ the public datasets for training. To construct shallow data, two images are randomly selected for each ID from the MS1M-v1c  dataset. Thus, the shallow data includes 72,778 IDs and 145,556 images. For deep data, we use the full MS1M-v1c which has 44 images per ID in average. Besides, we utilize a real-world surveillance face recognition benchmark QMUL-SurvFace  for the experiment of pretrain-finetune.
Test Data. For a thorough evaluation, we adopt LFW , BLUFR , AgeDB-30 , CFP-FP , CALFW , CPLFW , MegaFace  and QMUL-SurvFace  datasets. AgeDB-30 and CALFW focus on large age gap face verification. CFP-FP and CPLFW aim at cross-pose variants face verification. BLUFR is dedicated for the evaluation with focus at low false accept rates (FAR), and we report the verification rate at the lowest FAR (1e-5) on BLUFR. MegaFace also evaluates the performance of large-scale face recognition with the millions of distractors. QMUL-SurvFace test set aims at real-world surveillance face recognition and has a large domain gap compared to above benchmarks.
CNN Architecture. To balance the performance and the time cost, we use the MobileFaceNet  in the ablation study and the experiments with various loss functions. Besides, we employ Attention-56  in the deep data and pretrain-finetune experiments. The output is a 512-dimension feature. In addition, we also employ extra backbones including VGG-16 , SE-ResNet-18 , ResNet-50 and -101  to prove the convergence of SST with various architectures.
Training and Evaluation. Four NVIDIA Tesla P40 GPUs are employed for training. The batch size is 256 and the learning rate begins with 0.05. In the shallow data experiments, the learning rate is divided by 10 at the 36k, 54k iterations and the training process is finished at 64k iterations. For the deep data, we divide the learning rate at the 72k, 96k, 112k iterations and finish at 120k iterations. For pretrain-finetune experiments, the learning rate starts from 0.001 and is divided by 10 at the 6k, 9k iterations and finished at 10k iterations. The size of the gallery queue depends on the number of classes in training datasets, so we empirically set it as 16,384 for shallow and deep data, and 2,560 for QMUL-SurvFace. In the evaluation stage, we extract the last layer output from the probe-set network as the face representation. The cosine similarity is utilized as the similarity metric. For strict and precise evaluation, all the overlapping IDs between training and test datasets are removed according to the list .
Loss Function. SST can be flexible integrated with the existing training loss functions. Both classification and embedding learning loss functions are considered as the baseline, and compared with the integration with SST. The classification loss functions include A-softmax , AM-softmax , Arc-softmax , AdaCos , MV-softmax , DP-softmax  and Center loss . The embedding learning methods include Contrastive , Triplet  and N-pairs .
We analyze each technique in SST, and compare them with the other choices mentioned in the previous section, such as the network constraint () and the prototype constraint (). Table 2 compares their performance with four basic loss functions (softmax, A-Softmax, AM-Softmax and Arc-softmax). In this table, “Org.” denotes the plain training, “A” denotes the prototype constraint, “B” denotes the network constraint, “C” denotes the gallery queue, “D” denotes the combination of “B” and “C”, “SST” denotes the ultimate scheme of Semi-Siamese Training which includes the moving-average updating Semi-Siamese networks and the training scheme with gallery queue. From Table 2, we can conclude: (1) the naive prototype constraint “A” leads to decrease in most terms, which means enlarging in every dimension indiscriminatively does not help on Shallow Face Learning; (2) the network constraint “B” and the gallery queue “C” results in progressive increase, and the combination of them “D” obtains further improvement; (3) finally, SST employs moving-average updating and gallery queue, and achieves the best results by all terms. The comparison indicates SST well addresses the problem in Shallow Face Learning, and obtains significantly improvements in test accuracy.
First, we train the network on the shallow data with various loss functions and test it on BLUFR at FAR=1e-5 (the blue bars in Fig. 4). The loss functions include classification and embedding ones such as softmax, A-softmax, AM-softmax, Arc-softmax, AdaCos, MV-softmax, DP-softmax, Center loss, Contrastive, Triplet and N-pairs. Then, we train the same network with the same loss functions on the shallow data, but with SST scheme. As shown in Fig. 4, SST can be flexibly integrated with every loss function, and obtains large increase for Shallow Face Learning (the orange bars). Moreover, we employ hard example mining strategies when training on MV-softmax and embedding losses. The results prove SST can also work well with the hard example mining strategies.
To demonstrate the stable convergence in the training, we employ SST to train different CNN architectures, including MobileFaceNet, VGG-16, SE-ResNet-18, Attention-56, ResNet-50 and -101. As shown in Fig. 5, the loss curves of conventional training (the dot curves) suffer from oscillation. But every loss curve of SST (the solid curves) decreases steadily, indicating the convergence of each network along with the training of SST. Besides, the digits in the legend of Fig. 5 indicates the test result of each network on BLUFR. For conventional training, the test accuracy decreases with the deeper network architectures, showing that the larger model size exacerbates the model degeneration and over-fitting. In contrast, as the network becomes heavy, the test accuracy of SST increases, showing that SST makes increasing contribution with more complicated architectures.
The previous experiments show SST has well tackled the problems in Shallow Face Learning and obtained significant improvement in test accuracy. To further explore the advantage of SST for wider application, we adopt SST scheme on the deep data (full version of MS1M-v1c), and make comparison with the conventional training. Table 3 shows the performance on LFW, AgeDB-30, CFP-FP, CALFW, CPLFW and BLUFR and MegaFace. SST gains the leading accuracy in most of the test sets, and also the competitive results on CALFW and BLUFR. SST (softmax) achieves at least one percent improvement on AgeDB-30, CFP-FP, CALFW and CPLFW which include the hard cases of large face pose or large age gap. Notably, SST reduces large amount of FC parameters by which the classification loss is computed for the conventional training. One can refer to the supplementary material for more results on deep data.
In real-world face recognition, there is a large domain gap between the public training datasets and the captured face images. The public training datasets, such as MS-Celeb-1M and VGGFace2, are well-posed face images collected from internet. But the real-world applications are usually quite different. To cope with this issue, the typical routine is to pretrain a network on the public training datasets and fine-tune it on real-world face data. Although SST is dedicated to the training from scratch on shallow data, we are still interested in employing SST to deal with the challenge in finetuning task. So, we conduct an extra experiment with pretraining on MS1M-v1c and finetuning on QMUL-SurvFace in this subsection. The network is first pretrained with softmax on MS1M-v1c. We randomly select two samples for each ID from the QMUL-SurvFace to construct the shallow data. The network is then finetuned on QMUL-SurvFace shallow data with/without SST. The evaluation is performed on the QMUL-SurvFace test set. From Table 4, we can find that, no matter for classification learning or embedding learning, SST boosts the performance significantly in both verification and identification, compared with the conventional training.
In this paper, we first study a critical problem in real-world face recognition, i.e. Shallow Face Learning, which has been overlooked before. We analyze how the existing training methods suffer from Shallow Face Learning. The core issues consist in the training difficulty and feature space collapse, which leads to the model degeneration and over-fitting. Then, we propose a novel training method, namely Semi-Siamese Training (SST), to address challenges in Shallow Face Learning. Specifically, SST employs the Semi-Siamese networks and constructs the gallery queue with gallery features to overcome the issues. SST can perform with flexible integration with the existing training loss functions and network architectures. Experiments on shallow data show SST significantly improves the conventional training. Besides, extra experiments further explore the advantage of SST in a wide range, such as deep data training and pretrain-finetune development.
This work was supported in part by the National Key Research & Development Program (No. 2020YFC2003901), Chinese National Natural Science Foundation Projects #61872367, and #61572307, and Beijing Academy of Artificial Intelligence (BAAI).
Semi-Siamese Training for Shallow Face Learning
Hang DuEqual contribution. This work was performed at JD AI Research.
Dan Zeng Tao Mei
First, we provide more details about utilizing SST on deep data learning. In each iteration of deep data training, a batch of ID is randomly sampled, and for each ID, two images are randomly sampled. The arbitrary one acts as gallery, and the other one acts as probe. So, every image has chance to play the role of gallery or probe. Besides, as a supplementary experiment for Section 4.5 of the main paper, we evaluate SST with loss functions of DP-softmax , Contrastive , Triplet  and N-pairs . For evaluation, we use seven test benchmarks, including LFW , BLUFR , AgeDB , CFP , CALFW , CPLFW , MegaFace . From the results, we can find all the loss functions can achieve better performance on various benchmarks after employing SST. Moreover, we can observe the original embedding loss functions (i.e. Contrastive, Triplet and N-pairs) have poor performance in strict FAR ranges (such as BLUFR and MegaFace); after integrated with SST, they obtain significant improvement on these benchmarks.
In ablation study (the Section 4.2 of the main paper), we can see all the combination of “gallery queue” and “semi-siamese” (whatever network constraint or momentum) leads to the most significant boost for each training loss function. Besides, AM-softmax gains larger benefit than Arc-softmax from SST. We assume that the angular margin by Arc-softmax provides stronger supervision than AM-softmax, and such strong supervision distorts the feature space to some extent because the margin penalty performs on feature-feature pairs instead of features-FC pairs (especially training from scratch).
In pretrain and finetune experiment (the Section 4.6 of the main paper), we can find the different improvement for softmax-based methods (softmax, AM-softmax, Arc-softmax) and pair/triplet-based methods (contrastive, triplet, N-pairs). We argue the heavy parameters of original softmax-based methods in classification FC layer brings the sub-optimal results in this experiment. After integrating with SST, the FC layer is replaced by an updating feature queue, which can significantly alleviate the optimization issue. Meanwhile, the pair/triplet-based methods adopt features rather than FC layer in the original version. So, the benefit brought by SST for softmax-based methods is larger than for pair/triplet-based methods in the finetuning stage.
Cheng, Y., Zhao, J., Wang, Z., Xu, Y., Jayashree, K., Shen, S., Feng, J.: Know you at one glance: A compact vector representation for low-shot learning. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 1924–1932 (2017)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. vol. 1, pp. 539–546 (2005)
Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in neural information processing systems. pp. 1988–1996 (2014)