Large-scale Bisample Learning on ID vs. Spot Face Recognition

by   Xiangyu Zhu, et al.
Beihang University

In many face recognition applications, there is large amount of face data with two images for each person. One is an ID photo for face enrollment, and the other is a probe photo captured on spot. Most existing methods are designed for training data with limited breadth (relatively small class number) and sufficient depth (many samples for each class). They would meet great challenges when applied on this ID vs. Spot (IvS) data, including the under-represented intra-class variations and the excessive demand on computing devices. In this paper, we propose a deep learning based large-scale bisample learning (LBL) method for IvS face recognition. To tackle the bisample problem that there are only two samples for each class, a classification-verification-classification (CVC) training strategy is proposed to progressively enhance the IvS performance. Besides, a dominant prototype softmax (DP-softmax) is incorporated to make the deep learning applicable on large-scale classes. We conduct LBL on a IvS face dataset with more than two million identities. Experimental results show the proposed method achieves superior performance than previous ones, validating the effectiveness of LBL on IvS face recognition.


page 1

page 5

page 10


Learning from Millions of 3D Scans for Large-scale 3D Face Recognition

Deep networks trained on millions of facial images are believed to be cl...

Accelerated Training for Massive Classification via Dynamic Class Selection

Massive classification, a classification task defined over a vast number...

Prototype Memory for Large-scale Face Representation Learning

Face representation learning using datasets with massive number of ident...

Semi-Siamese Training for Shallow Face Learning

Most existing public face datasets, such as MS-Celeb-1M and VGGFace2, pr...

NPCFace: A Negative-Positive Cooperation Supervision for Training Large-scale Face Recognition

Deep face recognition has made remarkable advances in the last few years...

Deep Learning-based Single Image Face Depth Data Enhancement

Face recognition can benefit from the utilization of depth data captured...

An Efficient Training Approach for Very Large Scale Face Recognition

Face recognition has achieved significant progress in deep-learning era ...

1 Introduction

Figure 1: The ID versus Spot (IvS) data, each identity has one ID photo and one spot photo.

Face recognition has witnessed dramatic improvements in recent years, primarily due to the advances in network architectures Krizhevsky2012ImageNet ; szegedy2014going ; simonyan2014very ; szegedy2016rethinking ; he2016deep , training strategies Taigman-CVPR-2013 ; sun2014deep ; schroff2015facenet ; Smirnov2017Doppelganger ; huang2016local ; Chen2017Beyond and a large amount of face data yi2014learning ; nech2017level ; guo2016ms ; Cao2017VGGFace2 . Recent methods mainly focus on face recognition in the wild, where the training datasets are collected from internet by web searching engines yi2014learning or electronic album applications nech2017level . Most of wild datasets like CASIA-Webface yi2014learning , Ms-Celeb-1M guo2016ms and VGG2 Cao2017VGGFace2 are well-posed, where they have a limited number of classes (less than ) and adequate samples per class (more than ). However, this is not the case in many real-world face data, like the ID versus Spot (IvS) face recognition, which aims to match unconstrained spot photos with constrained ID photos, see Fig. 1 for example. Compared with wild datasets, IvS datasets present threefold challenges below.

  1. Heterogeneity: ID and spot photos are taken in different environments. The ID photos are taken in constrained environments with clean background, in frontal pose, normal illumination and neutral expression. The spot photos are taken in unconstrained environments. There are pose, lighting, expression and occlusion (e.g., glasses, haircut, scarf etc.) variations. Moreover, there may be a large age gap between ID and spot photos since ID photos are updated every years. This heterogeneity increases the difficulty of IvS face recognition.

  2. Bisample Data: Usually, IvS training data is collected by face authentication systems. When a user passes the authentication system, a pair of his photos will be recorded, one ID photo from his ID card and the other spot photo taken online. As a result, there are only two samples available for each subject. The intra-variations of classes are not well represented, making the discriminative training on bisample data a more challenging problem.

  3. Large-scale Classes: IvS data is collected by practical systems, where there can be as many as million or even hundreds of million identities. How to perform deep learning on such massive classes with limited GPU devices is worth studying.

The above three characteristics pose great challenges for IvS face recognition. In real-world applications, the high recognition rate at low false acceptance rate is demanded. To this end, the large margins between inter-class samples and the compactness of intra-class samples in the feature space are necessary. However, since there are only two samples for each subject, it is difficult to describe the intra-variations in the training phase so that the derived feature space would not be discriminative enough. In addition, there is a huge number of classes. It is a great challenge to explore the discriminative information among these classes with limited GPU devices. Taking deep learning with softmax as an example, there need to be millions of prototypes in the GPU memory, which is infeasible for most of computing devices.

In this paper, we cast the deep learning on IvS data as a Large-scale Bisample Learning (LBL) problem, where the training data has a huge number of classes and each class has only one positive pair. To enhance existing training strategies to handle the LBL problem, two challenges must be resolved: The weak intra-variations caused by bisample data and the model training scalability caused by large-scale classes. To deal with weak intra-variations, we propose a progressive model transferring method, named Classification-Verification-Classification (CVC). We pre-train a model on web-collected data by classification and finetune it on IvS data by verification to get a good initialization. Then we perform large-scale classification to obtain the final IvS model.

To improve scalability for model training, we adopt a prototype selection strategy in the last stage of CVC to scale up softmax-like losses to any number of classes. Specifically, we observe that the gradients of softmax are dominated by a small fraction of classes and the dominant classes can be effectively identified by the class proximities. Based on this, we build a dominant queue for each class to record its similar classes, from which we can select the most dominant classes to participate in the classification. The new softmax can perform effective training with only classes, significantly reducing the demand for computing devices.

We evaluate our method on a real-world IvS dataset and show it reaches the state-of-the-art performance with limited computing devices (4 TITANX GPU). Besides, we release a Public-IvS dataset of identities for open evaluation 111

. Moreover, to make our work reproducible, we devise a new protocol Megaface-bisample to mimic the large-scale bisample learning task. To our knowledge, it is the first investigation into training deep neural networks on large-scale bisample face data.

2 Related Works

In this section, we review the deep learning based face recognition and discuss two related problems about the LBL task: (1) Learning with insufficient data and (2) Large-scale classification.

2.1 Deep Learning based Face Recognition

Recently there are two schemes to train deep models for face recognition: classification and verification. The classification scheme considers each identity as a unique category and classifies each sample into one of the classes. During testing, the classification layer is removed and the top-level feature is regarded as the face representation 

Sun-CVPR-2014 . The most popular loss is softmax Sun-CVPR-2014 ; Taigman-CVPR-2013 ; taigman2014web . Based on that, the center loss wen2016discriminative proposes to learn the class-specific feature centers to make features more compact in the embedding space. The L2-softmax ranjan2017l2 adds a L2-constraint on features to promote the under-represented classes. The normface wang2017normface normalizes both features and prototypes to make the training and testing phases closer. Recently, enhancing margins between different classes is found to be effective in improving feature discrimination, including large-margin softmax liu2016large , A-softmax liu2017sphereface , GA-softmax liu2017deephyper and AM-softmax wang2018additive . Benefiting from the prototypes in the classification layer, the scheme can distinguish a sample from all the other classes, leading to fast convergence and good generalization ability wang2017normface .

On the other hand, the verification scheme optimizes distances between samples. Within a mini-batch, the contrastive loss sun2014deep optimizes pairwise distances in the feature space to reduce intra-class distances and enlarge inter-class distances. The triplet loss schroff2015facenet makes up a triplet consisting of an anchor, a positive sample and a negative sample. The loss aims to separate the positive pair from the negative pair by a distance margin. The lifted structured loss oh2016deep considers all the pairwise distances within the mini-batch and select the best positives and negatives. The N-pairs loss sohn2016improved optimizes each positive pair against all the related negative pairs following a local softmax formulation. Besides, hard negative mining is widely adopted to remove the easy negative pairs to ensure fast convergence schroff2015facenet . More recently, zhao2018princi presents a GAN-based method to deliberately generate hard triplet samples to improve the efficiency and effectiveness in training triplet losses. The performance of the verification scheme depends on the number of pairs generated in one mini-batch oh2016deep , which is determined by the batch size. However, increasing batch size, meaning that expanding GPU memory, is very expensive. To reduce the cost of GPU memory, smart sampling kumar2017smart

selects valuable pairs in the data layer instead of the feature layer. The method memorizes the pairs having large losses and selects them with higher probabilities afterwards 

kumar2017smart ; Smirnov2017Doppelganger ; wang2017train .

Most contemporary face recognition methods are based on wild datasets, e.g., CASIA-Webface yi2014learning , Ms-Celeb-1M guo2016ms , MF2 nech2017level and VGG2 Cao2017VGGFace2 . These well-posed datasets have a limited number of identities and sufficient samples per identity. However, this is not the case in IvS datasets. Table 1 gives a brief comparison between wild and IvS datasets. Our CASIA-IvS has more than million identities but only two samples per identity, on which existing well-studied methods cannot work well any more. Exploring IvS-specific training strategies is necessary.

Dataset Identities Samples/ID Scenarios Descriptions
CASIA-Webface yi2014learning wild Celebrity photos by web-searching
Ms-Celeb-1M guo2016ms wild Celebrity photos by web-searching
MF2 nech2017level wild User photos of electronic album
VGG2 Cao2017VGGFace2 wild Celebrity photos by web-searching
CASIA-IvS IvS ID and spot photos of the masses
Table 1: Description of face recognition datasets. We clean Ms-Celeb-1M and MF2 due to their low purities wu2015light , and cut the identities whose samples are smaller than to balance the long tail distribution zhang2017range . The numbers after indicate the information after cleaning.

2.2 Learning with Insufficient Data

Low-shot learning intends to recognize new classes by few samples feifei2006one-shot . Generally, low-shot learning transfers the knowledge from a well-posed source domain to the low-shot target domain. Siamese net koch2015siamese

trains a siamese CNN by same-or-different classification on the source domain and extracts the deep features for nearest neighbour matching in the target domain. MANN 

santoro2016one-shot ; weston2014memory ; vinyals2016matching memorizes the features of examples in the source domain to help predict the under-labeled classes. Model regression Wang2016Learning ; bertinetto2016learning directly transfers the neural network weights across domains. The L2-regularization on features guo2017one ; hariharan2016low ; chen2016analysis can prevent the network from ignoring low-shot classes. Besides, virtual sample generation hariharan2016low ; choe2017face and semi-supervised samples xu2016few are found effective in promoting low-shot classes. Although both low-shot learning and bisample learning intend to learn a concept with insufficient samples, they differ in that low-shot learning is close-set classification but bisample learning is open-set classification where the testing samples definitely belong to unseen classes.

Long-tail problem refers to the situation that only a limited number of classes appear frequently, while most of the others remain far less existing. Deep models trained on long-tailed data tend to ignore the classes in the tail. To resolve the problem, Yang2014Context retrieves more samples from the tail classes. Ouyang2016Factors

makes samples uniformly distributed by random sampling.

zhang2017range proposes a range loss to balance the rich and poor classes, where the largest intra-class distance is reduced and the shortest class-center distance is enlarged.

2.3 Large-scale Classification

Large-scale classification aims to perform classification on a vast number of classes, where the class number reaches millions or tens of millions. This task presents a great problem for deep learning: the common softmax loss can not be adopted due to the prohibitive parameter size and computation cost. The Megaface challenge nech2017level proposes four methods for training models on k identities. Model-A trains the network on random identities via softmax. Model-B finetunes Model-A on all the k identities with the triplet loss. Model-C adopts rotating softmax that randomly selects identities every epoches. After each rotation the parameters in the softmax layer are randomly initialized. Model-D further triplet-finetunes Model-C on all the identities.

Beyond the computer vision, extreme multi-label learning 


and noise contrastive estimation 

gutmann2010noise are related to large-scale classification. Extreme Multi-label Learning learns a classifier to tag a sample with the most relevant label from a large label set hsu2009multi . It faces the same challenge as LBL that training a multi-class classifier is computationally prohibitive when the class number is extremely large. To tackle this problem, the tree based methods choromanska2013extreme ; prabhu2014fastxml ; bengio2003neural learn a label hierarchy as follows: The root node contains the entire label set and nodes are recursively partitioned until each leaf contains a small number of labels. Finally a base classifier identifies the samples in only one leaf node. Although tree based methods reduce the class number for each classifier, the prediction error made at top-level cannot be corrected at lower levels due to its cascading architecture babbar2017dismec . On the other hand, the embedding based methods bhatia2015sparse ; tagami2017annexml ; xu2016robust assume the label matrix hsu2009multi , where each row is a

label vector of a sample, is low rank and the label vectors can be projected onto a low-dimensional linear subspace. As a result, the extreme classification task can be converted to a low-dimensional regression problem. However, the low rank assumption indicates that the samples concentrate on a small number of active classes, which is not the case in IvS data.

Noise Contrastive Estimation (NCE) gutmann2010noise

provides an approximate method to estimate the probabilistic distribution without the normalization constant, which is the major cost in large-scale classification. Its basic idea is training a logistic regression classifier to discriminate samples from data distribution and noise distribution, so that the density estimation is reduced to probabilistic binary classification. Although NCE has been successfully applied in language models 

mnih2012fast ; mnih2013learning ; vaswani2013decoding , recent face recognition tasks sun2014deep ; Sun-CVPR-2014 have shown that promoting the contrast among classes is crucial in training discriminative models. Turning multi-class classification to binary logistic regression may lose inter-class information and get inferior performance.

3 Large-scale Bisample Learning

The proposed method contains a complete pipeline for deep learning on large-scale bisample data. We begin by discussing of the classification and the verification schemes, showing how their pros and cons motivate the proposed methods. Then we present the way to train deep neural networks on bisample data. Finally we develop a dominant prototype softmax to perform -million-way classification in a scalable fasion. Fig. 2 shows an overview of our method.

Figure 2:

Overview of the large-scale bisample learning (LBL). LBL adopts a classification-verification-classification (CVC) training strategy, which has three stages: The first stage, pre-learning, is training the network from scratch on a wild dataset by a classification loss. The second stage, transfer learning, is finetuning the network on the IvS dataset with a verification loss. The last stage, fine-grained learning, is performing large-scale classification on the IvS dataset with a new dominant prototype softmax.

3.1 Problem Formulation and Motivation

Currently there are two schemes for training deep neural networks, i.e., verification and classification. The verification scheme optimizes sample-to-sample distances, such as the contrastive loss sun2014deep and the triplet loss schroff2015facenet . In each iteration, it performs local optimization within a mini-batch by making positive pairs close and negative pairs far away. Besides, the mining strategy schroff2015facenet filters out easy pairs for fast convergence. On the other hand, the classification scheme regards each identity as a unique class and trains the network as a -way classification problem, such as softmax Sun-CVPR-2014 and A-softmaxliu2017sphereface . Compared with the verification scheme, the classification scheme performs global optimization by identifying each sample into one of the classes.

In this paper, we motivate our method by comparing classification and verification. Interestingly, if we formulate the loss function for a whole mini-batch, we can unify the two schemes in a pair matching and weighting framework. First, the verification scheme extracts features with a neural network and makes pairs between deep features. Taking contrastive loss 

sun2014deep as an example:


where is the

dimensional deep feature extracted by the neural network;

are the features in the mini-batch where is the batch size; if and belong to the same class and if not; is the hard negative mining that filters out easy negative pairs with a threshold . We can see that the contrastive loss makes pairs within deep features and assigns weights to them.

In contrast, the classification scheme makes pairs between features and prototypes. Taking the softmax loss sun2014deep as an example:


where is the prototype matrix in the softmax layer where is the number of classes and is the label of . Its derivatives to a prototype and a feature are:


where is the indicator function which is when the statement is true and otherwise, and is the probability that belongs to the th class. Given that network training only concerns the gradients back-propagated, we can construct a dummy softmax loss sharing the same gradients with Equ. 2:


where is computed as in Equ. 4 and considered as a constant. and are equivalent in network training since they produce the same back-propagated signals. Obviously makes pairs between and , and assigns a weight to each pair by the probability . The negative pairs with higher probabilities and the positive pairs with lower probabilities have larger weights and yield louder signals during training.

Comparing Equ. 5 and Equ. 1, we can conclude that both classification and verification follow the same pair matching and weighting framework. The only differences lie in the pairing candidates (features with prototypes versus within features) and the weighting methods (soft weight versus hard weight). Recent works have empirically observed that increasing the number of pairs always delivers faster convergence and better discriminative power, hence the loss functions involving more pairs are preferred. Within a mini-batch with as the batch size and as the class number, a classification loss makes pairs in Equ. 5 and a verification loss makes pairs in Equ. 1. In real implementation with limited GPU memory, always holds. For example, when training ResNet64 liu2016large with a TITAN-X GPU, the batch size is about and the class number easily reaches tens or even hundreds of thousands. With more orders of magnitude pairs, the classification scheme is expected to acquire more discriminative features, which has been shown in the state-of-the-art methods liu2016large ; liu2017deephyper ; wang2018additive ; wang2018cosface . However, two challenges make classification infeasible on IvS data. First, the classification scheme has difficulty to converge on bisample data due to the weak intra-variations, which is demonstrated in our experiments. Second, the classification scheme suffers from weak scalability to large-scale classes due to the limited GPU memory. Directly performing -million-way classification with two samples per class is infeasible for current optimization methods and computing devices.

In this paper, we motivate our method to make the classification scheme feasible on large-scale bisample data. To this end, its robustness to bisample data and scalability to large-scale classes should be enhanced. First, we find the classification scheme convergent on bisample data only if it is well initialized. So that we propose a CVC training strategy to initialize the model and construct the prototypes for the classification scheme. Second, we propose a prototype selection strategy to scale up the classification scheme to any number of classes. With the improvements, we achieve superior performance to existing methods.

3.2 Bisample Learning

It has been observed that when training data is insufficient, transferring knowledge from related tasks is better than directly training on the target domain koch2015siamese . Inspired by this, we regard the well-posed wild data as the source domain and the IvS data as the target domain. A classification-verification-classification (CVC) training strategy is proposed to transfer the knowledge from wild scenarios to IvS scenarios and boost the performance by large-scale classification. As shown in Fig. 2, the CVC involves three stages:

  1. Pre-learning (Classification): We first train the deep model on a wild dataset to get a good initialization for general face recognition. With a limited number of classes (less than ), we can adopt a classification loss like softmax Sun-CVPR-2014 and A-softmax liu2017sphereface to perform one-vs-all optimization. The trained model performs well in wild scenarios but terribly in IvS scenarios due to the large bias zhou2015naive . Nevertheless, the model has learned basic knowledge about human faces and will not be puzzled by IvS data.

  2. Transfer Learning (Verification): Since the verification scheme only concerns a small number of classes and just needs two samples per class to optimize intra-class distances in each iteration. We believe verification is robust to large-scale bisample data. In this stage, we adopt the verification scheme to transfer the face knowledge from wild scenarios to IvS scenarios. Specifically, we remove the classification layer and finetune the model on the IvS dataset with a verification loss like contrastive sun2014deep or triplet schroff2015facenet . Benefiting from the initialization from the previous stage and the robustness to bisample data of the verification scheme, we can successfully optimize the loss function and provide a good initialization for the final large-scale classification.

  3. Fine-grained Learning (Classification): We construct a classification layer on the top of the network and conduct classification with million classes on the IvS dataset. A novel dominant prototype softmax is adopted to select a small number of dominant classes to participate into the classification in each iteration. The new softmax can effectively and efficiently perform large-scale classification and further boost the performance, finally achieves satisfactory recognition accuracy in IvS scenarios.

The key in CVC is that the knowledge transferring should be smooth. We find after the first stage, the large-scale classification has been able to converge. However, the loss descends slowly and the optimization gets stuck into a bad local optima. Considering that the verification scheme has good robustness to data distribution, we bridge the two classification stages with a verification stage, which gives a better initialization for large-scale classification and finally achieves much better performance. Although classification followed by verification parkhi2015deep and the joint identification-verification sun2014deep have been applied in training web-face models, the two schemes are applied on the same dataset. While the first two stages of CVC are applied on different datasets with different scenarios, which acts as a knowledge-transferring role.

To perform classification in the final stage of CVC, we must construct the absent classification layer, which contains the prototype for each class. Considering prototypes serve as the class proxies, to which the deep features will be optimized, we construct the prototype of a class by the features belonging to it. Specifically, we try two kinds of prototypes: ID-prototype and avg-prototype. Suppose and are the deep features of the ID and spot photos of the th identity, we set the ID-prototype and the avg-prototype . Intuitively, the ID-prototype enforces the spot feature to approach the more reliable ID feature and the avg-prototype makes the two features approach their centroid. Our experiments show that which kind of prototype is better depends on the loss function.

In the next section, we will introduce how to perform large-scale classification in the final stage of CVC.

3.3 Large-scale Classification

3.3.1 Random Prototype Softmax

With the well initialized network and prototypes, the only problem remained is to scale up the classification scheme to massive classes. If we directly perform classification on million classes, the massive prototypes will take GPU memory (GB of the GB) and dramatically increase the training time due to their numerous parameters.

We aim to improve scalability by reducing the cost of large-scale classification. As shown in Fig. 3, we select a fraction of prototypes to participate in the classification in each iteration. In the pair-matching formulation of softmax (Equ. 5), given one mini-batch where samples have different labels, all the prototypes can be divided into positive prototypes and the rest negative prototypes . Each prototype in has a mate in to make up a positive pair, while the prototypes in do not share class with any of and only make up negative pairs. Given that , it is unnecessary to put the whole into GPU memory since negative pairs are redundant. Based on this, we propose a naive solution called Random Prototype Softmax (RP-softmax). The RP-softmax stores the full prototype matrix in the memory. In each iteration, it first constructs a temporary prototype matrix , where has randomly selected prototypes from and is the number of selected prototypes. Then is copied into GPU for training and updated to . Finally, and are synchronized by replacing the selected prototypes with the updated ones. Overall, the prototype selection and updating procedure is listed in Algorithm 1.

Figure 3: Overview of large-scale classification.
Input :  Prototype matrix:
  Feature matrix:
  Number of selected classes
Output :  Updated prototype matrix
1 Initialize selected class set
2 for each feature in  do
3       Get the label of
4       .insert()
6 end for
7while  do
8       randomly select a label
9       .insert()
11 end while
12for  do
13       , is the member of
15 end for
16Training with and , getting updated
17 for  do
18       =
20 end for
Algorithm 1 Random Prototype Softmax

The hyper parameter plays a key role in RP-softmax. Larger brings more negative pairs and provides richer inter-variation information. However, increasing is not cost free. Besides the time-consuming large matrix multiplication, the softmax layer has to get blocked until is copied into GPU. Sometimes the waiting time exceeds the forward propagation time. Moreover, increasing

squeezes the batch size and degrades the data-driven layers like batch-normalization. As a result,

is set empirically to balance the performance and the training time. In our experiments, with the RP-softmax significantly improves the performance in IvS scenarios.

3.3.2 Dominant Prototype Softmax

Although RP-softmax makes it possible to perform large-scale classification, it is still inefficient due to its blind prototype selection. In this section, we show that the quality not the quantity really matters in prototype selection. We begin with the demonstration that in each iteration, only a small fraction of negative prototypes generate strong gradients.

In Equ. 3, a negative prototype contributes to the back-propagated gradient by , whose norm is . Usually, we restrict to one liu2017sphereface and the norm will be , which can measure the impact of to the training process. In this paper, with a mini-batch , we define the energy of a negative prototype as:


where is the probability that belongs to class . Note that none of has the label since is a negative prototype. To analyze whether the energy is concentrated on a small fraction of prototypes, we further define the top- cumulative energy as:


where is the set of negative prototypes and is the set of negative prototypes with the largest energy. A large with small denotes that the energy of negative prototypes are highly concentrated. We plot the along the training process in Fig. 4. It can be seen that in the beginning the top- possesses of energy. As the training proceeds, the energy becomes more and more concentrated. In the middle and end of the training process, the energy of top- is increased to and . These results indicate that only a small fraction of prototypes can produce large gradients to affect training. We call these negative prototypes with large energy as dominant prototypes.

Figure 4: The top- cumulate energy of negative prototypes () for a mini-batch, in the beginning, middle ( iterations) and end ( iterations) of the training process. The batch size is and the number of classes is . The curves come from averaging of mini-batches.

In real implementation, given a batch of features, how can we know the most dominant prototypes before we compute the probabilities in softmax? In this paper, we assume that if two identities have similar ID features, their prototypes and features are likely to make hard negative pairs. Based on this, we propose the Dominant Prototype Softmax (DP-softmax). The basic idea is selecting prototypes from a set of dominant queues and updating the queues by the softmax predications. The procedure is detailed as follows:

Queue Initialization: For each class , we define the K-Nearest Classes as the top- classes having the nearest ID features with . Before training, we build an approximate nearest neighbor (ANN) graph by ID features and get the for each class. Then we construct a dominant queue and a candidate set for each class. The is initialized by and its members are sorted by the distances of ID features to . The is set to . Note that .

Prototype Selection: After training begins, in each iteration we need to select prototypes for the mini-batch . First we select their positive prototypes where is the label of . Second, for each feature we select the prototypes of the classes in its dominant queue that and the full negative prototypes are . Thirdly, we remove the repeated prototypes and randomly select negative prototypes into until a preset number is reached. Finally and constitute the temporary prototype matrix in this iteration and are copied into GPU for training. Algorithm 2 summarizes the DP-softmax.

Input :  Prototype matrix:
  Feature matrix:
  Number of selected classes
  Dominant Queues:
Output :  Updated prototype matrix
1 Initialize selected class set
2 for each feature in  do
3       Get the label of
4       .insert()
5       for each class in  do
6             .insert()
7       end for
9 end for
10while  do
11       randomly select a label
12       .insert()
14 end while
15for  do
16       , is the member of
18 end for
19Training with and , getting updated
20 for  do
21       =
23 end for
Algorithm 2 Dominant Prototype Softmax

Queue Updating: After training in each iteration, we can update the dominant queues by the predictions of softmax. For a feature , its highest activated class provides valuable information: First if then it is a successful prediction and there is nothing to update. Second if but , then this is a mis-prediction but the wrong-matched class is still in the dominant queue. Hence we need not to update . Thirdly, if and but , it means the class neighborhood has changed as the training proceeds. Therefore, we push into and pop the class that is the most dissimilar to . Finally if and is not in or , it means and have dissimilar ID features in the beginning but become close at this time. This case is mostly caused by the mislabelled or low-quality spot photo of which misdirects its prototype, as shown in Fig. 5. Therefore, we do not update since is a noisy label.

(a) Mislabelling
(b) Low Quality
Figure 5: The refused mis-predicted class. When a mis-predicted class is refused to enter the dominant queue, there are always something wrong in its spot photo, including (a) mislabelling and (b) low quality.

The whole prototype selecting and queue updating operations can be done in real time. Compared with the RP-softmax, the DP-softmax significantly improves the quality and reduce the quantity of prototypes, leading to faster training and better performance.

Since the prototypes are saved in memory, which can easily hold tens of millions of prototypes, the dominant prototype selection scales up the classification scheme to any number of classes. Besides, when new training data come, the prototype matrix can be extended by the ID features of the new identities. Then the network can be finetuned on the whole training data.

4 Experiments

In this section, the proposed large-scale bisample learning (LBL) is systematically evaluated. We first analyze the CVC training strategy. Then we explore how different prototype selection methods affect the final performance. Finally we conduct comparison experiments on three datasets including CASIA-IvS-Test, Public-IvS and Megaface-bisample.

4.1 Datasets

Ms-Celeb-1M: The Ms-Celeb-1M guo2016ms is one of the largest wild dataset containing celebrities and million images. The list of wu2015light is adopted to clean the noisy labels, resulting in identities and million images.

CASIA-IvS: The CASIA-IvS dataset is collected for IvS face recognition. The training set CASIA-IvS-Train contains identities, each having two images. One image is the ID photo from the ID card, which is taken with uniform background, in frontal pose, normal illumination and neutral expression. The other is the spot photo taken by on-site devices, with variations in pose, expression, illumination, occlusion and resolution, as shown in Fig. 6. The test set CASIA-IvS-Test contains identities and images, which are checked manually to clean the noisy labels and ensure there is no identity overlap between training and test sets. During testing, all the ID photos and spot photos are paired, generating positive pairs and nearly million negative pairs.

Figure 6: Example images in CASIA-IvS.

Public-IvS: An IvS test dataset is released for open evaluation. We found some public characters, such as politicians, teachers and researchers, had their ID photos on BaiduBaike url-baidubaike and official pages. We recorded their names and collected their spot photos on the web. Afterwards, we cleaned the dataset manually and removed the profile-view images. The final Public-IvS dataset has identities and images, each identity having one ID photo and to spot photos. There are positive pairs and nearly million negative pairs. Fig. 7 shows some images in Public-IvS. Although Public-IvS is not a strictly IvS dataset since the spot photos are collected from the web, experiments on Public-IvS have consistent results with the real-world CASIA-IvS-Test.

Figure 7: Example images in Public-IvS.

4.2 Experimental Settings

Preprocessing   We detect faces by the FaceBox zhang2017faceboxes detector and localize landmarks (two eyes, nose tip and two mouth corners) by a simple -layer CNN feng2017wing . All the faces are normalized by similarity transformation and cropped to RGB images.

CNN Architecture   For the sake of fairness, all the CNN models in the experiments follow the same ResNet64 architecture liu2017sphereface . It has four residual blocks and gets a -dimensional feature vector by average pooling. The learning rate begins with and is divided by when the loss does not decrease. All the networks are trained on TITANX GPUs parallelly and the batch size is set to occupy all the GPU memory. Specifically, the batch size is in the verification scheme and about in the classification scheme.

Training Setup   There are three stages in the CVC training strategy: pre-learning by classification on wild data, transfer learning by verification on IvS data and fine-grained learning by large-scale classification on IvS data. In the first stage, we train model from scratch by the A-Softmax loss liu2017sphereface on the Ms-Celeb-1M. In the second stage, we finetune the model on CASIA-IvS-Train with the triplet loss schroff2015facenet . The triplet loss is modified by N-pairs batch construction sohn2016improved , online hard-negative mining schroff2015facenet and anchor swapping balntas2016learning . In the third stage, we adopt the proposed DP-softmax to finetune the model on CASIA-IvS-Train. If not specified, there are two samples for each class in a mini-batch; the classification layer in the third stage is initialized by the ID-prototypes; softmax provides the probabilities and A-softmax provides the gradients. In DP-softmax the sizes of dominant queues and candidate sets are and , respectively.

Evaluation Setup   For each image, we extract features from both the original image and the flipped one and concatenate them as the final representation. The score is measured by the cosine distance of two features. We evaluate all the networks with ROC curves. The verification rate (VR) at low false acceptance rate (FAR) is preferred since in real application false acceptance gives higher risks than false rejection.

Method Procedure Performance
Classification Verification Classification VR@FAR= VR@FAR= VR@FAR=
(A-Soft on MS) (Triplet on IvS) (DP-Soft on IvS)
C## 85.31 69.11 51.90
CV# 96.41 91.39 83.23
CVC 97.70 96.17 91.92
##C not converge
#V# 79.39 58.90 38.33
C#C 94.36 85.35 72.35
C(VC) 96.43 91.75 82.80
Table 2: The intermediate results after each stage in the CVC training strategy. The performance is evaluated by the verification rate, VR(%), on CASIA-IvS-Test. In each stage, we indicate the loss function and the training data, where A-Soft refers to A-softmax, Triplet refers to triplet loss, DP-Soft refers to DP-softmax, MS refers to Ms-Celeb-1M and IvS refers to CASIA-IvS-Train. The “#” in method names indicates the corresponding stage is skipped.

4.3 Bisample Training

4.3.1 Classification-Verification-Classification (CVC)

To illustrate the effectiveness of CVC, we show the intermediate results in Table 2. After the first stage, the C## is a well trained model in wild scenarios, with on LFW Huang-2007-LFW and at FAR on Megaface challenge nech2017level . However the state-of-the-art face model cannot work well on CASIA-IvS-Test, indicating the large bias between the two scenarios. Second, after being finetuned on CASIA-IvS-Train with the triplet loss, the CV# achieves much better performance, indicating the knowledge is successfully transferred from wild scenarios to IvS scenarios. Finally, the large-scale classification on CASIA-IvS-Train further improves the performance and reaches at FAR=.

To further analyze the impact of each stage, we perform an ablation study by removing some stages. First, in ##C we directly perform large-scale classification on IvS data without any initialization and find the loss does not decrease after iterations. Second, we try to train model from scratch by the triplet loss on IvS data. Since the learning task is challenging without any initialization, we begin without hard-negative mining and slightly increase the ratio of hard negatives. The training converges but the model #V# has a bad result. Thirdly, we pre-train the model on wild data and directly finetune it on IvS data by large-scale classification. The training successfully converges but the resultant C#C is worse than the complete CVC. Finally, after pre-training on wild data, we perform joint verification and large-scale classification on IvS data, yielding the C(VC) model, which is also inferior than the complete CVC. From the results we can conclude that: (1) Comparing ##C, C#C and CVC, a good initialization is crucial for the large-scale classification on bisample data. (2) Comparing C#C, CV# and CVC, the verification scheme has higher scalability than the classification scheme when dealing with large-scale bisample data, but it cannot get satisfactory performance independently. (3) Comparing C#C, C(VC) and CVC, the smoothness is important in knowledge transferring and it is better to bridge the two classification stages with a verification stage.

There are some interesting phenomena we have observed in CVC learning. First, we find that the wild performance in the first stage does not affect the final IvS performance much. We begin with two pre-trained models with different wild performance ( on LFW with triplet loss and with A-softmax) and find their final IvS performances differ slightly ( vs. at FAR on IvS). Second, we find the model cannot keep its high wild performance after being finetuned on IvS data. We evaluate models on both CASIA-IvS-Test and LFW liao2014benchmark , shown in Table 3. After each stage of CVC, the IvS performance is improved at the cost of degenerated wild performance. We further train our model on the joint data from both scenarios and find the wild performance is greatly improved with slight drop in IvS. This joint training is a good strategy when both scenarios are concerned.

C## 51.90 94.23
CV# 83.23 86.38
CVC 91.92 80.71
CVC 89.96 90.81
Table 3: The performances in wild scenarios (LFW-BLUFER protocol liao2014benchmark ) and IvS scenarios (CASIA-IvS-Test) after each stage CVC. The CVC means the final large-scale classification stage is performed on the joint data from both Ms-Celeb-1M and CASIA-IvS-Train.

4.3.2 Prototype Construction

As introduced in Section 3.2, there are two ways to construct the prototypes in large scale classification: The ID-prototype is the feature of the ID photo and the avg-prototype is the average vector of all the features in this class. The way to construct prototypes depends on the loss function involved. We select the most representative softmax Sun-CVPR-2014 and the state-of-the-art A-softmax liu2017sphereface in this experiment. Table 4 shows the performances with different losses and prototypes.

Method FAR= FAR= FAR=
softmax (avg) 96.94 93.55 87.13
softmax (ID) 97.31 94.91 89.55
A-softmax (avg) 97.35 95.40 90.30
A-softmax (ID) 97.43 95.40 90.34
Table 4: The comparison of different prototype construction methods with different loss functions on CASIA-IvS-Test, evaluated by VR(%) at different FAR. The prototypes are randomly selected.

When softmax is adopted, the model initialized by avg-prototypes almost converges in the beginning and the loss only produces small gradients. If we replace avg-prototypes with ID-prototypes, the softmax loss will have a larger initial loss and end up with better results. When A-softmax is adopted, the angular margin keeps the initial loss large enough and the two prototypes end up with close performances. In our experiments, we prefer ID-prototypes and only adopt avg-prototypes when there is no ID photo like the mimic experiments in Sec. 4.7.

4.4 Large-scale Classification

In large-scale classification, we need to select a fraction of prototypes each time. In Sec. 3.3 we introduce two methods for prototype selection: one is to select prototypes randomly and the other is to select the dominant prototypes.

4.4.1 Random Prototype Softmax

In random prototype softmax (RP-softmax), we can increase the involved classes at a small cost of batch size due to the tiny memory cost of a single prototype. We evaluate the RP-softmax with k, k and k prototypes respectively in Table 5 and find more prototypes always come with better performance.

Method FAR= FAR= FAR=
RPS(20k) 97.19 94.49 87.84
RPS(50k) 97.19 94.71 88.57
RPS(100k) 97.43 95.40 90.34
Table 5: The performance of RP-softmax (RPS) on CASIA-IvS-Test, evaluated by VR(%) at different FAR. The values in the brackets are the numbers of prototypes.

However, increasing the number of prototypes is not cost free. More prototypes increase the overhead of computing softmax and copying prototypes in GPUs. In Fig. 8, we show the time costs and GPU-util percent with different prototype numbers. When prototypes increase from k to k, the training time increases by and the GPU-util percent drops from to . We further try k prototypes and find the GPU-util percent drops to , which means most time is spent on waiting for prototype copying.

Figure 8: The total training time (forward and backward propagation) of one mini-batch and the GPU-util percent with different prototype numbers. Low GPU-util percent means the GPU is blocked to wait for prototype copying.

4.4.2 Dominant Prototype Softmax

To improve performance and training efficiency simultaneously, we select the dominant prototypes instead of the random prototypes. In DP-softmax we maintain a dominant queue for each class to store their similar classes, where the queue size is an important parameter that impacts both performance and training time. Table 6 shows the performances with different queue sizes and Fig. 9 shows the corresponding training time. We can see that the performance increases as the queue size increases, but quickly saturates when reaches with only prototypes. Considering both performance and efficiency we set in our implementation. Compared with RP-softmax with prototypes, DP-softmax achieves better performance ( vs. at FAR=) with much lower training time (s vs. s per iteration).

Method FAR= FAR= FAR=
DPS(0.3k) 96.62 92.48 85.12
DPS(0.6k) 96.77 93.69 86.57
DPS(1.5k) 97.16 94.37 88.29
DPS(3.0k) 97.70 96.17 91.92
DPS(10.0k) 97.72 96.27 92.01
Table 6: The performances of DP-softmax (DPS) with different queue sizes on CASIA-IvS-Test, evaluated by VR(%) at different FAR. DPS indicates the queue size is and the number of prototypes is . Note that there are two samples per class in the mini-batch, there are at most dominant prototypes.
Figure 9: The total training time (forward and backward propagation) of one mini-batch with different dominant queue size.

In Table 7, we also compare the performances with and without queue updating, which demonstrates the effectiveness of queue updating.

Method FAR= FAR= FAR=
w/o update 97.54 95.58 90.77
update 97.70 96.17 91.92
Table 7: The performances of DP-softmax with and without queue updating, evaluated by VR(%) at different FAR.

4.4.3 Softmax Formulation

Large-scale classification mainly involves a prototype selection strategy, which can be combined with any softmax formulation. Besides the traditional softmax Sun-CVPR-2014 , the state-of-the-art A-softmax liu2017sphereface and AM-softmax wang2018additive can also be adopted. Table 8 shows the results with different softmax formulations. We can see that A-softmax and AM-softmax have improved performance by introducing the margins and A-softmax has the best results.

Method FAR= FAR= FAR=
DPS+softmax 97.31 94.91 89.55
DPS+AM-softmax 97.60 95.59 90.73
DPS+A-softmax 97.70 96.17 91.92
Table 8: The performances of adopting different softmax formulations in large-scale classification, evaluated by VR(%) at different FAR on CASIA-IvS-Test. The dominant prototype selection (DPS) is adopted.

4.5 Identity Volume

It has been repeatedly observed that more data always delivers better performance schroff2015facenet ; sun2017revisiting . Does the blessing of data still exist in IvS face recognition? To study this, we randomly sample a subset of k, k and M identities from CASIA-IvS-Train and train the model, respectively. As shown in Fig. 10, the performance grows logarithmically as identities increase, which is consistent with sun2017revisiting . We believe more identities provide more information about intra- and inter-variations, which delivers more discriminative features. Besides, it is suggested that the model can be further improved with more IvS data.

Figure 10: The performances under different identity volume, evaluated by VR(%) at FAR=.

4.6 Comparison Experiments

In order to compare our method with the state of the arts, we choose several methods feasible on large-scale bisample data, including Contrastive sun2014deep , Triplet schroff2015facenet , Lifted Struct oh2016deep , N-pairs sohn2016improved and the Model A-D in Megaface challenge nech2017level (MF-A to MF-D). We also evaluate the large-scale classification methods in language models including Noise Contrastive Estimation (NCEgutmann2010noise and Hierarchical Softmax (H-softmaxbengio2003neural . For fair comparison, all the methods adopt the ResNet64 architecture and their models are pretrained on Ms-Celeb-1M. In our implementation, for contrastive, each sample is paired with all the other ones in a mini-batch and the negative pairs are filtered by hard negative mining. For triplet, we adopt N-pairs batch construction sohn2016improved and anchor swapping balntas2016learning to construct the most triplets. Besides, online hard mining schroff2015facenet is performed to remove easy triplets. For N-pairs, we adopt the N-pair-mc loss to optimize each positive pair against all the related negative pairs and use the hard negative class mining to generate mini-batches with similar classes. For lifted struct, we directly use the released codes. For MF-A, we train the model with softmax on randomly selected classes. Then we finetune MF-A on the full data with the triplet loss as MF-B. For MF-C, we adopt the rotating softmax where random classes are selected in each epoch. Then we adopt the same triplet finetuning strategy to get MF-D. For NCE and H-softmax, directly training with the two losses cannot converge. In our implementation, we first train the models by the triplet loss and initialize the prototypes by the deep features as our LBL, making the training convergent. For our methods, we first provide a naive baseline to perform softmax on IvS data named LBL(softmax), where in the final stage of CVC we train model only on classes (which are the most classes affordable by the machine) and the classes do not change as training proceeds. Besides, we report LBL with RP-softmax and DP-softmax.

Table. 9 shows the performances on the real-world CASIA-IvS-Test and the open Public-IvS. Fig. 11(a) and Fig. 11(b) show the corresponding ROC curves. During implementation, we find MF-A cannot achieve satisfactory performance since only a small part of data can be used. MF-C is hard to converge since the rotating softmax randomly initializes the prototypes periodically. After finetuned by the triplet loss on all the data, the models (MF-B and MF-D) still fail to get satisfactory performances due to the poor initializations. As for our method LBL, we can see Public-IvS shows consistent results with CASIA-IvS-Test where our methods perform best. Besides, LBL significantly outperforms other methods on IvS data, especially at low FAR. The improvement at FAR= is to on CASIA-IvS-Test and to on Public-IvS. The DP-softmax further improves the RP-softmax and achieves the best performance. LBL also achieves better recognition rates than the large scale classification methods in language models like NCE and H-softmax.

Method CASIA-IvS-Test Public-IvS
Contrastive sun2014deep 96.25 91.17 81.39 96.52 91.71 84.54
Triplet schroff2015facenet 96.41 91.39 83.23 97.72 94.11 87.47
Lifted Struct oh2016deep 96.42 92.25 83.53 98.03 94.56 88.63
N-pairs sohn2016improved 96.45 92.13 83.96 98.23 94.57 86.49
NCE gutmann2010noise (Triplet-init) 96.30 91.18 82.62 97.90 93.93 87.27
H-softmax bengio2003neural (Triplet-init) 96.50 92.36 84.16 98.01 94.54 87.45
MF-A nech2017level 51.61 33.82 20.67 49.42 28.40 14.99
MF-B nech2017level 75.24 53.85 35.09 66.68 44.40 28.66
MF-C nech2017level 51.02 31.11 15.05 43.41 24.20 12.55
MF-D nech2017level 75.04 52.46 31.84 64.68 42.85 25.23
LBL(softmax) 97.01 93.69 86.68 98.38 95.49 89.63
LBL(RP-softmax) 97.43 95.40 90.34 98.44 96.29 91.99
LBL(DP-softmax) 97.70 96.17 91.92 98.83 97.21 93.62
Table 9: The performances of the state of the arts, evaluated by the VR(%) at different FAR. The models are trained on CASIA-IvS-Train and evaluated on CASIA-IvS-Test and Public-IvS, with our method and the best baseline highlighted.
(a) CASIA-IvS-Test
(b) Public-IvS
(c) Megaface-bisample
Figure 11: Comparison of ROC curves on CASIA-IvS-Test, Public-IvS and Megaface-bisample. The values in the brackets are the VR(%) at FAR=.

4.7 Mimic Experiments on Megaface-bisample

To make our work reproducible, we mimic the large-scale bisample challenge on the open MF2 nech2017level dataset and propose a new protocol Megaface-bisample. The MF2 contains identities which are much more than other datasets. We split MF2 into two subsets, MF2-thick and MF2-mini. The MF2-thick contains the identities having more than samples, which is used to simulate the well-posed dataset for pre-learning. The MF2-mini contains two randomly selected samples for each identity, which is used to simulate the bisample data. As for testing, we follow the BLUFR protocol liao2014benchmark on LFW Huang-2007-LFW . In summary, MF2-thick, MF2-mini and LFW-BLUFR simulate Ms-Celeb-1M, CASIA-IvS-Train and CASIA-IvS-Test, respectively. Specifically, MF2-thick has identities and samples per identity and MF2-mini has cleaned identities and samples per identity, whose image list will be released. As well known, MF2 has few celebrities and we have tried our best to ensure there is no identity overlap between MF2 and LFW. Although Megaface-bisample is not IvS data, it shares the same challenges: the weak intra-variations and model training scalability, as IvS data. Since there is no ID photo in MF2, we initialize the classification layer with avg-prototypes and construct the by avg-prototypes instead of ID features.

First, to verify the effectiveness of the simulation, we re-implement the experiments of Table. 2 about the CVC training strategy. As shown in Fig. 12, there is significant improvement after each stage. Besides, we try to train model from scratch on MF2-mini and find the training quickly falls into bad local optima. Since the results are consistent with the ones on CASIA-IvS, we believe Megaface-bisample can well simulate our task.

Figure 12: The intermediate results of CVC, following the Megaface-bisample protocol.

On Megaface-bisample we also compare our methods with the state of the arts in Table. 10, whose ROC curves are shown in Fig. 11(c). The proposed LBL still consistently outperforms the other methods and the improvement at FAR is over percent.

Contrastive sun2014deep 93.74 82.53 63.75
Triplet schroff2015facenet 93.54 82.63 65.06
Lifted Struct oh2016deep 90.50 75.46 53.45
N-pairs sohn2016improved 90.16 73.40 50.30
LBL(DP-softmax) 95.68 88.03 73.86
Table 10: The verification rates, VR(%), at different false acceptance rates (FAR) on LFW-BLUFR following the Megaface-bisample, with the top-2 results highlighted.

5 Conclusion

This paper proposes a large-scale bisample learning (LBL) method to train deep neural networks on ID versus Spot (IvS) face data. Specifically, we develop a Classification-Verification-Classification (CVC) bisample training strategy that first transfers the knowledge from wild scenarios to IvS scenarios and then boosts the performance by large-scale classification. We also propose a dominant prototype softmax (DP-softmax) to perform -million classification, which is used in the final stage of CVC. The DP-softmax diligently selects the dominant prototypes for each mini-batch, which improves the performance and reduces the training cost simultaneously. Experiments on a large real-world dataset show the proposed LBL significantly improves the IvS face recognition and the DP-softmax can perform effective classification with only of classes. Besides, we also release a Public-IvS dataset for open IvS evaluation and a new protocol Megaface-bisample to mimic the large-scale bisample learning task.

6 Acknowledgments

This work was supported by the Chinese National Natural Science Foundation Projects #61876178, #61806196, the National Key Research and Development Plan (Grant No.2016YFC0801002), and AuthenMetric R&D Funds. Zhen Lei is the corresponding author.