FaceX-Zoo: A PyTorch Toolbox for Face Recognition

Deep learning based face recognition has achieved significant progress in recent years. Yet, the practical model production and further research of deep face recognition are in great need of corresponding public support. For example, the production of face representation network desires a modular training scheme to consider the proper choice from various candidates of state-of-the-art backbone and training supervision subject to the real-world face recognition demand; for performance analysis and comparison, the standard and automatic evaluation with a bunch of models on multiple benchmarks will be a desired tool as well; besides, a public groundwork is welcomed for deploying the face recognition in the shape of holistic pipeline. Furthermore, there are some newly-emerged challenges, such as the masked face recognition caused by the recent world-wide COVID-19 pandemic, which draws increasing attention in practical applications. A feasible and elegant solution is to build an easy-to-use unified framework to meet the above demands. To this end, we introduce a novel open-source framework, named FaceX-Zoo, which is oriented to the research-development community of face recognition. Resorting to the highly modular and scalable design, FaceX-Zoo provides a training module with various supervisory heads and backbones towards state-of-the-art face recognition, as well as a standardized evaluation module which enables to evaluate the models in most of the popular benchmarks just by editing a simple configuration. Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Rather than including as many as possible of the prior techniques, we enable FaceX-Zoo to easily upgrade and extend along with the development of face related domains. The source code and models are available at https://github.com/JDAI-CV/FaceX-Zoo.


page 2

page 6

page 7


Open Source Face Recognition Performance Evaluation Package

Biometrics-related research has been accelerated significantly by deep l...

VIPLFaceNet: An Open Source Deep Face Recognition SDK

Robust face representation is imperative to highly accurate face recogni...

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?

Face recognition performance improves rapidly with the recent deep learn...

Face.evoLVe: A High-Performance Face Recognition Library

In this paper, we develop face.evoLVe – a comprehensive library that col...

MixFaceNets: Extremely Efficient Face Recognition Networks

In this paper, we present a set of extremely efficient and high throughp...

iExam: A Novel Online Exam Monitoring and Analysis System Based on Face Detection and Recognition

Online exams via video conference software like Zoom have been adopted i...

Face Recognition: Perspectives from the Real-World

In this paper, we analyze some of our real-world deployment of face reco...

Code Repositories


A PyTorch Toolbox for Face Recognition

view repo

1 Introduction

Deep learning based face recognition has witnessed great progress in research field. Correspondingly, there emerge a number of excellent open-source projects developed for facilitating the experiments and production of deep face recognition networks. For example, Facenet [1]

is a TensorFlow 

[5] implementation of the model proposed by Schroff et al[26], which is a classic project for deep face recognition. OpenFace [6] is a general face recognition library, especially for the support of mobile device applications. InsightFace [2] is a toolbox for 2D&3D deep face analysis, mainly written in MXNet [8]

. It includes the commonly-used training data, network settings and loss functions. face.evoLVe 

[3] provides a comprehensive face recognition library for face related analytics and applications. Although these projects have been widely used and brought a great deal of convenience, the rapid development of deep face recognition techniques pursuits a significant need of a more comprehensive framework and standard evaluation to facilitate the research and development. To this end, we develop a new framework, named FaceX-Zoo, in the form of PyTorch [25] library, which is highly modular, flexible and scalable. It is composed of the state-of-the-art training pipeline for discriminative face feature learning, the standard evaluation towards fair comparisons, and the deployment SDK for efficient proof of concept and further applications. We release all the source codes and trained models to facilitate the community to develop their own solutions against various real-world challenges from the perspective of training, evaluation, and deployment. We hope that FaceX-Zoo is able to provide helpful support to the community and promote the development of face recognition.

The remaining part of this paper is organized as follows. In Section 2, we depict the structure and the highlight of FaceX-Zoo. In Section 3, we introduce the detailed design of this project. Section 4 provides the experiments with respect to the various supervisory heads and backbones that integrated in the training module, and reports the test accuracies on the commonly-used benchmarks which are also provided by the evaluation module. Section 5 presents our solutions for two practical situations, i.e. shallow face learning and masked face recognition. Finally, we discuss about the future work and give the conclusion in Section 6 and Section 7, respectively.

2 Overview of FaceX-Zoo

2.1 Architecture

The overall architecture of FaceX-Zoo is subtly presented in Figure 1. The whole project mainly consists of four parts: the training module, the evaluation module, the additional module and the face SDK, where the former two modules are the core part of this project. Several components are contained in the training and evaluation modules, including Pre-Processing, Training Mode, Backbone, Supervisory Head and Test Protocol. We elaborate on them as below.

Pre-Processing. This module fulfils the basic transformations on images before sending them to the network. For training, we implement the commonly-used operations, such as resizing, normalization, random cropping, random flipping, random rotation, etc. One can add the customized operations flexibly, according to various demands. For evaluation, only resizing and normalization are employed. Likewise, the testing augmentations, such as five crops, horizontal flipping, etc., can also be easily added into our framework by customizing.

Training Mode. The conventional training mode of face recognition is treated as the baseline routine. Concretely, it schedules the training inputs by DataLoader, then sends the inputs to the backbone network for forward passing, and finally computes a criterion as the training loss for backward updating. In addition, We consider a practical situation in face recognition that is to train the network with shallow distributed data [11]. Accordingly, we integrate a recent training strategy to facilitate the training on shallow face data.

Backbone. The backbone network is used to extract the features of face images. We provided a series of state-of-the-art backbone architectures in FaceX-Zoo, which are listed below. Besides, any other architecture choices can be easily customized with the support of PyTorch, as long as modifying the configuration file and adding the architecture definition file.

  • MobileFaceNet [7]: An efficient network for the applicaiton on mobile devices.

  • ResNet [15]: A series of classic architectures for general vision tasks.

  • SE-ResNet [16]: ResNet equipped with SE blocks that recalibrates the channel wise feature responses.

  • HRNet [31]: A network for deep high-resolution representation learning.

  • EfficientNet [28]: A bunch of architectures that scale among depth, width and resolution.

  • GhostNet [14]: A model aiming at generating more feature maps from cheap operations.

  • AttentionNet [30]: A network built by stacking attention modules to learn attention-aware features.

  • TF-NAS [17]: A series of architectures searched by NAS with the latency constraint.

Supervisory Head

. Supervisory Head is defined as the supervision single and its corresponding computation module towards accurate face recognition. In order to learn discriminative features for face recognition, the predicted logits are usually processed by some specific operations, such as normalization, scaling, adding margin,


., before sending to the softmax layer. We implement a series of softmax-style losses in FaceX-Zoo as follows:

  • AM-Softmax [29]: An additive margin loss that adds a cosine margin penalty to the target logit.

  • ArcFace [9]: An additive angular margin loss that adds a margin penalty to the target angle.

  • AdaCos [37]

    : A cosine-based softmax loss that is hyperparameter-free and adaptive scaling.

  • AdaM-Softmax [21]: An adaptive margin loss that can adjust the margins for different classes adaptively.

  • CircleLoss [27]: A unified formula that learns with class-level labels and pair-wise labels.

  • CurricularFace [19]: An loss function that adaptively adjusts the importance of easy and hard samples during different training stages.

  • MV-Softmax [34]

    : A loss function that adaptively emphasizes the mis-classified feature vectors to guide the discriminative feature learning.

  • NPCFace [36]:A loss function that emphasizes the training on both the negative and positive hard cases.

Test protocol. There are various benchmarks to measure the accuracy of face recognition models. Many of them focus on specific face recognition challenges, such as cross age, cross pose, and cross race. Among them, the commonly used test protocols are mainly based on the benchmarks of LFW [18] and MegaFace [20]. We integrates these protocols into FaceX-Zoo with simple usage and clear instruction, by which people can easily test their models on single or multiple benchmarks via simple configurations. Besides, it is convenient to extend additional test protocols by adding the test data and parsing the test pairs. It is worth noting that a masked face recognition benchmark based on MegaFace is provided as well.

  • LFW [18]: It contains 13,233 web-collected images of 5,749 identities with the pose, expression and illumination variations. We report the mean accuracy of 10-fold cross validation on this classic benchmark.

  • CPLFW [38]: It contains 11,652 images of 3,930 identities, which focuses on cross-pose face verification. Following the official protocol, the mean accuracy of 10-fold cross validation is adopted.

  • CALFW [39]: It contains 12,174 images of 4,025 identities, aiming at cross-age face verification. The mean accuracy of 10-fold cross validation is adopted.

  • AgeDB30 [24]: It contains 12,240 images of 440 identities, where each test pair has an age gap of 30 years. We report the mean accuracy of 10-fold cross validation.

  • RFW [32]: It contains 40,607 images of 11,430 identities, which is proposed to measure the potential racial bias in face recognition. There are four test subsets in RFW, named African, Asian, Caucasian and Indian, and we report the mean accuracy of each subset, respectively.

  • MegaFace [20]: It contains 80 probe identities with 1 million gallery distractors, aiming at evaluating large-scale face recognition performance. We report the Rank-K identification accuracy on MegaFace.

  • MegaFace-Mask: It contains the same probe identities and gallery distractors with MegaFace [20], while each probe image is added by a virtual mask. This protocol is designed to evaluate large-scale masked face recognition performance. More details can be found in Section 5.2. We report the Rank-K identification accuracy on MegaFace-Mask.

2.2 Characteristics

Modular and extensible design. As described above, FaceX-Zoo is designed to be modular and extensible. It consists of a set of modules with respective functions. Most of the modules are developed following the principle of object-oriented design, so that the scalability is highly promoted. One can easily add new training modes, backbones, supervisory heads and data samplers to the training module, as well as more test protocols to the evaluation module. Last but not the least, we provide the face SDK and the additional module for the efficient deployment and flexible extension according to various demands.

State-of-the-art training. We provide several state-of-the-art practices for face recognition model training, such as the complete pre-processing operations for data augmentation, the various backbone networks for model ensemble, the softmax-style loss functions for discriminative feature learning, and the Semi-Siamese Training mode for practical shallow face learning.

Easy to use and deploy. We release all the codes, models and training logs for reproducing the state-of-the-art results. Besides, we provide a simple yet fully functional face SDK written in python, which acts as a demo to help the users learn the usage of each module and develop the further versions.

Standardized evaluation module. The commonly used evaluation benchmarks for face recognition are in need of a unified implementation for efficient and fair evaluation. For example, the official test protocol of MegaFace is implemented as a bin file, leading to inconvenient application in many evaluation conditions. FaceX-Zoo provides a standard and open-source implementation for evaluating on LFW-based and MegaFace-based benchmarks. Users can evaluate models on various benchmarks by editing the configuration file efficiently. We will also release the 106-point facial landmarks defined as [22] so that users can utilize them for face alignment of these benchmarks.

Support for masked face recognition. Recently, due to the pandemic of COVID-19, masked face recognition has attracted increasing attention. In order to develop such a model, three essential components are indispensable: masked face training data, masked face training algorithm and masked face evaluation benchmark. FaceX-Zoo is the very project that provides all the three components via the 3D face mask adding technique.

3 Detailed Design

In this section, we describe the design of the training module (Figure 2), the evaluation module (Figure 3), and the face SDK (Figure 4) in details, which are modular and extensible.

Backbone LFW CPLFW CALFW AgeDB RFW (Afr) RFW (Asi) RFW (Cau) RFW (Ind) MegaFace
MobileFaceNet 99.57 83.33 93.82 95.97 88.73 88.02 95.70 90.85 90.39
ResNet50-ir 99.78 88.20 95.47 97.77 95.25 93.95 98.57 95.80 96.67
ResNet152-irse 99.85 89.72 95.56 98.13 95.85 95.43 99.08 96.27 97.48
HRNet 99.80 88.89 95.48 97.82 95.87 94.77 99.08 95.93 97.32
EfficientNet-B0 99.55 84.72 94.37 96.63 89.67 89.32 96.10 91.93 91.38
TF-NAS-A 99.75 85.90 94.87 97.23 91.97 91.62 97.43 93.33 94.42
GhostNet 99.65 85.30 93.92 96.08 88.67 88.48 95.13 90.63 87.88
Attention-56 99.88 89.18 95.65 98.12 96.52 95.72 99.13 96.83 97.75
Table 1: The performance (%) with different backbones, where RFW (Afr), RFW (Asi), RFW (Cau) and RFW (Ind) denote the African, Asian, Caucasian and Indian test protocols in RFW, respectively. Apart from MegaFace, we report the mean accuracies on these benchmarks. For MegaFace, we report the Rank-1 accuracy.
Supervisory head LFW CPLFW CALFW AgeDB RFW (Afr) RFW (Asi) RFW (Cau) RFW (Ind) MegaFace
AM-Softmax 99.58 83.63 93.93 95.85 88.38 87.88 95.55 91.18 88.92
AdaM-Softmax 99.58 83.85 93.50 96.02 87.90 88.37 95.32 91.13 89.40
AdaCos 99.65 83.27 92.63 95.38 85.88 85.50 94.35 88.27 82.95
ArcFace 99.57 83.68 93.98 96.23 88.22 88.00 95.13 90.70 88.39
MV-Softmax 99.57 83.33 93.82 95.97 88.73 88.02 95.70 90.85 90.39
CurricularFace 99.60 83.03 93.75 95.82 88.20 87.33 95.27 90.57 87.27
CircleLoss 99.57 83.42 94.00 95.73 89.25 88.27 95.32 91.48 88.75
NPCFace 99.55 83.80 94.13 95.87 88.08 88.20 95.47 91.03 89.13
Table 2: The performance (%) with different supervisory heads, where RFW (Afr), RFW (Asi), RFW (Cau) and RFW (Ind) denote the African, Asian, Caucasian and Indian test protocols in RFW, respectively. Apart from MegaFace, we report the mean accuracies on these benchmarks. For MegaFace, we report the Rank-1 accuracy.

3.1 Training Module

As shown in Figure 2, the TrainingMode is the core class to aggregate all the other classes in the training module. There are mainly three classes aggregated in the TrainingMode: (1) BackboneFactory is a factory class to provide the backbone network; (2) HeadFactory is a factory class to produce the supervisory head according to the configuration; (3) DataLoader is in charge of loading the training data.

3.2 Evaluation Module

As depicted in Figure 3

, the LFWEvaluator and the MegaFaceEvaluator are the core classes in the evaluation module. Both of them contain the class of CommonExtrator for face feature extraction. The CommonExtrator class depends on the ModelLoader class and the DataLoader class, where the former loads the models and the later loads the test data. Besides, the LFWEvaluator class also aggregates the PairsParseFactory class for parsing the test pairs in each test set. Differently, we split two classes for MegaFace-based evaluations, named CommonMegaFaceEvaluator and MaskedMegafaceEvaluator, for the MegaFace evaluation and the MegaFace-Mask evaluation, respectively. Both of them are inherited from the MegaFaceEvaluator class.

Figure 2: The class diagram of training module.
Figure 3: The class diagram of evaluation module.
Figure 4: The class diagram of the face SDK.

3.3 Face SDK

In order to validate and demonstrate the effectiveness of the trained models for face recognition in a convenient way, we provide a simple yet fully functional module of Face SDK. As shown in Figure 4, Face SDK includes three core classes, named ModelLoader, ImageCropper and ModelHandler. The ModelLoader class is used to load the models of face detection, face landmark localization and face feature extraction. The ImageCropper class is used to crop the facial area from the input image according to the detected facial landmarks, and output the normalized face crop. The ModelHandler class provides pre-processing and post-processing operations, as well as the inference interface.

In Face SDK, we provide a series of models, i.e. face detection, facial landmark localization, and face recognition, for the non-masked face recognition and masked face recognition scenarios. Specifically, for the non-masked face recognition scenario, we train the face detection model by RetinaFace [10] on the WiderFace dataset [35]. The facial landmark localization model is trained by PFLD [13] on the JD-landmark dataset [22]. We train the face recognition model with MobileFaceNet [7] and MV-Softmax [33] on MS-Celeb-1M-v1c [4]. For the masked face recognition scenario, we train the models with the same algorithms as the non-masked scenario while the training data is expanded by our FMA-3D method described in Section 5.2. We will continuously update the models with more methods in the future.

4 Experiments of SOTA Components

To facilitate the readers to reproduce and fulfil their own works with our framework, we conduct extensive experiments about the backbone and supervisory head with the state-of-the-art methods. The adopted backbones and supervisory heads are listed in Section 2.1. We use MS-Celeb-1M-v1c [4] as the training data, which is well cleaned. For clear presentation, in the experiments of backbone, we adopt the same supervisory head, i.e. MV-Softmx [34]; in the experiments of supervisory head, we adopt the same backbone, i.e. MobileFaceNet [7]

. The remaining settings are kept the same for each trial. Four NVIDIA Tesla P40 GPUs are employed for training. We set the total epoch to 18 and the batch size to 512 for training. The learning rate is initialized as 0.1, and divided by ten at the epoch 10, 13 and 16. The test results of the experiments of backbone and supervisory head are shown in Table 

1 and Table 2, respectively. One can refer to these results for guiding and verifying the usage of our framework.

5 Task-specific Solutions

In this section, we present to use the specific solutions for handling two challenging tasks of face recognition within the framework of FaceX-Zoo, including Semi-Siamese Training [11] for shallow face learning, and the masked face recognition for the recent demand caused by the pandemic of COVID-19.

Training Mode LFW CPLFW CALFW AgeDB RFW (Afr) RFW (Asi) RFW (Cau) RFW (Ind)
Conventional Training 91.77 61.56 76.52 73.90 61.35 67.38 73.27 70.12
Semi-siamese Training 99.38 82.53 91.78 93.60 85.03 85.25 92.80 87.40
Table 3: The performance (%) of different training modes applied on shallow data. RFW (Afr), RFW (Asi), RFW (Cau) and RFW (Ind) denote the African, Asian, Caucasian and Indian test protocols in RFW, respectively.

5.1 Shallow Face Learning

Background. In many real-world scenarios of face recognition, the training dataset is limited in depth, e.g. only two face images are available for each ID. This task, which is so called Shallow Face Learning as described in [11], is problematic to the conventional training methods for face recognition. The shallow face data severely lacks the intra-class diversity for each ID, and leads to the collapse of feature dimension against effective training. Consequently, the trained network suffers from either model degeneration or over-fitting. As suggested in [11], we adopt Semi-Siamese Training (SST) to tackle this issue. Furthermore, we implement it by the framework of FaceX-Zoo, in which the upstream and downstream stages (i.e. efficient data reading and unified automatic evaluation) complete the pipeline and facilitate the users to employ SST for model production.

Experiments and results. For a quick verification of the effectiveness of FaceX-Zoo towards shallow face learning, we employ an off-the-shelf architecture, i.e. MobileFaceNet, as the model backbone, and perform a comparison experiment between the conventional training and SST. Following the settings of [11], the training dataset is constructed by randomly selecting two facial images from each ID of MS-Celeb-1M-v1c, called MS-Celeb-1M-v1c-Shallow. The training epoch is set to 250 and the batch size is set to 512. The learning rate is initialized as 0.1, and divided by ten at the epoch 150, 200, 230. The test results on LFW, CPLFW, CALFW, AgeDB and RFW are presented in Table 3, which verifies the effectiveness of poly-mode training on the shallow data.

5.2 Masked Face Recognition

Background. Due to the recent world-wide COVID-19 pandemic, masked face recognition has become a crucial application demand in many scenarios. However, few masked face datasets are available for training and evaluation. To address this issue, we empower the framework of FaceX-Zoo to add virtual mask to the existing face images by the specialized module, named FMA-3D (3D-based Face Mask Adding).

FMA-3D. Given a real masked face image (Fig. 5(a)) and a non-masked face image (Fig. 5(d)), we synthesize a photo-realistic masked face image with the mask from and the facial area from . First, we utilize a mask segmentation model [23] to extract the mask area from image (Fig. 5(b)), and then map the texture map into UV space by the 3D face reconstruction method PRNet [12] (Fig. 5(c)). For image , we compute the texture map in UV space in the same way of A (Fig. 5(e)). Next, we blend the mask texture map and the face texture map in UV space as Fig. 5(f) shows. Finally, the masked face image is synthesized (Fig. 5(g)) by rendering the blended texture map according to the UV position map of image . Fig. 6 shows more cases of masked face image synthesized by FMA-3D.

Compared with the 2D-based and GAN-based methods, our method shows superior performance on the robustness and fidelity, especially for the large head poses.

Figure 5: The method for wearing virtual masks on face image. The mask template can be sampled from various choices subject to the input masked face.
Figure 6: Top: the original non-masked face images. Bottom: the masked face image synthesized by FMA-3D.

Training masked face recognition model. Resorting to our FMA-3D, it is convenient to synthesize large number of masked face images from the existing non-masked datasets, such as MS-Celeb-1M-v1c. Since the existing datasets already have the ID annotation, we can directly employ them for training the face recognition network without additional labeling. The training method can be either the conventional routine or SST, as well as the training head and backbone can be instantiated with the choices integrated in FaceX-Zoo. Note that the testing benchmark can be augmented from non-masked to masked version in the same manner.

Experiments and results. By using FMA-3D, we synthesize the training data from MS-Celeb1M-v1c to its masked version, named MS-Celeb1M-v1c-Mask. It includes the original face images of each identity in MS-Celeb1M-v1c, as well as the masked face images corresponding to the original ones. We choose MobileFaceNet as the backbone, and MV-Softmax as the supervisory head. The model is trained for 18 epochs with a batch size of 512. The learning rate is initialized as , and divided by ten at the epoch 10, 13 and 16. To evaluate the model on masked face recognition task, we synthesize the masked facial datasets based on MegaFace by using FMA-3D, named MegaFace-mask, which contains the masked probe images and remains the gallery images non-masked. As shown in Figure 7, we conduct comparison experiments among four scenarios. Specifically, is the baseline which is trained on MS-Celeb1M-v1c; is also trained on MS-Celeb1M-v1c, but only the upper half of face is cropped for training, which can be regarded as a naive manner to eliminate the adverse effect of mask; is trained on MS-Celeb1M-v1c-Mask; is the ensemble of and . We can see that the rank1 accuracy of baseline model is 27.03%. By only utilizing the upper half of face, the performance of is improved to 71.44%. achieves the best performance of 78.39% in single models with the help of synthesized masked face images. By combining and , the rank1 accuracy is further improved to 79.26%.

Figure 7: Rank-K identification accuracy on MegaFace-Mask. Zoom in for better view.

6 Future Work

In the future, we will try to improve FaceX-Zoo from three aspects: breadth, depth, and efficiency. First, more additional modules will be included, such as face parsing and face lightning, to thereby enrich the functionality “X” in FaceX-Zoo. Second, the modules of backbone architecture and supervisory heads will be continually supplemented along with the development of deep learning techniques. Third, we will try to improve the training efficiency via distributed data parallel technique and mixed precision training.

7 Conclusion

In this work, we introduce a highly modular and scalable open-source framework for face recognition, namely FaceX-Zoo. It is easy to install and utilize. The Training Module enable users to train face recognition networks with various choices of backbone and supervisory head. The Training Mode includes both the conventional routine and the specific solution for shallow face learning. The Evaluation Module provides an automatic evaluation benchmark for standard and convenient testing. Face SDK provides modules for the whole pipeline, i.e. face detection, face landmark localization, and face feature extraction, for face recognition. It can be taken as a baseline as well as further development towards deployment. Besides, the Additional Module supports training and testing on masked face recognition via 3D virtual mask adding technique.

All the source codes are released along with the logs and trained models. One can easily play with this framework as a prototype, and develop his own work from this baseline.