3D High-Fidelity Mask Face Presentation Attack Detection Challenge

The threat of 3D masks to face recognition systems is increasingly serious and has been widely concerned by researchers. To facilitate the study of the algorithms, a large-scale High-Fidelity Mask dataset, namely CASIA-SURF HiFiMask (briefly HiFiMask) has been collected. Specifically, it consists of a total amount of 54, 600 videos which are recorded from 75 subjects with 225 realistic masks under 7 new kinds of sensors. Based on this dataset and Protocol 3 which evaluates both the discrimination and generalization ability of the algorithm under the open set scenarios, we organized a 3D High-Fidelity Mask Face Presentation Attack Detection Challenge to boost the research of 3D mask-based attack detection. It attracted 195 teams for the development phase with a total of 18 teams qualifying for the final round. All the results were verified and re-run by the organizing team, and the results were used for the final ranking. This paper presents an overview of the challenge, including the introduction of the dataset used, the definition of the protocol, the calculation of the evaluation criteria, and the summary and publication of the competition results. Finally, we focus on introducing and analyzing the top ranking algorithms, the conclusion summary, and the research ideas for mask attack detection provided by this competition.


3D Face Mask Presentation Attack Detection Based on Intrinsic Image Analysis

Face presentation attacks have become a major threat to face recognition...

Introduction to Presentation Attack Detection in Face Biometrics and Recent Advances

The main scope of this chapter is to serve as an introduction to face pr...

Real Masks and Fake Faces: On the Masked Face Presentation Attack Detection

The ongoing COVID-19 pandemic has lead to massive public health issues. ...

TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection

3D mask face presentation attack detection (PAD) plays a vital role in s...

Partial Attack Supervision and Regional Weighted Inference for Masked Face Presentation Attack Detection

Wearing a mask has proven to be one of the most effective ways to preven...

Masked Face Inpainting Through Residual Attention UNet

Realistic image restoration with high texture areas such as removing fac...

1 Introduction

Recently, Face Anti-Spoofing (FAS) has been attracted more and more attention [45, 43, 30]

due to the wide application of face recognition in financial payment, access control, and phone unlocking. Therefore, face Presentation Attack Detection (PAD) technology is a critical stage to reinforce the face recognition systems by determining whether the captured face from an imaging sensor is real or fake. With the release of several high-quality 2D attack datasets 

[2, 27, 53, 19], the previous algorithms [46, 47, 39, 49, 31, 42, 51, 23, 6, 52, 50, 22] show better performance against Print attacks and video Replay attacks, but they are more vulnerable to mask attacks with more realistic color and structure. However, with the maturity of 3D printing technology, face mask has become a new type of Presentation Attack (PA), which can easily fool the FAS system based on coarse texture and facial depth information.

Although some works have been devoted to 3D mask attacks, including datasets [29, 26, 13, 9, 48] collection and algorithms [54, 12, 33, 26, 15, 24, 16] design, there are still some limitations that hinder the performance of the algorithm: (1) Lack of a high-quality and large-scale mask dataset for algorithm research, due to the limitation of high fidelity mask production cost. As far as we know, the existing mask datasets are insufficient in the aspects of the number of subjects and skin tones, mask quality and types, scene settings, lighting environments, and collection devices, which seriously limits the research of data-driven algorithms. (2) Lack of a challenging and public benchmark for performance comparison of different algorithms. As a result, the existing algorithms only work for specific mask types or in constrained environments. Several rPPG-based methods [15, 26, 24, 16, 25, 44] are proposed according to the evidence that periodic rPPG pulse cues could be recovered from the live faces but noisy for the mask attacks. However, they are vulnerable to the interference of illumination change and sensitive to detection distance. (3) Compared with 2D attacks, such as Print-Attack, Replay-Attack, high fidelity mask has realistic skin color and structure. It is difficult to distinguish between a live face and a mask from the visible spectrum.

In order to promote the community’s research on mask attack detection, we solve the current difficulties from the following three aspects based on the above analysis: (1) We collect and release a large-scale 3D high-fidelity mask face PAD dataset named HiFiMask. Compared with public 3D mask datasets, it has several advantages, such as high fidelity masks and amount of data in the term of identities, lightings, sensors, and videos. (2) We define a more general and valuable testing protocol for real-world deployment and provide a decent result as a benchmark. Our protocol evaluates both the discrimination and generalization ability of the algorithm under the open set scenarios. In other words, the training and developing sets contain only parts of common mask types and scenarios while there are more general mask types and scenarios on the testing set. (3) Based on the dataset and protocol, we successfully held a competition, 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021 111https://competitions.codalab.org/competitions/30910, attracting teams from all over the world. The results of the top three teams are far better than our baseline results, which greatly pushes the current best performance of mask attack detection. A summary with the names and affiliations of teams that entered the final stage is shown in Tab. 1. Interestingly, compared with the previous challenges [17, 18, 1], the majority of the final participants of this competition come from the industrial community, which indicates the increased importance of the topic for daily life applications.

Ranking Team Name Leader Name, Affiliation
1 VisionLabs Oleg Grinchuk, visionlabs.ai
2 WeOnlyLookOnce Ke-Yue Zhang, Tencent Youtu Lab
3 CLFM Samuel Huang, FaceMe
4 oldiron666
Zezheng Wang,
Kuaishou Technology
Mingmu Chen,
Reconova Technology
6 inspire Jiang Hao, Bytedance Ltd.
7 Piercing Eyes
National University of Singapore
8 msxf_cvas
Liang Gao, MaShang
Consumer Finance Co.,Ltd
9 VIC_FACE Cheng Zhen, Meituan
Weitai Hu, Du Xiaoman Financial
11 fscr
Artem Petrov, Peter the Great St.
Petersburg Polytechnic University
12 VIPAI Yao Xiao, Zhejiang University
13 reconova-ZJU Zhishan Li, Zhejiang University
14 sama_cmb
Yifan Chen,
Chinese Merchants Bank(CMB)
15 Super
Yu He,
Technische Universität München,
16 ReadFace Zhijun Tong, ReadFace
17 LsyL6 Dongxiao Li, Zhejiang University
18 HighC
Minzhe Huang,
Akuvox (Xiamen) Networks
Co., Ltd.
Table 1: Team and affiliations name are listed in the final ranking of this challenge.

To sum up, the contributions of this paper are summarized as follows: (1) We describe the design of the 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021 challenge. (2) We organize this challenge around the HiFiMask datsaset, proving the suitability of such a resource for boosting research on the topic. (3) We report and analyze the solutions developed by participants. (4) We conclude the effective scheme of mask attack detection from the top-ranked algorithms and point out the research direction through this competition.

2 Challenge Overview

In this section, we review the organized challenge, including a brief introduction of the HiFiMask dataset, the challenge process and timeline, the challenge protocol, and evaluation metrics.

HiFiMask Dataset. HiFiMask [21] is currently the largest 3D face mask PAD dataset, which contains videos captured from subjects of three skin tones, including subjects in yellow, white, and black, respectively. For mask types, it contains high-fidelity masks for each identity, which are made of transparent, plaster, and resin materials, respectively. During the acquisition process, it considers complex scenes for video recording, , White Light, Green Light, Periodic Three-color Light, Outdoor Sunshine, Outdoor Shadow, and Motion Blur. For each scene, there are videos under different lighting directions (, NormalLight, DimLight, BrightLight, BackLight, SideLight, and TopLight) to explore the impact of directional lighting. Among them, there is periodic lighting within [0.7, 4]Hz for the first three scenarios to mimic the human heartbeat pulse, thus might interfere with the rPPG-based mask detection technology [15]. Finally, mainstream imaging devices (, iPhone11, iPhone X, MI10, P40, S20, Vivo, and HJIM) are utilized for video recording to ensure high resolution and imaging quality.

In order to facilitate the participating teams to use the dataset, we have carried out some data preprocessing steps. We remove irrelevant background areas from original videos, such as the part below the neck. After face detection, we sample 10 frames at equal intervals from each video. Finally, we name the folder of this video according to the following rule: .

Challenge Protocol and Data Statistics. In order to increase the challenge of the competition and meet the actual deployment requirements, we consider a protocol that can comprehensively evaluate the performance of algorithm discrimination and generalization. In other words, the training and developing sets contain only parts of common mask types and scenarios while there are more general mask types and scenarios on the testing set. Based on Protocol 1 [21], we define training and development sets with parts of representative samples while a full testing set is used. Thus, the distribution of testing sets is more complicated than the training and development sets in terms of mask types, scenes, lighting, and imaging devices. Different from Protocol 2 [21] with only ‘unseen’ mask types, the challenge protocol considers both ‘seen’ and ‘unseen’ domains as well as mask types, which are more general and valuable for real-world deployment.

Subset subj. Mask Scene Light Sensor # live # mask # all
Train 45 1,3 1,4,6 1,3,4,6 1,2,3,4 1,610 2,105 3,715
Dev 6 1,3 1,4,6 1,3,4,6 1,2,3,4 210 320 536
Test 24 13 16 16 17 4,335 13,027 17,362
Table 2: Statistical information for Challenge Protocol. ‘#’ means the number of videos. Note that 1, 2 and 3 in the third column mean Transparent, Plaster and Resin mask, respectively. Other columns refer to 2 in a similar way.

In the challenge protocol, as shown in Tab. 2, all skin tones, part of mask types, such as transparent and resin materials (short for 1, 3), part of scenes, such as White Light, Outdoor Sunshine, and Motion Blur (short for 1, 4, 6), part of lightings, such as NormalLight, BrightLight, BackLight, and TopLight (short for 1, 3, 4, 6), and part of imaging devices, such as iPhone11, iPhone X, MI10, P40 (short for 1, 2, 3, 4) are presented in the training and development subsets. While all skin tones, mask types, scenes, lightings, and imaging devices are presented in the testing subset. For clarity, the dataset partition and video quantity of each subset of the challenge protocol are shown in Tab. 2.

Challenge Process and Timeline. The challenge was run in the CodaLab222https://competitions.codalab.org/competitions/30910 platform, and comprised two stages as follows:

Development Phase: (Started: April. 19, 2021 - Ended: in June 10, 2021). During this phase, participants had access to labeled training data and unlabeled development data. Participants could use training data to train their models, and they could submit predictions on the development data. Training data was made available with samples labeled with the genuine, 2 types of the mask (short for 1, 3), 3 types of scenes (short for 1, 4, 6), 4 kinds of lightings (short for 1, 2, 4, 6) and 4 imaging sensors (short for 1, 2, 3, 4). Although the development data maintains the same data type as the training data, the label is not provided to the participants. Instead, participants could submit predictions on the development data and receive immediate feedback via the leader board.

Final phase: (Started: June 10, 2021 - Ended: June 20, 2021). During this phase, labels for the development set were made available to participants, so that they can have more labeled data for training their models. The unlabeled testing set was also released, participants had to make predictions for the testing data and upload their solutions to the challenge platform. The test set was formed by examples labeled with the genuine, and all skin tones, mask types (short for 13), scenes (short for 16), lightings (short for 16), and imaging devices (short for 17). Participants had the opportunity to make 3 submissions for the final phase, this was done with the goal of assessing the stability of their methods. Note that the CodaLab platform defaults to the result of the last submission.

The final ranking of participants was obtained from the performance of submissions in the testing sets. To be eligible for prizes, winners had to publicly release their code under a license of their choice and provide a fact sheet describing their solution.

Evaluation Metrics. In this challenge, we selected the recently standardized ISO/IEC 30107-3333https://www.iso.org/obp/ui/iso metrics: Attack Presentation Classification Error Rate (APCER), Normal Presentation Classification Error Rate (NPCER) and Average Classification Error Rate (ACER) as the evaluation metrics. The ACER on the testing set is determined by the Equal Error Rate (EER) thresholds on the development set. Finally, The value ACER was the leading evaluation measure for this challenge, and Area Under Curve (AUC) was used as additional evaluation criteria.

3 Description of solutions

VisionLabs  Due to the tiny fake features of 3D face masks and the complexity to distinguish, team VisionLabs proposed a pipeline based on high-resolution face parts cropped from the original image, as shown in Fig. 1

. Those parts are used as additional information to classify the full images through the network.

Figure 1: The pipeline of team VisionLabs. Original images are cropped by DSFD detector [14]

and split into five regions by prior knowledge. Then, those parts are input into the backbone with one shared convolution block and five branches. Each branch will output a 320-dimensional vector (two vectors from the ears branch). All vectors are concatenated to one vector and calculated as


During preparation state, centered face crops are created using the Dual Shot Face Detector (DSFD) detector [14]. The crop bounding box is expanded times around face detection bounding box. If the bounding box is out of the original image border, missing parts are filled with black. If no face is found, the original image is used instead of crop. Then, five face regions are cropped using prior information from face bounding box, as eyes, nose, chin, left, and right ear. Each part will be resized to and input into the backbone after data augmentation (, rotation, random crop, color jitter). Additionally, as a regularization technique, they turned 10% of images into trash images by scaling random tiny parts of images. As shown in Fig. 1, team VisionLabs used a multi-branch network, including Face, Eyes, Nose, Chin, and Ears branches. Since any pre-trained weight is prohibited in this competition, they tried to replicate the generalization ability of convolutional filters by using shared the first block of each branch. This made the first block filters learn more diverse features. Five branches all adopt EfficientNet-B0 [34] as the backbone, and the original descriptor size of each branch is reduced from 1280 to 320. Due to the presence of left and right ears, the Ears Branch outputs two vectors. Then, each loss and confidence of the current branch can be obtained through the fully connected layer, as , , , , , ( for left ear and

for right ear). These six vectors are concatenated to obtain a 1920-dimensional vector, used to calculate the loss function

. All branches are trained simultaneously with the final loss:


where all losses are binary cross-entropy(BCE) losses. Since face parts do not always contain the tiny fake features, for eyes, nose, chin and ears, they increase positive class weight in BCE loss by a factor of 5. The partial face part descriptors will not be punished too hard if don’t contain useful features.

They trained the model with Adam optimizer for 60 epochs using an initial learning rate of 0.0006 and decreasing it every 3 epochs by a factor of 0.9. During the inference phase, they chose 0.7 as the test set threshold when it is inaccurate to select a threshold from a validation set close to the full score. Based on average positions and prior information, some face parts of images may be cropped in the wrong way. So a test time augmentation is introduced. They flip an image and obtain the final results by averaging the scores of original and flipped faces.

WeOnlyLookOnce  In this method, considering that there are irrelevant noises in the raw training data, a custom algorithm is used to detect black borders firstly. After that, the DSFD [14] is applied to detect potential faces for each image. To be specified, the training set is processed by wiping black borders merely, while the testing set and validation set are cropped with a ratio of 1.5 times bounding box further. What’s more, the positive samples in the training set are much less than negative samples. The training augmentations include rotation, image crop, color jitter, etc.

Figure 2: The framework of team WeOnlyLookOnce. The DSFD face detector is used to detect bounding box. Afterwards, a lightweight self-defined ResNet12 subsequently aims to classify the input into three categories. To be mentioned, Label smoothing and output distribution tuning are used as additional tricks.

As shown in Fig. 2, the framework [38, 3, 37, 41] consists of a CNN branch and a CDC branch. Both networks are self-designed lightweight ResNet12, and each of them is a three-class classification network aiming to detect real images and two kinds of mask. The realization of CNN branch is a vanilla convolution while the CDC branch used Central Difference Convolution [49]. To alleviate the overfitting problem, the team additionally adopted a label smoothing strategy and an output distribution tuning strategy inspired by temperature scaling [28]

. After computing the cross-entropy loss by using the logits and label, the total loss is calculated by the following equation


To minimize the distribution gap between validation and test sets, this team proposed an effective distribution tuner. They provide two strategies in this tuner both of which are proved to be effective. In the first strategy, they reform the three-class classification task into a binary classification task by adding the two attack-class logits into one uniformed value, then dividing the real logits by a factor of 3.6 and the fake logits by a factor of 5.0 before the softmax operation. In the second strategy, the task still remains a three-class classification problem while the real score on the validation set is subtracted by 0.07.

CLFM  Team CLFM produced a model with only cross-entropy loss based on CDCN++ model but earn a good result. The central difference convolution was used to replace traditional convolution. Also, attention modules were introduced in each stage to make the model performed better. Besides, they fuse three stages’ output parameters as feature vectors before the fully connected layer.

For data pre-processing, they adopt their own face detection model to perform face detection and take patches of the face as input. Something should be the highlight that they play some brilliant and practical tricks both on train and test set. On the one hand, they find that there are hats/glasses that will most likely lead the model in the wrong direction. So they firstly crop the face according to the bounding box and then crop the region around the mouth. The face size is randomly set in a small range to ensure the generalization of the model. If the region is not enough to fill it, they tend to flip and mirror the region to keep the texture constant. The model input is square blocks resized to

and normalized with mean and standard deviation parameters which are summarized from ImageNet. On the other hand, they also notice that there are about 17% of the images in the test set that the face detection didn’t detect any face, and in this kind of scenario the model will have no other choice but use the whole image as the bounding box. So they randomly make part of the training data’s bounding box to be the whole image and make slight changes to the cropping process of the test set compared with the training set, so that at least the model won’t be pure guessing when facing this situation. Finally, they use self voting by moving the patch in a small range and averaging them as the final score.

Oldiron666  The team Oldiron666 proposed a self-dense regularization framework for face anti-spoofing. For data pre‐processing, they expand an adaptive scale for the cropped face, which improves the performance. The input size of the images is 256, and the following data augmentations are performed to improve generalization, such as Random Crop, Cutout, and Patch Shuffle Augmentation, etc.

The team Oldiron666 used a representation learning framework similar to SimSiam [5]

, but introduces a multilayer perceptron(MLP) for supervised classification. During the training process, the face image

will be randomly augmented to obtain two views for input, and . The two views are processed by an encoder network , which consists of the backbone and an MLP head called projector [4]. They found a more light network may bring better performance on the HiFiMask. Therefore, Resnet6 is utilized as the backbone, which contains a low computational complexity. The output of one view is transformed to match the other view by a dense predictor MLP head, denoted as . The dense similarity loss, marked as

, maximizes the similarity between both sides. To implement supervised learning, they perform the dense classifier

at the end of the framework and use Mean Squared Error(MSE) to evaluate the output. MSE loss is calculated with the ground truth label on one side, denoted as , while is calculated as the difference between the category output of both two sides. The training loss can be defined as,


During the training process, they perform half-precision floating-point to obtain a faster training speed. The SGD optimizer is adopted, with an initial learning rate of 0.03, weight decay by 0.0005, and momentum by 0.9. During inference, only the side is executed to obtain the result of face anti-spoofing.

Figure 3: The application flow chart of team Reconova-AI-Lab. Raw images are firstly pre-processed by RetinaFace for face detection, crop and alignment in the upper flow. Then, they used a multi-task learning algorithm, which mainly includes three branches.

Reconova-AI-Lab  Team Reconova-AI-LAB contributes a variety of models and generates many different results, the best of which is used for the competition. They proposed a multi-task learning algorithm, which mainly includes three branches, the direct classification branch, the real person learning Gaussian mask branch, and the Region of interest(ROI) classification branch. In the rest of this section, we take Cls, Seg, and ROI branches as abbreviations respectively. Cls branch takes a focal loss which combining Sigmoid and BCE Loss as the supervision information. It is annotated as . Seg branch adopts the same loss function as Cls and its loss annotation is . Concerning with ROI branch, it take three loss functions, which is , and , respectively. The effect of the first one is focal loss mentioned before. The second one aims to the alignment of ROI which is used to calibrate the operation of ROI pooling. Subsequently, the purpose of the last one is to reduce the distance between classes. Finally, the lost function of ROI branch equals plus plus . All branches are trained synchronously with an SGD optimizer in 800 epochs, and the total loss function formulates as follows:


Their application flow chart is shown in Fig. 3. First, the data pre-processing includes the use of RetinaFace to detect the face and generate 14 landmarks per face, including face coordinates and bounding boxes of left, right ear, and mouth. At that stage, they use some strategies to avoid large-angle posture and non-existence of face by constraining the size of the bounding box of ROI. Meanwhile, they take mirroring, random rotation, random color enhancement, random translation, and random scaling as treatments of data enhancement. Then they adopt a backbone called Res50_IR, which has stacked 3, 4, 14, and 3 blocks respectively in four stages. In order to enhance features, an improved Residual Bottleneck structure named Yolov3_FPN is connected to the different stages of the network. The slightly complicated network is followed by three branches mentioned before. All of the parameters are initialized by different methods according to different layers.

inspire The team firstly utilized a ResNet50 [11] based RetinaFace [7] to detect face bounding boxes for all images. To be noticed, three different threshold values of 0.8, 0.1, and 0.01 are used to record the different types of bounding boxes. If the detecting confidence value is above 0.1, the box label is set to be 2. If it is between 0.1 and 0.01, the box label is 1. While the value is less than 0.01, the box label remains to be 0. According to the box label depicted above, hard samples of the cropped images is partitive.

Figure 4: The framework of team inspire. Raw images are firstly processed in the upper flow. After that, this team used a Context Contrastive Learning framework to train, while the backbone is a SE-ResNeXt101 network.

For the training stage, SE-ResNeXt101 [40] was selected as the backbone. Besides, the team applied the Context Contrastive Learning(CCL) [21] architecture as the framework, which is shown in Fig. 4. As a result, they used a sampling strategy the same as that in [21]. The MSE loss , Cross Entropy loss and Contrastive loss [10] are applied to calculate total loss by the following weights:


Afterward, Ranger optimizer444https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer is set as a learning strategy with an initial learning rate of 0.001. The total epoch is 70, and the learning rate decays by 0.1 at 20, 30, 60 epochs, respectively.

Piercing Eye  The team Piercing Eye used the modified CDCN [49] as the basic framework, shown in Fig. 5. During data processing, the face regions are detected from the original images, which are resized to and randomly cropped to . Same as other teams, some types of data augmentation like color jitter are used.

Figure 5: The framework of team Piercing Eye. Two branches are attached to the CDCN backbone, called map regression and global classifier, respectively.
Figure 6: The framework of team msxf_cvas. Raw images are firstly processed by detecting face and alignment. After that, a ResNet34 network is utilized to classify the input image into four types including live, transparent mask, resin mask and no face.
Figure 7: The framework of team VIC_FACE.

In addition to the original output (depth map) of CDCN, a multi-layer perceptron (MLP) is attached to the backbone, implementing the global binary classification. The shape of the depth map is

. The label of the real face region is set to 1, while the background and fake face region is set to 0. They trained the model with an SGD optimizer for 260 epochs using an initial learning rate of 0.002 and decreasing it by a factor of 0.5 with milestones. As in [49], both mean square error loss and contrastive depth loss are utilized for pixel-wise supervision. They also perform cross-entropy loss in a global branch, denoted as . So the overall loss function is formulated as


msxf_cvas  From the analysis of competition data, the team finds two different distributions of spoof masks which are transparent material and fidelity material. They consider two materials (plaster and resin) as one category as the features of these two types looks similar. Besides, there are small amounts of noisy data without a human face which do not contain spoof or live features. Therefore, the team try to classify them as one category called non-face. The final task is to classify all data into four categories which are the live, transparent mask, resin mask, and non-face. Considering that there are many extreme posture and light and low-quality data in the competition data, they focus more on data augmentation strategies during training including cutMix, ISONoise, randomSunFlare, randomFog, motionBlur, and imageCompression.

First of all, the team applied a face detector to detect faces and align faces by five points. After that, the mmclassification555https://github.com/open-mmlab/mmclassification project was used to train a face anti-spoofing model. To begin with, the team chose a ResNet34 [11] as the backbone and the cross-entropy loss was selected as the loss function. The whole framework is illustrated in Fig. 6.


 The prerequisites need to know is that deep bilateral has been successfully applied in convolutional networks to filter the deep features instead of original images. Inspired by this, team VIC_FACE proposed a novel method based on fusing the deep bilateral operator on the basis of original CDCN in order to learn more intrinsic features via aggregating multilevel bilateral macro- and micro- information. As shown in Fig. 

7, the backbone model is an initial CDCN, which divides the backbone into multilevel (low-level, mid-level, and high-level) blocks to predict the gray-scale facial depth map with size from a single RGB facial image with size . Besides, the DBO as a channel-wise deep bilateral filtering mimics a residual layer embedded in the network and replaces the original convolution layer by representing the aggregated bilateral base and residual features.

Specifically, at the first stage, they detect and crop the face area in the full image as input of the model. Secondly, they edit the images randomly with down-sampling and jpeg compression, which often occur unintentionally when the images are captured from different devices. Moreover, it is worth mentioning that excepting some regular data augmentation methods including cutout, color jitter and erase to improve generalization of the model, affine transformation of brightness and color of random area based on OpenCV is applied to simulate light condition in training data. Finally, they design a contrastive loss function for controlling the contrast depth map of gray-scale output and a mean-square error loss function for reducing the difference between augmented input and binary mask, then combining them into one loss function with Adam optimizer.

Figure 8: The framework of DXM-DI-AI-CV-TEAM.

DXM-DI-AI-CV-TEAM  Due to the generalization performance of the challenge evaluation algorithm for unknown attack scenarios, this team casts faces anti-spoofing as a domain generalization (DG) problem. To let the model generalize well to unseen scenes, inspired by [32], the proposed framework trains their model to perform well in the simulated domain shift scenarios, which is achieved by finding generalized learning directions in the meta-learning process. Different from [32], the team removed the branch of depth prior knowledge from the real face and mask, which contained similar depth information. Besides, a series of data augmentation and training strategies are used to achieve the best results.

In the challenge, the training data are collected in 3 scenes, namely White Light, Outdoor Sunshine, and Motion Blur (short for 1, 4, 6). Therefore, the objective of DG for this challenge is to make the model trained on the 3 source scenes can generalize well to unseen attacks from the target scene. To this end, as shown in Fig. 8, the framework in this team that composes of a feature extractor and a meta learner. At each training iteration, they divide the original 3 source scenes by randomly selecting 2 scenes as meta-train scenes and the remaining one as the meta-test scene. In each meta-train and meta-test scene, meta learner conducts the meta-learning in the feature space supervised by the image and label pairs denoted as and , where are ground truth with binary class labels ( is the label of fake/real face). In this way, their model can learn how to perform well in the scene shift scenarios through many training iterations and thus learn to generalize well to unseen attacks.

4 Challenge Results

4.1 Challenge Results Report

We adopted four metrics to evaluate the performance of the solutions, which are APCER, NPCER, ACER, and AUC respectively. Please note that although we report performance for a variety of evaluation measures, the leading metric was ACER. See from the Tab. 3, which lists the results and ranking of the top 18 teams, we can draw three conclusions: (1) The ACER performance of the top 3 teams is relatively close, and the top 2 teams have the best results in all metrics. (2) The top 6 teams are from industry, which indicates that mask attack detection is no longer limited to academia, but also an urgent problem in practical application. (3) The ACER performance of all teams is evenly distributed between and , which not only shows the rationality and selectivity of our challenge but also demonstrates the value of HiFiMask for further research.

1 VisionLabs 492 101 3.777 2.330 3.053 0.995
2 WeOnlyLookOnce 242 193 1.858 4.452 3.155 0.995
3 CLFM 483 118 3.708 2.722 3.215 0.994
4 oldiron666 644 115 4.944 2.653 3.798 0.992
5 Reconova-AI-LAB 277 276 2.126 6.367 4.247 0.991
6 inspire 760 176 5.834 4.060 4.947 0.986
7 Piercing Eyes 887 143 6.809 3.299 5.054 0.983
8 msxf_cvas 752 232 5.773 5.352 5.562 0.982
9 VIC_FACE 1152 104 8.843 2.399 5.621 0.965
1100 181 8.444 4.175 6.310 0.970
11 fscr 794 326 6.095 7.520 6.808 0.979
12 VIPAI 1038 268 7.968 6.182 7.075 0.976
13 reconova-ZJU 1330 183 10.210 4.221 7.216 0.974
14 sama_cmb 1549 188 11.891 4.337 8.114 0.969
15 Super 780 454 5.988 10.473 8.230 0.979
16 ReadFace 1556 202 11.944 4.660 8.302 0.965
17 LsyL6 2031 138 15.591 3.183 9.387 0.951
18 HighC 1656 340 12.712 7.843 10.278 0.966
Table 3: Team and results are listed in the final ranking of this challenge.

4.2 Competition summary and Future Work

Through the introduction and result analysis of team methods in the challenge, we summarize the effective ideas for mask attack detection: (1) At the data level, data expansion is almost the strategy adopted by all teams. Therefore, data augmentation plays an important role in preventing the over-fitting of the model and improving the stability of the algorithm. (2) The segmentation of the face region can not only enlarge the local information to mine the difference between high fidelity mask and live face but also avoid the extraction of irrelevant features such as face ID. (3) Multi-branch-based feature learning is a framework widely used by participating teams. Firstly, a multi-branch network is used to mine the differences between the mask and live face from multiple aspects, such as texture, color contrast, material difference, etc., and then feature fusion is used to improve the robustness of the algorithm.

Since the challenge prohibits the use of additional datasets and pre-trained models, the performance is limited to a certain extent. In the following work, we further improve the performance from the following aspects: (1) Under visible light, it is difficult to distinguish between a live face and a mask. Therefore, we will use additional or generate multi-modal data [20, 35] to assist mask attack detection. (2) Besides CNN, we will explore the effectiveness of recent vision transformer [8] and MLP-like [36] architectures for face mask detection task. (3) As the HiFiMask dataset contains challenging dynamic lighting and scenes, we will explore more reliable rPPG [25, 44] technology for detecting liveness clues.

5 Conclusion

We organized the 3D High-Fidelity Mask Face Presentation Attack Detection Challenge at ICCV2021 based on the HiFiMask dataset and running on the CodaLab platform. 195 teams registered for the competition and 18 teams made it to the final stage. Among the latter, teams were formed by 12 companies and 6 academic institutes/universities. We first described the associated dataset, the challenge protocol, and the evaluation metrics. Then, we reviewed the top-ranked solutions and reported the results from the final phases. Finally, we summarized the relevant conclusions, and pointed out the effective methods against mask attacks explored by this challenge.

6 Acknowledgement

This work was supported by the Chinese National Natural Science Foundation Projects 61961160704, 61876179, the External cooperation key project of Chinese Academy Sciences 173211KYSB20200002, the Key Project of the General Logistics Department Grant No.AWS17J001, Science and Technology Development Fund of Macau (No. 0010/2019/AFJ, 0008/2019/A1, 0025/2019/AKP, 0019/2018/ASC), by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE), and by ICREA under the ICREA Academia programme.


  • [1] Z. Boulkenafet, J. Komulainen, Z. Akhtar, A. Benlamoudi, D. Samai, S. E. Bekhouche, A. Ouafi, F. Dornaika, A. Taleb-Ahmed, L. Qin, et al. (2017) A competition on generalized software-based face presentation attack detection in mobile scenarios. In IJCB, Cited by: §1.
  • [2] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid (2017) OULU-npu: a mobile face presentation attack database with real-world variations. In FGR, pp. 612–618. Cited by: §1.
  • [3] S. Chen, T. Yao, Y. Chen, S. Ding, J. Li, and R. Ji (2021) Local relation learning for face forgery detection. Cited by: §3.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020-13–18 Jul) A simple framework for contrastive learning of visual representations. In

    Proceedings of the 37th International Conference on Machine Learning

    , H. D. III and A. Singh (Eds.),
    Proceedings of Machine Learning Research, Vol. 119, pp. 1597–1607. Cited by: §3.
  • [5] X. Chen and K. He (2021-06) Exploring simple siamese representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 15750–15758. Cited by: §3.
  • [6] Z. Chen, T. Yao, K. Sheng, S. Ding, Y. Tai, J. Li, F. Huang, and X. Jin (2021) Generalizable representation learning for mixture domain face anti-spoofing. In AAAI, Cited by: §1.
  • [7] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou (2019) Retinaface: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641. Cited by: §3.
  • [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. External Links: 2010.11929 Cited by: §4.2.
  • [9] A. George, Z. Mostaani, D. Geissenbuhler, O. Nikisins, A. Anjos, and S. Marcel (2019)

    Biometric face presentation attack detection with multi-channel convolutional neural network

    TIFS. Cited by: §1.
  • [10] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 2, pp. 1735–1742. Cited by: §3.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3, §3.
  • [12] N. Kose and J. Dugelay (2014) Mask spoofing in face recognition and countermeasures. Image and Vision Computing. Cited by: §1.
  • [13] H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot (2018) Unsupervised domain adaptation for face anti-spoofing. TIFS. Cited by: §1.
  • [14] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and F. Huang (2019) DSFD: dual shot face detector. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5055–5064. External Links: Document Cited by: Figure 1, §3, §3.
  • [15] X. Li, J. Komulainen, G. Zhao, P. Yuen, and M. Pietikäinen (2016) Generalized face anti-spoofing by detecting pulse from face videos. In ICPR, Cited by: §1, §2.
  • [16] B. Lin, X. Li, Z. Yu, and G. Zhao (2019) Face liveness detection by rppg features and contextual patch-based cnn. In ICBEA, Cited by: §1.
  • [17] A. Liu, J. Wan, S. Escalera, H. J. Escalante, and S. Z. Li (2019) Multi-modal face anti-spoofing attack detection challenge at cvpr2019. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §1.
  • [18] A. Liu, X. Li, J. Wan, Y. Liang, S. Escalera, H. J. Escalante, M. Madadi, Y. Jin, Z. Wu, X. Yu, et al. (2020) Cross-ethnicity face anti-spoofing recognition challenge: a review. IET Biometrics. Cited by: §1.
  • [19] A. Liu, Z. Tan, J. Wan, S. Escalera, G. Guo, and S. Z. Li (2021) Casia-surf cefa: a benchmark for multi-modal cross-ethnicity face anti-spoofing. In WACV, Cited by: §1.
  • [20] A. Liu, Z. Tan, J. Wan, Y. Liang, Z. Lei, G. Guo, and S. Z. Li (2021) Face anti-spoofing via adversarial cross-modality translation. IEEE TIFS. Cited by: §4.2.
  • [21] A. Liu, C. Zhao, Z. Yu, J. Wan, A. Su, X. Liu, Z. Tan, S. Escalera, J. Xing, Y. Liang, et al. (2021) Contrastive context-aware learning for 3d high-fidelity mask face presentation attack detection. arXiv preprint arXiv:2104.06148. Cited by: 3D High-Fidelity Mask Face Presentation Attack Detection Challenge, §2, §2, §3.
  • [22] S. Liu, K. Zhang, T. Yao, M. Bi, S. Ding, J. Li, F. Huang, and L. Ma (2021) Adaptive normalized representation learning for generalizable face anti-spoofing. In ACM MM, Cited by: §1.
  • [23] S. Liu, K. Zhang, T. Yao, K. Sheng, S. Ding, Y. Tai, J. Li, Y. Xie, and L. Ma (2021) Dual reweighting domain generalization for face presentation attack detection. In IJCAI, Cited by: §1.
  • [24] S. Liu, X. Lan, and P. C. Yuen (2018) Remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. In ECCV, Cited by: §1.
  • [25] S. Liu, X. Lan, and P. C. Yuen (2021) Multi-channel remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. TIFS. Cited by: §1, §4.2.
  • [26] S. Liu, P. C. Yuen, S. Zhang, and G. Zhao (2016) 3D mask face anti-spoofing with remote photoplethysmography. In ECCV, Cited by: §1.
  • [27] Y. Liu, A. Jourabloo, and X. Liu (2018) Learning deep models for face anti-spoofing: binary or auxiliary supervision. In CVPR, Cited by: §1.
  • [28] R. Müller, S. Kornblith, and G. E. Hinton (2019) When does label smoothing help?. In Advances in Neural Information Processing Systems, Vol. 32. External Links: Link Cited by: §3.
  • [29] E. Nesli and S. Marcel (2013) Spoofing in 2d face recognition with 3d masks and anti-spoofing with kinect. In BTAS, Cited by: §1.
  • [30] Y. Qin, Z. Yu, L. Yan, Z. Wang, C. Zhao, and Z. Lei (2021) Meta-teacher for face anti-spoofing. IEEE TPAMI. Cited by: §1.
  • [31] Y. Qin, C. Zhao, X. Zhu, Z. Wang, Z. Yu, T. Fu, F. Zhou, J. Shi, and Z. Lei (2020) Learning meta model for zero-and few-shot face anti-spoofing. In AAAI, pp. 11916–11923. Cited by: §1.
  • [32] R. Shao, X. Lan, and P. C. Yuen (2020) Regularized fine-grained meta face anti-spoofing. In

    Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI)

    Cited by: §3.
  • [33] H. Steiner, A. Kolb, and N. Jung (2016) Reliable face anti-spoofing using multispectral swir imaging. In ICB, Cited by: §1.
  • [34] M. Tan and Q. Le (2019-09–15 Jun)

    EfficientNet: rethinking model scaling for convolutional neural networks

    In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. External Links: Link Cited by: §3.
  • [35] H. Tang and N. Sebe (2021)

    Total generate: cycle in cycle generative adversarial networks for generating human faces, hands, bodies, and natural scenes

    IEEE TMM. Cited by: §4.2.
  • [36] I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, D. Keysers, J. Uszkoreit, M. Lucic, et al. (2021) Mlp-mixer: an all-mlp architecture for vision. arXiv preprint arXiv:2105.01601. Cited by: §4.2.
  • [37] W. Wang, B. Yin, T. Yao, L. Zhang, Y. Fu, S. Ding, J. Li, F. Huang, and X. Xue (2021) Delving into data: effectively substitute training for black-box attack. In CVPR, pp. 4761–4770. Cited by: §3.
  • [38] X. Wang, T. Yao, S. Ding, and L. Ma (2020) Face manipulation detection via auxiliary supervision. In International Conference on Neural Information Processing, pp. 313–324. Cited by: §3.
  • [39] Z. Wang, Z. Yu, C. Zhao, X. Zhu, Y. Qin, Q. Zhou, F. Zhou, and Z. Lei (2020) Deep spatial gradient and temporal depth learning for face anti-spoofing. In CVPR, Cited by: §1.
  • [40] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §3.
  • [41] B. Yin, W. Wang, T. Yao, J. Guo, Z. Kong, S. Ding, J. Li, and C. Liu (2021) Adv-makeup: a new imperceptible and transferable attack on face recognition. Cited by: §3.
  • [42] Z. Yu, X. Li, X. Niu, J. Shi, and G. Zhao (2020) Face anti-spoofing with human material perception. In ECCV, pp. 557–575. Cited by: §1.
  • [43] Z. Yu, X. Li, J. Shi, Z. Xia, and G. Zhao (2021) Revisiting pixel-wise supervision for face anti-spoofing. IEEE TBIOM. Cited by: §1.
  • [44] Z. Yu, X. Li, P. Wang, and G. Zhao (2021) Transrppg: remote photoplethysmography transformer for 3d mask face presentation attack detection. IEEE Signal Processing Letters. Cited by: §1, §4.2.
  • [45] Z. Yu, Y. Qin, X. Li, C. Zhao, Z. Lei, and G. Zhao (2021) Deep learning for face anti-spoofing: a survey. arXiv preprint arXiv:2106.14948. Cited by: §1.
  • [46] Z. Yu, Y. Qin, X. Xu, C. Zhao, Z. Wang, Z. Lei, and G. Zhao (2020) Auto-fas: searching lightweight networks for face anti-spoofing. In ICASSP, pp. 996–1000. Cited by: §1.
  • [47] Z. Yu, Y. Qin, H. Zhao, X. Li, and G. Zhao (2021) Dual-cross central difference network for face anti-spoofing. In IJCAI, Cited by: §1.
  • [48] Z. Yu, J. Wan, Y. Qin, X. Li, S. Z. Li, and G. Zhao (2020) NAS-fas: static-dynamic central difference network search for face anti-spoofing. In TPAMI, Cited by: §1.
  • [49] Z. Yu, C. Zhao, Z. Wang, Y. Qin, Z. Su, X. Li, F. Zhou, and G. Zhao (2020) Searching central difference convolutional networks for face anti-spoofing. In CVPR, Cited by: §1, §3, §3, §3.
  • [50] J. Zhang, Y. Tai, T. Yao, J. Meng, S. Ding, C. Wang, J. Li, F. Huang, and R. Ji (2021) Aurora guard: reliable face anti-spoofing via mobile lighting system. arXiv preprint arXiv:2102.00713. Cited by: §1.
  • [51] K. Zhang, T. Yao, J. Zhang, S. Liu, B. Yin, S. Ding, and J. Li (2021) Structure destruction and content combination for face anti-spoofing. In IJCB, pp. 1–6. Cited by: §1.
  • [52] K. Zhang, T. Yao, J. Zhang, Y. Tai, S. Ding, J. Li, F. Huang, H. Song, and L. Ma (2020)

    Face anti-spoofing via disentangled representation learning

    In ECCV, Cited by: §1.
  • [53] S. Zhang, X. Wang, A. Liu, C. Zhao, J. Wan, S. Escalera, H. Shi, Z. Wang, and S. Z. Li (2019) A dataset and benchmark for large-scale multi-modal face anti-spoofing. In CVPR, Cited by: §1.
  • [54] Z. Zhang, D. Yi, Z. Lei, and S. Z. Li (2011) Face liveness detection by learning multispectral reflectance distributions. In FG, Cited by: §1.