[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection
This paper reports methods and results in the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge employs the DeeperForensics-1.0 dataset, one of the most extensive publicly available real-world face forgery detection datasets, with 60,000 videos constituted by a total of 17.6 million frames. The model evaluation is conducted online on a high-quality hidden test set with multiple sources and diverse distortions. A total of 115 participants registered for the competition, and 25 teams made valid submissions. We will summarize the winning solutions and present some discussions on potential research directions.READ FULL TEXT VIEW PDF
As facial interaction systems are prevalently deployed, security and
In this paper, we present our on-going effort of constructing a large-sc...
This paper presents a review of the 2018 WIDER Challenge on Face and
Deepfakes are a recent off-the-shelf manipulation technique that allows
We survey over 100 face datasets constructed between 1976 to 2019 of 145...
This paper presents a summary of the DFGC 2021 competition. DeepFake
The threat of 3D masks to face recognition systems is increasingly serio...
[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection
The solution for the DeeperForensics Challenge 2020
Recent years have witnessed exciting progress [DFL, DFLPaper, DeepFakes, faceswap-GAN, faceshifter, deeperforensics1] in automatic face swapping. Indeed, these techniques have eschewed the cumbersome hand-crafted face manipulation processes, hence facilitating the development of various popular softwares for face editing. From another perspective, these easy-to-access softwares, named “Deepfakes”, have also brought risks for being misused and spread. Tampered videos on the internet could lead to possible perilous consequences, entailing the potential legitimate concerns among the general public and authorities. Therefore, effective face forgery detection methods become an urgent need to safeguard against these photorealistic fake videos, particularly in real-world scenarios where the video sources and distortions are unknown.
We organize the DeeperForensics Challenge 2020 with the aim to advance the state-of-the-art in face forgery detection. Participants are expected to develop robust and generic methods for forgery detection in real-world scenarios. The challenge uses DeeperForensics-1.0 [deeperforensics1], a large-scale real-world face forgery detection dataset that contains videos with a total of million frames111 Project page: https://liming-jiang.com/projects/DrF1/DrF1.html.. All source videos in DeeperForensics-1.0 are carefully collected, and fake videos are generated by a newly proposed end-to-end face swapping framework. Extensive real-world perturbations are applied to obtain a more challenging benchmark of larger scale and higher diversity. The dataset also features a hidden test set, which is richer in distribution than the publicly available training set, suggesting a better setting to simulate real-world scenarios. Besides, the hidden test set will be continuously updated to get future versions along with the development of Deepfakes technology. The evaluation of the challenge is performed online on the current version of the hidden test set.
In the following sections, we will describe the DeeperForensics Challenge 2020, summarize the winning solutions and results, and provide discussions to take a closer look at the current status and possible future development of real-world face forgery detection.
The DeeperForensics Challenge 2020 is hosted on the CodaLab platform222 Challenge website: https://competitions.codalab.org/competitions/25228. in conjunction with ECCV 2020, The 2nd Workshop on Sensing, Understanding and Synthesizing Humans333 Workshop website: https://sense-human.github.io.. The online evaluation is conducted using Amazon Web Services (AWS)444 Online evaluation website: https://aws.amazon.com.. First, participants register their teams on the CodaLab challenge website. Then, they are requested to submit their models to the AWS evaluation server (with one 16 GB Tesla V100 GPU for each team) to perform the online evaluation on the hidden test set. When the evaluation is done, participants receive the encrypted prediction files through an automatic email. Finally, they submit the result file to the CodaLab challenge website.
The DeeperForensics Challenge 2020 employs the DeeperForensics-1.0 dataset [deeperforensics1] that was proposed in CVPR 2020. DeeperForensics-1.0 contains videos constituted by a total of million frames. The dataset features three appealing properties: good quality, large scale, and high diversity.
To ensure good quality, extensive data collection is conducted. The high-resolution () source videos are collected from paid actors with four typical skin tones across countries. Their eight expressions (, neutral, angry, happy, sad, surprise, contempt, disgust, fear) are recorded under nine lighting conditions by seven cameras at different locations. We further ask the actors to perform supplementary expressions defined by 3DMM blendshapes [3dmm] to make the dataset more diverse. Besides, a robust end-to-end face swapping framework, DF-VAE, is developed to generate the fake videos. In addition, seven types of real-world perturbations at five intensity levels are applied to obtain a more challenging benchmark of larger scale and higher diversity. Readers are referred to [deeperforensics1] for details.
An indispensable component of DeeperForensics-1.0 is the hidden test set, which is richer in distribution than the publicly available training set. The hidden test set suggests a better real-world face forgery detection setting: 1) Multiple sources. Fake videos in-the-wild should be manipulated by different unknown methods; 2) High quality. Threatening fake videos should have high quality to deceive human eyes; 3) Diverse distortions. Different perturbations should be considered. The hidden test set will evolve by including more challenging samples along with the development of Deepfakes technology. The evaluation of the challenge is performed on its current version.
Similar to Deepfake Detection Challenge (DFDC) [DFDCWeb], the DeeperForensics Challenge 2020 uses the binary cross-entropy loss (BCELoss) to evaluate the performance of face forgery detection models:
where is the number of videos in the hidden test set, denotes the ground truth label of video (fake: , real: ), and
indicates the predicted probability that videois fake. A smaller BCELoss score is better, which directly contributes to a higher ranking. If the BCELoss score is the same, the one with less runtime will achieve a higher ranking. To avoid an infinite BCELoss that is both too confident and wrong, the score is bounded by a threshold value.
The DeeperForensics Challenge 2020 lasted for nine weeks – eight weeks for the development phase and one week for the final test phase.
The challenge officially started at the ECCV 2020 SenseHuman Workshop on August 28, 2020, and it immediately entered the development phase. In the development phase, the evaluation is performed on the test-dev hidden test set, which contains videos representing general circumstances of the full hidden test set. The test-dev hidden test set is used to maintain a public leaderboard. Participants can conduct four online evaluations (each with 2.5 hours of runtime limit) per week.
The final test phase started on October 24, 2020. The evaluation is conducted on the test-final hidden test set, containing videos (also including test-dev videos) with a similar distribution as test-dev, for the final competition results. A total of two online evaluations (each with 7.5 hours of runtime limit) are allowed. The final test phase ended on October 31, 2020.
Finally, the challenge results were announced in December 2020. In total, participants registered for the competition, and teams made valid submissions.
Among the teams who made valid submissions, many participants achieve promising results. We show the final results of the top-5 teams in Table 1. In the following sections, we will present the winning solutions of top-3 entries.
Team members: Baoying Chen, Peiyu Zhuang, Sili Li
As shown in Figure 1, the method designed by the champion team contains three stages, namely Face Extraction, Classification, and Output.
Face Extraction. They first extract frames from each video at equal intervals using VideoCapture of OpenCV. Then, they use the face detector MTCNN [mtcnn] to detect the face region of each frame and expand the region by times to crop the face image.
Classification. They define the prediction of the probability that the face is fake as the face score. They use EfficientNet [efficientnet] as the backbone, which was proven effective in the Deepfake Detection Challenge (DFDC) [DFDCWeb]. The results of three models (EfficientNet-B0, EfficientNet-B1 and EfficientNet-B2) are ensembled for each face.
Output. The final output score of a video is the predicted probability that the video is fake, which is calculated by the average of face scores for the extracted frames.
The team employs EfficientNet pre-trained on ImageNet as the backbone. They select EfficientNet-B0, EfficientNet-B1, and EfficientNet-B2 for the model ensemble. In addition to DeeperForensics-1.0, they use some other public datasets, , UADFV[UADFV], Deep Fake Detection [google], FaceForensics++ [FF++iccv], Celeb-DF [celebdfcvpr], and DFDC Preview [DFDC]. They balance the class samples with the down-sampling mode. The code of the champion solution has been made publicly available555 https://github.com/beibuwandeluori/DeeperForensicsChallengeSolution..
Training: Inspired by the DFDC winning solution, appropriate data augmentation could contribute to better results. As for the data augmentation, the champion team uses the perturbation implementation in DeeperForensics-1.0 [drf1_aug] during training. They only apply the image-level distortions: color saturation change (CS), color contrast change (CC), local block-wise (BW), white Gaussian noise in color components (GNC), Gaussian blur (GB) and JPEG compression (JPEG). They randomly mixup these distortions with a probability of 0.2. Besides, they also try other data augmentation [dfdc_aug], but the performance improvement is slim. The images are resized to . The batch size is
, and the total training epoch is. They use AdamW optimizer [adamw] with initial learning rate of 0.001. Label smoothing is applied with a smoothing factor of .
Testing: The testing pipeline follows the three stages in Figure 1. They clip the prediction score of each video in a range of to reduce the large loss caused by the prediction errors. In addition to the best BCELoss score, their fastest execution speed may be attributed to the use of the faster face extractor MTCNN and the ensemble of three image-level models with fewer parameters.
Team members: Shen Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji
Face manipulated video contains two types of forgery traces, , image-level artifacts and video-level artifacts. The former refers to the artifacts such as blending boundaries and abnormal textures within image, while the latter is the face jitter problem between video frames. Most previous works only focused on artifacts in a specific modality and lacked consideration of both. The team in the second place proposes to use an attention mechanism to fuse the temporal information in videos, and further combine it with an image model to achieve better results.
The overall framework of their method is shown in Figure 2. First, they use RetinaFace [retinaface] with
margin to detect faces in video frames. Then, the face sequence is fed into an image-based model and a video-based model, where the backbones are both EfficientNet-b5[efficientnet] with NoisyStudent [noisystudent] pre-trained weights. The image-based model predicts frame by frame and takes the median of probabilities as the prediction. The video-based model takes the entire face sequence as the input and adopts an attention module to fuse the temporal information between frames. Finally, the per-video prediction score is obtained by averaging the probabilities predicted by the above two models.
The team implements the proposed method via PyTorch. All the models are trained onNVIDIA Tesla V100 GPUs. In addition to the DeeperForensics-1.0 dataset, they use three external datasets, , FaceForensics++ [FF++iccv], Celeb-DF [celebdfcvpr], and Diverse Fake Face Dataset [DFFD]. They used the official splits provided by the above datasets to construct the training, validation and test sets. They balance the positive and negative samples through the down-sampling technique.
Training: The second-place team uses the following data augmentations: RandAugment [randaugment], patch Gaussian [patchgaussian], Gaussian blur, image compression, random flip, random crop and random brightness contrast. They also employ the perturbation implementation in DeeperForensics-1.0 [drf1_aug]
. For the image-based model, they train a classifier based on EfficientNet-b5[efficientnet]
, using binary cross-entropy loss as the loss function. They adopt a two-stage training strategy for the video-based model. In stage-1, they train an image-based classifier based on EfficientNet-b5. In stage-2, they fix the model parameters trained in stage-1 to serve as face feature extractor, and introduce an attention module to learn temporal information via nonlinear transformations andsoftmax operations. The input of the network is the face sequence (, frames per video) in stage-2, and only the attention module and classification layers are trained. The binary cross-entropy loss is adopted as the loss function. The input size is scaled to . Adam optimizer [adam] is used with a learning rate of , , , and weight decay of . The batch size is . The total number of training epochs is set to , and the learning rate is halved every epochs.
Testing: They sample frames at equal intervals for each video and detect faces by RetinaFace [retinaface] as in the training phase. Then, the face images are resized to . Test-time augmentation (TTA) (, flip) is applied to get images ( original and flipped), which are fed into network to get the prediction score. They clip the prediction score of each video to to avoid excessive losses on extreme error samples.
Team members: Changlei Lu, Ganchao Tan
Similar to the second-place entry, the team in the third place also utilize the poor temporal consistency in existing face manipulation techniques. To this end, they propose to use a 3D convolutional neural network (3DCNN) to capture spatial-temporal features for forgery detection. The framework of their method is shown in Figure3.
Implementation Details. First, the team crops faces in the video frames using the MTCNN [mtcnn] face detector. They combine all the cropped face images into a face video clip. Each video clip is then resized to or
. Various data augmentations are applied, including Gaussian blur, white Gaussian noise in color components, random crop, random flip, . Then, they use the processed video clips as the input to train a 3D convolutional neural network (3DCNN) using the cross-entropy loss. They examine three kinds of networks, I3D[i3d], 3D ResNet [3dresnet] and R(2+1)D [r2plus1d]. These models are pre-trained on the action recognition datasets, , kinetics [kinetics]. In addition to DeeperForensics-1.0, they use three external public face manipulation datasets, , the DFDC dataset [DFDCChallenge], Deep Fake Detection [google], and FaceForensics++ [FF++iccv].
The methods mentioned above have considered different potential aspects in developing a robust face forgery detection model. We are glad to find the winning solutions achieve promising performance in the DeeperForensics Challenge 2020. In summary, there are three key points inspired by these methods that could improve real-world face forgery detection. 1) Strong backbone. Backbone selection of the forgery detection models is important. The high-performance winning solutions are based on state-of-the-art EfficientNet. 2) Diverse augmentations. Applying appropriate data augmentations may better simulate real-world scenarios and boost the model performance. 3) Temporal information. Since the primary detection target is the fake videos, temporal information can be a critical clue to distinguish the real from the fake.
Despite the promising results, we believe that there is still much room for improvement in the real-world face forgery detection task. 1) More suitable and diverse data augmentations may contribute to a better simulation of real-world data distribution. 2) Developing a robust detection method that can cope with unseen manipulation methods and distortions is a critical problem. At this stage, we observe that the model training is data-dependent. Although data augmentations can help improve the performance to a certain extent, the generalization ability of most forgery detection models is still poor. 3) Different artifacts in the Deepfakes videos (, checkerboard artifacts, fusion boundary artifacts) remain rarely explored.
Acknowledgments. We thank Amazon Web Services for sponsoring the prize of this challenge. The organization of this challenge is also supported by A*STAR through the Industry Alignment Fund - Industry Collaboration Projects Grant.