[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection
In this paper, we present our on-going effort of constructing a large-scale benchmark, DeeperForensics-1.0, for face forgery detection. Our benchmark represents the largest face forgery detection dataset by far, with 60, 000 videos constituted by a total of 17.6 million frames, 10 times larger than existing datasets of the same kind. Extensive real-world perturbations are applied to obtain a more challenging benchmark of larger scale and higher diversity. All source videos in DeeperForensics-1.0 are carefully collected, and fake videos are generated by a newly proposed end-to-end face swapping framework. The quality of generated videos outperforms those in existing datasets, validated by user studies. The benchmark features a hidden test set, which contains manipulated videos achieving high deceptive scores in human evaluations. We further contribute a comprehensive study that evaluates five representative detection baselines and make a thorough analysis of different settings. We believe this dataset will contribute to real-world face forgery detection research.READ FULL TEXT VIEW PDF
This paper reports methods and results in the DeeperForensics Challenge ...
In recent years, the abuse of a face swap technique called deepfake Deep...
The rapid progress of photorealistic synthesis techniques has reached at...
AI-manipulated videos, commonly known as deepfakes, are an emerging prob...
A variety of effective face-swap and face-reenactment methods have been
Thanks to the substantial and explosively inscreased instructional video...
There are substantial instructional videos on the Internet, which enable...
[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection
Face swapping has become an emerging topic in computer vision and graphics. Indeed, many works[9, 10, 12] on automatic face swapping have been proposed in recent years. These efforts have circumvented the cumbersome and tedious manual face editing processes, hence expediting the advancement in face editing. At the same time, such enabling technology has sparked legitimate concerns, particularly on its potential for being misused and abused. The popularization of “Deepfakes” on the internet has further set off alarm bells among the general public and authorities, in view of the conceivable perilous implications. Accordingly, there is a dire need for countermeasures to be in place promptly, particularly innovations that can effectively detect videos that have been manipulated.
Working towards forgery detection, various groups have contributed datasets (e.g., FaceForensics++ , Deep Fake Detection  and DFDC ) comprising manipulated video footages. The availability of these datasets has undoubtedly provided essential avenues for research into forgery detection. Nonetheless, the aforementioned datasets suffer several drawbacks. Videos in these datasets are either of a small number, of low quality, or overly artificial. Understandably, these datasets are inadequate to train a good model for effective forgery detection in real-world scenarios. This is particularly true when current advances in human face editing are able to produce extremely realistic videos, rendering forgery detection a highly challenging task. On another note, we observe high similarity between training and test videos, in terms of their distribution, in certain works [30, 37]. Their actual efficacy in detecting real-world face forgery cases, which are much more variable and unpredictable, remains to be further elucidated.
We believe that forgery detection models can only be enhanced when trained with a dataset that is exhaustive enough to encompass as many potential real-world variations as possible. To this end, we propose a large-scale dataset named DeeperForensics- consisting of videos with a total of million frames for real-world face forgery detection. The main steps of our dataset construction are shown in Figure 1. We set forth three yardsticks when constructing this dataset: 1) Quality. The dataset shall contain videos more realistic and much closer to the distribution of real-world detection scenarios. (details in Section 3.1 and 3.2) 2) Scale. The dataset shall be made up of a large-scale video sets. (details in Section 3.3) 3) Diversity. There shall be sufficient variations in the video footages (e.g., compression, blurry, transmission errors) to match those that may be encountered in the real world (details in Section 3.3).
The primary challenge in the preparation of this dataset is the lack of good-quality video footages. Specifically, most publicly available videos are shot under an unconstrained environment resulting in large variations, including but not limited to suboptimal illumination, large occlusion of the target faces, and extreme head poses. Importantly, the lack of official informed consents from the video subjects precludes the use of these videos, even for non-commercial purposes. On the other hand, while some videos of manipulated faces are deceptively real, a larger number remains easily distinguishable by human eyes. The latter is often caused by model negligence towards appearance variations or temporal differences, leading to preposterous and incongruous results.
We approach the aforementioned challenge from two perspectives. 1) Collecting fresh face data from individuals with informed consents (details in Section 3.1) 2) Devising a novel method, DeepFake Variational Auto-Encoder (DF-VAE), to enhance existing videos (details in Section 3.2). In addition, we introduce diversity into the video footages through deliberate addition of distortions and perturbations, simulating real-world scenarios. We collate the newly collected data and the DF-VAE-modified videos into the DeeperForensics-
dataset, with the aim of further expanding it gradually over time. We benchmark five representative open-source forgery detection methods using our DeeperForensics-dataset as well as a hidden test set containing manipulated videos that achieve high deceptive ranking in user studies.
We summarize our contributions as follows: 1) We propose a new dataset, DeeperForensics- that is larger in scale than existing ones, of high quality and rich diversity. To improve its quality, we introduce a carefully designed data collection and a novel framework, DF-VAE, that effectively mitigate the obvious fabricated effects of existing manipulated videos. DeeperForensics- dataset shall facilitate future research in forgery detection of human faces in real-world scenarios. 2) We benchmark results of existing representative forgery detection methods on our DeeperForensics- dataset, offering insights into the current status and future improvisation strategy in face forgery detection.
|UADFV ||98||1 : 1||–||–|
|DeepFake-TIMIT ||620||only fake||–||–|
|Celeb-DF ||1203||1 : 1.95||–||–|
|FaceForensics++ ||5000||1 : 4||–||2|
|3431||1 : 8.5||28||–|
|DFDC Preview Dataset ||5214||1 : 3.6||66||3|
|DeeperForensics-1.0 (Ours)||60000||5 : 1||✔||100||35||✔||✔|
This paper includes two main aspects of face forgery detection related to other works: dataset and benchmark. We will cover some important works in this section.
Face forgery detection datasets. Building a dataset for forgery detection requires a huge amount of effort on data collection and manipulation. Early forgery detection datasets are based on images under highly restrictive conditions, e.g., MICC_F2000 , Wild Web dataset , Realistic Tampering dataset .
Owing to the urgency in video-based face forgery detection, some prominent groups have devoted their efforts to create face forensics video datasets (see Table 1). UADFV  contains videos, i.e., real videos from YouTube and fake ones generated by FakeAPP . DeepFake-TIMIT  manually selects similar looking pairs of people from VidTIMIT  database. For each of the subjects, they generate about videos using low-quality and high-quality versions of faceswap-GAN , resulting in a total of fake videos. Celeb-DF  includes YouTube videos, mostly of celebrities, from which fake videos are synthesized. FaceForensics++  is the first large-scale face forensic dataset that consists of fake videos manipulated by four methods (i.e., DeepFakes , Face2Face , FaceSwap , NeuralTextures )), and real videos from YouTube. Afterwards, Google joins FaceForensics++ and contributes Deep Fake Detection  dataset with real and fake videos from actors. Recently, Facebook invites individuals and builds the DFDC preview dataset , which includes original and tampered videos with three types of augmentations.
In comparison, we invite paid actors and collect high-resolution () source data with various poses, expressions, and illuminations. 3DMM blendshapes  are taken as reference to supplement some extremely exaggerated expressions. We get consents from all the actors for using and manipulating their faces. In contrast to prior works, we also propose a new end-to-end face swapping method (i.e., DF-VAE) and systematically apply seven types of perturbations to the fake videos at five intensity levels. The mixture of distortions to a single video make our dataset better imitate real-world scenarios. Ultimately, we construct DeeperForensics- dataset, which contains up to high-quality videos with a total of million frames.
Face forgery detection benchmarks. Recently, a new prominent benchmark, FaceForensics Benchmark , for facial manipulation detection has been proposed. The benchmark includes six image-level face forgery detection baselines [1, 3, 7, 8, 15, 36]. Although FaceForensics Benchmark adds distortions to the videos by converting them into different compression rates, a deeper exploration of more perturbation types and their mixture is missing. Celeb-DF  also provides a face forgery detection benchmark including seven methods [1, 7, 29, 32, 34, 49, 51] trained and tested on different datasets. In aforementioned benchmarks, the test set usually shares a similar distribution with the training set. Such an assumption inherently introduces biases and renders these methods impractical for face forgery detection in real-world settings with much more diverse and unknown fake videos.
In our benchmark, we introduce a challenging hidden test set with manipulated videos that achieve high deceptive scores in user studies, to better simulate real-world distribution. Various perturbations are analyzed to make our benchmark more comprehensive. In addition, we mainly exploit video-level forgery detection baselines [6, 17, 19, 43, 45] in this work. Temporal information – a significant cue for video forgery detection besides single-frame quality – has been considered. We will elaborate our benchmark in Section 4.
The main contribution of this paper is a new large-scale dataset for real-world face forgery detection, DeeperForensics-, which provides an alternative to existing databases. DeeperForensics- consists of videos with million frames in total, including original collected videos and manipulated videos. To construct a dataset more suitable for real-world face forgery detection, we design this dataset with careful consideration of quality, scale, and diversity. In Section 3.1 and 3.2, we will discuss the details of data collection and methodology (i.e., DF-VAE) to improve quality. In Section 3.3, we will show how to ensure large scale and high diversity of DeeperForensics-.
Source data is the first factor that highly affects quality. Taking results in Figure 2 as an example, the source data collection increases the robustness of our face swapping method to extreme poses, since videos on the internet usually have limited head pose variations.
We refer to the identity in the driving video as the “target” face and the identity of the face that is swapped onto the driving video as the “source” face. Different from previous works, we find that the source faces play a much more critical role than the target faces in building a high-quality dataset. Specifically, the expressions, poses, and lighting conditions of source faces should be much richer in order to perform robust face swapping. Hence, our data collection mainly focuses on source face videos. Figure 3 shows the diversity in different attributes of our data collection.
We invited paid actors to record the source videos. Similar to [4, 11], we obtained consents from all the actors for using and manipulating their faces to avoid the portrait right issues. The participants were carefully selected to ensure variability in genders, ages, skin colors, and nationalities. We tried to maintain a roughly equal proportion w.r.t. each of the attributes above. In particular, we invited males and females from countries. The age ranges from to years old to match the most common age group appearing on real-world videos. The actors have four typical skin tones: white, black, yellow, brown, with ratio :::. All faces are clean without glasses or decorations.
Different from previous data collection in the wild (see Table 1), we built a professional indoor environment for a more controllable data collection. We only use the facial regions (detected and cropped by LAB ) of the source data, so we can neglect the background. We set seven HD cameras from different angles: front, left, left-front, right, right-front, oblique-above, oblique-below. The resolution of our recorded videos is high (). We trained the actors in advance to keep the collection process smooth. We requested the actors to turn their heads and speak naturally with eight expressions: neutral, angry, happy, sad, surprise, contempt, disgust, fear. The head poses range from to . Furthermore, the actors were asked to perform expressions defined in 3DMM blendshapes  (see Figure 4) to supplement some extremely exaggerated expressions. When performing 3DMM blendshapes, the actors also spoke naturally to avoid excessive frames that show a closed mouth.
In addition to expressions and poses, we systematically set nine lighting conditions from various directions: uniform, left, top-left, bottom-left, right, top-right, bottom-right, top, bottom. The actors were only asked to turn their heads under uniform illumination, so the lighting remains unchanged on specific facial regions to avoid many duplicate data samples recorded by the cameras set at different angles. In the end, our collected data contain over videos with a total of million frames – an order of magnitude more than existing datasets.
To tackle low visual quality problems of previous works, we consider three key requirements in formulating a high-fidelity face swapping method: 1) It should be general and scalable for us to generate large number of videos with high quality. 2) The problem of face style mismatch caused by appearance variations need to be addressed. Some failure cases of existing methods are shown in Figure 5. 3) Temporal continuity of generated videos should be taken into consideration.
Based on the aforementioned requirements, we propose DeepFake Variational Auto-Encoder (DF-VAE), a novel learning-based face swapping framework. DF-VAE consists of three main parts, namely a structure extraction module, a disentangled module, and a fusion module. We will give a brief and intuitive understanding of the DF-VAE framework below. Please refer to the Appendix for detailed derivations and results.
Disentanglement of structure and appearance. The first step of our method is face reenactment – animating the source face with similar expression as the target face, without any paired data. Face swapping is considered as a subsequent step of face reenactment that performs fusion between the reenacted face and the target background. For robust and scalable face reenactment, we should cleanly disentangle structure (i.e., expression and pose) and appearance representation (i.e., texture, skin color, etc.) of a face. This disentanglement is rather difficult because structure and appearance representation are far from independent. We describe our solution as follows.
Let be a sequence of source face video frames, and be the sequence of corresponding target face video frames. We first simplify our problem and only consider two specific snapshots at time , and . Let , , represent the reconstructed source face, the reconstructed target face, and the reenacted face, respectively.
Consider the reconstruction procedure of the source face . Let denotes the structure representation and
denotes the appearance information. The face generator can be depicted as the posteriori estimate. The solution of our reconstruction goal, marginal log-likelihood , by a common Variational Auto-Encoder (VAE)  can be written as:
where is an approximate posterior to achieve the evidence lower bound (ELBO) in the intractable case, and the second RHS term is the variational lower bound w.r.t. both the variational parameters and generative parameters .
In Eq. 1, we assume that both and are latent priors computed by the same posterior . However, the separation of these two variables in the latent space is rather difficult without additional conditions. Therefore, we employ a simple yet effective approach to disentangle these two variables.
The blue arrows in Figure 6 demonstrate the reconstruction procedure of the source face . Instead of feeding a single source face , we sample another source face to construct unpaired data in the source domain. To make the structure representation more evident, we use the stacked hourglass networks  to extract landmarks of in the structure extraction module and get the heatmap . Then we feed the heatmap to the Structure Encoder , and to the Appearance Encoder . We concatenate the latent representations (small cubes in red and green) and feed it to the Decoder . Finally, we get the reconstructed face , i.e., marginal log-likelihood of .
Therefore, the latent structure representation in Eq. 1 becomes a more evident heatmap representation , which is introduced as a new condition. The unpaired sample with the same identity w.r.t. is another condition, being a substitute for . Eq. 1 can be rewritten as a conditional log-likelihood:
The first RHS term KL-divergence is non-negative, we get:
and can also be written as:
We let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:
is an identity matrix. Exploiting the reparameterization trick, the non-differentiable operation of sampling can become differentiable by an auxiliary variable with independent marginal. In this case, is implemented by where is an auxiliary noise variable . Finally, the approximate posterior is estimated by the separated encoders, Structure Encoder and Appearance Encoder , in an end-to-end training process by standard gradient descent.
We discuss the whole workflow of reconstructing the source face. In the target face domain, the reconstruction procedure is the same, as shown by orange arrows in Figure 6.
During training, the network learns structure and appearance information in both the source and the target domains. It is noteworthy that even if both and belong to arbitrary identities, our effective disentangled module is capable of learning meaningful structure and appearance information of each identity. During inference, we concatenate the appearance prior of and the structure prior of (small cubes in red and orange) in the latent space, and the reconstructed face shares the same structure with and keeps the appearance of . Our framework allows concatenations of structure and appearance latent codes extracted from arbitrary identities in inference and permits many-to-many face reenactment.
In summary, DF-VAE is a new conditional variational auto-encoder  with robustness and scalability. It conditions on two posteriors in different domains. In the disentangled module, the separated design of two encoders and , the explicit structure heatmap, and the unpaired data construction jointly force to learn structure information and to learn appearance information.
Style matching and fusion. To fix the obvious style mismatch problems as shown in Figure 5, we introduce a masked adaptive instance normalization (MAdaIN) module. We place a typical AdaIN  network after the reenacted face . In the face swapping scenario, we only need to adjust the style of the face area and use the original background. Therefore, we use a mask to guide AdaIN  network to focus on style matching of the face area. To avoid boundary artifacts, we apply Gaussian Blur to and get the blurred mask .
In our face swapping context, is the content input of MAdaIN, is the style input. MAdaIN adaptively computes the affine parameters from the face area of the style input:
where , . With the very low-cost MAdaIN module, we reconstruct again by Decoder . The blurred mask is used again to fuse the reconstructed image with the background of . At last, we get the swapped face . Figure 8 shows the effectiveness of MAdaIN module for style matching and fusion.
The MAdaIN module is jointly trained with the disentangled module in an end-to-end manner. Thus, by a single model, DF-VAE can perform many-to-many face swapping with obvious reduction of style mismatch and facial boundary artifacts (see Figure 7 for the face swapping between three source identities and three target identities). Even if there are multiple identities in both the source domain and the target domain, the quality of face swapping does not degrade.
Temporal consistency constraint. Temporal discontinuity of fake videos leads to obvious flickering of the face area, making them very easy to be spotted by forgery detection methods and human eyes. To improve temporal continuity, we let the disentangled module to learn temporal information of both the source face and the target face.
For simplification, we make a Markov assumption that the generation of the frame at time sequentially depends on its previous frames . In our experiment, we set to balance quality improvement and training time.
In order to build the relationship between a current frame and previous ones, we further make an intuitive assumption that the optical flows should remain unchanged after reconstruction. We use FlowNet 2.0  to estimate the optical flow w.r.t. and , w.r.t. and . Since face swapping is sensitive to minor facial details which can be greatly affected by flow estimation, we do not warp by the estimated flow like . Instead, we minimize the difference between and to improve temporal continuity while keeping stable facial detail generation. To this end, we propose a new temporal consistency constraint, which can be written as:
where for a common form of optical flow.
We only discuss the temporal continuity w.r.t. the source face in this section because the case of the target face is the same. If multiple identities exist in one domain, temporal information of all these identities can be learned in an end-to-end manner.
|1||Color saturation change|
|2||Local block-wise distortion|
|3||Color contrast change|
|5||White Gaussian noise in color components|
|7||Video compression rate change|
Our extensive data collection and the proposed DF-VAE method are designed to improve the quality of manipulated videos in DeeperForensics- dataset. In this section, we will mainly discuss the scale and diversity aspects.
We provide manipulated videos with million frames. It is also an order of magnitude more than the previous datasets. Thanks to the scalability and multimodality of DF-VAE, the time overhead of model training and data generation is reduced to compared to the common Deepfakes methods, with no degradation in quality. Thus, a larger-scale dataset construction is possible.
We take refined YouTube videos collected by FaceForensics++  as the target videos. Each face of our collected identities is swapped onto target videos, thus raw manipulated videos are generated directly by DF-VAE in an end-to-end process.
To ensure diversity, we apply various perturbations to better simulate videos in real scenes. Specifically, as shown in Table 2, seven types of distortions defined in Image Quality Assessment (IQA) [31, 35] are included. Each of these distortions is divided into five intensity levels. We apply random-type distortions to the raw manipulated videos at five different intensity levels, producing a total of manipulated videos. Besides, more robust manipulated videos are generated by adding random-type, random-level distortions to the raw manipulated videos. Moreover, in contrast to all the previous datasets, each sample of another manipulated videos in DeeperForensics- is subjected to a mixture of more than one distortion. The variability of perturbations improves the diversity of DeeperForensics- to better imitate the data distribution of real-world scenarios.
DeeperForensics- is a new large-scale dataset consisting of over videos with million frames for real-world face forgery detection. High-quality source videos and manipulated videos constitute two main contributions of the dataset. The diversity of perturbations applying to the manipulated videos ensures the robustness of DeeperForensics- to simulate real scenes. The whole dataset will be released, free to all research communities, for developing face forgery detection and more general human face related research.
To examine the quality of our DeeperForensics- dataset, we engage professional participants, most of whom specialize in computer vision research. We believe these participants are qualified and well-trained in assessing realness of tempered videos. The user study is conducted on DeeperForensics- and six former datasets, i.e., UADFV , DeepFake-TIMIT , Celeb-DF , FaceForensics++ , Deep Fake Detection , DFDC . We randomly select video clips from each of these datasets and prepare a platform for the participants to evaluate their realness. Similar to the user study of , the participants are asked to provide their feedbacks to the statement “The video clip looks real.” and give scores at five levels (-clearly disagree, -weakly disagree, -borderline, -weakly agree, -clearly agree. We assume that users who give a score of or think the video is “real”). The user study results are shown in Table 3. The quality of our dataset is appreciated by most of the participants. Compared to the previous datasets, DeeperForensics- achieves the highest realism rating. Although Celeb-DF  also gets very high realness scores, the scale of our dataset is much larger.
|Deep Fake Detection ||26.0||28.0||24.1||11.5||10.3||21.9%|
Dataset split. In our benchmark, we exploit raw manipulated videos in Section 3.3 and YouTube videos from FaceForensics++  as our standard set. The videos are split into training, validation, and test set with a ratio of . The identities of the swapped faces may be duplicated because the faces of invited actors are swapped onto driving videos. To avoid data leak, we randomly choose unrepeated , and identities, and group all the videos according to the identities. Similar to , the test and training sets share a close distribution in our standard set.
Other experiments in our benchmark are different variants of the standard set. They share the same driving videos with the standard set. We will detail the variants in Section 4.2. For a fair comparison, all the experiments are conducted in the same split setting.
Hidden test set. For real-world scenarios, some experiments conducted in previous works [30, 37] may not perform a convincing evaluation due to the huge biases caused by a close distribution between the training and the test sets. The aforementioned standard set has the same setting with these works. As a result, strong detection baselines obtain very high accuracy on the standard test set as demonstrated in Section 4.2. However, the ultimate goal of the face forensics dataset is to help detect forgery in real scenes. Even if the accuracy on the standard test set is high, the models may easily fail in real-world scenarios.
We argue that the test set of real-world face forgery detection should not share a close distribution with the training set. What we need is a test set that better simulates the real-world setting. We call it “hidden” test set. To better imitate fake videos in the real scene, the hidden test set should satisfy three factors: 1) Diverse distortions. Different perturbations should be taken into consideration. 2) Multiple sources. Fake videos in-the-wild should be manipulated by different unknown methods. 3) High quality. Fake videos that are threatening should have high quality to fool human eyes.
Thus, in our initial benchmark, we introduce a challenging hidden test set with carefully selected videos. First, we collect fake videos generated by several unknown face swapping methods to ensure multiple sources. Then, we obscure all selected videos multiple times with diverse hidden distortions that are commonly seen in real scenes. Finally, we only select videos that can fool at least out of human observers in a user study. The ground truth labels are hidden and are used on our host server to evaluate the accuracy of detection models. Besides, the hidden test set will be enlarged constantly to get future versions along with development of Deepfakes technology. Fake videos manipulated by future face swapping methods will be included as long as they can pass the human test supported by us.
Existing studies [30, 37] primarily provide image-level face forgery detection benchmark. However, fake videos in-the-wild are much more menacing than manipulated images. We propose to conduct evaluation mainly based on video classification methods for two reasons. First, image-level face forgery detection methods do not consider any temporal information – an important cue for video-based tasks. Second, image-level face forgery detection methods have been widely studied. We only choose one image-level method XceptionNet , which achieves the best performance in , as one part of our benchmark for reference. The other four video-based baselines are C3D , TSN , I3D , and ResNet+LSTM [17, 19], which have achieved promising results in video classification tasks. Details of all the baselines will be introduced in our Appendix.
Owing to the goal of detecting fakes in real-world scenarios, we mainly explore how common distortions appearing in real scenes affect the model performance. Accuracies of face forgery detection on the standard test set and the introduced hidden test set are evaluated under various settings.
|Train||FF++ DF||FF++ F2F||FF++ FS||FF++ NT||DeeperForensics-1.0|
|ResNet+LSTM [17, 19]||57.38||56.13||54.88||59.50||78.25|
Evaluation of effectiveness of DeeperForensics-1.0. For a fair comparison, we evaluate DeeperForensics- and the state-of-the-art FaceForensics++  dataset because they use the same driving videos. In this setting, we use raw manipulated videos without distortions in the standard set of DeeperForensics-. For FaceForensics++, the same split is applied to its four subsets. All the models are tested on the hidden test set (see Table 4).
The baselines trained on the standard training set of DeeperForensics- achieve much better performance on the hidden test set than all the four subsets of FaceForensics++. This proves the higher quality of DeeperForensics- over prior works, making it more useful for real-world face forgery detection. In Table 4, I3D  obtains the best performance on the hidden test set when trained on the standard training set. We conjecture that the temporal discontinuity of fake videos leads to higher accuracy by this video-level forgery detection method.
|Test (acc)||std||std/sing||std/rand||std/sing||std/rand||std/rand||std/ sing|
|ResNet+LSTM [17, 19]||100.00||90.63||97.13||100.00||98.63||100.00||97.25|
Evaluation of dataset perturbations. We study the effect of perturbations towards the forgery detection model performance. In contrast to prior work , we try to evaluate the baseline accuracies when applying different distortions to the training and the test sets, in order to explore the function of perturbations in face forensics dataset.
In this setting, we conduct all the experiments on DeeperForensics- dataset with high diversity of perturbations. We use manipulated videos in the standard set (std), manipulated videos with single-level (level-5), random-type distortions (std/sing), manipulated videos with random-level, random-type distortions (std/rand). The data split is the same as the standard set with a ratio of .
In Column of Table 5, we find the accuracy is nearly when the models are trained and tested on the standard set. This is reasonable because the strong baselines perform very well in a clean dataset with the same distribution. In Columns and , the accuracy decrease compared to Column , when we choose std/sing and std/rand as the test set. Most of the video-level methods except C3D  are more robust to perturbations on test set than XceptionNet . This setting is very common because different distributions of the training and the test sets lead to decrease in model accuracies. Hence, the lack of perturbations in the face forensics dataset cutbacks the model performance for real-world face forgery detection with even more complex data distribution.
When we apply corresponding distortions to the training and test set, the accuracy will increase compared to Column and (see Column and in Table 5). However, this setting is impractical because the distributions of the training and test sets are still the same. We should augment the test set to better simulate the real-world distribution. Thus, some evaluation settings in previous works [30, 37] are unreasonable. If we swap the training set and the test set of std/sing and std/rand to further randomize the condition, results shown in Column and indicate that the high accuracy keep unchanged. This evaluation setting shows the possibility that with the same generation method, exerting appropriate distortions to the training set can make face forgery detection models more robust to real-world perturbations.
|ResNet+LSTM [17, 19]||78.25||80.25||79.50||80.25|
Evaluation of variants of training set for real-world face forgery detection. We have conducted several experiments for evaluations of possible perturbations. Nevertheless, the case is more complex in real scenes because information about the fake videos is not available. The video may be subjected to more than one type and diverse levels of distortions. In addition to distortions, the method manipulating the faces is unknown.
From the evaluation of perturbations, we find the possibility of augmenting the training set to improve detection model performance. Thus, we further evaluate baseline performance on the hidden test set by devising some variants of the training set. We perform experiments on DeeperForensics-. In this setting, other than std, std/sing, and std/rand, we use more manipulated videos, each of which is subjected to a mixture of three random-level, random-type distortions (std/mix). We combine std with std/sing, std/rand, and std/mix, respectively, to form three new training sets (with the same data split as the former settings).
Column in Table 6 shows the low accuracy when the models trained on std and tested on the hidden test set (the same as Column in Table 4). Columns and indicate the accuracy of all the baseline models increase when trained on std+std/sing and std+std/rand. The accuracy of I3D  and ResNet+LSTM [17, 19], are over in some cases. In a more complex setting, when the models are trained on std+std/mix, Column shows the accuracy of all the detection baselines further increase.
It proves that designing suitable training set variants has the potential to help increase the face forgery detection accuracy, and applying various distortions to ensure the diversity of DeeperForensics- is necessary. In addition, compared to image-level method, video-level face forgery detection methods have more potential capabilities to crack real-world fake videos as shown in Table 6.
Despite the accuracy on the challenging hidden test set is still not very high, we provide two initial directions for future real-world face forgery detection research: 1) Improving the source data collection and generation method to ensure the quality of the training set; 2) Augmenting the training set by various distortions to ensure its diversity. We welcome researchers to make our benchmark more comprehensive.
In this work, we propose a new large-scale dataset named DeeperForensics- to facilitate the research of face forgery detection towards real-world
scenarios. We make several efforts to ensure the good quality, large scale, and high diversity of the proposed dataset. Based on the dataset, we further benchmark the results of existing representative forgery detection methods, offering insights into the current status and future improvisation strategy in face forgery detection. Several topics are considered as the future works. 1) We will keep on collecting the identities of source and target videos to further expand DeeperForensics gradually over time. 2) We will ask interested researchers for any results of future video falsification methods to enlarge our hidden test set, as long as the fakes can pass the human test supported by us. 3) A better evaluation metric for face forgery detection methods is also an interesting research topic.
A deep learning approach to universal image manipulation detection using a new convolutional layer. In IH & MMSEC, Cited by: §2.
Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection. In IH & MMSEC, Cited by: §2.
Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685. Cited by: Table 1, §2, §3.4, Table 3.
Stacked hourglass networks for human pose estimation. In ECCV, Cited by: Appendix C, §3.2.
Derivation of Eq. 4:
In the reconstruction, the source face and target face share the same forms of loss functions. The reconstruction loss of the source face,, can be written as:
indicates pixel loss. It calculates the Mean Absolute Error (MAE) after reconstruction, which can be written as:
denotes ssim loss. It computes the Structural Similarity (SSIM) of the reconstructed face and the original face, which has the form of:
are two hyperparameters that control the weights of two parts of the reconstruction loss. For the target face, we have the similar form of reconstruction loss:
Thus, the full reconstruction loss can be written as:
KL loss. Since DF-VAE is a new conditional variational auto-encoder, reparameterization trick is utilized to make the sampling operation differentiable by an auxiliary variable with independent marginal. We use the typical KL loss in  with the form of:
where is the dimensionality of the latent prior , and are the
-th element of variational mean and s.d. vectors, respectively.
MAdaIN loss. The MAdaIN module is jointly trained with the disentangled module in an end-to-end manner. We apply MAdaIN loss for this module, in a similar form as described in . We use the VGG-19  to compute MAdaIN loss to train Decoder :
denotes the content loss, which is the Euclidean distance between the target features and the features of the swapped face. has the form of:
where , . is the blurred mask described in Section 3.2.
represents the style loss, which matches the mean and standard deviation of the style features. Like, we match the IN  statistics instead of using Gram matrix loss which can produce similar results. can be written as:
is the weight of style loss to balance two parts of MAdaIN loss.
Total objective. DF-VAE is an end-to-end many-to-many face swapping framework. We jointly train all parts of the networks. The problem can be described as the optimization of the following total objective:
where , , , are the weight hyperparameters of four types of loss functions introduced above.
The whole DF-VAE framework is end-to-end. We use the pretrained stacked hourglass networks  to extract landmarks. The numbers of stacks and blocks are set to and , respectively. We exploit FlowNet 2.0 network  to estimate optical flows. The typical AdaIN network  is applied to our style matching and fusion module. The learning rate is set to for all parts of DF-VAE. We utilize Adam  and set , . All the experiments are conducted on NVIDIA Tesla V100 GPUs.
In addition to user study based on datasets to examine the quality of DeeperForensics- dataset, we also carry out a user study to compare DF-VAE with state-of-the-art face manipulation methods. We will present the user study of methods in this section.
Baselines. We choose three learning-based open-source methods as our baselines: DeepFakes , faceswap-GAN , and ReenactGAN . These three methods are representative, which are based on different architectures. DeepFakes  is a well-known method based on Auto-Encoders (AE). It uses a shared encoder and two separated decoders to perform face swapping. faceswap-GAN  is based on Generative Adversarial Networks (GAN) , which has a similar structure as DeepFakes  but also uses a paired discriminators to improve face swapping quality. ReenactGAN  makes a boundary latent space assumption and uses a transformer to adapt the boundary of source face to that of target face. As a result, ReenactGAN can perform many-to-one face reenactment. After getting the reenacted faces, we use our carefully designed fusion method to obtain the swapped faces. For a fair comparison, DF-VAE utilizes the same fusion method when compared to ReenactGAN .
Results. We randomly choose real videos from DeeperForensics- as the source videos and real videos from FaceForensics++  as the target videos. Thus, each method generates fake videos. Same as the user study based on datasets, we conduct the user study based on methods among professional participants who specialize in computer vision research. Because there are corresponding fake videos, we let the users directly choose their preferred fake videos between those generated by other methods and those generated by DF-VAE. Finally, we got answers for each compared pair. The results are shown in Figure 9. We can see that DF-VAE shows an impressive advantage over the baselines, underscoring the high quality of DF-VAE-generated fake videos.
Frechet Inception Distance (FID)  is a widely exploited metric for generative models. FID evaluates the similarity of distribution between the generated images and the real images. FID correlates well with the visual quality of the generated samples. A lower value of FID means a better quality.
Inception Score (IS)  is an early and somewhat widely adopted objective evaluation metric for generated images. IS evaluates two aspects of generation quality: articulation and diversity. A higher value of IS means a better quality.
Table 7 shows the FID and IS scores of our method compared to other methods. DF-VAE outperforms all the three baselines in quantitative evaluations by FID and IS.
Ablation study of temporal loss. Since the swapped faces do not have the ground truth, we evaluate the effectiveness of temporal consistency constraint, i.e., temporal loss, in a self-reenactment setting. Similar to , we quantify the re-rendering error by Euclidean distance of per pixel in RGB channels ([, ]). Visualized results are shown in Figure 10. Without the temporal loss, the re-rendering error is higher, hence demonstrating the effectiveness of temporal consistency constraint.
Ablation study of different components. We conduct further ablation studies w.r.t. different components of our DF-VAE framework under many-to-many face swapping setting (see Figure 11). The source and target faces are shown in Column and Column . In Column , our full method, DF-VAE, shows high-fidelity face swapping results. In Column , style mismatch problems are very obvious if we remove the MAdaIN module. If we remove the hourglass (structure extraction) module, the disentanglement of structure and appearance is not very thorough. The swapped face will be a mixture of multiple identities, as shown in Column . When we perform face swapping without constructing unpaired data in the same domain (see Column ), the disentangled module will completely reconstruct the faces on the side of , thus the disentanglement is not established at all. Therefore, the quality of face swapping will degrade if we remove any component in DF-VAE framework.
We will elaborate on five baselines used in our face forgery detection benchmark in this section. Our benchmark contains four video-level face forgery detection methods, C3D , Temporal Segment Networks (TSN) , Inflated 3D ConvNet (I3D) , and ResNet+LSTM [17, 19]. One image-level detection method, XceptionNet , which achieves the best performance in FaceForensics++ , is evaluated as well.
C3D  is a simple but effective method, which incorporates 3D convolution to capture the spatiotemporal feature of videos. It includes convolutional, max-pooling, and fully connected layers. The size of the 3D convolutional kernels is . When training C3D, the videos are divided into non-overlapped clips with 16-frames length, and the original face images are resized to .
TSN  is a 2D convolutional network, which splits the video into short segments and randomly selects a snippet from each segment as the input. The long-range temporal structure modeling is achieved by the fusion of the class scores corresponding to these snippets. In our experiment, we choose BN-Inception  as the backbone and only train our model with the RGB stream. The number of segments is set to as default, and the original images are resized to .
I3D  is derived from Inception-V1 . It inflates the 2D ConvNet by endowing the filters and pooling kernels with an additional temporal dimension. In the training, we use -frame snippets as the input, whose starting frames are randomly selected from the videos. The face images are resized to .
ResNet+LSTM [17, 19] is based on ResNet  architecture. As a 2D convolutional framework, ResNet  is used to extract spatial features (the output of the last convolutional layer) for each face image. In order to encode the temporal dependency between images, we place an LSTM  module with hidden units after ResNet- 
to aggregate the spatial features. An additional fully connected layer serves as the classifier. All the videos are downsampled with a ratio of, and the images are resized to before feeding into the network. During training, the loss is the summation of the binary entropy on the output at all time steps, while only the output of the last frame is used for the final classification in inference.
XceptionNet  is a depthwise-separable-convolution based CNN, which has been used in  for image-level face forgery detection. We exploit the same XceptionNet model as  but without freezing the weights of any layer during training. The face images are resized to . In the test phase, the prediction is made by averaging classification scores of all frames within a video.
We will also show some examples of perturbations in DeeperForensics-. Seven types of perturbations and the mixture of two (Gaussian blur, JPEG compression) / three (Gaussian blur, JPEG compression, white Gaussian noise in color components) / four (Gaussian blur, JPEG compression, white Gaussian noise in color components, color saturation change) perturbations are shown in Figure 13. These perturbations are very common distortions existing in real life. The comprehensiveness of perturbations in DeeperForensics- ensures its diversity to better simulate fake videos in real-world scenarios.