🌞 Profile of 𝘼𝙡𝙚𝙭𝙖𝙣𝙙𝙚𝙧 𝙍𝙤𝙜𝙖𝙡𝙨𝙠𝙞𝙮
The proliferation of deepfake media is raising concerns among the public and relevant authorities. It has become essential to develop countermeasures against forged faces in social media. This paper presents a comprehensive study on two new countermeasure tasks: multi-face forgery detection and segmentation in-the-wild. Localizing forged faces among multiple human faces in unrestricted natural scenes is far more challenging than the traditional deepfake recognition task. To promote these new tasks, we have created the first large-scale dataset posing a high level of challenges that is designed with face-wise rich annotations explicitly for face forgery detection and segmentation, namely OpenForensics. With its rich annotations, our OpenForensics dataset has great potentials for research in both deepfake prevention and general human face detection. We have also developed a suite of benchmarks for these tasks by conducting an extensive evaluation of state-of-the-art instance detection and segmentation methods on our newly constructed dataset in various scenarios. The dataset, benchmark results, codes, and supplementary materials will be publicly available on our project page: https://sites.google.com/view/ltnghia/research/openforensicsREAD FULL TEXT VIEW PDF
This paper pushes the envelope on camouflaged regions to decompose them ...
On existing public benchmarks, face forgery detection techniques have
Being heavily reliant on animals, it is our ethical obligation to improv...
With the introduction of large-scale datasets and deep learning models
Recent deep face hallucination methods show stunning performance in
A variety of effective face-swap and face-reenactment methods have been
Onfocus detection aims at identifying whether the focus of the individua...
🌞 Profile of 𝘼𝙡𝙚𝙭𝙖𝙣𝙙𝙚𝙧 𝙍𝙤𝙜𝙖𝙡𝙨𝙠𝙞𝙮
Continuing advances in deep learning have led to impressive improvements in deepfake methods ( . Emerging techniques such as autoencoder (AE) models and generative adversarial networks (GANs) enable transferring one person’s face to another person while retaining the original facial expression and head pose . The realistic appearance synthesized with deepfake methods is drawing much attention in the fields of computer vision and graphics because of the potential application of such methods in a wide range of areas
Continuing advances in deep learning have led to impressive improvements in deepfake methods (, deep learning-based face forgery), which can change the target person’s identity [32, 12, 64, 42]
. Emerging techniques such as autoencoder (AE) models and generative adversarial networks (GANs) enable transferring one person’s face to another person while retaining the original facial expression and head pose[67, 68, 56, 66]
. The realistic appearance synthesized with deepfake methods is drawing much attention in the fields of computer vision and graphics because of the potential application of such methods in a wide range of areas[18, 26, 30, 79, 39]. Moreover, falsified AI-synthesized images/videos have raised serious concerns about individual harassment and criminal deception [5, 62, 11]. To address threats posed by spoofing and impersonation attacks, it is essential to develop countermeasures against face forgeries in digital media.
Conventional face forgery recognition methods [1, 53, 55] require the input of given face regions. Therefore, they can process only one face at a time, and processing multiple faces sequentially is time-consuming. Moreover, their performance greatly depends on the accuracy of the independent face detection method used. Given that these methods have been evaluated only in laboratory environments using images with a simple background and a single clear front face [31, 78], they are not ready for deployment in the real world, where the contexts are much more diverse and challenging than simple staged scenarios.
|Dataset||Year||Task||GT Type||Fake Identity||
|2018||Cls.||Image label||Other videos||1||✗||320||320||640||✗|
|UADFV ||2019||Cls.||Image label||Other videos||1||✗||49||49||98||✗|
|FaceForensics++ ||2019||Cls.||Image label||Other videos||1||✗||1,000||4,000||5,000||✗|
|Google DFD ||2019||Cls.||Image label||Other videos||1||✗||363||3,068||3,431||✗|
|Facebook DFDC ||2020||Cls.||Image label||Other videos||1||✗||48,190||104,500||128,154||✓|
|Celeb-DF ||2020||Cls.||Image label||Other videos||1||✗||590||5,639||6,229||✗|
|DeeperForensics ||2020||Cls.||Image label||Hired actors||1||✗||1,000||1,000||10,000||✓|
|WildDeepfake ||2020||Cls.||Image label||N/A||1||✗||0||707||N/A||✗|
|OpenForensics||2021||Det. / Seg.||BBox/Mask||GAN||✓||45,473||70,325||115,325||✓|
It has thus become essential to develop methods that can effectively process multiple faces simultaneously from an input image. To our best knowledge, no methods have been proposed for face forgery detection and segmentation officially. We attribute this partially to the lack of a large-scale dataset for training and testing. To encourage more studies in this field, we present four contributions in this paper.
First, we present a comprehensive study on tasks related to massive face forgery in-the-wild. Particularly, we introduce two new tasks: multi-face forgery detection and segmentation in-the-wild. This is the first formal exploration of these tasks to the best of our knowledge. Previous work has explored only single-face forgery recognition.
Second, we propose generating an infinite number of fake individual identities using GAN models for non-target face-swapping without repeatedly training a deepfake AE. Our proposed forgery workflow reduces the cost of synthesizing fake data.
Third, using the proposed forgery workflow, we introduce a novel image dataset to support the development of multi-face forgery detection and segmentation tasks. Our newly constructed OpenForensics dataset is the first large-scale dataset designed for these tasks. It consists of 115K unrestricted images with 334K human faces. Unlike existing datasets, ours contains various backgrounds and multiple people of various ages, genders, poses, positions, and face occlusions. All images have face-wise rich annotations supporting multiple tasks, such as forgery category, bounding box, segmentation mask, forgery boundary, and general facial landmarks (see Figs. LABEL:fig:examples and 1). The dataset can thus support not only multi-face forgery detection and segmentation tasks but also conventional tasks involving the general human face.
Fourth, we present a benchmark suite to facilitate the evaluation and advancement of these tasks. We conducted an extensive evaluation and in-depth analysis of state-of-the-art instance detection and segmentation models in various scenarios.
The whole dataset, evaluation toolkit, and trained models will be freely available on our project page111https://sites.google.com/view/ltnghia/research/openforensics.
Table 1 summarizes basic information about existing forensic datasets. The DF-TIMIT dataset  has 640 fake videos crafted from Vid-TIMIT dataset  using Faceswap-GAN . The UADFV dataset  consists of 98 videos, half of which are fake, created using FakeAPP . The FaceForensics++ dataset  contains 1000 pristine videos from YouTube and 4000 synthetic videos manipulated using deepfake methods [12, 67, 32, 68]. The Google DFD dataset  includes 3068 fake videos. The Facebook DFDC dataset  contains 128K original and manipulated videos created using various deepfake and augmentation methods [59, 24, 79, 56, 28]. The Celeb-DF dataset  comprises YouTube celebrity videos and 5,639 fake videos. The DeeperForensics dataset  consists of 10K manipulated videos using a deepfake VAE and augmentations on 1000 original videos in the FaceForensics++ dataset. The WildDeepfake dataset  contains face sequences extracted from 707 deepfake videos collected from the Internet. As shown in Table 1, our OpenForensics is the first dataset designed for face forgery detection and segmentation.
Existing forensic datasets were created by dividing long videos into short ones, leading to that even pristine videos have the same background. Subsequent synthesizing many fake videos from one pristine video resulted in lots of similar backgrounds. Deep models trained on the existing datasets may not generalize well to the real world due to the repeated background. In contrast, our large-scale image dataset contains diverse backgrounds. Inspired by the work of Dolhansky  and Jiang , we systematically applied a mixture of perturbations to raw manipulated images to imitate real-world scenarios. With the existing datasets, a deepfake model needs to be trained on each pair of videos to swap human identities, yielding a considerable number of models requiring training. In contrast, a massive number of fake faces in our dataset are synthesized by GAN without repeatedly re-training deepfake models. While existing datasets were developed for only the single-face forgery classification task, our dataset is the first one designed for multi-face forgery detection and segmentation tasks, which require more annotation than the classification task. Our dataset can also be utilized for various general face-related tasks.
A number of deepfake open-source techniques for swapping human faces have been released were combined with AE models to morph expressions. In addition to transferring expressions, the head pose can be controlled by using a recurrent neural network to enhance naturalness
A number of deepfake open-source techniques for swapping human faces have been released[32, 12, 64]. These techniques have gradually evolved from using hand-crafted features  to using deep learning by training AE architectures  and GAN models   to achieve realism. Facial reenactment techniques have been developed for transferring expressions [67, 68, 56]. Different techniques such as 3D reconstruction  and neural textures  were used to preserve the target skin color and lighting conditions. Boundary latent space  and disentangle shape 
were combined with AE models to morph expressions. In addition to transferring expressions, the head pose can be controlled by using a recurrent neural network to enhance naturalness by using different modalities  and by using human interpretable attributes and actions .
Subsequently proposed techniques for face synthesis use deep learning. They generally use GAN for facial attribute translation [7, 8, 28, 29], for identity-attribute combination , for identified characteristics removal , and for interactive semantic manipulation [40, 83]. Facial disentangled features are being interpreted in different latent spaces, resulting in more precise control of attribute manipulation in face editing [28, 29, 65, 60].
Existing deepfake methods require face pairs for specific training, meaning that the cost of training is very high. Training requires sequences of images; thus, these methods are practical only for videos, and the generated faces usually have low-resolution. Although existing face synthesis methods can generate high-quality faces, the synthesized faces are oriented to the front and are not consistent with the original faces if the original faces are not close to the distribution of the training data. We combine these two approaches to generate an infinite number of fake human identities without repeatedly training the AEs. We achieve this by transforming GAN-based high-quality synthesized faces into original poses.
|#Annotated Images||Ground-Truth Type|
|COCO ||2014||General object||200,000||Coarse mask|
|CityScapes ||2016||Road object||25,000||Coarse&Fine mask|
|WiderFace ||2016||Human face||32,200||Bounding box|
|SESIV ||2019||Salient object||5,700||Fine mask|
|ADV ||2020||Accident object||10,000||Fine mask|
|CAMO++ ||2021||Camouflaged object||5,500||Fine mask|
|OpenForensics||2021||Forged face||115,325||Fine mask|
|Subset||#Images||#Faces||#Real Faces||#Forged Faces|
Researchers have been investigating the problem of face forgery classification, which is generally regarded as merely a binary classification problem (real/fake). The research task is also called ‘deepfake detection,’ but the term ‘detection’ may lead to a misunderstanding of the fundamental task of object detection. Early methods exploited inconsistencies created by visual artifacts in deepfake images and videos by analyzing biological clues such as eye blinking , head pose , skin texture , and iris and teeth color . A few works investigated artifacts in affine face warping  or in the blending boundary  to distinguish real and fake faces. Most current methods are data-driven, directly training deep networks on real and fake images and videos [1, 53, 61, 55, 82, 71]. They do not rely on specific artifacts.
Existing face forgery classification approaches do not have a face localization ability. They can work only on a single cropped face; thus, their performance relies heavily on independent face detection performed as pre-processing. To the best of our knowledge, ours is the first work addressing multi-face detection and segmentation in-the-wild.
The emergence of new tasks and datasets has led to rapid progress in human research areas [77, 13, 54, 20, 19]. However, research on human forgery prevention is only now beginning, and the field is still immature with work only on the face forgery classification task. With this in mind, our goal is to study and develop a dataset that will support challenging new forgery research tasks in both the computer vision and forensic communities.
As shown in Fig. 3, the dataset construction workflow includes three main steps: real human image collection, forged face image synthesis, and multi-task annotation.
We collected raw images from Google Open Images  and removed images without people. Images consisting of unreal human faces (, images on money and in books, magazines, cartoons, and sketches) or human-like objects (, dolls, robots, and sculptures) were also removed. We ended up with 45,473 images, which were used as pristine data.
Figure 3 shows an overview of the process used to synthesize forged face images. First, all faces in the real human images are extracted and checked in the manipulation feasibility inspection module to see whether they can be manipulated. This is done using various conditions ( , face size, image quality, and blurring) and a random manipulation probability. If manipulation is feasible, the image undergoes a cyclical process. Inspired by GAN-based face synthesis , we first extract the facial identity latent vector and modify it using random values. The modified latent vector is then fed into GAN models
, face size, image quality, and blurring) and a random manipulation probability. If manipulation is feasible, the image undergoes a cyclical process. Inspired by GAN-based face synthesis[8, 29]
, we first extract the facial identity latent vector and modify it using random values. The modified latent vector is then fed into GAN models[65, 60] to generate a new face. The synthesized face is subsequently transformed into an original pose. Feasible manipulation regions in the synthesized face (, regions inside facial landmarks or the entire face) are extracted and blended into the original face using Poisson blending  and a color adaptation algorithm in the face-swapping module, with the final result being a new identity. The new identity image is then tested to determine whether it can spoof a simple classifier (, XceptionNet ) in the forgery justification module, which is trained to distinguish real and fake identities. Those for which spoofing is successful are overlaid onto the original image. The others are discarded, and new faces are generated. We provide detailed implementation and training of networks in the supplementary material.
Our synthesis workflow features the ability to synthesize an unlimited number of fake identities at low cost for non-target face-swapping without paired training. Meanwhile, other deepfake methods use a limited number of fake identities extracted from videos and perform paired training using deep models for target face-swapping. They thus require much time and resources to synsthesize datasets. Our synthesis approach also overcomes the limitations of existing approaches. Existing approaches [61, 14, 27] generate low-resolution faces (typically less than pixels) while our approach generates faces with higher resolution (, pixels) and better visual quality (cf. Fig. 2). Our use of Poisson blending  and a color adaptation algorithm to reduce the color mismatch between the synthesized and original face (Fig. 2) enhances the naturalness of the forged faces. We also improve the smoothness of the blending mask by extracting 68 facial landmark points and training face segmentation models, resulting in fine boundaries and complete facial coverage (see Fig. 1 for different blending masks). The blending masks used to create existing datasets were either rectangular or rough convex hulls between the eyebrows and lower lip, resulting in incomplete facial coverage or visible boundaries (cf. Fig 2).
Finally, we randomly split the accepted images into separate training, validation, and test-development sets (ratio of 60:10:30). Table 3 shows the distribution of images and faces in our newly constructed OpenForensics dataset.
To enhance the challenges posed by our OpenForensics dataset for real-world face forgery detection and segmentation, we applied various perturbations to better simulate contexts in natural scenes, resulting in a test-challenge subset. Various augmented operators are divided into overarching groups.
Color manipulation: Hue change, saturation change, brightness change, histogram adjustment, contrast addition, grayscale conversion.
Edge manipulation: edge detection and alteration.
Block-wise distortion: color grouping, color pooling, color quantization, and pixelation.
Image corruption: elastic deformation, jigsaw distortion, JPEG compression, noise addition, and dropout.
Convolution mask transformation: Gaussian blurring, motion blurring, sharpening, and embossing.
External effect: fog, cloud, sun, frost, snow, and rain.
These augmentations are divided into three intensity levels (, easy, medium, and hard) to ensure diverse scenarios. For each level, random-type augmentation is applied separately or as a mixture, resulting in 45,000 images. Example images in the test-challenge set are shown in Fig. 4.
Task Diversity. Existing deepfake datasets [61, 14, 27, 46] focus exclusively on video-wise labels for classification. In contrast, we aim to exploit the face-wise ground truth, which requires much more annotation effort, to advance further forgery analysis. Each face was labeled with various ground-truths such as forgery category (real/fake), bounding box, segmentation mask, forgery boundary, and facial landmarks (cf. Fig. 1). Our rich annotation can be utilized for various tasks and even multi-task learning.
Dataset Size. OpenForensics is one of the largest detection and segmentation datasets (cf. Table 2) and is large enough to train and evaluate deep networks. This should encourage more research in this field.
Diverse Scenarios. Existing datasets [61, 14, 27, 46] were released as short videos. Although they contain a vast number of images, frames in a short video are similar and do not contribute much to the training of deep networks. With these datasets, data sampling is usually used for training deep networks to avoid overfitting and to reduce training time. We define similar frames in a short video as a ‘scenario’ and assert that training using a diversity of scenarios helps to make deep networks more effective. Table 1 shows that the OpenForensics dataset is an order of magnitude larger than existing datasets in terms of the number of scenarios, with only slightly fewer than in the DFDC dataset.
|a) Scene word cloud||b) Image resolution||c) Faces per image||d) Bounding box size||e) Mask size||f) Face centroid|
Image Scene. Existing deepfake datasets [61, 46] contain limited types of image scenes, such as indoor scenes and television scenes. In contrast, the OpenForensics dataset contains various types of scenes. We computed scenes using a pre-trained model on the large-scale Places2 dataset . Figure 5(a) shows the distribution as a word cloud, with the various outdoor scenes accounting for 36.3% of the images.
Image Resolution. Figure 5(b) shows the distribution of image resolutions in the OpenForensics dataset. The large number of high-resolution images, which provide more face boundary details for model training, results in better performance.
Multiple Faces Per Image. Existing deepfake datasets [61, 14, 27, 46] mostly have only one face per image. In contrast, the OpenForensics dataset has multiple faces per image (2.9 on average). Figure 5(c) shows the distribution.
Face Characteristics. Figures 5(d and e) show the distribution of faces in the OpenForensics dataset by bounding box size and mask size (, number of pixels covering face). OpenForensics contains faces of various sizes, from tiny to large. The distribution of face centroids in Fig. 5(f) shows that the faces tend to be near the image center. In addition, the ratio of male and female faces is 50:50, and there is a diversity of ages. More details are provided in the supplementary material.
Data Augmentation. Deep models trained on existing deepfake datasets may not perform well in the real world due to overfitting caused by image similarity in the training data. Although strong deep models have obtained very high accuracy
Deep models trained on existing deepfake datasets may not perform well in the real world due to overfitting caused by image similarity in the training data. Although strong deep models have obtained very high accuracy[53, 43], even near 100%, they may easily fail in the real world if they do not share a close distribution with the training dataset. To simulate real-world contexts in the OpenForensics dataset, diverse perturbations were used to improve scenario diversity so as to better imitate real-world data distributions. Improvements have been made to a couple of existing datasets by using simple perturbations, which have increased their size. For instance, the DFDC dataset  and DeeperForensics dataset  have been improved by applying geometric and color transforms, adding noise, blurring, and overlaying objects.
To evaluate the visual quality of the images in the OpenForensics dataset and human performance in face forgery detection, we conducted a user study with 200 participants, 80 of whom are experts, who can provide knowledgeable opinions due to their researching deepfakes. The study results can fairly reflect the performance of both experts and non-experts.
The study was conducted on the OpenForensics dataset and four existing deepfake datasets: FaceForensics++ , DFDC , Celeb-DF  and DeeperForensics . For each dataset, we randomly selected 600 images and prepared a virtual platform for the participants.
We argue that participants can quickly see that a face is fake if they see two similar images but different people, leading to unfair comparison with existing datasets. In addition, the forgery identification may becomes difficult if forged faces are mixed with real faces. To investigate these hypothesises, our user study focused on both two cases: cropped faces to eliminate surrounding contexts and full images with multi-face.
Evaluation of Image Realism. We cropped the forged heads, which had been doubly extended from the faces, to ensure that the upper-half of each person was completely extracted. The participants were asked to view 200 forged head images and then provide feedback on each image’s realism in the form of a score 1 to 5, corresponding to ‘clearly fake,’ ‘weakly unreal,’ ‘borderline,’ ‘almost real,’ and ‘clearly real.’ As shown by the results in Fig. 6, the visual quality of the images in the OpenForensics dataset was highly evaluated by most of the participants. That is, the forged faces in the OpenForensics dataset were judged to be the most realistic. Our dataset achieved the highest mean opinion score (MOS) 4.0, much higher than that of the second-best dataset Celeb-DF (3.2). The DeeperForensics and DFDC datasets had medium-quality images (MOS of 2.8). The FaceForensics++ dataset had the most unrealistic images (MOS of only 1.3).
Human Performance on Face Forgery Classification. We again cropped the heads similar to the cropping done for the evaluation of image realism. The participants were asked to view a mixture of 400 images randomly composed of pristine and forged heads with a ratio of 50:50. After viewing each image, the participants were asked whether the image was ’real’ or ’fake.’ As shown in Fig. 7, the participants had the most trouble distinguishing between the real and fake images in the OpenForensics dataset. This is evidenced by the OpenForensics dataset having the lowest overall accuracy (59.7%), followed by Celeb-DF (68.7%), DFDC (72.0%), FaceForensics++ (82.0%), and DeeperForensics (82.9%,). The graph also shows that both experts and non-experts had difficulty distinguishing between the real and fake images in our dataset. It is interesting that although experts could recognize fake faces better than non-experts, they incorrectly identified real faces with low quality, low resolution, or low contrast (, FaceForensics++ dataset). We attribute this to their overconfidence and their belief that GANs might generate such faces, leading to misidentification.
Figure 8 illustrates the correlation between the visual properties and the human ability to recognize forged faces. The ability to recognize forged faces depends on image realism, resulting in an increased false alarm rate as realism improves (, as the MOS increases). The graph shows that a large number of participants misclassified forged faces in the OpenForensics dataset as real faces. The OpenForensics dataset had the highest MOS (4.0) and the highest false alarm rate (34.6%). The figure also shows that the BRISQUE score  of the OpenForensics dataset was the lowest (35.2), which indicates that the images in our dataset have the best visual quality. Reducing image quality (, increasing the BRISQUE score) would affect human observation, resulting in a lower false alarm rate.
Human Performance on Multi-Face Forgery Detection. The participants were asked to view a set of 160 images, each with multiple persons and each consisting of both pristine and forged faces randomly selected, of only pristine faces, or of only forged faces. They were asked to identify the number of forged faces in each image. Figure 9 shows that detection accuracy was the highest (86%) when there were no forged faces in the images and tended to drop as the number of forged faces increased. This can be explained that when there are many faces in an image, participants tend to less carefully check each face and guess that all the faces are real. That explains why the accuracy is high when all faces are real while it significantly reduces when forged faces exist. Indeed, when the number exceeded 7, accuracy dropped to 0%. Even people find it extremely difficult to identify forged faces among mixture of pristine and forged faces on in-the-wild images, highlighting the challenge of our OpenForensics dataset.
We conducted a competitive benchmark for multi-face forgery detection and segmentation. To this end, we trained and evaluated the latest instance detection and segmentation methods in various scenarios. The methods were MaskRCNN , MSRCNN , RetinaMask , YOLACT , YOLACT++ , CenterMask , BlendMask , PolarMask , MEInst , CondInst , SOLO , and SOLO2 . MaskRCNN and MSRCNN are well-known two-stage models that perform detect-then-segment slowly. The YOLACT ones [3, 4] are early single-stage models aimed at real-time performance. The remaining methods are widely used modern single-stage models that overcome accuracy and processing time problems. Among them, the SOLO ones [72, 73] directly output masks without computing bounding boxes.
All the methods were used with the same backbone (FPN-ResNet50 [47, 22] ) to make the comparison fair. We trained models on PCs with 32 GB of RAM and a Tesla P100 GPU. The models were initialized with ImageNet weights and trained on our training set for 12 epochs. The base learning rate was decreased by
) to make the comparison fair. We trained models on PCs with 32 GB of RAM and a Tesla P100 GPU. The models were initialized with ImageNet weights
and trained on our training set for 12 epochs. The base learning rate was decreased byat the and epochs. Other settings were in accordance with the default public configurations provided by the authors.
|Method||Year||Multi-Face Forgery Detection||Multi-Face Forgery Segmentation|
|MaskRCNN ||ICCV 2017||79.2||29.9||80.2||79.5||24.3||9.5||2.7||4.0||83.6||16.1||82.1||85.8||21.2||7.6||3.0||4.2|
|MSRCNN ||CVPR 2019||79.0||29.5||80.1||79.5||24.3||9.6||2.7||3.8||85.1||16.8||84.2||86.8||21.1||7.7||2.6||4.4|
|RetinaMask ||arXiv 2019||80.0||30.9||80.2||80.7||24.2||9.0||3.0||4.6||82.8||16.4||80.6||85.1||22.6||8.1||2.9||4.9|
|YOLACT ||ICCV 2019||68.1||12.5||67.1||69.3||37.2||13.4||6.3||8.7||72.5||3.1||67.0||75.7||34.0||11.4||6.4||8.7|
|YOLACT++ ||TPAMI 2020||72.9||20.9||73.4||73.6||31.5||12.1||4.0||5.8||77.3||6.5||73.9||80.0||28.2||10.0||3.9||6.5|
|CenterMask ||CVPR 2020||85.5||32.0||85.2||86.2||21.1||6.8||3.3||5.9||87.2||16.5||85.0||89.4||21.4||6.1||3.2||7.8|
|BlendMask ||CVPR 2020||87.0||32.7||86.3||88.0||19.5||6.2||2.4||6.2||89.2||19.8||87.3||91.0||18.3||5.4||2.5||6.3|
|PolarMask ||CVPR 2020||85.0||27.4||85.4||85.7||20.7||6.6||2.5||6.6||85.0||15.3||83.3||87.0||21.3||6.9||2.5||6.6|
|MEInst ||CVPR 2020||82.8||26.0||82.7||83.4||23.8||7.6||4.1||6.8||82.2||13.9||81.5||83.3||25.0||8.1||4.0||7.2|
|CondInst ||ECCV 2020||84.0||29.4||83.6||84.8||20.8||7.4||2.3||5.2||87.7||18.1||85.1||89.8||18.3||5.9||2.4||5.3|
|SOLO ||ECCV 2020||-||-||-||-||-||-||-||-||86.6||15.4||85.6||88.4||20.0||6.6||2.1||6.0|
|SOLO2 ||NeurIPS 2020||-||-||-||-||-||-||-||-||85.1||13.7||83.7||87.1||21.5||7.1||3.1||5.8|
We evaluated the methods using standard COCO-style average precision (AP) . We report the results for mean AP and AP on different scales (, , , where S, M, and L represent small, medium, and large objects). We also evaluated the methods using the localization recall precision (LRP) error . We report the results for mean optimal LRP (oLRP) and its error components including localization (oLRP), the false positive rate (oLRP), and the false negative rate (oLRP).
As shown in Fig. 10, BlendMask had the best performance, with the highest AP and lowest oLRP error for both the detection and segmentation tasks on standard images. The other modern single-stage methods also had high performance, and the two-stage methods had medium performance. The YOLACT methods had the worst performance on both tasks because they are mainly focused on real-time processing. YOLACT++ and BlendMask were the most robust for unseen images.
Table 4 shows detailed results for the multi-face forgery detection task broken down by metric. They show that BlendMask had the best performance, achieving the highest AP (87.0) and the lowest oLRP error (19.5). BlendMask also achieved the highest AP for all object scales. The modern single-stage methods (, BlendMask, PolarMask, and CondInst) had minor location errors and false positive rates while the two-stage methods (, MaskRCNN and MSRCNN) had low false negative rates.
|Method||Year||Multi-Face Forgery Detection||Multi-Face Forgery Segmentation|
|MaskRCNN ||ICCV 2017||42.1||11.8||46.2||40.5||65.4||13.6||29.3||40.0||43.7||4.7||44.3||44.0||64.4||11.8||29.4||41.2|
|MSRCNN ||CVPR 2019||42.2||11.8||45.9||40.8||65.3||13.7||29.6||39.9||43.3||5.2||44.6||43.5||64.1||11.8||30.4||39.6|
|RetinaMask ||arXiv 2019||48.5||12.8||51.0||48.1||63.3||12.6||33.2||34.6||48.0||4.7||46.5||49.7||63.3||11.8||30.9||38.0|
|YOLACT ||ICCV 2019||49.4||5.6||49.6||50.3||60.1||15.3||23.2||29.9||51.8||1.4||47.2||54.6||58.4||13.5||23.4||30.1|
|YOLACT++ ||TPAMI 2020||53.7||11.1||54.0||54.8||57.1||14.1||19.7||29.3||54.7||2.4||50.7||57.9||55.4||12.2||20.0||30.0|
|CenterMask ||CVPR 2020||0.03||0.4||0.0||0.0||99.5||29.7||97.7||97.9||0.02||0.0||0.0||0.0||99.6||28.3||97.9||98.4|
|BlendMask ||CVPR 2020||53.9||13.5||56.6||53.5||60.2||10.6||26.5||37.4||54.0||7.1||54.5||54.5||59.9||9.8||26.4||38.4|
|PolarMask ||CVPR 2020||51.7||12.3||53.2||51.5||60.4||10.7||24.6||39.5||52.7||5.3||54.1||37.6||60.2||10.4||24.7||39.5|
|MEInst ||CVPR 2020||46.1||8.6||49.9||44.9||65.9||12.4||34.6||39.7||46.0||3.8||49.0||45.2||66.2||12.6||34.8||39.8|
|CondInst ||ECCV 2020||52.7||12.6||55.3||51.8||60.7||11.5||28.3||35.3||54.1||6.5||55.2||53.8||59.6||10.0||26.7||37.3|
|SOLO ||ECCV 2020||-||-||-||-||-||-||-||-||55.9||3.9||53.3||57.3||57.6||11.3||24.6||33.0|
|SOLO2 ||NeurIPS 2020||-||-||-||-||-||-||-||-||53.2||3.6||52.1||54.0||59.6||11.0||24.5||37.2|
With the emergence of explainable AI (XAI) technology [15, 21, 35, 37], it is useful to identify manipulated areas in detected faces. Therefore, we also evaluated segmentation performance. As shown in Table 4, for the multi-face forgery segmentation task, the trends in the ranking of method performance are similar to those for the detection task. BlendMask had the best segmentation performance, with AP of almost 90 and an oLRP error of approximately 18 for the test-dev set.
Images in the real world obviously contain human faces of various sizes. It is thus essential to investigate detection and segmentation abilities on different scales. Table 4 shows that all the baseline methods achieved high performance for only medium-size and large faces. Performance decreased with the face size, resulting in a marginal difference between small faces and medium/large faces in both detection and segmentation. These results illustrate the challenges of our OpenForensics dataset, which consists of enormous face sizes.
Similar to the detection task, we found that single-stage methods, which are based on dense detection, have fewer FP errors while the two-stage ones, which are based on sparse detection, have fewer FN errors. Therefore, the development of post-processing using NMS and the improvement of RPN, respectively, can help to improve forgery detectors.
We conducted experiments to evaluate the robustness of the methods on our test-challenge set, which simulates scenarios in the real world. Table 5 shows that YOLACT++ and BlendMask were the most robust methods for unseen images. CenterMask was the least robust method, which is attributed to its results containing a lot of noise, resulting in extremely high false positive and false negative rates.
Tables 4 and 5 show a substantial drop in performance for all methods for unseen images, which are beyond the distribution of the training set. Although existing methods can work well on standard images, their robustness is weak for unseen images. Even leading forgery-identification methods in the deep learning era remain limited and cannot yet effectively address real-world challenges (Top-1: on test-challenge set). Hence, multi-face forgery detection and segmentation problems in-the-wild are still far from being solved, leaving much room for improvement. These results also illustrate the challenges of our OpenForensics dataset.
As part of our comprehensive study on multi-face forgery detection and segmentation in-the-wild, we created a large-scale dataset. In-depth analysis of our OpenForensics dataset demonstrated its diversity and complexity. We also conducted an extensive benchmark by evaluating state-of-the-art instance segmentation methods in various experimental settings. We expect that our OpenForensics dataset will boost research activities in deepfake prevention. We intend to continue enlarging this dataset to accompany future developments in deepfake technology.
Thanks to the rich annotations in our OpenForensics dataset, there are a number of foreseeable research directions that will provide a solid basis for forgery and general face studies, including fundamental research (, weak/semi-supervised/self-supervised detection/segmentation, universal network for multiple tasks) and specific research ( , anti-forgery robustness detection, forgery boundary detection, forgery ranking, face anonymization, face detection/segmentation, facial landmark prediction).
, anti-forgery robustness detection, forgery boundary detection, forgery ranking, face anonymization, face detection/segmentation, facial landmark prediction).
Acknowledgements. This work was partially supported by JSPS KAKENHI Grants JP16H06302, JP18H04120, JP21H04907, JP20K23355, and JP21K18023, and by JST CREST Grants JPMJCR18A6 and JPMJCR20D3, including the AIP challenge program, Japan.
Conference on Computer Vision and Pattern Recognition, pp. 6713–6722. Cited by: §2.2.
StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
The cityscapes dataset for semantic urban scene understanding. In Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: Table 2.
A denoising autoencoder, adversarial losses and attention mechanisms for face swapping. Note: https://github.com/shaoanlu/faceswap-GAN[Online; accessed 18-Feb-2021] Cited by: §1, §2.1, §2.2.
International Joint Conference on Artificial Intelligence, pp. 3444–3451. Cited by: §2.3.
Two-stream neural networks for tampered face detection. In Conference on Computer Vision and Pattern Recognition Workshops, pp. 1831–1839. External Links: Cited by: §2.3.