Zooming into Face Forensics: A Pixel-level Analysis

12/12/2019 ∙ by Jia Li, et al. ∙ 21

The stunning progress in face manipulation methods has made it possible to synthesize realistic fake face images, which poses potential threats to our society. It is urgent to have face forensics techniques to distinguish those tampered images. A large scale dataset "FaceForensics++" has provided enormous training data generated from prominent face manipulation methods to facilitate anti-fake research. However, previous works focus more on casting it as a classification problem by only considering a global prediction. Through investigation to the problem, we find that training a classification network often fails to capture high quality features, which might lead to sub-optimal solutions. In this paper, we zoom in on the problem by conducting a pixel-level analysis, i.e. formulating it as a pixel-level segmentation task. By evaluating multiple architectures on both segmentation and classification tasks, We show the superiority of viewing the problem from a segmentation perspective. Different ablation studies are also performed to investigate what makes an effective and efficient anti-fake model. Strong baselines are also established, which, we hope, could shed some light on the field of face forensics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human faces play an important role in human communication, as a face is associated with the identity of a person. The unique face information, working as fingerprints, has been used in many applications such as phone unlocking, payment, etc., thanks to remarkable progress in face detection and recognition systems

[45, 41, 37]. However, we have also seen stunning progress in image and video manipulation methods, which enable editing the images or videos in a visually plausible way. Some face specific manipulation methods [15, 17, 46, 47] are able to manipulate the face image of person and create an indistinguishable fake image.

Image Cls Seg GT
Figure 1: Predictions of a classification network and a segmentation network. The second image is the activation map of the classification network showing the high-response area. The third is the heatmap given by the segmentation network. Compared with the ground-truth on the right, the segmentation network localizes the tampered pixels on a far accurate level.

Current face manipulation methods can be roughly divided into two categories, facial reenactment and identity swap. Facial reenactment tries to transfer the facial expressions of one person to another person and synthesize realistic details. Face2Face [47] and NeuralTextures [46] are two representative works. Identity swap is a technique that enables replacing the face of a person with another person’s face. Deepfakes [15] and FaceSwap [17] are two of the most prominent methods. These methods enable effortless creation of fake face images and videos, which poses potential threats to our society. For example, fake news can be easily created by synthesizing a speech video of a politician [43].

To alleviate the potential issues caused by the fake face videos and images, great efforts have been dedicated to the field of face forensics, which aims to determine authenticity of a face photo. General image forensics techniques, relying on hand-crafted cues [18, 33, 5, 30]

, might not be suitable for face specific forensics tasks since faces are highly structured data. Recent works take advantage of great representation power of CNNs (Convolutional Neural Networks) and train a network using a large dataset containing authentic and manipulated face images

[36, 1, 49, 6]. In [36], a large scale dataset called “FaceForensics++” is released to address the problem of face forensics. The dataset contains 5,000 videos generated from 4 popular face manipulation methods, Deefakes, FaceSwap, Face2Face and NeuralTextures, which provides rich data to train models as well as a standard benchmark for evaluation.

Most methods for face forensics cast the problem as a classification problem, in which given an image the model is expected to determine whether it is a real face or a manipulated face. Using deep networks has been proved effective in dealing with such a classification problem [36, 49]. In [36], a modified Xception network [8] is trained on “FaceForensics++” dataset and achieves remarkable results, accuracy of 99.26 on the raw data. In [1], a compact network also achieves comparable performance. However, one question is raised: “Is the problem well-defined?” or “Is it a good definition of the problem?” In Figure 2, the second image shows the activation map of a classification model revealing the high response area for the fake face on the left. It is obviously that the activation map is not actually consistent with the ground-truth, which suggests that the features used to distinguish the fake images might have weak correlation to the real manipulated regions.

The example implies one of the limitations of a classification network that it can only produce a global scalar value representing the confidence of being fake but can not reflect the degree of how the image is manipulated. It would be more beneficial to have a pixel-level output that accurately reflects the manipulated pixels, as shown in the third image of Figure 2. Therefore, It would be more natural to formulate the problem of face forensics as a semantic segmentation task so that the model is forced to learn discriminative features to localize manipulated regions.

In this paper, we analyze the problem of face forensics from a pixel-level perspective using segmentation methods to complement the existing classification methods for face forensics. There are some questions that are still under investigated such as: 1) By nature, whether face forensics is a classification or segmentation problem? 2) What is the most suitable network architecture for this problem? 3) Should we adopt shallow or deep networks? 4) Should we train the model from scratch or initialize it using general vision features. We conduct experiments to try to answer these questions. By evaluating various architectures, we compare the performance of the segmentation networks and their counterpart classification networks from different aspects. We hope to provide more insight to the problem and establish a new baseline for the benchmark.

Our contributions are three folds:

  • We conduct a pixel-level analysis to the problem of face forensics by using segmentation methods to be complementary to the existing classification methods.

  • By redefining the problem to be a pixel-level task, we evaluate various architectures and create a strong new baseline for the problem.

  • By performing different ablation studies, we analyze what makes an effective and efficient anti-fake model, which, we hope, can shed some light on the field of research.

2 Related Work

We cover the most important related papers in the following paragraphs.

2.1 Digital Face Manipulation

A comprehensive state-of-the-art report of digital face manipulation can be found in [52]. Current facial manipulation methods can be separated into four categories: image-based approach, Audio-based approach, computer-graphics-based approach as well as learning based approach.

State of the Arts image-based approaches such as Video Rewrite [7], Video Face Replacement [2], Bringing Portraits to Life [4] and Deep Video Portraits [24]. These methods employ 2D warps to deform the image to match the expressions of a source actor. “Synthesizing Obama” [42] learned the mapping between audio and lip motions.

State-of-the-arts computer-graphics-based approaches such as Video Face Replacement [13], VDub [19] an Face2Face [47]. These methods usually reconstruct 3D models using blendshapes or other mesh editing process, based on high-quality 3D face capturing techniques as well as precise and rapid tracking techniques.

Recently, generative adversarial networks (GANs) are used to apply different facial attributes such as Aging [3], viewpoints [21], skin color [28], smiling [48], or other essential computer graphic renderings [24]

, which are implemented as an image-to-image translation, applying a patch-based GAN-loss.

2.2 Face forensics

Face forensics aims to ensure authenticity and origin of the face. Face forensics identify computer generated characters from computer graphics faces [1], print-scanned morphed faces [34], face splicing [12, 22], face swapping [51, 1], and face reenactment [1, 14]. Specific artifacts arising from the synthesis process such as color , texture [12] or eye blinking [26] can also be exploited. Learning-based approach propose a deep network trained to capture the subtle inconsistencies arising from low-level and/or high level features[1, 51]. Particularly, [20]

uses a convolutional neural network to extract frame-level features,which are then used to train a recurrent neural network (RNN) that learns to classify if a video has been subject to manipulation or not. These approaches show impressive results, but can not precisely locate the manipulated area.

Figure 2: Pipeline of classification task. Different colors of arrows indicate different stages. Blue is for the training stage, Green is for the inference stage. When the classification score is above 0.5, it is classified as a fake image and is further processed to get the manipulated regions. When the score is below 0.5, indicating a real image, an all-zero mask is produced.

2.3 Pixel-level task

Instead of a rough prediction in global image-level view, there are many works towards to provide a local or pixel-level prediction, such as Unet [35], fully convolutional network (FCN)[27], Deeplab.for semantic segmentation. As for image generation, pix2pix [23] realize the pixel-level transformation between different domains. There are lots of application concerning face parsing[29], pose parsing or scene segmentation.

As for face forensics, the mainstream methods are based on global classification at present, we drive the segmentation motivation of face manipulation to predict the region of local manipulation area. The face is often occluded by objects, but the face in the database [36] is generally unobstructed, so it can be trained directly.

3 Problem Setting

In this section, we first introduce the problem settings and methodologies for both the classification task and the segmentation task. Then we present an overview of the architectures used for evaluation.

3.1 Classification Task

We first revisit the classification task. Formally Let represent an image containing either an real or a tampered face, and represent the label associated to it. We learn a mapping function to predict the authenticity of a face image. Given a dataset containing T images, the network is trained by the following BCE (Binary Cross Entropy) loss:

(1)

where is the output of the network for th sample.

Since a classification network can only map an image to a scalar indicating the probability of an image being tampered, It is unclear whether the model has learned useful features to localize the manipulated regions. There are some interpretation and visualization works trying to reveal more information from a classification network by investigating the activated regions on featuremaps.

[10, 31, 50, 38] We adopt the most representative method, CAM (Class Activation Map), to help visualize what the model has learned.

CAM requires the network has an average pooling layer before the classifier, which collapses the output of the last convolution layer to a single vector. Suppose the featuremaps from the last convolution layer is

; the classifier has weight ; the activation map of a tampered face is calculated as:

(2)

where , and are entries of , and respectively.

What Equation 2 does is actually apply the classifier directly to the featuremaps , which performs classification on each spatial location. For simplicity, we modify the original CAM setting by switching the average pooling layer and the classifier. As shown in Figure 2, the activation map can be viewed as a dense prediction output for the image and the classification score is actually produced by averaging the activation map to a scalar.

In order to convert to a pixel-level mask, we need to further normalize it to the range of 0 to 1 and quantize it using a threshold. The normalization is operated as:

(3)

The final pixel-level prediction is generated as:

(4)

where is a indicator function and a threshold.

Now we have a pixel-level output that highlights the manipulated regions. Using these outputs makes it easier to investigate and analyze how well the classification model is able to learn discriminative and high-quality features on a pixel-level. Details and analysis are described in Section 4.

Figure 3: Pipeline of segmentation task. The network predicts a pixel-level output and is supervised directly by a pixel-level mask.

3.2 Segmentation Task

F2F NT P DF FS
Figure 4: Illustration of example images and the corresponding masks for the “FaceForensics++” dataset. (P: Pristine, DF: DeepFakes, F2F: Face2Face, FS: FaceSwap, NT: NeuralTextures)

A classification network has limited capability to localize manipulated regions with a pixel-level manner because it is supervised only by a global label. Segmentation extends the task to a dense classification problem by assigning a label to each pixel of an image. The model is then forced to learn discriminative features to determine the authenticity of each pixel. Formally, the supervision for an image is defined as a mask instead of a single label and the loss is imposed on each pixel:

(5)

where and are the label and the prediction respectively for th sample at position .

Since a segmentation task requires pixel-level mask as supervision, annotation of the data is usually time-consuming. For example, as mentioned in [11], a high-resolution street view image for semantic segmentation requires around 1.5 hours for labelling. Fortunately for the face forensics task, the mask can be easily calculated by checking the pixel difference between the original image and the forged image without any extra annotation cost. Figure 4 shows some training images from “FaceForensics++” dataset as well as their corresponding mask indicating the manipulated area.

A classification network can be easily converted to a FCN (Fully Convolutional Network) [27] where the fully connected layers are replaced by convolutional layers. The pipeline for training a segmentation network is illustrated in Figure 3. Compared with the classification task in Figure 2, the main difference is that the average pooling is dropped and the BCE loss is directly applied to each pixel. The pixel-level prediction can be directly obtained from the trained model.

A segmentation model can be also evaluated from a global classification perspective by aggregating the dense prediction:

(6)

where represents the prediction at position and is the threshold.

In this way, we are able to make fair comparison between a segmentation network and its counterpart classification network under classification metrics. With extensive experiments in Section 4, we show the superiority of the segmentation networks for the face forensics task.

3.3 Architectures

In order to conduct deep analysis on the classification and segmentation task, we choose several representative architectures to evaluate the effectiveness on the problem of face forensics.

Xception [9]

is a deep network architecture constructed by a series of modified inception modules

[44]

where the depthwise separable convolution is used. There are totally 36 convolutional layers involved to form the feature extraction base of the network. The architecture is adopted in

[36] for the classification task of face forensics.

MesoInception-4 [1] is a compact and light-weight network to address the problem of face forensics. It consists of two inception modules followed by two classic convolutional layers with maxpooling layers. We replace the all the operations after the last batchnorm layer with a single convolutional layer as the classifier.

UNet [35] is an effective and popular architecture for pixel-level tasks such as segmentation and pixel-to-pixel translation [23]. A Unet is basically defined by an encoder, consisting of convolutional layers and downsampling operations, and a decoder, consisting of convolutional layers and upsampling operations. There are skip connections between the encoder and the decoder to enable passing information from low-level features. We choose two variants of UNet with different downsampling times in the encoder. UNet8x and UNet4x are downsampled by 8 and 4 times respectively.

VGG16 [39] is a classic deep network for recognition tasks, which consists of 16 convolutional layers. Since we found the full network fails to converge on face forensics tasks, we only use two shallow versions VGG8 and VGG5, containing the first 7 and 4 feature layers of vgg respectively and a classifier.

FN3

is a 3-layer network we design to explore the potential of shallow networks. This architecture only contains two “Conv-BN-ReLU” blocks, and a

convolutional layer as the classifier. The first two convolutional layers are with kernel size 7 and stride 2. It is interesting that this minimum structure works surprisingly good, even outperforming most of those deep architectures. Please refer to Section

4 for more details.

4 Experiments

4.1 Experiment Setup

DF F2F FS NT P Avg
Xception-cls 99.16 98.35 98.88 99.09 99.18 98.93

Mesonet-cls
93.33 77.01 26.77 92.05 88.99 75.63

UNet8x-cls
56.57 33.4 22.96 47.21 92.55 50.5

UNet4x-cls
66.9 45.45 37.48 55.42 98.98 60.8

VGG7-cls
41.69 73.37 67.78 38.81 76.45 59.6

VGG4-cls
56.02 84.92 90.72 40.33 70.66 68.53

Conv3-cls
94.35 93.28 81.13 94.26 61.43 84.89
Xception-seg 96.45 97.98 99.02 98.39 99.92 98.35

Mesonet-seg
68.86 79.58 89.77 96.92 59.56 78.94

UNet8x-seg
99.08 98.74 97.17 99.42 66.65 92.21

UNet4x-seg
98.61 97.32 99.05 96.53 97.01 97.70

VGG7-seg
98.41 98.34 99.05 99.01 99.33 98.83

VGG4-seg
98.24 98.29 99.03 99.01 99.99 98.91

Conv3-seg
98.16 98.32 99.03 99.06 99.99 98.91
Table 1: Classification accuracy on different manipulation methods. (P: Pristine, DF: DeepFakes, F2F: Face2Face, FS: FaceSwap, NT: NeuralTextures)
mIoU Bg-IoU Fg-IoU
DF F2F FS NT P Avg DF F2F FS NT P Avg DF F2F FS NT P Avg
Xception-seg 89.32 88.18 87.7 62.81 99.95 85.59 95.95 93.79 94.19 41.94 99.95 85.16 82.7 82.56 81.21 83.67 - 82.54
Mesonet-seg 56.58 51.14 54.52 40.23 90.2 58.53 78.96 71.06 74.68 22.02 90.2 67.38 34.19 31.21 34.35 58.44 - 39.55
UNet8x-seg 87.83 86.97 85.02 50.51 86.02 79.27 94.7 92.32 91.82 28.27 86.02 78.63 80.96 81.62 78.22 72.75 - 78.39
UNet4x-seg 89.12 89.43 86.29 51.46 96.09 82.48 95.41 93.89 92.59 30.68 96.09 81.73 82.82 84.96 79.99 72.25 - 80.00
VGG8-seg 94.68 95.21 94.33 76.04 99.31 91.91 97.87 97.34 97.19 59.42 99.31 90.23 91.48 93.07 91.47 92.67 - 92.17
VGG5-seg 95.78 96.21 94.81 75.6 99.86 92.45 98.36 97.94 97.51 58.97 99.86 90.53 93.21 94.48 92.11 92.23 - 93.01
FN3-seg 92.68 93.05 89.01 64.42 99.72 87.78 97.16 96.24 94.81 43.89 99.72 86.36 88.21 89.86 83.21 84.95 - 86.56
Xception-cls 47.6 58.98 56.21 58.83 99.23 64.17 59.9 71.9 75.62 23.27 99.23 65.98 35.3 46.06 36.8 52.95 - 42.78
Mesonet-cls 45.96 37.14 37.48 24.78 95.46 48.16 67.87 55.75 65.08 13.93 95.46 59.62 24.05 18.53 9.88 35.63 - 22.02
UNet8x-cls 23 33.6 34.82 29.71 86.39 41.5 28.63 49.62 53.8 11.3 86.39 45.94 17.42 17.58 15.84 48.11 - 24.74
UNet4x-cls 22.3 32.95 34.38 35.14 97.59 44.47 26.25 46.11 51.92 13.99 97.59 47.17 18.35 19.79 16.83 56.29 - 27.82
VGG8-cls 28.73 23.45 26.12 29.84 63.66 34.36 40.91 21.65 30.72 12.1 63.66 33.81 16.54 25.25 21.51 47.57 - 27.72
VGG5-cls 39.18 37.92 38.85 15.73 80.39 42.41 66.81 63.83 69.56 13.51 80.39 58.82 11.56 12.01 8.14 17.96 - 12.42
FN3-cls 16.77 18.46 20.47 43.84 48.68 29.64 14.58 10.63 17.9 8.09 48.68 19.97 18.97 26.29 23.04 79.59 - 36.97
Table 2: Segmentation results for different architectures. (P: Pristine, DF: DeepFakes, F2F: Face2Face, FS: FaceSwap, NT: NeuralTextures)

Dataset: FaceForensics++ [36] is a large scale face forensics dataset consisting of 5,000 video clips in total. Video sequences are crawled from the internet and a manual screening is adopted to ensure high quality and avoid face occlusion, resulting in 1,000 original videos. Four manipulation methods, Deepfakes, Face2Face, FaceSwap and NeuralTextures, are applied to create forged videos, resulting in 4,000 fake clips. The dataset also provides data with three different compression levels, raw, HQ and LQ. We only focus on the raw quality task because low quality videos usually suffer from strong loss of visual and identity information, which might not cause abuse as those clear ones. [36] also suggests a split of 720 videos for training, 140 for validation as well as testing. We follow the same setting.

Evaluation protocol and metrics: In [36], there are two types of training protocols involved, method specific training and mixed training. The former involves forged data from only one of the manipulation methods. The latter requires training a model with all the real and forged data and the performance is evaluated on each specific method. We only adopt mixed training as it poses a more challenging task and real scenario. The evaluation is frame-based, therefore we extract all frames for the training set and partial frames for validation and testing (every 10 frames).

In terms of evaluation metrics, we use classification accuracy for the classification tasks, which represents how many test images are correctly classified. For segmentation tasks, IoU (Intersection over Union) is used to represent the ratio of

, where TP (True Positive), FP (False Positive) and TN (True Negative) are calculated based on pixels. The IoU is calculated for both foreground and background, denoted as Fg-IoU and Bg-IoU. The two IoUs are averaged to get mIoU, the mean IoU.

Implementation details: In face forensics, faces are the most important regions. As shown in [36], the model trained with the whole images performs poorly. Therefore, instead of using the whole image, we extract the faces as a pre-processing step using a public face detection tool [16] and only use the face regions to train the models . In order to include more background information, we enlarge the bounding box to the scale of 2. The segmentation masks are calculated by checking the difference between a manipulated face image and its corresponding original image. For segmentation tasks, the images are randomly cropped by size 256x256 and the same operation is applied to the corresponding mask to get the cropped mask. For classification tasks, it is necessary to include most face regions in the crop. Therefore, the shorter dimension of the image is first resized to 256, then a patch of 256x256 is cropped from the resized image.

The implementation is based on PyTorch

[32]. All the models are trained using the Adam [25] optimizer with parameters and . Since the Adam optimizer can adjust the learning rate dynamically, we only set the default learning rate to and do not use any learning rate decay policies. The batchsize is set to 64.

4.2 Experimental Results

Classification task: Table 1 shows the classification accuracy of different architectures. The suffixes “-seg” and “-cls” represent a segmentation model and a classification model respectively. The pixel-wise output is aggregated to a global output according to Equation 6. From the scores of the classification models, Xception-cls reaches the best performance, which is consistent with [36]. It can be seen that UNet, as a popular segmentation model for various pixel-level prediction tasks, fails to perform well in the classification task. it is interesting to see that FN3-cls, a minimum structure with only 3 layer works surprisingly good. Although, the performance is lower than Xception, FN3-cls achieves far better performance than other models. For those segmentation models, it can be easily noticed that they obtain better classification results than their counterpart classification models, which shows the benefit of training models under pixel-level supervision.

Segmentation task: Table 2 shows the segmentation results of different architectures. The classification models are trained using a global image-level label and visualized by the CAM to get a pixel-level output, explained in Section 3.1. For segmentation models, VGG5-seg achieves the best performance in terms of both mIoU, Bg-IoU, and Fg-IoU. Mesonet-seg, as a compact and efficient architecture, does not achieve comparable results, outperformed by other methods by a large margin. We suspect it could be due to the limited capacity of the model. It is also worth noting that UNet still does not reach promising results as a popular segmentation architecture. On the contrary, the 3-layer network FN3-seg shows better potential, even better than Xception-seg. For classification models, Xception-cls achieves the best results in most of the scores, which implies that Xception-cls can successfully learn high-quality features to locate manipulated regions even trained with a global image-level label. However, Xception-cls can be hardly compared with its segmentation counterpart that obtains far higher scores. The rest of the classification models all suffer from low scores. Even VGG5-cls, whose segmentation counterpart achieves the best results, is unable to produce plausible predictions without pixel-level supervision.

From the results above, obviously segmentation models show superiority over the classification models in terms of both pixel-level prediction and global-level prediction. Figure 5 shows outputs of different architectures, which further illustrates the benefit of analyzing fake faces on a pixel-level.

DF
F2F
FS
NT
Image&GT Xception Mesonet UNet8x UNet4x VGG8 VGG5 FN3
Figure 5: Qualitative results of the classification and segmentation models. Each of two rows relates to a specific manipulation method. For each method, on the left are the input fake image and the ground-truth indicating the manipulated area. The upper row shows the pixel-level results of all the classification models, and the lower row displays the predictions of the segmentation models. (DF: DeepFakes, F2F: Face2Face, FS: FaceSwap, NT: NeuralTextures)



Image & GT conv1 conv2 conv3 conv4
Figure 6: Kernel visualization of VGG5. The left column is the input fake image and the ground-truth. Each column on the right shows kernels from a specific convolutional layer.

4.3 Analysis

Deep vs Shallow

To explore the effect of model depth to the task of face forensics, we also take a closer look at the performance of models with different depth. In Table 3, we summarize the mIoUs of segmentation models with different depth. Apart from VGG8 and VGG5, we also include VGG3, which only uses the first two layers of VGG16 followed by a classifier. It is interesting to see that the deep model, Xception with 36 layers, does not reach to a high score, whereas the shallow models present better abilities. This reveals that face forensics is supposed to be defined as a low-level vision problem than a high-level perception problem.


DF F2F FS NT P Avg
Xception (36) 89.32 88.18 87.7 62.81 99.95 85.59

VGG8 (7)
94.68 95.21 94.33 76.04 99.31 91.91

VGG5 (4)
95.78 96.21 94.81 75.6 99.86 92.45

FN3 (3)
92.68 93.05 89.01 64.42 99.72 87.78

VGG3 (3)
88.79 89.92 79.65 57.93 96.64 82.58
Table 3: Comparison among models with different depth. The number in the parentheses indicates the depth of the model. The numbers are mIoU.

Pretrained or From Scratch

As implied by the analysis in the last section, face forensics is more like a low-level vision task. Another question is that “can the models benefit from the features used for general vision recognition tasks?” We conduct another ablation study where we compare the performance on the segmentation task using models with and without ImageNet-pretraining. The results are shown in Table

4. According to the numbers, there is little difference between the pretrained model and the trained-from-scratch model. The features learned in a general vision recognition task such as ImageNet did not help quickly find a better local optima.

DF F2F FS NT P Avg
Xception (pretrained) 89.32 88.18 87.7 62.81 99.95 85.59
Xception (non-pretrained) 88.72 87.88 88.70 62.84 99.74 85.57
VGG5 (pretrained) 95.78 96.21 94.81 75.6 99.86 92.45
VGG5 (non-pretrained) 95.69 96.2 94.75 75.35 99.86 92.37
VGG8 (pretrained) 94.68 95.21 94.33 76.04 99.31 91.91
VGG8 (non-pretrained) 95.67 95.93 95.06 75.18 99.83 92.33
Table 4: Comparison between a pretrained model and the model trained from scratch. The numbers are mIoU.

Kernel visualization

In order to have a better understanding of the features learned by the model, we analyze the kernels by visualizing them using the technique in [40]. In Figure 6, for each fake image, we visualize two kernels in each convolutional layer. Apart from the features in conv1, which are mostly low-level edges and corners, the kernels in following layers do not make much sense to us humans. Intuitively, the model tries to learn subtle features, to which humans are not sensitive to. Humans are good at recognizing things on a semantic level, but fake faces, generated by advanced manipulation methods, seem beyond humans ability. This further emphasizes the demand of a good face forensics model.

5 Conclusion

Face forensics has become increasingly important as face manipulation methods have made stunning progress to enable effortless generation of indistinguishable fake face images. Most previous works cast the problem as a classification task, which suffers from limitations. In this paper, we analyze the problem from pixel-level perspective by using segmentation methods to complement the traditional classification methods. With comprehensive experiments, we show the superiority of formulating it as a segmentation problem instead of a classification problem. In addition, we also perform different ablation studies to analyze the important factors of being an effective face forensics model, which establishes a strong new baseline for the benchmark. We hope that our analysis can provide more insight to the field of face forensics.

References

  • [1] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen (2018-12) MesoNet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–7. External Links: Document, ISSN Cited by: §1, §1, §2.2, §3.3.
  • [2] M. Afifi, K. Hussain, H. Ibrahim, and N. Omar (2014-12) Video face replacement system using a modified poisson blending technique. 2014 International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS 2014, pp. . External Links: Document Cited by: §2.1.
  • [3] G. Antipov, M. Baccouche, and J. Dugelay (2017-Sep.) Face aging with conditional generative adversarial networks. In 2017 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 2089–2093. External Links: Document, ISSN Cited by: §2.1.
  • [4] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen (2017) Bringing portraits to life. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36 (6), pp. 196. Cited by: §2.1.
  • [5] M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini, B. Tondi, and S. Tubaro (2017-08) Aligned and non-aligned double jpeg detection using convolutional neural networks. Journal of Visual Communication and Image Representation 49, pp. . External Links: Document Cited by: §1.
  • [6] B. Bayar and M. C. Stamm (2016)

    A deep learning approach to universal image manipulation detection using a new convolutional layer

    .
    In Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, IH&MMSec ’16, New York, NY, USA, pp. 5–10. External Links: ISBN 978-1-4503-4290-2, Link, Document Cited by: §1.
  • [7] C. Bregler, M. Covell, and M. Slaney (1997) Video rewrite: driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97, New York, NY, USA, pp. 353–360. External Links: ISBN 0-89791-896-7, Link, Document Cited by: §2.1.
  • [8] F. Chollet (2017-07) Xception: deep learning with depthwise separable convolutions. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1.
  • [9] F. Chollet (2017-07) Xception: deep learning with depthwise separable convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • [10] R. G. Cinbis, J. Verbeek, and C. Schmid (2017-01) Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (1), pp. 189–203. External Links: Document, ISSN Cited by: §3.1.
  • [11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
  • [12] T. J. d. Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, and A. d. R. Rocha (2013-07) Exposing digital image forgeries by illumination color classification. IEEE Transactions on Information Forensics and Security 8 (7), pp. 1182–1194. External Links: Document, ISSN Cited by: §2.2.
  • [13] K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Matusik, and H. Pfister (2011) Video face replacement. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA ’11, New York, NY, USA, pp. 130:1–130:10. External Links: ISBN 978-1-4503-0807-6, Link, Document Cited by: §2.1.
  • [14] D. Dang-Nguyen, G. Boato, and F. G. B. De Natale (2012-12) Identify computer generated characters by analysing facial expressions variation. In 2012 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 252–257. External Links: Document, ISSN Cited by: §2.2.
  • [15] Deepfakes github. Note: https://github.com/deepfakes/faceswap Cited by: §1, §1.
  • [16] Facerecognition github. Note: https://github.com/ageitgey/face_recognition Cited by: §4.1.
  • [17] Faceswap. Note: https://github.com/MarekKowalski/FaceSwap Cited by: §1, §1.
  • [18] H. Farid (2016) Photo forensics. The MIT Press. External Links: ISBN 0262035340, 9780262035347 Cited by: §1.
  • [19] P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner, K. Varanasi, P. Pérez, and C. Theobalt (2015-05) VDub: modifying face video of actors for plausible visual alignment to a dubbed audio track. Computer Graphics Forum 34, pp. . External Links: Document Cited by: §2.1.
  • [20] D. Güera and E. J. Delp (2018-11) Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 1–6. External Links: Document, ISSN Cited by: §2.2.
  • [21] R. Huang, S. Zhang, T. Li, and R. He (2017-10) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [22] M. Huh, A. Liu, A. Owens, and A. A. Efros (2018-09) Fighting fake news: image splice detection via learned self-consistency. In The European Conference on Computer Vision (ECCV), Cited by: §2.2.
  • [23] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    .
    arxiv. Cited by: §2.3, §3.3.
  • [24] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt (2018) Deep Video Portraits. ACM Transactions on Graphics 2018 (TOG). Cited by: §2.1, §2.1.
  • [25] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [26] Y. Li, M. Chang, and S. Lyu (2018-12) In ictu oculi: exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–7. External Links: Document, ISSN Cited by: §2.2.
  • [27] J. Long, E. Shelhamer, and T. Darrell (2015-06) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. External Links: Document, ISSN Cited by: §2.3, §3.2.
  • [28] Y. Lu, Y. Tai, and C. Tang (2018-09) Attribute-guided face generation using conditional cyclegan. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [29] P. Luo (2012) Hierarchical face parsing via deep learning. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, Washington, DC, USA, pp. 2480–2487. External Links: ISBN 978-1-4673-1226-4, Link Cited by: §2.3.
  • [30] J. F. O’Brien and H. Farid (2012-01) Exposing photo manipulation with inconsistent reflections. ACM Transactions on Graphics 31 (1), pp. 4:1–11. Note: Presented at SIGGRAPH 2012 External Links: Document, Link Cited by: §1.
  • [31] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014-06) Learning and transferring mid-level image representations using convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1717–1724. External Links: Document, ISSN Cited by: §3.1.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
  • [33] A.C. Popescu and H. Farid (2005-02) Exposing digital forgeries by detecting traces of resampling. Trans. Sig. Proc. 53 (2), pp. 758–767. External Links: ISSN 1053-587X, Link, Document Cited by: §1.
  • [34] R. Raghavendra, K. B. Raja, S. Venkatesh, and C. Busch (2017-07) Transferable deep-cnn features for detecting digital and print-scanned morphed face images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 1822–1830. External Links: Document, ISSN Cited by: §2.2.
  • [35] O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §2.3, §3.3.
  • [36] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Niesner (2019) FaceForensics++: learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2.3, §3.3, §4.1, §4.1, §4.1, §4.2.
  • [37] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: a unified embedding for face recognition and clustering.. In CVPR, Cited by: §1.
  • [38] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017-10) Grad-cam: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.1.
  • [39] K. Simonyan and A. Zisserman (2015-05) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §3.3.
  • [40] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), External Links: Link Cited by: §4.3.
  • [41] Y. Sun, Y. Chen, X. Wang, and X. Tang (2014) Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pp. 1988–1996. Cited by: §1.
  • [42] S. Suwajanakorn, S. Seitz, and I. Kemelmacher (2017-07) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics 36, pp. 1–13. External Links: Document Cited by: §2.1.
  • [43] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017-07) Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36 (4), pp. 95:1–95:13. External Links: ISSN 0730-0301 Cited by: §1.
  • [44] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015-06) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • [45] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2013) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708. Cited by: §1.
  • [46] J. Thies, M. Zollhöfer, and M. Niessner (2019-07) Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. 38 (4), pp. 66:1–66:12. External Links: ISSN 0730-0301, Link, Document Cited by: §1, §1.
  • [47] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Niessner (2016-06) Face2Face: real-time face capture and reenactment of rgb videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1.
  • [48] P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger (2017-07) Deep feature interpolation for image content changes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [49] S. Wang, O. Wang, A. Owens, R. Zhang, and A. A. Efros (2019) Detecting photoshopped faces by scripting photoshop. arXiv preprint arXiv:1906.05856. Cited by: §1, §1.
  • [50] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016-06) Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
  • [51] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis (2017-07) Two-stream neural networks for tampered face detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 1831–1839. External Links: Document, ISSN Cited by: §2.2.
  • [52] M. Zollhofer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt (2018) State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum. External Links: ISSN 1467-8659, Document Cited by: §2.1.