Human faces play an important role in human communication, as a face is associated with the identity of a person. The unique face information, working as fingerprints, has been used in many applications such as phone unlocking, payment, etc., thanks to remarkable progress in face detection and recognition systems[45, 41, 37]. However, we have also seen stunning progress in image and video manipulation methods, which enable editing the images or videos in a visually plausible way. Some face specific manipulation methods [15, 17, 46, 47] are able to manipulate the face image of person and create an indistinguishable fake image.
Current face manipulation methods can be roughly divided into two categories, facial reenactment and identity swap. Facial reenactment tries to transfer the facial expressions of one person to another person and synthesize realistic details. Face2Face  and NeuralTextures  are two representative works. Identity swap is a technique that enables replacing the face of a person with another person’s face. Deepfakes  and FaceSwap  are two of the most prominent methods. These methods enable effortless creation of fake face images and videos, which poses potential threats to our society. For example, fake news can be easily created by synthesizing a speech video of a politician .
To alleviate the potential issues caused by the fake face videos and images, great efforts have been dedicated to the field of face forensics, which aims to determine authenticity of a face photo. General image forensics techniques, relying on hand-crafted cues [18, 33, 5, 30]
, might not be suitable for face specific forensics tasks since faces are highly structured data. Recent works take advantage of great representation power of CNNs (Convolutional Neural Networks) and train a network using a large dataset containing authentic and manipulated face images[36, 1, 49, 6]. In , a large scale dataset called “FaceForensics++” is released to address the problem of face forensics. The dataset contains 5,000 videos generated from 4 popular face manipulation methods, Deefakes, FaceSwap, Face2Face and NeuralTextures, which provides rich data to train models as well as a standard benchmark for evaluation.
Most methods for face forensics cast the problem as a classification problem, in which given an image the model is expected to determine whether it is a real face or a manipulated face. Using deep networks has been proved effective in dealing with such a classification problem [36, 49]. In , a modified Xception network  is trained on “FaceForensics++” dataset and achieves remarkable results, accuracy of 99.26 on the raw data. In , a compact network also achieves comparable performance. However, one question is raised: “Is the problem well-defined?” or “Is it a good definition of the problem?” In Figure 2, the second image shows the activation map of a classification model revealing the high response area for the fake face on the left. It is obviously that the activation map is not actually consistent with the ground-truth, which suggests that the features used to distinguish the fake images might have weak correlation to the real manipulated regions.
The example implies one of the limitations of a classification network that it can only produce a global scalar value representing the confidence of being fake but can not reflect the degree of how the image is manipulated. It would be more beneficial to have a pixel-level output that accurately reflects the manipulated pixels, as shown in the third image of Figure 2. Therefore, It would be more natural to formulate the problem of face forensics as a semantic segmentation task so that the model is forced to learn discriminative features to localize manipulated regions.
In this paper, we analyze the problem of face forensics from a pixel-level perspective using segmentation methods to complement the existing classification methods for face forensics. There are some questions that are still under investigated such as: 1) By nature, whether face forensics is a classification or segmentation problem? 2) What is the most suitable network architecture for this problem? 3) Should we adopt shallow or deep networks? 4) Should we train the model from scratch or initialize it using general vision features. We conduct experiments to try to answer these questions. By evaluating various architectures, we compare the performance of the segmentation networks and their counterpart classification networks from different aspects. We hope to provide more insight to the problem and establish a new baseline for the benchmark.
Our contributions are three folds:
We conduct a pixel-level analysis to the problem of face forensics by using segmentation methods to be complementary to the existing classification methods.
By redefining the problem to be a pixel-level task, we evaluate various architectures and create a strong new baseline for the problem.
By performing different ablation studies, we analyze what makes an effective and efficient anti-fake model, which, we hope, can shed some light on the field of research.
2 Related Work
We cover the most important related papers in the following paragraphs.
2.1 Digital Face Manipulation
A comprehensive state-of-the-art report of digital face manipulation can be found in . Current facial manipulation methods can be separated into four categories: image-based approach, Audio-based approach, computer-graphics-based approach as well as learning based approach.
State of the Arts image-based approaches such as Video Rewrite , Video Face Replacement , Bringing Portraits to Life  and Deep Video Portraits . These methods employ 2D warps to deform the image to match the expressions of a source actor. “Synthesizing Obama”  learned the mapping between audio and lip motions.
State-of-the-arts computer-graphics-based approaches such as Video Face Replacement , VDub  an Face2Face . These methods usually reconstruct 3D models using blendshapes or other mesh editing process, based on high-quality 3D face capturing techniques as well as precise and rapid tracking techniques.
Recently, generative adversarial networks (GANs) are used to apply different facial attributes such as Aging , viewpoints , skin color , smiling , or other essential computer graphic renderings 
, which are implemented as an image-to-image translation, applying a patch-based GAN-loss.
2.2 Face forensics
Face forensics aims to ensure authenticity and origin of the face. Face forensics identify computer generated characters from computer graphics faces , print-scanned morphed faces , face splicing [12, 22], face swapping [51, 1], and face reenactment [1, 14]. Specific artifacts arising from the synthesis process such as color , texture  or eye blinking  can also be exploited. Learning-based approach propose a deep network trained to capture the subtle inconsistencies arising from low-level and/or high level features[1, 51]. Particularly, 
uses a convolutional neural network to extract frame-level features,which are then used to train a recurrent neural network (RNN) that learns to classify if a video has been subject to manipulation or not. These approaches show impressive results, but can not precisely locate the manipulated area.
2.3 Pixel-level task
Instead of a rough prediction in global image-level view, there are many works towards to provide a local or pixel-level prediction, such as Unet , fully convolutional network (FCN), Deeplab.for semantic segmentation. As for image generation, pix2pix  realize the pixel-level transformation between different domains. There are lots of application concerning face parsing, pose parsing or scene segmentation.
As for face forensics, the mainstream methods are based on global classification at present, we drive the segmentation motivation of face manipulation to predict the region of local manipulation area. The face is often occluded by objects, but the face in the database  is generally unobstructed, so it can be trained directly.
3 Problem Setting
In this section, we first introduce the problem settings and methodologies for both the classification task and the segmentation task. Then we present an overview of the architectures used for evaluation.
3.1 Classification Task
We first revisit the classification task. Formally Let represent an image containing either an real or a tampered face, and represent the label associated to it. We learn a mapping function to predict the authenticity of a face image. Given a dataset containing T images, the network is trained by the following BCE (Binary Cross Entropy) loss:
where is the output of the network for th sample.
Since a classification network can only map an image to a scalar indicating the probability of an image being tampered, It is unclear whether the model has learned useful features to localize the manipulated regions. There are some interpretation and visualization works trying to reveal more information from a classification network by investigating the activated regions on featuremaps.[10, 31, 50, 38] We adopt the most representative method, CAM (Class Activation Map), to help visualize what the model has learned.
CAM requires the network has an average pooling layer before the classifier, which collapses the output of the last convolution layer to a single vector. Suppose the featuremaps from the last convolution layer is; the classifier has weight ; the activation map of a tampered face is calculated as:
where , and are entries of , and respectively.
What Equation 2 does is actually apply the classifier directly to the featuremaps , which performs classification on each spatial location. For simplicity, we modify the original CAM setting by switching the average pooling layer and the classifier. As shown in Figure 2, the activation map can be viewed as a dense prediction output for the image and the classification score is actually produced by averaging the activation map to a scalar.
In order to convert to a pixel-level mask, we need to further normalize it to the range of 0 to 1 and quantize it using a threshold. The normalization is operated as:
The final pixel-level prediction is generated as:
where is a indicator function and a threshold.
Now we have a pixel-level output that highlights the manipulated regions. Using these outputs makes it easier to investigate and analyze how well the classification model is able to learn discriminative and high-quality features on a pixel-level. Details and analysis are described in Section 4.
3.2 Segmentation Task
A classification network has limited capability to localize manipulated regions with a pixel-level manner because it is supervised only by a global label. Segmentation extends the task to a dense classification problem by assigning a label to each pixel of an image. The model is then forced to learn discriminative features to determine the authenticity of each pixel. Formally, the supervision for an image is defined as a mask instead of a single label and the loss is imposed on each pixel:
where and are the label and the prediction respectively for th sample at position .
Since a segmentation task requires pixel-level mask as supervision, annotation of the data is usually time-consuming. For example, as mentioned in , a high-resolution street view image for semantic segmentation requires around 1.5 hours for labelling. Fortunately for the face forensics task, the mask can be easily calculated by checking the pixel difference between the original image and the forged image without any extra annotation cost. Figure 4 shows some training images from “FaceForensics++” dataset as well as their corresponding mask indicating the manipulated area.
A classification network can be easily converted to a FCN (Fully Convolutional Network)  where the fully connected layers are replaced by convolutional layers. The pipeline for training a segmentation network is illustrated in Figure 3. Compared with the classification task in Figure 2, the main difference is that the average pooling is dropped and the BCE loss is directly applied to each pixel. The pixel-level prediction can be directly obtained from the trained model.
A segmentation model can be also evaluated from a global classification perspective by aggregating the dense prediction:
where represents the prediction at position and is the threshold.
In this way, we are able to make fair comparison between a segmentation network and its counterpart classification network under classification metrics. With extensive experiments in Section 4, we show the superiority of the segmentation networks for the face forensics task.
In order to conduct deep analysis on the classification and segmentation task, we choose several representative architectures to evaluate the effectiveness on the problem of face forensics.
is a deep network architecture constructed by a series of modified inception modules
where the depthwise separable convolution is used. There are totally 36 convolutional layers involved to form the feature extraction base of the network. The architecture is adopted in for the classification task of face forensics.
MesoInception-4  is a compact and light-weight network to address the problem of face forensics. It consists of two inception modules followed by two classic convolutional layers with maxpooling layers. We replace the all the operations after the last batchnorm layer with a single convolutional layer as the classifier.
UNet  is an effective and popular architecture for pixel-level tasks such as segmentation and pixel-to-pixel translation . A Unet is basically defined by an encoder, consisting of convolutional layers and downsampling operations, and a decoder, consisting of convolutional layers and upsampling operations. There are skip connections between the encoder and the decoder to enable passing information from low-level features. We choose two variants of UNet with different downsampling times in the encoder. UNet8x and UNet4x are downsampled by 8 and 4 times respectively.
VGG16  is a classic deep network for recognition tasks, which consists of 16 convolutional layers. Since we found the full network fails to converge on face forensics tasks, we only use two shallow versions VGG8 and VGG5, containing the first 7 and 4 feature layers of vgg respectively and a classifier.
is a 3-layer network we design to explore the potential of shallow networks. This architecture only contains two “Conv-BN-ReLU” blocks, and a
convolutional layer as the classifier. The first two convolutional layers are with kernel size 7 and stride 2. It is interesting that this minimum structure works surprisingly good, even outperforming most of those deep architectures. Please refer to Section4 for more details.
4.1 Experiment Setup
Dataset: FaceForensics++  is a large scale face forensics dataset consisting of 5,000 video clips in total. Video sequences are crawled from the internet and a manual screening is adopted to ensure high quality and avoid face occlusion, resulting in 1,000 original videos. Four manipulation methods, Deepfakes, Face2Face, FaceSwap and NeuralTextures, are applied to create forged videos, resulting in 4,000 fake clips. The dataset also provides data with three different compression levels, raw, HQ and LQ. We only focus on the raw quality task because low quality videos usually suffer from strong loss of visual and identity information, which might not cause abuse as those clear ones.  also suggests a split of 720 videos for training, 140 for validation as well as testing. We follow the same setting.
Evaluation protocol and metrics: In , there are two types of training protocols involved, method specific training and mixed training. The former involves forged data from only one of the manipulation methods. The latter requires training a model with all the real and forged data and the performance is evaluated on each specific method. We only adopt mixed training as it poses a more challenging task and real scenario. The evaluation is frame-based, therefore we extract all frames for the training set and partial frames for validation and testing (every 10 frames).
In terms of evaluation metrics, we use classification accuracy for the classification tasks, which represents how many test images are correctly classified. For segmentation tasks, IoU (Intersection over Union) is used to represent the ratio of, where TP (True Positive), FP (False Positive) and TN (True Negative) are calculated based on pixels. The IoU is calculated for both foreground and background, denoted as Fg-IoU and Bg-IoU. The two IoUs are averaged to get mIoU, the mean IoU.
Implementation details: In face forensics, faces are the most important regions. As shown in , the model trained with the whole images performs poorly. Therefore, instead of using the whole image, we extract the faces as a pre-processing step using a public face detection tool  and only use the face regions to train the models . In order to include more background information, we enlarge the bounding box to the scale of 2. The segmentation masks are calculated by checking the difference between a manipulated face image and its corresponding original image. For segmentation tasks, the images are randomly cropped by size 256x256 and the same operation is applied to the corresponding mask to get the cropped mask. For classification tasks, it is necessary to include most face regions in the crop. Therefore, the shorter dimension of the image is first resized to 256, then a patch of 256x256 is cropped from the resized image.
The implementation is based on PyTorch. All the models are trained using the Adam  optimizer with parameters and . Since the Adam optimizer can adjust the learning rate dynamically, we only set the default learning rate to and do not use any learning rate decay policies. The batchsize is set to 64.
4.2 Experimental Results
Classification task: Table 1 shows the classification accuracy of different architectures. The suffixes “-seg” and “-cls” represent a segmentation model and a classification model respectively. The pixel-wise output is aggregated to a global output according to Equation 6. From the scores of the classification models, Xception-cls reaches the best performance, which is consistent with . It can be seen that UNet, as a popular segmentation model for various pixel-level prediction tasks, fails to perform well in the classification task. it is interesting to see that FN3-cls, a minimum structure with only 3 layer works surprisingly good. Although, the performance is lower than Xception, FN3-cls achieves far better performance than other models. For those segmentation models, it can be easily noticed that they obtain better classification results than their counterpart classification models, which shows the benefit of training models under pixel-level supervision.
Segmentation task: Table 2 shows the segmentation results of different architectures. The classification models are trained using a global image-level label and visualized by the CAM to get a pixel-level output, explained in Section 3.1. For segmentation models, VGG5-seg achieves the best performance in terms of both mIoU, Bg-IoU, and Fg-IoU. Mesonet-seg, as a compact and efficient architecture, does not achieve comparable results, outperformed by other methods by a large margin. We suspect it could be due to the limited capacity of the model. It is also worth noting that UNet still does not reach promising results as a popular segmentation architecture. On the contrary, the 3-layer network FN3-seg shows better potential, even better than Xception-seg. For classification models, Xception-cls achieves the best results in most of the scores, which implies that Xception-cls can successfully learn high-quality features to locate manipulated regions even trained with a global image-level label. However, Xception-cls can be hardly compared with its segmentation counterpart that obtains far higher scores. The rest of the classification models all suffer from low scores. Even VGG5-cls, whose segmentation counterpart achieves the best results, is unable to produce plausible predictions without pixel-level supervision.
From the results above, obviously segmentation models show superiority over the classification models in terms of both pixel-level prediction and global-level prediction. Figure 5 shows outputs of different architectures, which further illustrates the benefit of analyzing fake faces on a pixel-level.
|Image & GT||conv1||conv2||conv3||conv4|
Deep vs Shallow
To explore the effect of model depth to the task of face forensics, we also take a closer look at the performance of models with different depth. In Table 3, we summarize the mIoUs of segmentation models with different depth. Apart from VGG8 and VGG5, we also include VGG3, which only uses the first two layers of VGG16 followed by a classifier. It is interesting to see that the deep model, Xception with 36 layers, does not reach to a high score, whereas the shallow models present better abilities. This reveals that face forensics is supposed to be defined as a low-level vision problem than a high-level perception problem.
Pretrained or From Scratch
As implied by the analysis in the last section, face forensics is more like a low-level vision task. Another question is that “can the models benefit from the features used for general vision recognition tasks?” We conduct another ablation study where we compare the performance on the segmentation task using models with and without ImageNet-pretraining. The results are shown in Table4. According to the numbers, there is little difference between the pretrained model and the trained-from-scratch model. The features learned in a general vision recognition task such as ImageNet did not help quickly find a better local optima.
In order to have a better understanding of the features learned by the model, we analyze the kernels by visualizing them using the technique in . In Figure 6, for each fake image, we visualize two kernels in each convolutional layer. Apart from the features in conv1, which are mostly low-level edges and corners, the kernels in following layers do not make much sense to us humans. Intuitively, the model tries to learn subtle features, to which humans are not sensitive to. Humans are good at recognizing things on a semantic level, but fake faces, generated by advanced manipulation methods, seem beyond humans ability. This further emphasizes the demand of a good face forensics model.
Face forensics has become increasingly important as face manipulation methods have made stunning progress to enable effortless generation of indistinguishable fake face images. Most previous works cast the problem as a classification task, which suffers from limitations. In this paper, we analyze the problem from pixel-level perspective by using segmentation methods to complement the traditional classification methods. With comprehensive experiments, we show the superiority of formulating it as a segmentation problem instead of a classification problem. In addition, we also perform different ablation studies to analyze the important factors of being an effective face forensics model, which establishes a strong new baseline for the benchmark. We hope that our analysis can provide more insight to the field of face forensics.
-  (2018-12) MesoNet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–7. External Links: Cited by: §1, §1, §2.2, §3.3.
-  (2014-12) Video face replacement system using a modified poisson blending technique. 2014 International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS 2014, pp. . External Links: Cited by: §2.1.
-  (2017-Sep.) Face aging with conditional generative adversarial networks. In 2017 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 2089–2093. External Links: Cited by: §2.1.
-  (2017) Bringing portraits to life. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36 (6), pp. 196. Cited by: §2.1.
-  (2017-08) Aligned and non-aligned double jpeg detection using convolutional neural networks. Journal of Visual Communication and Image Representation 49, pp. . External Links: Cited by: §1.
A deep learning approach to universal image manipulation detection using a new convolutional layer. In Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, IH&MMSec ’16, New York, NY, USA, pp. 5–10. External Links: Cited by: §1.
-  (1997) Video rewrite: driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’97, New York, NY, USA, pp. 353–360. External Links: Cited by: §2.1.
-  (2017-07) Xception: deep learning with depthwise separable convolutions. In , Cited by: §1.
-  (2017-07) Xception: deep learning with depthwise separable convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
-  (2017-01) Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (1), pp. 189–203. External Links: Cited by: §3.1.
The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
-  (2013-07) Exposing digital image forgeries by illumination color classification. IEEE Transactions on Information Forensics and Security 8 (7), pp. 1182–1194. External Links: Cited by: §2.2.
-  (2011) Video face replacement. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA ’11, New York, NY, USA, pp. 130:1–130:10. External Links: Cited by: §2.1.
-  (2012-12) Identify computer generated characters by analysing facial expressions variation. In 2012 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 252–257. External Links: Cited by: §2.2.
-  Deepfakes github. Note: https://github.com/deepfakes/faceswap Cited by: §1, §1.
-  Facerecognition github. Note: https://github.com/ageitgey/face_recognition Cited by: §4.1.
-  Faceswap. Note: https://github.com/MarekKowalski/FaceSwap Cited by: §1, §1.
-  (2016) Photo forensics. The MIT Press. External Links: Cited by: §1.
-  (2015-05) VDub: modifying face video of actors for plausible visual alignment to a dubbed audio track. Computer Graphics Forum 34, pp. . External Links: Cited by: §2.1.
-  (2018-11) Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 1–6. External Links: Cited by: §2.2.
-  (2017-10) Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2018-09) Fighting fake news: image splice detection via learned self-consistency. In The European Conference on Computer Vision (ECCV), Cited by: §2.2.
Image-to-image translation with conditional adversarial networks. arxiv. Cited by: §2.3, §3.3.
-  (2018) Deep Video Portraits. ACM Transactions on Graphics 2018 (TOG). Cited by: §2.1, §2.1.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
-  (2018-12) In ictu oculi: exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–7. External Links: Cited by: §2.2.
-  (2015-06) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. External Links: Cited by: §2.3, §3.2.
-  (2018-09) Attribute-guided face generation using conditional cyclegan. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
-  (2012) Hierarchical face parsing via deep learning. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, Washington, DC, USA, pp. 2480–2487. External Links: Cited by: §2.3.
-  (2012-01) Exposing photo manipulation with inconsistent reflections. ACM Transactions on Graphics 31 (1), pp. 4:1–11. Note: Presented at SIGGRAPH 2012 External Links: Cited by: §1.
-  (2014-06) Learning and transferring mid-level image representations using convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1717–1724. External Links: Cited by: §3.1.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
-  (2005-02) Exposing digital forgeries by detecting traces of resampling. Trans. Sig. Proc. 53 (2), pp. 758–767. External Links: Cited by: §1.
-  (2017-07) Transferable deep-cnn features for detecting digital and print-scanned morphed face images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 1822–1830. External Links: Cited by: §2.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Cited by: §2.3, §3.3.
-  (2019) FaceForensics++: learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2.3, §3.3, §4.1, §4.1, §4.1, §4.2.
-  (2015) FaceNet: a unified embedding for face recognition and clustering.. In CVPR, Cited by: §1.
-  (2017-10) Grad-cam: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.1.
-  (2015-05) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §3.3.
-  (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), External Links: Cited by: §4.3.
-  (2014) Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems, pp. 1988–1996. Cited by: §1.
-  (2017-07) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics 36, pp. 1–13. External Links: Cited by: §2.1.
-  (2017-07) Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36 (4), pp. 95:1–95:13. External Links: Cited by: §1.
-  (2015-06) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
-  (2013) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708. Cited by: §1.
-  (2019-07) Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. 38 (4), pp. 66:1–66:12. External Links: Cited by: §1, §1.
-  (2016-06) Face2Face: real-time face capture and reenactment of rgb videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1.
-  (2017-07) Deep feature interpolation for image content changes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
-  (2019) Detecting photoshopped faces by scripting photoshop. arXiv preprint arXiv:1906.05856. Cited by: §1, §1.
-  (2016-06) Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
-  (2017-07) Two-stream neural networks for tampered face detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 1831–1839. External Links: Cited by: §2.2.
-  (2018) State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum. External Links: Cited by: §2.1.