Successful computer vision technologies have recently also fueled face manipulation strategies with a series of methods proposed, including Deepfakes, Face2Face , FaceSwap and NeuralTextures . The fake content created by these methods is becoming increasingly realistic. Falsified video content, involving both manipulated identity and edited expression, raises various disconcerting problems within widespread social media, such as identity theft, fake news dissemination, and fraud. Even worse, most existing face manipulation technologies are disclosed online and easy to implement. Even a non-expert person without any prior professional skills can create high-quality fake content. It cannot be denied that these manipulation techniques and off-the-shelf commercial products (e.g., FakeApp ) bring certain positive impacts to the film-making industry and other entertainment applications. However, face manipulation techniques are highly likely to be misused, and the potential malicious applications are hazardous.
As such, there has been an exponential increase in the demand for efficient face manipulation detection methods to counteract its dangerous impacts.
Despite the demonstrated success in developing detection methods, the advancement of face manipulation techniques still poses a grand challenge. The situation worsens when manipulated videos are compressed before distribution. One general detection methodology includes leveraging high-level semantic clues (e.g., lack of eye blinking , head pose inconsistency , and face splicing 
). Such methods may ignore low-level signal variations in manipulated videos, resulting in limitations in terms of robustness and effectiveness. This has motivated the development of frequency-domain features[9, 10, 11]. Manipulation region localization [12, 8, 13] is another task of paramount importance in face forensics as it is pivotal to unveil the intention of a forger. However, this aspect has been largely ignored in most existing methods. Another concern is the degradation of low-quality videos’ detection accuracy due to the limited generalization capability across different quality levels. This could primarily affect the deployment of detection methods in practical applications.
In this vein, we aim to develop an anti-manipulation method with three desired properties,
Detectability: It must achieve high accuracy in detecting manipulated faces.
Locatable: It must be able to locate the manipulated regions precisely.
Flexibility: It must be robust to videos in different quality levels.
We introduce a two-stream multi-scale face anti-manipulation framework based on the fusion of semantic-level and noise-level features, achieving the three desired properties mentioned above, simultaneously.
As Fig. 1
depicts, the semantic-level guidance (middle row) reveals the location of the manipulated face, rooted at the generally adopted procedure in face manipulation by blending the new face with a background. On the other hand, the noise-level guidance originates from noise patterns (bottom row), and observers may find that the noise pattern of the pristine face is more consistent and homogeneous than others. As such, semantic masks and noise patterns are employed as guiding labels to supervise the model’s training. Besides, we extract three feature maps from the shallow, middle and deep layers of the a convolutional neural network backbone, such that the extracted multi-scale features can carry information from low, middle, and high levels. Finally, we combine the semantic-level and noise-level features to check authenticity and locate the manipulated region of the input face. Extensive experiments demonstrate that the proposed framework achieves consistent performance improvement compared to state-of-the-art methods, holds promise for both high and low-quality videos, and delivers more information regarding the manipulated regions.
Ii Related Work
Ii-a Face Manipulation and Detection
Generally speaking, face manipulation methods can be classified into the following two categories: identity swap and facial expression manipulation. In particular, Deepfakes and FaceSwap  are two prominent representatives of face identity swap techniques. Face2Face  and NeuralTextures  are two popular facial expression editing methods. Deepfakes and NeuralTextures are learning-based methods among those four face manipulation techniques, while FaceSwap and Face2Face are computer graphics-based methods.
Due to the potential malicious applications of face manipulation techniques, numerous face forgery detectors have been proposed. For example, Li et al.  proposed the Face X-ray to detect the blending boundary of the fake face, which also reveals the manipulated boundaries. Chintha et al.  applied a recurrent structure to capture spatial and temporal information of manipulated videos. In  and , the authors designed two models to perform face forensics detection and localization. Li et al.  proposed a sharp-MIL (Multiple Instance Learning) to perform face forgery detection on videos. Qian et al.  introduced frequency analysis into face forgery detection and achieved promising detection results. Masi et al.  designed a two-branch recurrent network to detect face manipulation. None of the above methods considered combining semantic and noise signatures to detect forgeries and basically all of them suffer from performance drop when dealing with low-quality videos or cross-dataset evaluations.
Ii-B Binary Mask Supervision
Image manipulation segmentation focuses on regions which are potentially manipulated  and often relies upon guiding binary masks. The ground-truth masks are generated by marking forged regions as ‘1’ and the pristine (un-forged) regions as ‘0’. In image forensics, Zhou et al.  fused the features from the RGB stream and the noise stream to capture the manipulated boundaries as well as the noise inconsistencies between the authentic and manipulated regions, and the features are then used to locate the potentially manipulated regions. Bappy et al.  designed a hybrid CNN-LSTM model to capture boundary discrepancy features and perform image manipulation localization. Zhou et al.  employed a novel generator to perform data augmentation and trained a model to complete image forgery localization. The binary mask guidance has also been widely used in face anti-spoofing. In [20, 21, 22], binary masks are employed to perform pixel-wise supervision during the training process. These models aim to discover arbitrary cues that can identify the live (pristine) and spoofed (manipulated) faces. In face manipulation detection, Du et al. 
designed a locality-aware autoencoder to enforce the model to learn the feature representation of the forgery region, which boosts the generalization capability. Danget al.  utilized the binary mask and an attention mechanism to make the extracted feature maps focus on manipulated regions, which further improves the detection performance. Li et al.  designed a more general model to capture the boundary splicing artifacts which can explicitly locate the potentially manipulated facial region.
Differently from the previous methods, in this paper, we design a multi-scale semantic map prediction module to capture the semantic-level information of input faces and simultaneously perform the localization of face manipulation. The binary masks play a supervision role in the model training process, encouraging the network to learn features that account for such fine-grained manipulations.
Ii-C Noise Modeling
Pioneering work on noise modeling for image forensics date back to 2004, where inconsistencies at the noise level have been leveraged to expose splicing artifacts of manipulated images [23, 24, 25]. More recently, photo-response non-uniformity (PRNU)  noise patterns have been widely studied in multimedia forensics. In general, PRNU can also be regarded as the device fingerprint. Korus et al.  proposed a multi-scale PRNU-based scheme to tackle the tampering localization problem. Quan et al.  focused on the correlation between the noise residual with the PRNU pattern for forgery detection. Cozzolino et al.  applied a noise extraction Siamese network  to achieve manipulation detection and forgery localization.
Noise patterns are also regarded as key clues in face anti-spoofing. More specifically, Jourabloo et al.  designed a CNN to decompose a spoofing face into the spoofing noise and a live face, and the spoofing noise is further used to perform the live/spoofing identification. Ren et al.  designed a noise-attention network architecture to extract noise-features for live/spoofing face classification.
While noise signature has been widely used in image forensics and face anti-spoofing tasks, how the noise signature benefits face manipulation detection, especially the fake face contents generated by deep learning technologies, has not been investigated yet. In this work, we argue that some key cues remain in the noise maps of manipulated faces. Therefore, a noise map prediction module is designed to extract the noise pattern of corresponding input faces, and the noise feature is subsequently fed to a classification module for face manipulation detection.
Iii Proposed Method
This section presents a method to simultaneously detect and locate face manipulated regions. It relies upon high-level semantic information clues about an image combined with noise low-level features. The method consists of four components: (a) a feature extraction backbone; (b) a semantic map prediction module; (c) a noise map prediction module; and (d) a classification module.
Iii-a Overall Framework
We propose a two-stream multi-scale framework for face forgery detection and localization, and the architecture is illustrated in Fig. 2. In particular, we denote the training dataset as , where is the total number of training samples. The dataset consists of the following four components: the th input face , its corresponding binary mask label , noise pattern label and ground-truth binary label of forgery . We employ the Xception network  as the backbone to learn multi-scale feature maps based on two insights. Other networks could also be used without loss of generality.
First, multi-scale features enable the model to learn both semantic and geometric information as features from different layers contribute to different receptive fields. Second, multi-scale supervision leads the model to focus on the valuable information from the beginning (i.e., learning face-forgery-specific information at the shallow layer), which benefits face manipulation detection and localization performance.
The semantic map prediction module and the noise map prediction module synchronously processes the multi-scale feature maps, where the outputs are guided by binary mask label and noise pattern label , respectively. The semantic map prediction module is trained to capture the high-level semantic information from the extracted multi-scale features where the estimated semantic map indicates the potentially manipulated regions in the corresponding input faces.
The three predicted semantic segmentation maps are leveraged to perform forgery localization. The noise map prediction module is further proposed to enforce the multi-scale features to focus on image content and pay attention to content-irrelevant low-level information. The estimated noise maps contain rich high-frequency cues that might expose noise artifacts in manipulated faces by noise modeling.
We combine the last semantic features and the last noise features, and the spatial attention is sequentially conducted by element-wisely multiplying the concatenated features with the last predicted semantic segmentation maps. Finally, the attention-based features are forwarded to the classification module to identify the authenticity of input faces.
The overall objective function consists of three components:
where , , and denote final classification loss, noise map prediction loss, and semantic map estimation loss, respectively. and are hyper-parameters to weigh the loss components. is the cross-entropy loss between the prediction result and ground truth label :
We better describe and in the following subsections.
Iii-B Semantic Map Prediction Module
Binary mask supervision has been widely used to tackle various forensics problems. In this paper, we generate the ground-truth mask for each face in the training stage and use the mask to supervise the training of the proposed model. The benefits of the semantic map prediction module are mainly threefold: 1) it enables the model to localize manipulated regions, providing evidence to show whether the input faces have been manipulated; 2) it constrains the model to focus on manipulated regions, leading the model to achieve a better manipulation detection performance; and 3) different manipulation approaches tend to have different binary mask shapes, thus providing auxiliary information to complete multi-class classification (more details are elaborated in Sec.IV-C.3)).
The semantic map prediction module takes the extracted multi-scale features as inputs, where the outputs are the estimated semantic segmentation maps. In particular, we apply the depth-wise separable convolution block as our semantic map prediction module as it has been shown to be effective in different types of problems 
. For the output of the semantic map prediction module, each pixel value indicates the probability that the corresponding receptive field is a manipulated region in the input face. Herein, we conduct a weighted summation strategy over the estimated semantic maps in multi-scale to determine the final manipulation localization map of the input faces.
We use videos in FaceForensics++  as our training set as it provides a manipulation mask for each manipulated face, supervising the training of the semantic map prediction module. Training the semantic map prediction module can tackle a binary classification problem, including the manipulated region and the non-manipulated area. Furthermore, the cross-entropy loss is leveraged to supervise the estimation of semantic segmentation maps:
where and represent the estimated and ground truth semantic maps, and the sizes of all ground truth maps have been aligned to the corresponding predicted maps. The parameters and denote the pixel location; specifies th input face; denotes the mask predicted by the specific feature layer; is the total pixel number.
Iii-C Noise Map Prediction Module
One notable example of forensics-related noise pattern analysis is PRNU, which is the dominant part of the noise in natural images caused by the inhomogeneity of silicon wafers and imperfections during the sensor manufacturing process  and has been widely applied in different multimedia forensics tasks [35, 27, 29] to date. Inspired by the PRNU paradigm, we leverage the wavelet-based filter  to extract a noise map to explore further the face image forensics artifacts as the low-level clue guidance for face anti-manipulation task. The noise map prediction module is designed to make the extracted multi-scale features irrelevant with high-level semantic content (i.e., facial appearance) and provide a complementary clue for face manipulation detection.
The design of the noise map prediction module relies upon the assumption that noise patterns of manipulated faces expose two artifacts. First, the noise pattern of the manipulated region tends to expose more abnormal artifacts due to the image processing of the fake face creation process. Second, the statistical discrepancy between the noise distributions of the manipulated face region and the background region introduces noise distribution inconsistencies [29, 18]. We statistically analyze the noise pattern distributions of four manipulation methods and pristine data in Fig. 3. It can be observed that the mean
and variancevalues of pristine faces are much smaller than the values of the other four manipulation methods. Therefore, it is reasonable to employ noise guidance to perform face manipulation detection.
where and represent the th input raw face and the corresponding extracted noise pattern, and is defined as the noise filter:
where is the denoised face image, indicates the variance estimation of the additive white Gaussian noise (AWGN), and is the variance of . The estimation details of can be found in .
We can extract the noise residual for each input face as the noise label. We further design a multi-scale noise map prediction module to estimate low-level noise patterns and extract noise-specific features. The extracted noise-specific features aim to expose noise artifacts, serving as auxiliary information to improve the final authentic identification performance. The noise map prediction loss calculates the norm of the pixel-wise difference between predicted noise maps and corresponding noise labels :
where specifies th input face; denotes the specific feature layer; and represent the ground truth and estimated noise patterns. This design enables the network to learn the content-irrelevant low-level information and expose noise artifacts.
Iv-a Data Preparation
In this paper, we conduct the experiments on the challenging FF++  dataset. FF++  is a dataset of face manipulation that contains 1,000 pristine videos and 4,000 associated manipulated videos created by four state-of-the-art forgery techniques: Deepfakes , Face2Face , FaceSwap and NeuralTextures . Besides, FF++  provides three quality levels controlled by the quantization parameters (QP) in compression for these 5000 videos: raw (QP=0), HQ (high-quality, QP=23), and LQ (low-quality, QP=40). Considering the deployment in real-world application scenarios, we conduct our experiments on both HQ videos and LQ videos. FF++  also provides the ground truth-masks that indicate the forged regions of manipulated faces, enabling the developments of face forgery localization methods. Following the experimental setting in , we take 720 videos for training, 140 videos for validation, and 140 videos for testing. We extract 270 frames from each training video and 100 frames from each validation and testing video. Following the official experimental setting, during training we augment the number of real faces four times to address the data imbalance between real and fake faces. The summary of the data in training, validation, and testing is listed in Table I. For that we just extract more faces from the training videos.
Iv-A2 Binary mask generation
FF++ provides a 3D mask for every video frame, and the 3D mask indicates the manipulated region of each face. As such, we generate the ground-truth binary mask for each face according to the boundary location of the 3D mask. The middle row in Fig. 1 illustrates the generated ground-truth masks for real and manipulated faces, where the white pixels denote the manipulated regions, and the black pixels represent the non-manipulated regions (pristine regions).
Iv-A3 Noise map generation
Following previous forensics methods [35, 27, 29], we apply the wavelet-based noise filter  to extract the noise map for each input face. The extracted noise maps are regarded as pseudo labels to supervise the training of the noise map prediction module. The hyper-parameter as defined in Eqn. (5) is the variance estimation of additive white Gaussian noise(AWGN). To determine the best values of for high-quality and low-quality faces, we leverage the noise patterns generated with different values to supervise the training of the noise map prediction module and the detection performances of high-quality and low-quality faces are illustrated in Fig. 4. According to the detection results in Fig. 4 (a) and (b), we set the values as 5 and 10 for high-quality and low-quality faces.
Iv-B Implementation Details
Iv-B1 Training strategy
We apply the popular dlib face detector 
to crop the face regions enlarged by a factor of 1.3. The proposed framework is implemented by Pytorch. The model is trained using Adam optimizer  with =0.9 and =0.999. Multi-scale feature layers are extracted after the conv2 , block2, and block11 of the Xception backbone.
We conduct a two-step training strategy: 1) training the Xception backbone and the semantic map prediction module, where the weights of the backbone are initialized with the ImageNet weights; and 2) initializing the weights of the Xception backbone and the semantic map prediction module with the weights obtained in Step 1), then training the overall framework in an end-to-end manner. This two-step training strategy can effectively alleviate the over-fitting problem and be more robust to local minima. We set the learning rate and weight decay as 0.0002 and 1e-5, respectively. The model is trained on 2 RTX 2080Ti GPUs with batch size 32.
Iv-B2 Evaluation metrics
Following most existing face forgery detection methods, we employ the detection Accuracy rate (ACC) and Area Under the Receiver Operating Characteristic Curve (AUC) as the forgery detection evaluation metrics. ACC is the most straightforward metric in face forgery detection, which directly reflects the detection capability of the detector. AUC is a more objective evaluation metric, which has been widely applied in recent state-of-the-art detection methods, such as Face X-ray, Celeb-DF , and Two-branch .
This paper takes AUC as our key evaluation metric and reports all detection results at the frame level. On the other hand, we utilize Intersection over Union (IoU) as the localization evaluation metric. IoU is one of the standard metrics in semantic segmentation and has been widely used in numerous semantic segmentation tasks [42, 43]. The definition of IoU is illustrated in Eqn. (7). The numerator calculates the area of overlap between the predicted mask and the ground truth mask , and the denominator represents the area encompassed by both the predicted mask and the ground truth mask. As such, IoU is employed to evaluate the forgery localization accuracy.
Iv-C Manipulation Detection
We compare the proposed method with previous detection methods in terms of ACC and AUC.
Iv-C1 Detection performance on all manipulation methods
Table II reports the detection results of the proposed framework and previous methods trained on the FF++ dataset. HQ and LQ indicate the high-quality and low-quality data. Comparing with the most recent work , the proposed model achieves a significant ACC score improvement, going from 80.18% 84.84%. Moreover, the proposed method outperforms all listed methods for HQ and LQ image qualities. These performance gains on both ACC and AUC scores are mainly due the the high-level and low-level joint learning strategy adopted in our formulation, further demonstrating the effectiveness of the proposed model.
Iv-C2 Detection accuracy on specific manipulation methods
Detecting low-quality manipulated face images is a challenging task as severe compression erases much detailed information from the original faces. To demonstrate that the proposed model can achieve remarkable detection performance on low-quality faces, we list the detection accuracy (ACC) of specific manipulation methods on the low-quality FF++ dataset in Table III. DF, FF, FS, and NT represent four manipulation methods: Deepfakes, Face2Face, FaceSwap, and NeuralTextures. The manipulation-specific detectors are trained and tested on the same face manipulation methods. Comparing with previous detection methods, the proposed model achieves the best detection accuracy on all four manipulation methods.
|LQ (QP=40)||HQ (QP=23)|
|Steg. Features ||-||55.98||-||70.97|
|Cozzolino et al. ||-||58.69||-||78.45|
|Bayer & Stamm ||-||66.84||-||82.97|
|Rahmouni et al. ||-||61.18||-||79.08|
|Face X-ray ||61.60||-||87.35||-|
|Nirkin et al. ||-||80.18||-||-|
Iv-C3 Cross-dataset evaluation
Most existing detection models always suffer a significant performance drop when applied to unseen datasets. To evaluate the robustness of our model, we train our model on the Deepfakes (DF) dataset of FF++ and evaluate it on the CelebDF dataset . The CelebDF dataset, which is widely considered for cross-dataset evaluation in this community [10, 11, 50], is the most challenging deepfake dataset containing 5,639 sophisticated deepfake videos.
The cross-dataset evaluation results (AUC) are listed in Table IV. The AUC detection results of previous methods are directly cited from the very recent findings in  and . It can be observed that AUC scores of most detectors drop significantly when testing on the CelebDF dataset, while the proposed model trained on high-quality faces (QP=23) can achieve the second-best 74.54 AUC score. The cross-dataset evaluation experiment demonstrates that the proposed model is capable of achieving high generalization capability.
Iv-C4 Multi-class classification evaluation
Although identifying the authenticity of input faces is of great importance, specifying the manipulation method is also a non-trivial problem. However, correctly classifying the manipulation approach of the input face can reveal whether the identity or expression of the input face has been manipulated, which further unveils the potential intent of the forger.
|Steg. Features ||67.00||48.00||49.00||56.00|
|Cozzolino et al. ||75.00||56.00||51.00||62.00|
|Bayer & Stamm ||87.00||82.00||74.00||74.00|
|Rahmouni et al. ||80.00||62.00||59.00||59.00|
We further evaluate the proposed model on this five-way (real and four respective manipulation methods) classification task. The classification results on the FF++ low quality (QP=40) dataset are reported in Table V. We use the bold font to highlight the best results while underlining the second-best results among all listed methods.
Compared with the result of the Xception baseline, the proposed model equipped with the semantic segmentation module (Xception(seg)) in the second last row achieves a 5.61% recall rate improvement, going from 75.43% 81.04%. We argue that the classification performance gain mainly benefits from the supervision of the binary mask.
The benefit of the semantic map prediction module is twofold. First, this constrains our model to focus on the manipulated region of the input face, leading the model to mine more manipulation-related information. Second, different manipulation methods tend to leave different binary mask shapes, thus providing auxiliary information to perform the five-class classification.
As such, the scheme of binary mask supervision can separate the faces manipulated by different approaches. To better clarify this point, we further present the visualization of predicted mask maps on the testing set in Fig. 5. The mean map on the left represents the average summation of overall predicted localization maps and pristine maps. The top row shows the mean maps of four specific manipulation methods and pristine face images. The absolute difference maps between the top-row maps and the mean map are shown in the bottom row. It can be readily observed that each bias map is discriminative from the others.
The final classification result in the last row of Table V shows that the proposed model outperforms the previous state-of-the-art method, SPSL, by 3.25% average recall rate, going from 78.94% 82.19%. The noise map prediction module also mines significant manipulation method clues. Thus, we can observe that the final result (Xception(fusion)) achieves a 1.15% average recall rate and a 6.22% pristine recall rate improvement comparing with the results of Xception(seg), demonstrating the effectiveness of the noise map prediction module from a complementary viewpoint.
Furthermore, we show the t-SNE feature embedding visualization of the Xception baseline and the proposed method in Fig. 6. As shown in Fig. 6 (a), the Xception model cannot separate different manipulation methods and pristine face images, and the pristine faces are clustered with NeuralTextures and Face2Face fake faces in the feature space. However, the proposed model achieves a better multi-class embedding division performance in the t-SNE feature space, and the pristine data is less confused with other manipulated data, demonstrating the effectiveness of the proposed model from another point of view.
Iv-C5 Feature map visualization
To better demonstrate the effectiveness of the proposed model, we further visualize the feature maps of the Xception baseline and the proposed model ones in Fig. 7. All real and fake faces are from the FF++ low-quality testing set, and the models are trained under the setting of Sec.IV-C.1).
As shown in Fig. 7, the baseline feature maps of real and corresponding fake faces are similar, resulting in a struggling manipulation detection performance. Conversely, the proposed model prefers to focus on the central regions for fake faces and the peripheral regions for pristine faces. As a result, comparing with the Xception baseline, the real and fake feature maps extracted from the proposed model are more discriminative, thus leading to a better manipulation detection performance.
|Nirkin et al. ||99.7||66.00|
|Xception (QP=0) ||99.70||48.20|
|Xception (QP=23) ||99.70||65.30|
|Xception (QP=40) ||95.50||65.50|
In this subsection, we quantitatively report the face forgery detection performances on various experimental settings and qualitatively visualize the feature maps and t-SNE feature embedding distributions. The presented results demonstrate the effectiveness and robustness of the proposed framework, which can capture the artifacts in both semantic-level and noise-level, thus achieving superior detection performance on all mentioned experimental settings.
Iv-D Manipulation Localization
In this subsection, we evaluate the manipulation localization performance of the proposed model both quantitatively and qualitatively.
Iv-D1 Quantitative evaluation
As we apply the multi-scale feature learning strategy in the proposed framework, three predicted manipulation maps can be obtained for each given face. Herein, we apply a weighted summation strategy over the three predicted maps to determine the final manipulation localization map:
where represents the final determined manipulation localization map. The parameters , , and are used to weigh the predicted maps. We resize the predicted masks , , and (, , and are specified in Eqn. (3)) to the original input face size 299 299, obtaining the aligned manipulation localization maps , , and .
Intuitively, the features extracted from the deep layer tend to carry rich semantic information. Thus we allocate relatively larger weight to . To study the best weighting strategy, we report the Intersection over Union (IoU) score for both high-quality and low-quality faces in Table VI. Comparing the IoUs in the first row (=0.0, =0.0, =1.0) and the IoUs in the fourth row (=0.1, =0.2, =0.7), we can conclude that the multi-scale feature learning strategy benefits the manipulation localization performance because leveraging the information of all three features leads to a better localization accuracy. As such, the values of , , and are set as 0.1, 0.2, and 0.7. Our model can achieve remarkable 0.8413 and 0.9305 IoU localization performance for low-quality and high-quality faces, respectively, demonstrating the effectiveness of the designed semantic map prediction module.
|IoU (LQ)||IoU (HQ)|
Iv-D2 Qualitative evaluation
To further study the manipulation localization performance of the proposed model, we qualitatively present the results of face manipulation localization on low-quality faces and high-quality faces in Fig. 8 (a) and Fig. 8 (b). The faces in different rows represent the fake faces created by corresponding manipulation methods and pristine real faces. The input face predicted localization map and the corresponding ground truth mask are shown in the left, middle, and right columns. Comparing with the ground-truth masks, it can be seen that the manipulated regions can be well captured for both low-quality and high-quality faces by the proposed model. Furthermore, our model can accurately determine the forgery locations for the faces with extreme head-poses (e.g., FS 1st example and NT 1st example in Fig. 8 (a); NT 3rd example and P 2nd example in Fig. 8 (b)) and very poor lighting conditions (e.g., P 4th example in Fig. 8 (a); DF 2nd example and NT 1st example in Fig. 8 (b)), which further validates the robustness of the proposed model.
In this subsection, we quantitatively and qualitatively evaluate the manipulation localization performance. The proposed model captures the high-level semantic information from the multi-scale features, with good localization accuracy and segmentation performance for low-quality and high-quality faces.
Iv-E Ablation Study
To validate the effectiveness of the semantic map prediction module, the noise map prediction module, and the multi-scale feature extraction strategy in the proposed framework, we conduct extensive ablation experiments in this subsection.
Iv-E1 Effectiveness of sub-modules
To study the effectiveness of the semantic map prediction module and the noise map prediction module, we report the manipulation detection results of the Xception backbone equipped with different modules on low-quality faces in Table VII. By comparing with the AUC score in the third row and the detection result of the Xception baseline, we can conclude that the semantic map prediction module leads the model to mine more artifact clues and improve the detection performance. On the other hand, the detection performance drops slightly when equipping the noise prediction module on the Xception backbone. The reason is that forcing the model to only focus on the noise-level clues may cause severe image information loss, resulting in detection performance drops. In this work, the noise clue is regarded as the auxiliary information that plays a complementary role to the semantic map prediction module.
Comparing with the detection results in the third row, the AUC score in the last row gains specific improvement, proving the effectiveness of the noise map prediction module. To better validate the effectiveness of each module, we also show the ROC curves in Fig. 9. We can observe that the proposed model achieves the best AUC performance and the best TPR at a lower FPR value, while low FPR is one of the most challenging face manipulation detection scenarios .
We can be concluded that the semantic map prediction module and the noise map prediction module indeed help improve the detection capability of the proposed model.
Iv-E2 Effectiveness of multi-scale feature learning
We further conduct an ablation study to demonstrate the effectiveness of the multi-scale feature learning strategy. The manipulation detection performance of different feature extraction strategies is listed in Table VIII. As illustrated in Fig. 2, , , and represent the features extracted from the shallow, middle, and deep layers of the Xception backbone, and each feature layer will be processed by the semantic map prediction module and the noise map prediction module.
, we can conclude that 1) the usage of deep feature layerleads to the most AUC score gain as it contains much semantic information; and 2) each extracted feature layer contributes the final detection performance improvements, verifying the effectiveness of the proposed multi-scale feature learning strategy.
Herein, we conduct a series of ablation studies to evaluate the effectiveness of the dedicated learning strategy and framework architecture. The quantitative experimental results demonstrate that the designed sub-modules and the multi-scale learning strategy lead the model to mine more artifact clues from the input faces, which further improves the final forgery detection performance.
V Conclusions and Future Work
This paper presented a novel framework to tackle the problem of face anti-manipulation. In particular, we introduced two complementary tasks, including semantic map prediction and noise map prediction, to capture the semantic-level and noise-level information to perform both face manipulation detection and forgery localization. Extensive experimental results show that the proposed two-stream multi-scale face anti-manipulation framework outperforms the state-of-the-art detection methods and is on par with the state-of-the-art cross-dataset detection methods for both high-quality and low-quality faces.
The proposed semantic map prediction module enables the model to perform face manipulation localization, and it also constrains the model to focus on manipulated regions, thus leading to a better binary and multi-class classification performance. Furthermore, the noise map prediction module serves as a complementary module, and it provides significative noise-level clues and subsequently empowers the final decision-making.
Last but not least, an ablation study demonstrated the effectiveness of the semantic map prediction module and the noise map prediction module, and the multi-scale feature learning strategy indeed helps the model improve its manipulation detection performance.
While our proposed method has shown to be effective for face anti-manipulation tasks, we limit our scope to only four different attacks (e.g., DeepFakes, Face2Face, FaceSwap, NeuralTextures). Therefore, adapting our method to unseen face manipulation is worth investigating in the future, although we already provided a good indication of the method’s performance considering the complex cross-dataset validation. On the other hand, since natural images/videos also face a similar threat of malicious manipulation, it is also worth applying our method to the manipulation detection based on other types of multimedia content in the future.
A. Rocha thanks the financial support of the São Paulo Research Foundation for grant DéjàVu #2017/12646-3.
-  “Github deepfake faceswap,” https://github.com/deepfakes/faceswap, 2018.
-  “Deepfakes faceswap,” [EB/OL], 2019, https://github.com/deepfakes/faceswap.
J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner,
“Face2face: Real-time face capture and reenactment of rgb videos,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2387–2395.
-  J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural rendering: Image synthesis using neural textures,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
-  “Fakeapp,” [EB/OL], 2018, https://www.malavida.com/en/soft.
-  Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created fake videos by detecting eye blinking,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2018, pp. 1–7.
-  X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent head poses,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8261–8265.
-  L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face x-ray for more general face forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5001–5010.
-  Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” in European Conference on Computer Vision. Springer, 2020, pp. 86–103.
-  H. Liu, X. Li, W. Zhou, Y. Chen, Y. He, H. Xue, W. Zhang, and N. Yu, “Spatial-phase shallow learning: rethinking face forgery detection in frequency domain,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 772–781.
-  I. Masi, A. Killekar, R. M. Mascarenhas, S. P. Gurudatt, and W. AbdAlmageed, “Two-branch recurrent network for isolating deepfakes in videos,” in European Conference on Computer Vision. Springer, 2020, pp. 667–684.
-  H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, “On the detection of digital face manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5781–5790.
-  M. Du, S. Pentyala, Y. Li, and X. Hu, “Towards generalizable deepfake detection with locality-aware autoencoder,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 325–334.
-  A. Chintha, B. Thai, S. J. Sohrawardi, K. Bhatt, A. Hickerson, M. Wright, and R. Ptucha, “Recurrent convolutional structures for audio spoof and video deepfake detection,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 1024–1037, 2020.
-  K. Songsri-in and S. Zafeiriou, “Complement face forensic detection and localization with faciallandmarks,” arXiv preprint arXiv:1910.05455, 2019.
-  X. Li, Y. Lang, Y. Chen, X. Mao, Y. He, S. Wang, H. Xue, and Q. Lu, “Sharp multiple instance learning for deepfake video detection,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1864–1872.
-  J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and B. Manjunath, “Exploiting spatial structure for localizing manipulated image regions,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4970–4979.
-  P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features for image manipulation detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1053–1061.
-  P. Zhou, B.-C. Chen, X. Han, M. Najibi, A. Shrivastava, S.-N. Lim, and L. Davis, “Generate, segment, and refine: Towards generic manipulation segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 058–13 065.
-  A. George and S. Marcel, “Deep pixel-wise binary supervision for face presentation attack detection,” in 2019 International Conference on Biometrics (ICB). IEEE, 2019, pp. 1–8.
-  Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu, “Deep tree learning for zero-shot face anti-spoofing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4680–4689.
-  W. Sun, Y. Song, C. Chen, J. Huang, and A. C. Kot, “Face spoofing detection based on local ternary label supervision in fully convolutional networks,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3181–3196, 2020.
-  A. C. Popescu and H. Farid, “Statistical tools for digital forensics,” in international workshop on information hiding. Springer, 2004, pp. 128–147.
-  B. Mahdian and S. Saic, “Using noise inconsistencies for blind image forensics,” Image and Vision Computing, vol. 27, no. 10, pp. 1497–1503, 2009.
-  S. Lyu, X. Pan, and X. Zhang, “Exposing region splicing forgeries with blind local noise estimation,” International journal of computer vision, vol. 110, no. 2, pp. 202–221, 2014.
-  J. Fridrich, J. Lukas, and M. Goljan, “Digital camera identification from sensor noise,” IEEE Transactions on Information Security and Forensics, vol. 1, no. 2, pp. 205–214, 2006.
-  P. Korus and J. Huang, “Multi-scale analysis strategies in prnu-based tampering localization,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 4, pp. 809–824, 2016.
-  Y. Quan and C.-T. Li, “On addressing the impact of iso speed upon prnu and forgery detection,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 190–202, 2020.
-  D. Cozzolino and L. Verdoliva, “Noiseprint: A cnn-based camera model fingerprint,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 144–159, 2019.
S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4353–4361.
-  A. Jourabloo, Y. Liu, and X. Liu, “Face de-spoofing: Anti-spoofing via noise modeling,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 290–306.
-  Y. Ren, Y. Hu, B. Liu, Y. Xie, and Y. Wang, “Face anti-spoofing with a noise-attention network using color-channel difference images,” in International Conference on Artificial Neural Networks. Springer, 2020, pp. 517–526.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258.
-  A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1–11.
-  J. Lukas, J. Fridrich, and M. Goljan, “Digital camera identification from sensor pattern noise,” IEEE Transactions on Information Forensics and Security, vol. 1, no. 2, pp. 205–214, 2006.
-  M. K. Mihcak, I. Kozintsev, and K. Ramchandran, “Spatially adaptive statistical modeling of wavelet image coefficients and its application to denoising,” in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 6. IEEE, 1999, pp. 3253–3256.
D. E. King, “Dlib-ml: A machine learning toolkit,”The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
-  A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library.” in NeurIPS, 2019.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
-  Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3207–3216.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
-  Y. Nirkin, L. Wolf, Y. Keller, and T. Hassner, “Deepfake detection based on discrepancies between faces and their context,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
-  J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital images,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012.
-  D. Cozzolino, G. Poggi, and L. Verdoliva, “Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection,” in Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, 2017, pp. 159–164.
-  B. Bayar and M. C. Stamm, “A deep learning approach to universal image manipulation detection using a new convolutional layer,” in Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, 2016, pp. 5–10.
-  N. Rahmouni, V. Nozick, J. Yamagishi, and I. Echizen, “Distinguishing computer graphics from natural images using convolution neural networks,” in 2017 IEEE Workshop on Information Forensics and Security (WIFS). IEEE, 2017, pp. 1–6.
-  D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” in 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2018, pp. 1–7.
-  H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, and N. Yu, “Multi-attentional deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2185–2194.
P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Two-stream neural networks for tampered face detection,” in2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2017, pp. 1831–1839.
-  Y. Li and S. Lyu, “Exposing deepfake videos by detecting face warping artifacts,” arXiv preprint arXiv:1811.00656, 2018.
-  H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task learning for detecting and segmenting manipulated facial images and videos,” in 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 2019, pp. 1–8.
-  H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: Using capsule networks to detect forged images and videos,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2307–2311.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.