Synthesized media, particularly deepfakes, has become a major source of concern in recent years. The now known term “deepfake” is a combination of the words “deep learning” and “fake” that was first introduced in late
, referred to as deep adversarial models (for instance, generative adversarial networks (GAN)) that generate fake videos by swapping a person’s face with the face of another person  using methods such as FaceSwap and FaceShifter . Current line of research  also include expression swapping as one of the deepfake generation techniques, such as Face2Face  and NeuralTextures . Since then the amount of generated media using deepfakes has increased rapidly, due to the rapid development in computer vision and more accessible and capable hardware. Deepfake has been a carrier of misinformation and malice since its inception, posing a political and social threat .
This deep learning and computer vision usage to generate high-fidelity synthetic media has thus been a major security risk and flagged as a top AI threat [15, 5]. As a result, there has been a resurgence in research into detecting facial modifications using both data-driven deep learning and biometric anti-spoofing techniques [16, 25, 27].
The current CNN-based (such as ResNet, MesoInceptionNet, and XceptionNet) deepfake detection methods are mainly based on detecting visual artifacts created due to the resolution and color inconsistency between the warped face area and the surrounding context [19, 20, 1, 29] during image blending operation. In , authors proposed a general method called face X-ray to detect forgery by detecting the blending boundary of a forged image using a two-class CNN model trained end-to-end using a classification loss and a loss associated with face X-ray. This model outperformed most of the existing CNN models in deepfake detection. Studies in [4, 7] used the optical flow of facial expression as a detection cue for altered video footage detection.
Most of the aforementioned deepfake detection methods [19, 20, 1, 29, 4, 7] are training-based and therefore, obtain very high performance in intra-dataset evaluation with an AUC of about . However, they obtain poor generalization to high-quality fakes generated due to evolving deepfake generation techniques in cross-dataset evaluation, thus, obtaining an AUC of about .
Media articles such as [5, 23] suggest the use of biometric technology in identifying deepfakes. Recently in 2020, Nguyen and Derakhshani  used ocular-based biometric matching to distinguish between real and fake images of identities using CNNs namely, lightCNN, ResNet, DenseNet and SqueezeNet trained for ocular-based user recognition. In , biometric tailored loss functions  (such as, Center, ArcFace, and A-Softmax) are used for two-class CNN training for deepfake detection. Worth mentioning, the study in  used only face recognition tailored loss functions in CNN training but not face recognition technology in detecting deepfakes.
With the advances in deepfake generation techniques, visual artifacts or other inconsistencies become progressively indistinguishable
in high quality and high-resolution deepfakes, but facial features are corrupted due to the swapping and image blending operation. Therefore, face recognition may be preferable in detecting fake images by matching corrupted feature vectors of the swapped face with the original templates of the target identity. A minimal analysis was carried out in
on a first generation dataset using 129 measures of face image quality which include signal to noise ratio, specularity, blurriness, etc., along with Support Vector Machine for deepfake detection. Studies in[3, 2, 11, 8] have used behavioral biometrics i.e., facial expression, head, and body movement in deepfake detection. These studies performed limited evaluation on identity swapping based deep fake generation techniques [3, 2, 11, 8].
This paper aims to evaluate the efficacy of deep face recognition111The term face recognition and face verification are used interchangeably in this study., trained using different loss functions, in identifying deepfakes generated using various methods. This study will provide us insight
into which deepfake generation techniques could be effectively detected by face recognition technology and the optimum loss functions to be used. This work has the advantage of bypassing the need for model training on fake images. However, the identity labels are required for biometric comparison (template and query image comparison) in deepfake detection. Figure1 illustrates deepfake detection using deep face recognition.
In summary, the main contributions of this paper are as follows:
To evaluate the efficiency of face recognition in identifying deepfakes generated using various methods. To this aim, deep face recognition model based on ResNet-50 pretrained on large-scale facial recognition datasets namely, MS1M-ArcFace  and WebFace  using six different loss functions (Softmax, ArcFace, Combined Loss, SphereFace, CosFace, and Triplet loss) are used for deepfake detection.
Use of explainable AI based t-distributed stochastic neighbor embedding (t-SNE)  in visualizing the effectiveness of facial recognition technology in detecting deepfakes.
Thorough experimental investigation on commonly adopted CelebDF  and FaceForensics++  datasets across five different deepfake detection methods namely, FaceSwap, Face2face, FaceShifter, NeuralTextures and Deepfakes222
The method of using autoencoders in deepfake generation is simply termed as DeepFakes in existing literature[32, 24]..
This paper is organized as follows: Section II discusses the methods for deepfake generation. Section III presents the dataset and the experimental protocol. Results and discussion are presented in section IV. Conclusions and future work are discussed in section V.
Ii Methods for Deepfake Generation
In this section, we will discuss various methods used for deepfake creation [32, 24]. Depending on the level of manipulation, these deepfake creation methods can be broadly categorized as Identity Swap and Expression Swap .
Identity Swap: Consists of replacing the face of a person in a video with the face of another person. FaceSwap333https://github.com/deepfakes/faceswap, FaceShifter , and Deepfakes  are an example of identity swapping methods for deepfake creation explained as follows:
FaceSwap: An identity swapping method that transfers the face region from a source video to a target video using a graphics-based approach based on detected facial landmarks. To swap the face of the source person to the target person, it uses face alignment, Gauss-Newton optimization, and image blending.
FaceShifter: FaceShifter  is an identity swapping method that uses a new generator with Adaptive Attentional Denormalization (AAD) layers utilized to adaptively integrate the identity and the attributes for face synthesis, and an attributes encoder used to extract multi-level target face attributes.
Deepfakes: In this method, two autoencoders that share the encoder are trained to reconstruct the source and target faces. To generate the fake image, the source face’s trained encoder and decoder are applied to the target face, and the autoencoder’s output is then blended using Poisson image editing.
Expression Swapping: Consists of facial reenactment by modifying the facial expression of the person.
Face2Face: In Face2Face , a temporary face identity is established using the first few frames, and then the expression is tracked over the remaining frames. Then, by transferring the source expression parameters of each frame to the target video, fake videos were created.
NeuralTextures: NeuralTextures  employs a rendering network with a patch-based GAN-loss, to learn a neural texture of the target individual from the source video. Only the mouth-related facial expression is altered, making it extremely difficult to detect.
Iii Dataset and Experiment Protocol
In this section, we discuss the datasets used and the experimental protocol followed for deepfake detection using face recognition technology.
In order to evaluate the efficiency of face recognition technology in identifying high-quality deepfakes without visual artifacts, generated using various methods. We tested our model on the high quality deepfake datasets namely Celeb-DF  and FaceForensics++  with most realistic fake videos. These commonly adopted high quality deepfake datasets also have subject identity labels which facilitates face recognition.
These datasets are described as follows:
Celeb-DF: The Celeb-DF  deepfake forensic dataset include genuine videos from celebrities as well as deepfake videos. Celeb-DF, in contrast to other datasets, has essentially no splicing borders, color mismatch, and inconsistencies in face orientation, among other evident deepfake visual artifacts.
FaceForensics++: FaceForensics++  is an automated benchmark for facial manipulation detection. It consists of several manipulated videos created using two different categories: Identity Swapping (FaceSwap, FaceSwap-Kowalski, FaceShifter, Deepfakes) and Expression swapping (Face2Face, and NeuralTextures). We used the FaceForensics++ dataset’s compressed test set, which has a curated list of videos for each of these deepfake creation methods.
Iii-B Experimental Protocol
We used the popular ResNet  architecture as it is widely used for face recognition. ResNet is a short form of residual network based on the idea of “identity shortcut connection” where input features may skip certain layers . In this study, we used ResNet- which has M parameters. ResNet-50  is trained from scratch on MS1M-Arcface [13, 10] and WebFace12M dataset . The MS1M-Arcface dataset , is a cleaned version of the MS1M dataset , containing around million images from subjects. The WebFace12M dataset , is a cleaned version and subset of the complete WebFace260M dataset , which contains around face images from identities. The face images were detected and aligned using MTCNN . MTCNN utilizes a cascaded CNNs based framework for joint face detection and alignment. The images are then resized to for both training and evaluation. The model was trained using six different loss functions i.e., ArcFace , CosFace , SphereFace , Combined Margin , Triplet loss  and SoftMax. ArcFace, SphereFace and CosFace loss functions learn intra-class similarity and inter-class diversity for performance enhancement.
For the ResNet-
model, the batch-normalization layer followed by the last fully connected layer of sizeand the final output layer equal to the number of classes (subjects) were used. The angular margin penalty hyper-parameters , and was set to , and
for the combined margin. The networks were trained using a Stochastic Gradient Descent (SGD) optimizer with a batch size offor epochs. The learning rate was set equal to at the onset and was divided at , iterations, and finish at iterations following the original ArcFace  implementation. We also set the momentum to and weight decay to . All the experiments are conducted on an Intel Xeon processor and two Nvidia Quadro RTX GPUs.
For the subject-disjoint evaluation of the pretrained deep learning face recognition models for deepfake detection on Celeb-DF  and FaceForensics++ , the gallery set consists of real face images (frames extracted from videos) per subject. Subject-disjoint protocol means that the subject identities used in the training face recognition model do not overlap with identities used for deepfake evaluation. Recall that MS1M-ArcFace and WebFace12M datasets were used for training ResNet-50 based face recognition model. The probe set consists of
real and fake face images randomly selected per subject. Depending on the dataset’s multiple videos per subject (CelebDF, DFD (subset of FF++)) or single video per subject (FF++) are used to create pairs of frames. From the gallery and probe set of all the subjects, deep features of sizeare extracted using the pretrained ResNet-50 model. The genuine match score is calculated between frames extracted from real videos and gallery face images of the corresponding subject, and the imposter score is calculated between frames extracted from fake videos and gallery face images of the target identity using cosine similarity metrics. We used Equal Error Rate (EER) and Area Under the Curve (AUC) as the performance metrics for the deepfake detection using face recognition technology.
Iv Results and Discussions
|Identity Swapping Methods||Expression Swapping Methods|
Nguyen and Derakhshani  (Ocular Recognition, Softmax)
Li et al.  (Face-Xray, Softmax)
Table I shows the deepfake detection performance in terms of Equal Error Rate (EER) and AUC evaluated on Celeb-DF and FaceForensics++ datasets, using face recognition ResNet-50 model pretrained on MS1M-ArcFace and WebFace12M dataset. The top performance results are highlighted in bold for both the training datasets and across the different deepfake generation techniques.
As can be seen from the table, Identity swapping methods namely, FaceSwap, FaceShifter, and Faceswap-Kowalski obtain AUC close to and EER of on an average, demonstrating the efficiency of face recognition in detecting fakes created using these methods. The deepfake methods based on autoencoders also obtained a high AUC score of about . This demonstrates that even high-quality deepfakes without apparent visual artifacts, such as those in Celeb-DF datasets, have their facial features corrupted using blending operation used in face-swapping techniques. These corrupted facial features are efficiently detected using a face recognition algorithm by matching original templates to the deepfake images of the target identity. The obtained results using face recognition technology significantly outperformed the ocular recognition for deepfake detection  by using subject-disjoint evaluation on CelebDF. The obtained results are also better than the popular two-class CNN-based Face-Xray model  by on the Celeb-DF dataset (see Table I). The Face-Xray model was chosen for comparison because it outperformed other CNN models in . Note that the existing studies [3, 2, 11, 8] on using behavioral biometrics (such as facial expression and head-pose movement) for deepfake detection, performed a very limited evaluation on identity swapping based deep fake generation techniques and reported only AUC score. Therefore, we did not use these studies for cross-comparison.
Expression swapping methods namely, Face2Face and NeuralTextures obtained the least performance with EER of and , respectively, for the best case. The obtained EER is about higher than those obtained for identity swapping techniques. The NeuralTextures approach further obtained lower performance over Face2Face, because it primarily altered the face expression that corresponded to the mouth region resulting in less deformed facial discriminators. Face2Face uses a re-targeting and warping procedure to swap expressions, leaving many of the original facial features intact, resulting in the facial recognition model failing in detecting deepfakes. Thus, facial recognition technology is not effective in detecting deepfake generated using expression swapping methods. This is primarily due to the fact that only expression is changed keeping the identity features intact. Experiments on the FF++ dataset suggest that even within the same environment and conditions (i.e. by using single video per subject), expression-swapping methods severely underperform. The experimental results also suggest that the dataset used for training face recognition model (i.e., MS1M-ArcFace and WebFace12M) has no impact on deepfake detection accuracy evaluated on Celeb-DF and FaceForensics++ deepfake datasets.
Among different loss functions that were used to train the face recognition model, Combined margin and CosFace loss performed the best by about , over other loss functions. Triplet loss performed the least on average. When trained on MS1M-Arcface dataset, CosFace obtained the best performance on detecting FaceSwap ( AUC, EER), FaceShifter ( AUC, EER), FaceSwap-Kowalski ( AUC, EER), and Deepfakes ( AUC, EER) in Faceforensic++ and Celeb-DF ( AUC, EER) dataset. Using SphereFace, the best performance ( AUC, ) was obtained in detecting Face2Face and NeuralTextures were best detected ( AUC, EER) using Softmax loss.
Similarly, when trained on WebFace12M dataset, Combined Margin obtained the best performance on detecting FaceSwap ( AUC, EER), FaceSwap-Kowalski ( AUC, EER), Deepfakes ( AUC, EER) for faceforensics++ and Celeb-DF ( AUC, EER) dataset. Face2Face ( AUC, EER) and NeuralTextures ( AUC, EER) were best detected using Softmax and FaceShifter ( AUC, EER) using CosFace loss function. Loss functions based on Cosine Margin Penalty, such as ArcFace, CosFace, and Combined Margin, fared better on an average by than the others in deepfake detection, whereas triplet loss obtained the least performance.
Figure 2 shows the t-SNE  visualization of the deep feature embeddings (number of components = for real and fake images) from FaceSwap, NeuralTextures and Deepfakes based fake creation techniques. There is a clear separation between facial features from genuine and deepfakes for the FaceSwap approach. A similar observation was obtained for other identity swapping methods as well. However, the real and fake features overlap when using expression swapping approaches, as can be seen for the NeuralTexture method.
Another noteworthy finding from these visualizations is that for Deepfakes, which are based on auto-encoders, deep features are considerably more spread out than the other approaches indicating the lack of consistency across different frames of the same subject. Figure 3 shows the histogram of the genuine and deepfake scores distribution obtained on comparing the templates with real and fake images using different deepfake generation techniques. Similar to t-SNE based visualization, expression swapping methods obtained higher overlap in genuine and deepfake distribution over identity swapping methods.
In summary, experimental results demonstrate the effectiveness of face recognition technology in identifying different identity swapping-based deepfake generation methods. Combined margin and CosFace loss functions obtained the best deepfake detection rate as they can attain better intra-class compactness as well as can maximize inter-class separability. The obtained results using face recognition technology significantly outperformed the existing biometric studies using facial region  and the popular Face-Xray model  for deepfake detection (see Table I). The facial recognition technology is not effective in detecting deepfake generation using expression swapping methods that only change expression keeping the identity features intact.
V Conclusion and Future Work
As most of the existing deepfake detection algorithms rely on visible structural artifacts or color inconsistencies, they do not perform well on high-quality datasets comprising next-generation deepfakes, such as those available in Celeb-DF and FaceForensics++. In this paper, we evaluated the effectiveness of deep face recognition in detecting high-quality deepfake images or videos from the real ones of the same identity, using the notion of detecting corrupted facial features rather than image anomalies. Experimental results demonstrated the efficiency of face recognition technology in identifying deepfake based identity swapping methods, surpassing those obtained by two-class CNNs on the same datasets. Combined margin and CosFace loss functions obtained the best deepfake detection rate. However, the face recognition technology could not be used for detecting expression swaps in deepfakes. One of the limitations of using biometric technology for detecting deepfakes is the requirement of the subject’s identity for biometric facial feature matching. As a part of future work, the bias of face recognition technology in identifying deepfakes across demographic variations will be evaluated. Through investigation will be done to understand the effectiveness and failure mode of face recognition technology across evolving deepfake generation techniques. The fusion of behavioral biometrics with facial features will be explored for enhanced deepfake detection performance.
This work is supported in part from a grant no. #210716 from University Research/Creative Projects at Wichita State University. The research infrastructure is supported in part from a grant No. 13106715 from the Defense University Research Instrumentation Program (DURIP) from Air Force Office of Scientific Research.
-  (2018-12) MesoNet: a compact facial video forgery detection network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS). External Links: Cited by: §I, §I.
-  (2020) Detecting deep-fake videos from appearance and behavior. In 2020 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–6. External Links: Cited by: §I, §IV.
Protecting World Leaders Against Deep Fakes.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, pp. 8. Cited by: §I, §IV.
-  (2019) Deepfake video detection through optical flow based cnn. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Vol. , pp. 1205–1207. External Links: Cited by: §I, §I.
-  (2020-08) Deepfakes declared top ai threat, biometrics and content attribution scheme proposed to detect them: biometric update. BiometricUpdate.com. External Links: Cited by: §I, §I.
-  (2019-09) Chinese deepfake app zao sparks privacy row after going viral. Guardian News and Media. External Links: Cited by: §I.
-  (2020) Leveraging edges and optical flow on faces for deepfake detection. In 2020 IEEE International Joint Conference on Biometrics (IJCB), Vol. , pp. 1–10. External Links: Cited by: §I, §I.
-  (2020) ID-reveal: identity-aware deepfake video detection. CoRR abs/2012.02512. External Links: Cited by: §I, §IV.
-  (2019-09) Deepfakes: what are they and why would i make one?. BBC. External Links: Cited by: §I.
-  (2019) ArcFace: additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4690–4699. External Links: Cited by: 1st item, §III-B, §III-B, Fig. 2, Fig. 3, TABLE I.
-  (2020) Identity-driven deepfake detection. CoRR abs/2012.03930. External Links: Cited by: §I, §IV.
-  (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pp. 2672–2680. Cited by: §I.
-  (2016) MS-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Vol. 9907, pp. 87–102. Cited by: §III-B.
-  (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 770–778. External Links: Cited by: §III-B.
-  (2020-07) Deepfakes: a grounded threat assessment. Technical report Center for Security and Emerging Technology, Georgetown University. External Links: Cited by: §I.
-  (2018) DeepFakes: a new threat to face recognition? assessment and detection. CoRR abs/1812.08685. External Links: Cited by: §I.
-  (2018) DeepFakes: a new threat to face recognition? assessment and detection. CoRR abs/1812.08685. External Links: Cited by: §I.
-  (2019) FaceShifter: towards high fidelity and occlusion aware face swapping. CoRR abs/1912.13457. External Links: Cited by: §I, item 1b, item 1.
-  (2020) Face x-ray for more general face forgery detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, June 13-19, 2020, pp. 5000–5009. External Links: Cited by: §I, §I, TABLE I, §IV, §IV.
-  (2019) Exposing deepfake videos by detecting face warping artifacts. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, June 16-20, 2019, pp. 46–52. Cited by: §I, §I.
-  (2020) Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In IEEE Conference on Computer Vision and Patten Recognition (CVPR), Seattle, WA, United States. Cited by: 3rd item, 1st item, §III-A, §III-B, TABLE I.
-  (2021) An experimental evaluation of recent face recognition losses for deepfake detection. In 2020 25th International Conference on Pattern Recognition (ICPR), Vol. , pp. 9827–9834. External Links: Cited by: §I.
-  (2020-09) Microsoft’s deepfake-spotting tech may have biometric applications. External Links: Cited by: §I.
-  (2021-04) The creation and detection of deepfakes. ACM Computing Surveys 54 (1), pp. 1–41. External Links: Cited by: §II, footnote 2.
-  (2020) Eyebrow recognition for identifying deepfake videos. In 2020 International Conference of the Biometrics Special Interest Group (BIOSIG), Vol. , pp. 1–5. External Links: Cited by: §I, §I, TABLE I, §IV, §IV.
-  (2020) DeepFaceLab: A simple, flexible and extensible face swapping framework. CoRR abs/2005.05535. External Links: Cited by: item 1.
-  (2020) PRNU-based detection of facial retouching. IET Biometrics 9 (4), pp. 154–164. External Links: Cited by: §I.
-  (2019) Faceforensics++: learning to detect manipulated facial images. In Proceedings of the IEEE ICCV, pp. 1–11. Cited by: 3rd item, 2nd item, §III-A, §III-B, TABLE I.
-  (2019) On the detection of digital face manipulation. arXiv preprint arXiv:1910.01717. Cited by: §I, §I.
-  (2016) Face2Face: real-time face capture and reenactment of rgb videos. In Proc. CVPR, Cited by: §I, item 2a.
-  (2019-04) Deferred Neural Rendering: Image Synthesis using Neural Textures. arXiv e-prints. External Links: Cited by: §I, item 2b.
-  (2020) DeepFakes and beyond: a survey of face manipulation and fake detection. arXiv preprint arXiv:2001.00179. Cited by: §I, §II, footnote 2.
Visualizing data using t-sne.
Journal of Machine Learning Research9 (86), pp. 2579–2605. External Links: Cited by: 2nd item, §IV.
-  (2018) CosFace: large margin cosine loss for deep face recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR June 18-22, pp. 5265–5274. External Links: Cited by: §III-B.
-  (2016-10) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23 (10), pp. 1499–1503. External Links: Cited by: §III-B.
-  (2021) WebFace260M: A benchmark unveiling the power of million-scale deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 10492–10502. Cited by: 1st item, §III-B, Fig. 2, Fig. 3, TABLE I.