DeepfakeUCL: Deepfake Detection via Unsupervised Contrastive Learning

04/23/2021 ∙ by Sheldon Fung, et al. ∙ University of Fukui Microsoft Deakin University 268

Face deepfake detection has seen impressive results recently. Nearly all existing deep learning techniques for face deepfake detection are fully supervised and require labels during training. In this paper, we design a novel deepfake detection method via unsupervised contrastive learning. We first generate two different transformed versions of an image and feed them into two sequential sub-networks, i.e., an encoder and a projection head. The unsupervised training is achieved by maximizing the correspondence degree of the outputs of the projection head. To evaluate the detection performance of our unsupervised method, we further use the unsupervised features to train an efficient linear classification network. Extensive experiments show that our unsupervised learning method enables comparable detection performance to state-of-the-art supervised techniques, in both the intra- and inter-dataset settings. We also conduct ablation studies for our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

AwesomeScottB

I would like to note that this technology has already managed to even harm individual people, but this is all fixable.

Authors

page 3

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Realistic face synthesis or manipulation has led to a rapid increase of deepfake images and videos which pose security and privacy threats to our society. In light of this, researchers have proposed countermeasures against face deepfake, i.e., the detection or recognition of face deepfake. Note that there are different deepfakes, and in this work, we simply use deepfake for face deepfake unless stated otherwise.

It is challenging to detect manipulated faces due to the evolution of deep learning techniques for face manipulation. Early deepfake detection methods focused on the use of cues left by the face manipulation techniques, for example, Li et al. [9]

determined whether or not an image is manipulated by analyzing the frequency of eye blinking. These methods are fragile and tend to fail if those cues are removed or missing. Deep learning methods were therefore proposed to detect face manipulation, and have achieved promising outcomes due to supervised learning. With technical evolution, those deepfake contents are becoming increasingly realistic and the well-trained models will thus become obsolete and require retraining on new data. Nevertheless, supervised learning has to “see” the labels, and labeling is time-consuming and tedious. To our knowledge, unsupervised learning for deepfake detection has rarely been studied so far. It is more challenging than supervised learning since labels are not known during training.

In this paper, we design a novel unsupervised learning approach for face manipulation detection. The core idea is to firstly generate two transformed versions of a face image using two different transformations, and then maximize their agreement after going through an encoder network and a projection head network. This is inspired by a contrastive framework [30]

. The model trained without supervision will be used to produce features (i.e., output of the encoder) to be taken as the input of a linear classifier network for deepfake evaluation. We use the output of the encoder since it has more effective features than the output of the projection head (see Section

IV-D).

Extensive experiments on three publicly available datasets validate our unsupervised contrastive learning method. It yields comparable performance to state-of-the-art supervised learning methods and non-deep-learning methods in terms of both the intra-dataset and inter-dataset settings. We also perform ablation studies for our method. Our main contributions are as follows.

  • We propose an unsupervised contrastive learning approach for deepfake detection.

  • We conduct a variety of experiments to test our method and compare it with state-of-the-art deepfake detection techniques.

The rest of the paper is organized as follows. Section II reviews previous research work on face manipulation and deepfake detection. Section III presents the proposed approach. We explain the experimental results and analyze the results in Section IV. Section V concludes this work.

Ii Related Work

In this section, we will firstly review previous face synthesis or manipulation methods. Then we will look back upon recent works on the detection of face forgery.

Face manipulation approaches. There have been methods for face image manipulation before the emergence of deep learning based methods. Dale et al. [2] introduced a face-swapping method based on a 3D multi-linear model for face tracking and warping. A similar approach was presented by Justus et al. [3], which tracks the facial expression using a dense photometric consistency measure. Zhmoginov et al. [4]

introduced deep learning techniques into the task of face manipulation. Concretely, they used neural networks to effectively invert low-dimensional face embeddings while producing realistically looking consistent images, which was later further implemented into a famous phone app called FaceAPP

[5]. Since then, many deep learning based face manipulation methods arise. For example, Thies et al. [6] introduced a new image synthesis approach that takes advantage of the traditional graphics pipeline and learnable components, which was later used to generate manipulated faces in FaceForensics++ [7]. An even more intriguing and similar technique was introduced by Wei et al. [8], which aimed at modifying a face image according to a given attribute value. Shao et al. [37] proposed to explicitly transfer expressions by directly mapping two unpaired images to two synthesized images with swapped expressions.

Detection methods. The privacy and security impact of face manipulation on individuals as well as society drives researchers to develop detection techniques for face manipulation. Methods for detecting manipulated faces can be classified into two types. One is to utilize the visual cues of the imperfections of face manipulation methods, e.g., detecting the frequency of eye blinking [9], the abnormality of head pose [10] and other visual features [11]. However, these methods are usually vulnerable and can easily become invalid once the manipulation methods are refined by removing those cues. For example, Li et al. [12] put forwarded a detection method based on the face warping artifacts. This method achieved the AUC of when testing on UADFV (Published by [12]) but dropped significantly to when confronting with Celeb-DF [13]. In addition to focusing on the visual cues, Li et al. [14] introduced a detection method for morphed face images based on PRNU [15, 16, 17].

The other category of methods resorts to the deep learning approaches. Darius et al. [18] presented two networks with a few layers to focus on the mesoscopic properties of images. Zhou et al. [19] proposed a method using the two-stream GoogLeNet InceptionV3 model [20] for the task and achieved state-of-the-art performance. Hsu et al. [21] also resorted to a similar solution and used a two-stream DenseNet with contrastive loss. A triplet loss [23] integrated with three-stream Xception [22] also reached state-of-the-art results, as presented by Feng et al. [24]. Rössler et al. [25] also presented high performance forensics results using Xception [22]. The robustness of the Xception [22] network allows it to be a strong candidate for the backbone network in some other methods, e.g., the approaches introduced by Dang et al. [26] and Tolosana et al. [27]. The rapid development of face manipulation also enables videos to be the manipulated content. Güera et al. [28]

used recurrent neural network (RNN) in conjunction with convolutional neural network (CNN) to detect videos containing manipulated faces. While

[28] used their private dataset, Sabir et al. [29] used a similar approach to train and evaluate on FaceForensics++ [25].

Iii Our Approach

Our approach takes three steps: data preprocessing, unsupervised training, and follow-up classification. We first preprocess an image into a face-centered image to fully utilize the data of the face area. We then perform unsupervised contrastive training with the paired transformed images of each preprocessed image and the contrastive loss. This step enables the unsupervised learning of separable features, which can be used to further train a classifier for the evaluation (step 3). Figure 1 shows the overview of our method.

Iii-a Data Preprocessing

Most available face manipulation datasets are stored in the format of video. Moreover, faces in those videos usually take up a small proportion of area and thus might raise the difficulty when learning features. Therefore, following [25], we use Dlib [31] for locating the bounding box of the face in each frame and crop the image with the maximum edge of the bounding box, generating a squared image with the face at the center (see Figure 1).

Fig. 1: Overview of our proposed method. It first preprocesses an image into a face-centred image (Section III-A) which is then transformed into two versions using augmentation. The pair of the transformed images are further fed sequentially through the network consisting of an encoder (Xception as backbone) and projection head (a stack of linear layers) for maximizing agreement. After unsupervised learning, a simple linear classifier is trained on the features to classify images of interest as either “real” or “fake”. Left: unsupervised contrastive learning, right: linear classification.

Iii-B Unsupervised Contrastive Learning

At first, we transform the image of interest into its two different versions via two different augmentations (or transformations). Then we encourage the backbone network to learn the features of one image by maximizing the similarity between the two augmented versions of this image.

Transformed versions generation. To synthesize two different versions of each image in the dataset, we process the image of interest with several augmentation methods: random crop, random flip, random color jittering, and grayscale (Figure 2).

Fig. 2: The basic augmentation we used when training the unsupervised learning network. Note that all augmentations are applied randomly.

We denote the image as and the two augmented views as and , which will be fed into the encoder network for unsupervised contrastive learning.

Encoder network. Many advanced convolutional neural networks (CNN) have proven to be feasible for detecting manipulated faces in recent research, e.g., InceptionV3 [19], DenseNet [21], VGG16 [26], Xception [25]. These CNNs are trained in a fully-supervised manner. Among these CNNs, we employ Xception as our encoder, which shows the capability of learning contrastive features [24]. We denote the output of the Xception network (denote as ) as .

Projection head network.

To raise the efficiency of performing the loss function, a stack of linear layers (

projection head, denoted as ) is concatenated with the Xception network. We thus obtain

(1)

where and are two linear layers with and neurons, respectively. A ReLu layer is placed between them. Notice that

is only used while training the feature extraction network.

Loss function.

We use the cosine similarity to measure the similarity of two samples.

(2)

where is set to . Suppose we have samples. With the pair augmentation, the mini-batch size will become . With regard to the loss, we simply apply cross-entropy after softmax regression to the similarity within a mini-batch of samples.

(3)
(4)

Where is the temperature of the contrastive loss.

Iii-C Classification

For evaluation, a variety of classifiers, such as SVM and Bayes classifier, can be trained on the features extracted by the unsupervised contrastive learning step. In this work, we simply choose a linear classification network, with the structure illustrated in Table I. It should be noted that the output of the encoder is used as input for supervised classifier training. This is because the output of the projection head involves less effective information than the output of the encoder, which is evidenced in Section IV-D.

Layer Number
Classification network
1 Fully connected layer, neurons = 2048
2 Fully connected layer, neurons = 4096
3 Fully connected layer, neurons = 2048
4 Fully connected layer, neurons = 256
5 Leaky ReLu, negative slope = 0.4
6 SoftMax Layer
TABLE I: The structure of the classification network.

Iv Experimental Results

In this section, we will first describe the used datasets and then explain the experimental settings. Then we will show the experimental results and compare the proposed method with the state-of-the-art techniques. In the end, we will provide ablation studies.

Iv-a Datasets

We use three commonly used datasets for evaluation: FaceForensics++ [25], UADFV [12] and Celeb-DF [13]. Each dataset has its own characteristics. FaceForensics++ contains videos involving faces manipulated by a variety of methods (i.e., DeepFake [32], Face2Face [33], FaceSwap [34] and NeuralTexture [6]). It has a total of 3,700 videos, including 1,000 pristine videos and 2,700 manipulated videos. UADFV is a relatively small yet commonly-used dataset. It consists of 98 videos, including 49 pristine videos and 49 manipulated videos. Celeb-DF is a large face forgery dataset containing faces manipulated with algorithms that are able to circumvent the common artifacts such as temporal flickering frames and color inconsistency present in the other two datasets. It contains 6,529 videos, involving 890 pristine videos and 5,639 manipulated videos.

For each dataset, we first perform data preprocessing as illustrated in Section III-A and extract a maximum of 400 images for each video. Note that some videos in the dataset contain frames less than 400, and in this case, we extract all the frames for such videos. We randomly select of those images to form the test set and the rest are used as the training set, as suggested by Feng et al. [24].

Datasets
Train (real)
Train (fake)
Test (real)
Test (fake)
FaceForensics++ 115556 108935 20393 20473
UADFV 10100 9761 1783 1723
Celeb-DF 172187 165884 30386 29259
TABLE II: Image numbers in the split sets used in our experiments.

Table II shows the details of the datasets we used in our experiments. Notice that the size of UADFV is less than of FaceForensics++ and Celeb-DF, which might affect the performance in the cross-dataset setting when the network is trained on UADFV.

Fig. 3: Face images after data preprocessing (face locating, cropping and resizing) from three datasets (FaceForensics++, UADFV and Celeb-DF).

We show some of the images in three datasets after preprocessing in Figure 3. We observe that most images from those three datasets share similar visual quality (image resolution, brightness, etc.). However, it is obvious for human eyes that the fake face images in FaceForensics++ are distinguishable because of the color inconsistency and the unnatural facial features.

Iv-B Experimental Settings

Our framework is implemented on PyTorch with a desktop PC equipped with an Intel Core i9-9820X CPU (3.30GHz, 48GB memory) and a GeForce RTX 2080Ti GPU (11GB memory, CUDA 10.0).

For training the networks, we resort to two-step learning illustrated in Figure 1.

  • For training the unsupervised network, the learning rate is set to with the scheduler of step size 6 and 50% descending rate. The batch size is a key factor and is set to 40 for most experiments unless otherwise specified. The temperature parameter is set to 0.5 throughout all experiments.

  • For the classification task, the learning rate is set to 3e-1 with the scheduler of step size 400 and 80% descending rate. The batch size is set to 6,000.

We employ SGD as an optimizer for both networks (unsupervised learning and classification learning) and train them for 20 epochs and 5,000 epochs, respectively.

Iv-C Comparisons

In this section, we provide intra-dataset and cross-dataset results, which allow us to compare our method with the state-of-the-art methods.

Methods
Train data
FF++
UADFV
Celeb-DF
Two-stream [19] Private 70.1 85.1 53.8
Meso4 [18] Private 84.7 84.3 54.8
MesoInception4 [18] Private 83.0 82.1 53.6
VA-MLP [11] Private 66.4 70.2 55.0
VA-LogReg [11] Private 78.0 54.0 55.1
Multi-task [35] FF 76.3 65.8 54.3
Xception [25] FF++ 99.7 80.4 48.2
Capsule [36] FF++ 96.6 61.3 57.5
Xception+Tri. [24] FF++ 99.9 74.3 61.7
Xception [13] UADFV - 96.8 52.2
Xception+Reg. [26] UADFV - 98.4 57.1
Xception+Tri. [24] UADFV 61.3 99.9 60.0
HeadPose [10] UADFV 47.3 89.0 54.6
FWA [12] UADFV 80.1 97.4 56.9
Xception+Tri. [24] Celeb-DF 60.2 88.9 99.9
DeepfakeUCL (Ours) FF++ 93.0 67.5 56.8
DeepfakeUCL (Ours) UADFV 56.2 98.9 64.8
DeepfakeUCL (Ours) Celeb-DF 58.9 85.6 90.5
TABLE III: AUC(%) on FaceForensics++ (FF++), UADFV and Celeb-DF (See subsection III-A). Best results in intra-dataset setting are underlined, and best results for the cross-dataset setting are in bold.
Fig. 4: Receiver Operating Characteristic (ROC) curves of the models trained and tested on the same and different datasets. Note that, F, U, and C in the figure represent FaceForensics++, UADFV, and Celeb-DF, respectively. The first pair of brackets specify the datasets that the model is trained and tested on. For instance, (F, U) indicates that the model is trained on FaceForensics++ and tested on UADFV.

We show the area under the curve (AUC) results in our experiments in Table III. Despite that our method is unsupervised, it still achieves fairly close performance and even outperforms some supervised learning methods. For results trained on FaceForensics++, our method is 3.6% weaker than Capsule and approximately 6.8% weaker than Xception and Xception+Tri. when testing on the same dataset. However, it outperforms Capsule and Xception by 6.2% and 8.2% when testing on UADFV and Celeb-DF, respectively. For results with the networks trained on UADFV, except for Xception+Tri., which is 1% stronger, our method outperforms all other methods when testing on the same dataset. It even achieves the top results when testing on Celeb-DF, reaching 64.8% in AUC, which is 4.8% higher than the second-best method, Xception+Tri.. It also beats HeadPose by 8.9% when testing on FaceForensics++, which is mere 47.3% in AUC. When training on the Celeb-DF dataset, there are insufficient results for comparison. However, we can observe that the results of our method are quite close to Xception+Tri.. Although our method is 9.4% weaker than Xception+Tri., it is merely 1.3% and 3.3% weaker than Xception+Tri. when testing on FaceForensics++ and UADFV, respectively.

We also report the ROC curves of the corresponding AUCs (in Table III) in Figure 4, where the 45-degree curve is the baseline which is achieved by a random classifier. Therefore, the closer to the baseline the curve is, the less reliable the model is. This also reveals that the inter-dataset setting is more challenging than the intra-dataset setting such that the trained models perform better at the intra-dataset setting.

Fig. 5: The data of the above three plots are based on Table III. From left to right, the networks in each plot are trained on FaceForensics++ (denoted as FF++), UADFV and Celeb-DF, respectively. In each plot, from left to right, the results in each section are tested on FaceForensics++ (also denoted as FF++), UADFV, and Celeb-DF, respectively.
Augmentation
Random Crop
Random Flip
Random Color Jittering
Random Gray Scale
FF++
UADFV
Celeb-DF
1 51.8 65.8 93.3
2 53.8 64.8 93.2
3 58.9 85.6 90.5
TABLE IV: AUC(%) trained on Celeb-DF and tested on FaceForensics++ (FF++), UADFV and Celeb-DF with different data augmentation methods.

Iv-D Ablation Study

In this section, we provide three ablation studies: unsupervised versus supervised contrastive learning, data augmentation and encoder versus projection head. The first study is to demonstrate if our unsupervised contrastive learning mechanism is effective. The second study is to test the effects of different combinations of data augmentation schemes. The third ablation study is to show that the output of the encoder is better than that of the projection head for the downstream classification evaluation.

Unsupervised contrastive learning versus supervised learning. We utilize the Xception network [22] as the backbone of our unsupervised contrastive learning framework, which learns feature points by comparing two different views of a single image. Thus, it is useful to compare the supervised Xception networks with our method.

In Figure 5, we report the comparison of three cases by taking Xception as backbone: original Xception (supervised), the combination of Xception and triplet loss (supervised), and our unsupervised contrastive learning method (DeepFakeUCL). Surprisingly, the results of our method outperform the results by training Xception on UADFV, which are 2.1% and 12.6% higher when testing on UADFV and Celeb-DF, respectively. Although our method is 6.7% and 12.9% weaker than Xception when training on FaceForensics++ and testing on FaceForensics++ and UADFV respectively, it boosts up to 56.8% when testing on Celeb-DF, which is 8.6% higher than the results trained by Xception. For the supervised contrastive learning method, which is driven by the Triplet network, our method still outperforms it by 4.8%, reaching 64.8% when training on UADFV and testing on Celeb-DF. Notice that UADFV is a relatively small dataset while Celeb-DF is about 17 times larger than UADFV (see Table II). Therefore, it is considered to be a more challenging dataset when used for training purposes.

Data augmentation choices. To generate two different views of a single image, we utilize some common data augmentation methods to manipulate the images. However, with different data augmentation combinations, we observe some noticeably different outcomes. The results are shown in Table IV. In the experiments, we utilize random image cropping as the fundamental data augmentation, which does not alter the content of the image but the position of the content in the image. Two other combinations are also used. Note that those results are all trained on Celeb-DF with the configuration illustrated in Section IV-B.

We can observe from the results that other than the test results of Celeb-DF itself, the AUC generally increases with respect to the complexity of the data augmentation. In other words, the more complicated the data augmentation is, the more robust the model is. When all four manipulation schemes are applied to the data augmentation process (i.e., Augmentation 3), the results for FaceForensics++ raise dramatically to about 59% from around 52%. Moreover, the result for UADFV boosts enormously up to 85.6%, which is over 20% higher than using random cropping only. The result of testing on Celeb-DF itself drops by around 3% when four forms of manipulation are applied, whereas this drawback seems insignificant compared to the aforementioned improvements.

Encoder versus projection head. We further evaluate the features learned by the projection head, by taking its output features as input for the linear classification. From Figure 6, we observe the classification results using the features output by the encoder are remarkably better than using the output of the features by the projection head, suggesting that the encoder learns more useful unsupervised features.

Fig. 6: Encoder versus projection head. Networks are trained on the UADFV dataset.

V Conclusion

We have presented an unsupervised contrastive learning method for deepfake detection. Compared to most existing deepfake detection techniques which are fully supervised, our method learns separable features in an unsupervised manner. Experiments demonstrate the effectiveness of our method and show the comparable performance of our method to state-of-the-art deepfake detection techniques in both intra- and inter-dataset settings. We also conducted ablation studies for the proposed method. As future work, it would be interesting to incorporate temporal information within the proposed framework to achieve more robust results.

References

  • [1] Hsu, C.C., Zhuang, Y.X. and Lee, C.Y., 2020. Deep fake image detection based on pairwise learning. Applied Sciences, 10(1), p.370.
  • [2] Dale, K., Sunkavalli, K., Johnson, M.K., Vlasic, D., Matusik, W. and Pfister, H., 2011, December. Video face replacement. In Proceedings of the 2011 SIGGRAPH Asia Conference (pp. 1-10).
  • [3]

    Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C. and Nießner, M., 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2387-2395).

  • [4] Zhmoginov, A. and Sandler, M., 2016. Inverting face embeddings with convolutional neural networks. arXiv preprint arXiv:1606.04189.
  • [5] Faceapp, https://www.faceapp.com/, (Accessed on 26/11/2020)
  • [6] Thies, J., Zollhöfer, M. and Nießner, M., 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4), pp.1-12.
  • [7] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J. and Nießner, M., 2019. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1-11).
  • [8] Shen, W. and Liu, R., 2017. Learning residual images for face attribute manipulation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4030-4038).
  • [9] Li, Y., Chang, M.C. and Lyu, S., 2018, December. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS) (pp. 1-7). IEEE.
  • [10] Yang, X., Li, Y. and Lyu, S., 2019, May. Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8261-8265). IEEE.
  • [11] Matern, F., Riess, C. and Stamminger, M., 2019, January. Exploiting visual artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW) (pp. 83-92). IEEE.
  • [12] Yuezun Li and Siwei Lyu, 2019, June. Exposing deepfake videos by detecting face warping artifacts. In 2019 IEEE Conference on Computer Vision and Pattern Recognition Work- shops (CVPRW).
  • [13] Li, Y., Yang, X., Sun, P., Qi, H. and Lyu, S., 2020. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3207-3216).
  • [14] Debiasi, L., Scherhag, U., Rathgeb, C., Uhl, A. and Busch, C., 2018, June. PRNU-based detection of morphed face images. In 2018 International Workshop on Biometrics and Forensics (IWBF) (pp. 1-7). IEEE.
  • [15] Chen, M., Fridrich, J., Goljan, M. and Lukás, J., 2008. Determining image origin and integrity using sensor noise. IEEE Transactions on information forensics and security, 3(1), pp.74-90.
  • [16] Li, C.T. and Li, Y., 2011. Color-decoupled photo response non-uniformity for digital image forensics. IEEE Transactions on Circuits and Systems for Video Technology, 22(2), pp.260-271.
  • [17] Lin, X. and Li, C.T., 2020. PRNU-Based Content Forgery Localization Augmented With Image Segmentation. IEEE Access, 8, pp.222645-222659.
  • [18] Afchar, D., Nozick, V., Yamagishi, J. and Echizen, I., 2018, December. Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS) (pp. 1-7). IEEE.
  • [19]

    Zhou, P., Han, X., Morariu, V.I. and Davis, L.S., 2017, July. Two-stream neural networks for tampered face detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 1831-1839). IEEE.

  • [20] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
  • [21] Hsu, C.C., Zhuang, Y.X. and Lee, C.Y., 2020. Deep fake image detection based on pairwise learning. Applied Sciences, 10(1), p.370.
  • [22] Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251-1258).
  • [23] Hoffer, E. and Ailon, N., 2015, October. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition (pp. 84-92). Springer, Cham.
  • [24] Feng, D., Lu, X. and Lin, X., 2020, November. Deep Detection for Face Manipulation. In International Conference on Neural Information Processing (pp. 316-323). Springer, Cham.
  • [25] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J. and Nießner, M., 2019. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1-11).
  • [26] Dang, H., Liu, F., Stehouwer, J., Liu, X. and Jain, A.K., 2020. On the detection of digital face manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5781-5790).
  • [27] Tolosana, R., Romero-Tapiador, S., Fierrez, J. and Vera-Rodriguez, R., 2020. DeepFakes Evolution: Analysis of Facial Regions and Fake Detection Performance. arXiv preprint arXiv:2004.07532.
  • [28] Güera, D. and Delp, E.J., 2018, November. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1-6). IEEE.
  • [29] Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I. and Natarajan, P., 2019. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1).
  • [30]

    Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning, in PMLR 119:1597-1607 Available from

    http://proceedings.mlr.press/v119/chen20j.html.
  • [31] King, D.E., 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10, pp.1755-1758.
  • [32] Deepfakes github, https://github.com/deepfakes/faceswap, (Accessed on 1/08/2021)
  • [33] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C. and Nießner, M., 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2387-2395).
  • [34] Faceswap github, https://github.com/MarekKowalski/FaceSwap/, (Accessed on 1/08/2021)
  • [35] H. H. Nguyen, F. Fang and J. Yamagishi, 2019. Multi-task learning for detecting and segmenting manipulated facial images and videos. CoRR abs/1906.06876.
  • [36] Nguyen, H.H., Yamagishi, J. and Echizen, I., 2019, May. Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2307-2311). IEEE.
  • [37] Shao, Z., Zhu, H., Tang, J., Lu, X., Ma, L., 2019. Explicit facial expression transfer via fine-grained semantic representations. arXiv preprint arXiv:1909.02967.