FDFtNet: Facing Off Fake Images using Fake Detection Fine-tuning Network

by   Hyeonseong Jeon, et al.

Creating fake images and videos such as "Deepfake" has become much easier these days due to the advancement in Generative Adversarial Networks (GANs). Moreover, recent research such as the few-shot learning can create highly realistic personalized fake images with only a few images. Therefore, the threat of Deepfake to be used for a variety of malicious intents such as propagating fake images and videos becomes prevalent. And detecting these machine-generated fake images has been quite challenging than ever. In this work, we propose a light-weight robust fine-tuning neural network-based classifier architecture called Fake Detection Fine-tuning Network (FDFtNet), which is capable of detecting many of the new fake face image generation models, and can be easily combined with existing image classification networks and finetuned on a few datasets. In contrast to many existing methods, our approach aims to reuse popular pre-trained models with only a few images for fine-tuning to effectively detect fake images. The core of our approach is to introduce an image-based self-attention module called Fine-Tune Transformer that uses only the attention module and the down-sampling layer. This module is added to the pre-trained model and fine-tuned on a few data to search for new sets of feature space to detect fake images. We experiment with our FDFtNet on the GANsbased dataset (Progressive Growing GAN) and Deepfake-based dataset (Deepfake and Face2Face) with a small input image resolution of 64x64 that complicates detection. Our FDFtNet achieves an overall accuracy of 90.29 detecting fake images generated from the GANs-based dataset, outperforming the state-of-the-art.



page 5


DA-FDFtNet: Dual Attention Fake Detection Fine-tuning Network to Detect Various AI-Generated Fake Images

Due to the advancement of Generative Adversarial Networks (GAN), Autoenc...

A Survey of Deep Fake Detection for Trial Courts

Recently, image manipulation has achieved rapid growth due to the advanc...

Detecting GAN generated Fake Images using Co-occurrence Matrices

The advent of Generative Adversarial Networks (GANs) has brought about c...

Detection of fake faces in videos

: Deep learning methodologies have been used to create applications that...

Pattern Detection in the Activation Space for Identifying Synthesized Content

Generative Adversarial Networks (GANs) have recently achieved unpreceden...

SOS: Score-based Oversampling for Tabular Data

Score-based generative models (SGMs) are a recent breakthrough in genera...

Deepfake Detection Scheme Based on Vision Transformer and Distillation

Deepfake is the manipulated video made with a generative deep learning t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The emergence of Generative Adversarial Networks (GANs) [goodfellow2014generative], which produces high-quality images through a generator and a discriminator that are trained adversely and competitively, enables the generated outputs to be highly realistic and sophisticated [karras2017progressive, zakharov2019few, karras2019style, wu2019sliced]. However, such high-quality images and videos generated by machines have been abused (e.g., DeepNude [deepnude]) and harmed the general public (e.g., DeepFake [deepfake]). Furthermore, a recent study using the few-shot learning technique [sun2019meta]

in GAN allows Deep Learning models to produce high-quality outputs with only a small amount of training data. Zakharov et al. 

[zakharov2019few] demonstrated that models capable of generating highly realistic personalized talking head faces can be constructed using few-shot learning techniques, where the training inputs provide attention to the generator as a compressed form of feature landmarks, extracted through embedding layers. Leveraging this method, DeepFake can easily be generated even with only a small amount of training data. Recently reported incidents [yin_2019] related to DeepFake [deepfake] and DeepNude [deepnude] show that these technologies are an imminent threat to the public.

Figure 1: Overview of our FDFtNet. FDFtNet modules are shown in yellow and green: (2) Fine-Tune Transformer to an input image and, (3) MobileNet block V3 is attached to (1) pre-trained model (backbone network), where details of each block is shown in Sec. 3.

Most of the previous approaches have focused on exploiting metadata information or handcrafted characteristics of images to detect fake images. However, these approaches fail to detect GAN-based fake images, because they are created from scratch and metadata can be also forged; handcrafted features are no longer useful for detection. Recent models, such as ShallowNet [tariq2019gan] and FakeTalkerDetect [jeon2019faketalkerdetect], used neural networks to detect GANs-generated fake images, and FaceForensics [rossler2018faceforensics] showed various forgery detection techniques. However, they lack generalization and will thus have difficulties coping with newly developed DeepFake generation techniques.

In this paper, we propose Fake Detection Fine-tuning Network (FDFtNet), a new robust fine-tuning neural network-based architecture for fake image detection.  FDFtNet  combines Fine-Tune Transformer

(FTT), with a pre-trained Convolutional Neural Network (CNN) as a backbone, and

MobileNet block V3 (MBblockV3). Figure 1 shows an overview of our approach, where we utilize well-known, existing CNN architectures [simonyan2014very, szegedy2015going, he2016deep, iandola2014densenet, hu2018squeeze, howard2017mobilenets]

for fake image detection. Our FTT is designed to use different feature extraction from images using the self-attention, and MBblockV3 extracts the feature using different convolution and structure techniques. MBblockV3 is added to the pre-trained model as a backbone network after removing the classification layers. We apply data augmentation by implementing the Cutout method to overcome the limitation of using a small fine-tuning dataset and improve the performance. Our approach provides a reusable fine-tuning network, improving the existing backbone CNN architectures, which were not designed to detect fake images effectively.

We experiment with three datasets, i.e., the GAN-based dataset (CelebFaces Attributes Dataset (CelebA) [liu2015deep] + Progressive Growing GAN (PGGAN) [karras2017progressive]) and the Deepfake-based dataset (FaceForensics [faceforencis] + Deepfake [deepfake] and Face2Face [thies2016face]), and four baseline models, i.e., SqueezeNet [iandola2016squeezenet], ShallowNetV3 [tariq2019gan], ResNetV2 [he2016identity], and Xception [chollet2017xception]. Our result shows that FDFtNet, combined with Xception, achieves an accuracy of 97.02%, which is higher than that of the state-of-the-art methods. Our main contributions are as follows:

  • We propose FDFtNet, a novel neural network-based fake image detector, showing superior performance on detecting fake images compared to previous approaches by achieving 97.02% accuracy.

  • We provide a robust fine-tuning neural network-based classifier, which requires only small amount data for fine-tuning and can be easily integrated with popular CNN architectures.

  • We improve the baseline model accuracy from 4% to 45% through our FTT and MBblockV3, where FTT is not explored in other image classification and fine-tuning research.

2 Related Work

Traditional image forgery detection methods. Many researchers [farid2009exposing, krawetz2007picture, yang2015estimating, mankar2015image]

have investigated various digital forensics algorithms to detect forged images. One way to detect forged images is to analyze them in the frequency domain. However, it is difficult to analyze images with refined smooth edges, thus giving rise to a different method, such as JPEG Ghost 

[farid2009exposing]. In JPEG Ghost, the forged part is regularly copied from different real images. The normalized pixel distance of the reproduced image differs from the original image, causing a difference in JPEG quality. However, this method will not work if the original image and the forged image have the same quality level. Another approach is Error Level Analysis (ELA) [krawetz2007picture], which checks the error level of the images. However, with GANs-generated fake images, ELA cannot classify the error level between the real and generated images. Another algorithm called the Copy-move Forgery detection [mankar2015image] is based on Pixel Based approach. Firstly, dyadic wavelet transform (DWT) is applied to the input image. This transforms the original image to an image of a reduced dimension representation, i.e., the LL1 sub-band. Then this LL1 sub-band is divided into sub-images. To compute the spatial offset between the Copy-move regions, the phase correlation is adopted. The Copy-Move regions can easily be located by pixel matching, which shifts the input image according to the offset, and calculate the difference between its shifted version and the original image. In the final step, the Mathematical Morphological Operations (MMO) are used to remove isolated points to improve the location. Traditional digital forensic tools fail to detect GANs-generated images, because they are generated as a single image. For this reason, these approaches are not effective.
Image forgery detection with neural networks. Various CNN-based models have been used to detect forged images. ShallowNet [tariq2019gan] outperformed previous architectures in detecting real vs. PGGAN with a shallow layer architecture. However, their approach showed limitations when detecting other types of DeepFake images. FaceForensic++ [rossler2019faceforensics++] proposed a forgery detection method tailored to facial manipulations and provided an extensive evaluation in a supervised manner. They also generated a large facial manipulations dataset based on computer graphics-based methods, such as Face2Face [thies2016face] and FaceSwap [faceswap]

. In addition, they introduced an automatic metric that takes into account the four forms of distortion in realistic scenarios (i.e., random encoding and random dimensions). Using these benchmarks, they analyzed various methods of forgery detection pipeline. However, transfer learning or fine-tuning capabilities were not explored.

Self-attention and Transformer. To achieve long-term dependencies on image data, CNN needs to increase the amount of computation via deeper layers, because one-time convolution computation sees only the convolution kernel size. In contrast, self-attention solves this long-term dependency issue by using the softmax outputs of the entire sequence that provide attention to CNN. Zhang et al. [zhang2018self] used self-attention modules to generate images with GANs. Our FTT is different in that we build only self-attention modules, such as Transformer, during the feature extraction in the classification tasks. We apply FTT for the image feature extractor and not for generator. This approach is similar to the Multi-head Attention Module [vaswani2017attention] (Query, key, and Value), but the difference is that FTT is suitable for the image to be applied to the 11 convolution.

3 Fake Detection Fine-tuning Network (FDFtNet)

In this section, we describe the architecture design of our FDFtNet. The main difference from other fake detection methods is that we utilize well-known, reusable pre-trained models and fine-tune the backbone networks with only a few data to improve the fake detection performance. Figure 1 shows an overview of our model, which is composed of 1) a pre-trained model, 2) Fine-Tune Transformer (FTT), and 3) MobileNet V3 blocks (MBblockV3). First, we describe the three datasets that are used in our experiment. Second, we explain why we use pre-trained models. Third, we describe our two modules, Fine-Tune Transformer and MobileNet blocks V3.

Figure 2: Illustration of our datasets. CelebA [liu2015deep] images are used as inputs for PGGAN [karras2017progressive] fake image generation. Images from the FaceForensics [rossler2018faceforensics] dataset are cropped and used as input images for Deepfakes [deepfake] and Face2Face [thies2016face] fake image generation.

3.1 Dataset Description

CelebA. CelebFaces Attributes Dataset (CelebA) [liu2015deep] is a large-scale face attributes dataset with more than 200,000 celebrity images. It is widely used for benchmarking and as inputs for generating training and test datasets for various GAN and VAE approaches. We use CelebA as an input to generate PGGAN fake images.

PGGAN. For the GAN-generated image, we used Progressive Growing GANs Dataset (PGGAN), consisting of 100,000 GAN-generated fake celebrity images at 10241024 resolution using the CelebA dataset. The key idea in PGGAN is to grow both the generator and discriminator progressively. The training starts with both the generator and discriminator having a low resolution. New layers are added as the training process advances, thus increasing the resolution of the generated images.

FaceForensics. FaceForensics [faceforencis] is a video dataset comprised of more than 500,000 frames, containing faces from 1004 videos that can be used to study image or video forgeries. All videos are cut down to short continuous clips that contain mostly frontal faces. We use these datasets as the source for the Deepfakes [deepfake] and Face2Face [thies2016face] fake image generation.

Deepfakes. Deepfakes [deepfake]

was the first publicly available method, which anyone can download and use to produce fake images and videos. The code is based on two autoencoders with a shared encoder. To produce a forged image, the trained encoder and decoder of the source are applied to the target image face. The output of the autoencoder is then blended with the target image. For our experiment, we used the

FaceSwap [faceswap] implementation.

Face2Face. The goal of Face2Face [thies2016face] is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. Face2Face first addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, the model tracks facial expressions of both the source and target videos using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between the source and the target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, Face2Face re-renders the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. Since our goal is to detect fake images, we use each frame from the generated Face2Face output.

3.2 Description of Pre-trained Backbone CNN networks

We use the following pre-trained CNN networks as our backbone networks as shown in Fig. 1, as well as our baselines: SqueezeNet, ShallowNetV3, ResNetV2, and Xception. They are the backbone networks to be used for fine-tuning in our architecture. Therefore, our goal is to compare these baselines with our approaches that use these backbone pre-trained networks.

SqueezeNet. SqueezeNet [iandola2016squeezenet] has an AlexNet-level accuracy with fewer parameters and would generally have poor performance in fake detection tasks, because SqueezeNet is not designed for fake detection. We chose SqueezeNet as the baseline, because our FDFtNet can provide a huge improvement.

ShallowNetV3. ShallowNetV3 [tariq2019gan] has the highest area under receiver operating characteristic (AUROC) (93.99%) on 6464 resolution images from the CelebA and PGGAN datasets. However, ShallowNetV3 has burdensome fully-connected layers (FC layer) for binary classification. Convolution layers have 115,490 parameters, while FC layers have 4,725,762 parameters. In addition, since this approach has not been tested on deepfakes other than those generated by PGGAN, we aim to investigate the performance.

ResNetV2. ResNetV2 has been widely adopted in many image classification tasks. We chose ResNetV2 [he2016identity] as one of the baselines, because ResNetV2 has an opposing characteristic to ShallowNetV3 in terms of the model depth, i.e., ResNetV2 has 50 layers, while ShallowNetV3 has only 8 layers. We believe that these two architectures would show complementary results and we plan to see the effect of our approach on such deep and shallow CNN architectures.

Xception. Xception [chollet2017xception] has been served as the baseline for fake image detection in [tariq2019gan, rossler2019faceforensics++]. For FaceForenscis++, Xception showed the highest accuracy, i.e., 96.36% in Deepfake and 86.86% in Face2Face, justifying our choice of it as a baseline. In addition, Xception has no FC layers, but extracts various image feature spaces thanks to depthwise separable convolutions, compared to the burdensome FC layers in the ShallowNetV3. We cut the classification layers in a pre-trained model, and add our FTT and MBblockV3 modules.

3.3 Fine-Tune Transformer (FTT)

Fine-Tune Transformer (FTT) consists of several self-attention modules, as shown in Fig. 3, where each attention module has (), (), and () using a 11 convolution filter. We iterate times from the image inputs. is a hyper-parameter and we empirically determined that = 3 yields the highest performance.

Input size Operation Num. Parameters Output dim. Stride
11 Conv 16 / 1
11 Conv 16 / 1
11 Conv 16 1
/ Matmul 0 -
Softmax 0 -
Batchdot 0 -
Multiply 1 -
Add 0 -
Table 1: Specification for the self-attention module. Conv denotes convolution, , , and define the input size for the previous layer, and

denotes the bottleneck ratio in the block. The number of parameters are simulated with the following hyperparameters:

= = = and = .
Figure 3: Self-attention module in the Fine-Tune Transformer. The input (the image or the output from the previous layer) is divided by a convolution into , , and . The attention map is the softmax output from and . The batchdot multiplies and the attention map . The input image is added to . The final output is the self-attention feature maps.

In Fig. 3, the input of the previous layers or the input image is divided into three feature spaces (), (), and (). As shown in Eq. 1, all of them are obtained through the 11 convolution, where , and are the respective filter weights of each space. () and () have channel bottleneck ratio parameter, , where is the number of channels. In this study, we choose = as suggested by Zhang et al. [zhang2018self]. In particular, we use the dot-product attention to produce the attention map in Fig. 3, synthesizing the and locations after the operation as shown in the above equation.


After obtaining the attention map , we apply the operation to multiply the attention map with (), as shown in Eq. 2, and produce output . After the multiplication, is added to the input . Finally, the self-attention feature map, , is obtained via multiplying and adding the input , as shown in Eq. 3. In particular, is a learnable parameter initialized as 0 at the early stage of learning. This is favorable, since the softmax function equally provides attention to all the feature spaces at the early stage of learning.

Input size Operation Num. Parameters Output dim. Stride
33 DConv 123 32 2
BN 128 32 -
ReLU 0 32 -
1st Stage self-attention 1,321 32 -
33 DConv 2,336 64 2
BN 256 64 -
ReLU 0 64 -
2nd Stage self-attention 5,201 64 -
33 DConv 8,768 128 2
BN 512 128 -
ReLU 0 128 -
3rd Stage self-attention 20,641 128 -
1x1 Conv 73,728 576 1
BN 2,304 576 -
ReLU 0 576 -
GAP 0 576 -
Table 2:

Specification for Fine-Tune Transformer (FTT). Conv, BN, DConv, and GAP denote convolution, batch normalization, depth-wise separable convolution, and global average pooling operation, respectively. The “Attention” operation in bold indicates the end of one transformer block. We repeat FTT three times (

= ) to maximize the performance.

Next, in our FTT, we apply the self-attention module three times ( = 3) with an input size of 64643, as shown in Table 2. The first layer is a 33 separable convolution with 32 filters and 2 strides followed by Batch Normalization (BN) [ioffe2015batch] and ReLU. The dimension of the output feature map from the self-attention module is 32, 64, and 128, respectively; the width (number of channels) is doubled when the resolution is down-sampled, as shown in Table 2. After that, self-attention is performed three times ( = ), followed by SeparableConv3

3, BN, and ReLU. The main reason we apply self-attention modules in FTT is to overcome the limitations of CNN in achieving long-term dependencies, caused by the use of numerous Conv filters with a small size. On the other hand, only a one-time use of the FTT is necessary to achieve the long-term dependencies, avoiding the construction of deep CNN layers. Also, a three-time application of self-attention modules allows us to explore and learn diverse deep features of the input images via fine-tuning.

Input size Operation Num. Parameters Output dim. Stride
11 Conv 294,912 576 1
BN 2,304 576 -
h-swish 0 576 -
33 DConv 5,184 576 1 or 2
BN 2,304 576 -
GAP 0 576 -
11 Conv 82,944 144 1
ReLU 0 144 -
11 Conv 82,944 576 1
hard-sigmoid 0 576 -
Multiply 0 576 -
h-swish 0 576 -
11 Conv 73,728 128 1
Linear 0 128 -
BN 2,304 128 -
Add 0 128 -
Table 3: Specification for MBblockV3 with = = and = . Conv, BN, DConv and GAP denote convolution, batch normalization, depth-wise separable convolution, and global average pooling. , and indicate input size. If the stride of 3x3 DConv is 2, the addition operation is skipped, and and are divided by 2. Bold operations represent the Squeeze-and-Excitation block.

3.4 MobileNet block V3

We chose MobileNet block V3 (MBblockV3) to explore the image feature space through inverted residual structure and linear bottleneck [sandler2018mobilenetv2]. Depthwise separable convolutions, as in Xception and MobileNetV1 [howard2017mobilenets], are also included in MBblockV3. Comprehensively, MobileNet is an architecture that has already proven its efficiency by using a small number of parameters, drastically increasing computational efficiency. We chose MBblockV3, because it is a suitable module for the efficient extraction of the feature space over the pre-trained feature space. FTT and MBblockV3 are repeatedly used and times, respectively. Each of them is added before the final classification layer. MBblockV3 has the parameter after the pre-trained model. In our experiment, we use = 4, determined empirically, yielding the best performance for fine-tuning. In particular, we use the modified h-swish [howard2019searching]

and the ReLU6 as activation functions. This non-linearity 

[ramachandran2017swish, elfwing2018sigmoid, hendrycks2016bridging] significantly improves the performance of neural networks and is defined as follows:


Since clipping the input values at the bottom layers may have a side effect of distorting the data distribution [sheng2018quantization], we apply these activation functions at the top layers to reduce distortion and extract different signals from ReLU. Next, the Squeeze-and-Excitation blocks (SE block) in Squeeze and Excitation networks [hu2018squeeze] are applied in the bottleneck layer. Global information on the image resolution is embedded in the squeeze stage, and information aggregation is used to capture channel dependencies and is re-calibrated through the gated computation (element-wise multiplication), similar to the attention mechanism in the excitation stage. Details of the SE block parameters are summarized in Table 3.

4 Experimental Results

4.1 Training details

Dataset Train Validation Test Fine-tune
PGGAN 128,404 32,100 37,566 1,000 (real), 1,000 (fake)
Deepfake 60,000 18,000 20,000 1,000 (real), 1,000 (fake)
Face2Face 60,000 18,000 20,000 1,000 (real), 1,000 (fake)
Table 4: The respective size of the train, validation, test, and fine-tune sets. We use only 1,000 real and fake images, respectively, for fine-tuning.
Figure 4:

Example of a Cutout data augmentation. Random regions of the original image (left) are masked out by black rectangles. Every epoch, the rectangular mask changes in form and all images are resized to 64

64 resolution.

All datasets have train, validation, test, and fine-tune sets. The size of each dataset is shown in Table 4

. Our training set is only fine-tuned with 1,000 samples for real and fake images, respectively, and we use the validation set to check the training strategy. Our FDFtNet is trained with Stochastic Gradient Descent (SGD) with momentum for 300 epochs on all 3 datasets. The learning rate is initialized at 0.3 and annealed using a cosine function. The momentum rate is set to 0.9 and the mini-batch size is set to 128. Early stopping is applied, when the validation loss ceases to decrease for 20 epochs. All other weight parameters are initialized based on the research by Glorot and Bengio 

[glorot2010understanding]. To reenact the most challenging scenario in detecting fake images, all input images are resized to 6464 resolution.

Data augmentation.

Because we just use 1,000 samples for each class, we apply data augmentation. Input images are translated into a width and height range of [-2, 2] with the nearest-padding on empty pixels generated after translation. Zoom and rotation are also applied to a degree range of [-0.2, 0.2]. We also perform random horizontal flipping. These data augmentations are applied to all fine-tune sets. For validation and test sets, only a

/ scaling augmentation to the input image is applied.

Cutout. Cutout method applies squared zero masks on a random location of each input image. Fig. 4 presents an example of a Cutout data augmentation. DeVries et al. [devries2017improved] used random zero masks of 16 pixels for CIFAR-10 (3232 pixels images), 5 random iteration parameters for cutting, and 16 random size multipliers for the cutting masks. We use 44 pixels mask, 3 iterations, and 5-size multipliers for cutting masks for 6464 images ( = 3 and = 5). Since we use random translation, we do not use random center cropping, which was used in the original paper. When we conducted with the original setting, we faced severe underfitting with no convergence of losses. We observed higher performance with a setting of low Cutout parameters ( = 3 and = 5) as compared to the implementation without Cutout, which showed strong overfitting. Because we fine-tune with a small amount of data, we apply this non-aggressive parameter setting.

4.2 Performance evaluation

We present our overall performance results in Table 5, where the same test datasets (see Table 4) are used for evaluating the baselines and our model. In Table 5

, the baselines and the backbones are interchangeably used, where the backbone is the pre-trained CNN network used in our FDFtNet. We use the accuracy (ACC) and area under the receiver operating characteristic (AUROC) as evaluation metrics for our experiments. We experimented with all four baseline models on each dataset with similar training strategies for each dataset. As shown in Table 

5, the experimental results show that our FDFtNet has superior detection performance in both ACC and AUROC, compared to all the baselines. In terms of training data size, our model shows high performance using 1,000 images for real and fake, respectively. We will now explain the detailed performance improvement for each dataset.

Model  Dataset PGGAN Deepfake Face2Face
Backbone ACC (%) AUROC ACC (%) AUROC ACC (%) AUROC
SqueezeNet baseline 50.00 50.00 50.00 50.00 50.00 50.00
FDFtNet (Ours) SqueezeNet 88.89 92.76 92.82 97.61 87.73 94.20
ShallowNetV3 baseline 85.73 92.90 89.77 92.81 83.35 88.49
FDFtNet (Ours) ShallowNetV3 88.03 94.53 94.29 97.83 84.55 93.28
ResNetV2 baseline 84.80 88.58 81.52 89.72 58.83 62.47
FDFtNet (Ours) ResNetV2 84.83 94.05 91.03 96.08 85.15 92.91
Xception baseline 87.12 94.96 95.10 98.92 85.78 93.67
FDFtNet (Ours) Xception 90.29 95.98 97.02 99.37 96.67 98.23
Table 5: Overall performance evaluation results. The evaluation metrics used are accuracy (ACC / %) and area under receiver operating characteristic (AUROC / %). The underlined results are improved performance compared to the baseline and the best detection results among all are highlighted in bold.

PGGAN. To yield the best detection performance, we freeze the weight parameters of all layers of the pre-trained models. FTT with parameter = 3 is used, and MBblockV3 with parameter = 2 is added; the same data augmentation is applied. Table 5 shows the results of our models compared with the baseline models. Our results show that Xception, among all baseline models, achieved the highest performance (87.12% ACC and 94.96% AUROC). Our model showed a performance of 90.29% ACC and 95.98% AUROC, which is higher than that of ShallowNetV3 with ensemble [tariq2019gan]. ShallowNetV3 is improved from 85.73% and 92.90% ACC to 88.03% and 94.53% AUROC, respectively, similar to the ensemble version. SqueezeNet baseline shows the lowest baseline performance, but it is significantly improved to a similar level to that of ShallowNetV3, from 50.00% to 92.76%, by applying our model.

Deepfake. Here also, the same data augmentation techniques are applied. For FTT, we use = and = for MBblockV3. Cutout has = iteration parameters and = 10 multiplier parameters. The results show that all models achieve significant improvement in performance. Table 5 indicates that Xception has the highest performance of 95.10% ACC and 98.92% AUROC. Using our approach, this baseline model is also improved to 97.02% ACC and 99.37% AUROC. ShallowNetV3 has 89.77% ACC and 92.81% AUROC. They increased to 94.29% ACC and 97.83% AUROC, respectively. ResNetV2 is also improved from 81.52% ACC and 89.72% AUROC to 91.03% ACC and 96.08% AUROC. SqueezeNet baseline shows the lowest performance, 50.00% ACC and AUROC, but is improved to 92.82% ACC and 97.61% AUROC.

Face2Face. The training strategies for Face2Face are very similar to those of the Deepfake dataset. Data augmentation is also applied. , , , and are set to , , , and , respectively. The interesting point is that ResNetV2 baseline performed poorly (58.83% ACC and 62.47% AUROC), but significant improvements are made using our methods (85.15% ACC and 92.91% AUROC). Our results demonstrate the generalization ability of our approach, improving the poorly performing baseline above 90% across all models and datasets. Compared to FaceForensics Benchmark Results [faceforencis], the highest state-of-the-art method is Xception, which shows 96.4% ACC in Deepfake and 86.9% ACC in Face2Face. Our FDFtNet achieves higher performance (97.02% and 96.67%) than the current state-of-the-art method for the same dataset.

5 Ablation study, discussions, and limitations

Method backbone Dataset Acc AUROC
With FTT Xception Deepfake 97.02% 99.37%
Without FTT Xception Deepfake 94.56% 98.89%
Table 6: Ablation study for Fine-Tune Transformer (FTT). Our models with FTT has 2.46% higher accuracy (ACC) than those without FTT, increasing the ACC from 94.56% to 97.02%.

In Section 3.3, we explained the reason for using FTT. In this section, we validate each module and technique through an ablation study. In Table 6, we choose the Xception model and the Deepfake dataset to compare our model with and without the FTT, while all other settings remain the same. With FTT, we can achieve about 2.5% higher performance than without FTT as shown in Table 6. Our current work has the following limitations: First, we used both real and fake data for training and fine-tuning, but we have constrained resources in practice. In FakeTalkerDetect [jeon2019faketalkerdetect] for fake detection, researchers used Siamese networks for training only on real data. However, in our implementation, few-shot learning and unbalanced learning are major obstacles to achieving high performance. Second, transfer learning is required to improve the performance. We trained each model on each dataset, e.g., PGGAN, Deepfake, and Face2Face. For future work, we plan to research the transfer learning ability to further generalize our model.

6 Conclusion

We propose FDFtNet, which is a robust fine-tuning neural network-based architecture, to detect fake images and significantly improve the baseline CNN architectures. Our model achieves the state-of-the-art accuracy in fake image detection on the GAN-based dataset and the Deepfake-based dataset. Our experimental results with the use of a limited amount of data show the exploration and exploitation of image feature space beyond the pre-trained models. Our results show that FDFtNet is a promising method for detecting fake images generated by powerful deep learning methods, requiring only a small amount of images for re-training. Therefore, FDFtNet can be a viable option even for detecting new fake images in a real-world scenario, where available datasets are extremely limited. Further, we offer open source versions of our work for it to be widely leveraged by the research community