1 Introduction
The emergence of Generative Adversarial Networks (GANs) [goodfellow2014generative], which produces high-quality images through a generator and a discriminator that are trained adversely and competitively, enables the generated outputs to be highly realistic and sophisticated [karras2017progressive, zakharov2019few, karras2019style, wu2019sliced]. However, such high-quality images and videos generated by machines have been abused (e.g., DeepNude [deepnude]) and harmed the general public (e.g., DeepFake [deepfake]). Furthermore, a recent study using the few-shot learning technique [sun2019meta]
in GAN allows Deep Learning models to produce high-quality outputs with only a small amount of training data. Zakharov et al.
[zakharov2019few] demonstrated that models capable of generating highly realistic personalized talking head faces can be constructed using few-shot learning techniques, where the training inputs provide attention to the generator as a compressed form of feature landmarks, extracted through embedding layers. Leveraging this method, DeepFake can easily be generated even with only a small amount of training data. Recently reported incidents [yin_2019] related to DeepFake [deepfake] and DeepNude [deepnude] show that these technologies are an imminent threat to the public.
Most of the previous approaches have focused on exploiting metadata information or handcrafted characteristics of images to detect fake images. However, these approaches fail to detect GAN-based fake images, because they are created from scratch and metadata can be also forged; handcrafted features are no longer useful for detection. Recent models, such as ShallowNet [tariq2019gan] and FakeTalkerDetect [jeon2019faketalkerdetect], used neural networks to detect GANs-generated fake images, and FaceForensics [rossler2018faceforensics] showed various forgery detection techniques. However, they lack generalization and will thus have difficulties coping with newly developed DeepFake generation techniques.
In this paper, we propose Fake Detection Fine-tuning Network (FDFtNet), a new robust fine-tuning neural network-based architecture for fake image detection. FDFtNet combines Fine-Tune Transformer
(FTT), with a pre-trained Convolutional Neural Network (CNN) as a backbone, and
MobileNet block V3 (MBblockV3). Figure 1 shows an overview of our approach, where we utilize well-known, existing CNN architectures [simonyan2014very, szegedy2015going, he2016deep, iandola2014densenet, hu2018squeeze, howard2017mobilenets]for fake image detection. Our FTT is designed to use different feature extraction from images using the self-attention, and MBblockV3 extracts the feature using different convolution and structure techniques. MBblockV3 is added to the pre-trained model as a backbone network after removing the classification layers. We apply data augmentation by implementing the Cutout method to overcome the limitation of using a small fine-tuning dataset and improve the performance. Our approach provides a reusable fine-tuning network, improving the existing backbone CNN architectures, which were not designed to detect fake images effectively.
We experiment with three datasets, i.e., the GAN-based dataset (CelebFaces Attributes Dataset (CelebA) [liu2015deep] + Progressive Growing GAN (PGGAN) [karras2017progressive]) and the Deepfake-based dataset (FaceForensics [faceforencis] + Deepfake [deepfake] and Face2Face [thies2016face]), and four baseline models, i.e., SqueezeNet [iandola2016squeezenet], ShallowNetV3 [tariq2019gan], ResNetV2 [he2016identity], and Xception [chollet2017xception]. Our result shows that FDFtNet, combined with Xception, achieves an accuracy of 97.02%, which is higher than that of the state-of-the-art methods. Our main contributions are as follows:
-
We propose FDFtNet, a novel neural network-based fake image detector, showing superior performance on detecting fake images compared to previous approaches by achieving 97.02% accuracy.
-
We provide a robust fine-tuning neural network-based classifier, which requires only small amount data for fine-tuning and can be easily integrated with popular CNN architectures.
-
We improve the baseline model accuracy from 4% to 45% through our FTT and MBblockV3, where FTT is not explored in other image classification and fine-tuning research.
2 Related Work
Traditional image forgery detection methods. Many researchers [farid2009exposing, krawetz2007picture, yang2015estimating, mankar2015image]
have investigated various digital forensics algorithms to detect forged images. One way to detect forged images is to analyze them in the frequency domain. However, it is difficult to analyze images with refined smooth edges, thus giving rise to a different method, such as JPEG Ghost
[farid2009exposing]. In JPEG Ghost, the forged part is regularly copied from different real images. The normalized pixel distance of the reproduced image differs from the original image, causing a difference in JPEG quality. However, this method will not work if the original image and the forged image have the same quality level. Another approach is Error Level Analysis (ELA) [krawetz2007picture], which checks the error level of the images. However, with GANs-generated fake images, ELA cannot classify the error level between the real and generated images. Another algorithm called the Copy-move Forgery detection [mankar2015image] is based on Pixel Based approach. Firstly, dyadic wavelet transform (DWT) is applied to the input image. This transforms the original image to an image of a reduced dimension representation, i.e., the LL1 sub-band. Then this LL1 sub-band is divided into sub-images. To compute the spatial offset between the Copy-move regions, the phase correlation is adopted. The Copy-Move regions can easily be located by pixel matching, which shifts the input image according to the offset, and calculate the difference between its shifted version and the original image. In the final step, the Mathematical Morphological Operations (MMO) are used to remove isolated points to improve the location. Traditional digital forensic tools fail to detect GANs-generated images, because they are generated as a single image. For this reason, these approaches are not effective.Image forgery detection with neural networks. Various CNN-based models have been used to detect forged images. ShallowNet [tariq2019gan] outperformed previous architectures in detecting real vs. PGGAN with a shallow layer architecture. However, their approach showed limitations when detecting other types of DeepFake images. FaceForensic++ [rossler2019faceforensics++] proposed a forgery detection method tailored to facial manipulations and provided an extensive evaluation in a supervised manner. They also generated a large facial manipulations dataset based on computer graphics-based methods, such as Face2Face [thies2016face] and FaceSwap [faceswap]
. In addition, they introduced an automatic metric that takes into account the four forms of distortion in realistic scenarios (i.e., random encoding and random dimensions). Using these benchmarks, they analyzed various methods of forgery detection pipeline. However, transfer learning or fine-tuning capabilities were not explored.
Self-attention and Transformer. To achieve long-term dependencies on image data, CNN needs to increase the amount of computation via deeper layers, because one-time convolution computation sees only the convolution kernel size. In contrast, self-attention solves this long-term dependency issue by using the softmax outputs of the entire sequence that provide attention to CNN. Zhang et al. [zhang2018self] used self-attention modules to generate images with GANs. Our FTT is different in that we build only self-attention modules, such as Transformer, during the feature extraction in the classification tasks. We apply FTT for the image feature extractor and not for generator. This approach is similar to the Multi-head Attention Module [vaswani2017attention] (Query, key, and Value), but the difference is that FTT is suitable for the image to be applied to the 11 convolution.
3 Fake Detection Fine-tuning Network (FDFtNet)
In this section, we describe the architecture design of our FDFtNet. The main difference from other fake detection methods is that we utilize well-known, reusable pre-trained models and fine-tune the backbone networks with only a few data to improve the fake detection performance. Figure 1 shows an overview of our model, which is composed of 1) a pre-trained model, 2) Fine-Tune Transformer (FTT), and 3) MobileNet V3 blocks (MBblockV3). First, we describe the three datasets that are used in our experiment. Second, we explain why we use pre-trained models. Third, we describe our two modules, Fine-Tune Transformer and MobileNet blocks V3.

3.1 Dataset Description
CelebA. CelebFaces Attributes Dataset (CelebA) [liu2015deep] is a large-scale face attributes dataset with more than 200,000 celebrity images. It is widely used for benchmarking and as inputs for generating training and test datasets for various GAN and VAE approaches. We use CelebA as an input to generate PGGAN fake images.
PGGAN. For the GAN-generated image, we used Progressive Growing GANs Dataset (PGGAN), consisting of 100,000 GAN-generated fake celebrity images at 10241024 resolution using the CelebA dataset. The key idea in PGGAN is to grow both the generator and discriminator progressively. The training starts with both the generator and discriminator having a low resolution. New layers are added as the training process advances, thus increasing the resolution of the generated images.
FaceForensics. FaceForensics [faceforencis] is a video dataset comprised of more than 500,000 frames, containing faces from 1004 videos that can be used to study image or video forgeries. All videos are cut down to short continuous clips that contain mostly frontal faces. We use these datasets as the source for the Deepfakes [deepfake] and Face2Face [thies2016face] fake image generation.
Deepfakes. Deepfakes [deepfake]
was the first publicly available method, which anyone can download and use to produce fake images and videos. The code is based on two autoencoders with a shared encoder. To produce a forged image, the trained encoder and decoder of the source are applied to the target image face. The output of the autoencoder is then blended with the target image. For our experiment, we used the
FaceSwap [faceswap] implementation.Face2Face. The goal of Face2Face [thies2016face] is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. Face2Face first addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, the model tracks facial expressions of both the source and target videos using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between the source and the target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, Face2Face re-renders the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. Since our goal is to detect fake images, we use each frame from the generated Face2Face output.
3.2 Description of Pre-trained Backbone CNN networks
We use the following pre-trained CNN networks as our backbone networks as shown in Fig. 1, as well as our baselines: SqueezeNet, ShallowNetV3, ResNetV2, and Xception. They are the backbone networks to be used for fine-tuning in our architecture. Therefore, our goal is to compare these baselines with our approaches that use these backbone pre-trained networks.
SqueezeNet. SqueezeNet [iandola2016squeezenet] has an AlexNet-level accuracy with fewer parameters and would generally have poor performance in fake detection tasks, because SqueezeNet is not designed for fake detection. We chose SqueezeNet as the baseline, because our FDFtNet can provide a huge improvement.
ShallowNetV3. ShallowNetV3 [tariq2019gan] has the highest area under receiver operating characteristic (AUROC) (93.99%) on 6464 resolution images from the CelebA and PGGAN datasets. However, ShallowNetV3 has burdensome fully-connected layers (FC layer) for binary classification. Convolution layers have 115,490 parameters, while FC layers have 4,725,762 parameters. In addition, since this approach has not been tested on deepfakes other than those generated by PGGAN, we aim to investigate the performance.
ResNetV2. ResNetV2 has been widely adopted in many image classification tasks. We chose ResNetV2 [he2016identity] as one of the baselines, because ResNetV2 has an opposing characteristic to ShallowNetV3 in terms of the model depth, i.e., ResNetV2 has 50 layers, while ShallowNetV3 has only 8 layers. We believe that these two architectures would show complementary results and we plan to see the effect of our approach on such deep and shallow CNN architectures.
Xception. Xception [chollet2017xception] has been served as the baseline for fake image detection in [tariq2019gan, rossler2019faceforensics++]. For FaceForenscis++, Xception showed the highest accuracy, i.e., 96.36% in Deepfake and 86.86% in Face2Face, justifying our choice of it as a baseline. In addition, Xception has no FC layers, but extracts various image feature spaces thanks to depthwise separable convolutions, compared to the burdensome FC layers in the ShallowNetV3. We cut the classification layers in a pre-trained model, and add our FTT and MBblockV3 modules.
3.3 Fine-Tune Transformer (FTT)
Fine-Tune Transformer (FTT) consists of several self-attention modules, as shown in Fig. 3, where each attention module has (), (), and () using a 11 convolution filter. We iterate times from the image inputs. is a hyper-parameter and we empirically determined that = 3 yields the highest performance.
Input size | Operation | Num. Parameters | Output dim. | Stride |
---|---|---|---|---|
11 Conv | 16 | / | 1 | |
11 Conv | 16 | / | 1 | |
11 Conv | 16 | 1 | ||
/ | Matmul | 0 | - | |
Softmax | 0 | - | ||
Batchdot | 0 | - | ||
Multiply | 1 | - | ||
Add | 0 | - |
denotes the bottleneck ratio in the block. The number of parameters are simulated with the following hyperparameters:
= = = and = .
(1) |
In Fig. 3, the input of the previous layers or the input image is divided into three feature spaces (), (), and (). As shown in Eq. 1, all of them are obtained through the 11 convolution, where , and are the respective filter weights of each space. () and () have channel bottleneck ratio parameter, , where is the number of channels. In this study, we choose = as suggested by Zhang et al. [zhang2018self]. In particular, we use the dot-product attention to produce the attention map in Fig. 3, synthesizing the and locations after the operation as shown in the above equation.
(2) | ||||
(3) |
After obtaining the attention map , we apply the operation to multiply the attention map with (), as shown in Eq. 2, and produce output . After the multiplication, is added to the input . Finally, the self-attention feature map, , is obtained via multiplying and adding the input , as shown in Eq. 3. In particular, is a learnable parameter initialized as 0 at the early stage of learning. This is favorable, since the softmax function equally provides attention to all the feature spaces at the early stage of learning.
Input size | Operation | Num. Parameters | Output dim. | Stride |
---|---|---|---|---|
33 DConv | 123 | 32 | 2 | |
BN | 128 | 32 | - | |
ReLU | 0 | 32 | - | |
1st Stage self-attention | 1,321 | 32 | - | |
33 DConv | 2,336 | 64 | 2 | |
BN | 256 | 64 | - | |
ReLU | 0 | 64 | - | |
2nd Stage self-attention | 5,201 | 64 | - | |
33 DConv | 8,768 | 128 | 2 | |
BN | 512 | 128 | - | |
ReLU | 0 | 128 | - | |
3rd Stage self-attention | 20,641 | 128 | - | |
1x1 Conv | 73,728 | 576 | 1 | |
BN | 2,304 | 576 | - | |
ReLU | 0 | 576 | - | |
GAP | 0 | 576 | - |
Specification for Fine-Tune Transformer (FTT). Conv, BN, DConv, and GAP denote convolution, batch normalization, depth-wise separable convolution, and global average pooling operation, respectively. The “Attention” operation in bold indicates the end of one transformer block. We repeat FTT three times (
= ) to maximize the performance.Next, in our FTT, we apply the self-attention module three times ( = 3) with an input size of 64643, as shown in Table 2. The first layer is a 33 separable convolution with 32 filters and 2 strides followed by Batch Normalization (BN) [ioffe2015batch] and ReLU. The dimension of the output feature map from the self-attention module is 32, 64, and 128, respectively; the width (number of channels) is doubled when the resolution is down-sampled, as shown in Table 2. After that, self-attention is performed three times ( = ), followed by SeparableConv3
3, BN, and ReLU. The main reason we apply self-attention modules in FTT is to overcome the limitations of CNN in achieving long-term dependencies, caused by the use of numerous Conv filters with a small size. On the other hand, only a one-time use of the FTT is necessary to achieve the long-term dependencies, avoiding the construction of deep CNN layers. Also, a three-time application of self-attention modules allows us to explore and learn diverse deep features of the input images via fine-tuning.
Input size | Operation | Num. Parameters | Output dim. | Stride |
---|---|---|---|---|
11 Conv | 294,912 | 576 | 1 | |
BN | 2,304 | 576 | - | |
h-swish | 0 | 576 | - | |
33 DConv | 5,184 | 576 | 1 or 2 | |
BN | 2,304 | 576 | - | |
GAP | 0 | 576 | - | |
11 Conv | 82,944 | 144 | 1 | |
ReLU | 0 | 144 | - | |
11 Conv | 82,944 | 576 | 1 | |
hard-sigmoid | 0 | 576 | - | |
Multiply | 0 | 576 | - | |
h-swish | 0 | 576 | - | |
11 Conv | 73,728 | 128 | 1 | |
Linear | 0 | 128 | - | |
BN | 2,304 | 128 | - | |
Add | 0 | 128 | - |
3.4 MobileNet block V3
We chose MobileNet block V3 (MBblockV3) to explore the image feature space through inverted residual structure and linear bottleneck [sandler2018mobilenetv2]. Depthwise separable convolutions, as in Xception and MobileNetV1 [howard2017mobilenets], are also included in MBblockV3. Comprehensively, MobileNet is an architecture that has already proven its efficiency by using a small number of parameters, drastically increasing computational efficiency. We chose MBblockV3, because it is a suitable module for the efficient extraction of the feature space over the pre-trained feature space. FTT and MBblockV3 are repeatedly used and times, respectively. Each of them is added before the final classification layer. MBblockV3 has the parameter after the pre-trained model. In our experiment, we use = 4, determined empirically, yielding the best performance for fine-tuning. In particular, we use the modified h-swish [howard2019searching]
and the ReLU6 as activation functions. This non-linearity
[ramachandran2017swish, elfwing2018sigmoid, hendrycks2016bridging] significantly improves the performance of neural networks and is defined as follows:(4) |
Since clipping the input values at the bottom layers may have a side effect of distorting the data distribution [sheng2018quantization], we apply these activation functions at the top layers to reduce distortion and extract different signals from ReLU. Next, the Squeeze-and-Excitation blocks (SE block) in Squeeze and Excitation networks [hu2018squeeze] are applied in the bottleneck layer. Global information on the image resolution is embedded in the squeeze stage, and information aggregation is used to capture channel dependencies and is re-calibrated through the gated computation (element-wise multiplication), similar to the attention mechanism in the excitation stage. Details of the SE block parameters are summarized in Table 3.
4 Experimental Results
4.1 Training details
Dataset | Train | Validation | Test | Fine-tune |
---|---|---|---|---|
PGGAN | 128,404 | 32,100 | 37,566 | 1,000 (real), 1,000 (fake) |
Deepfake | 60,000 | 18,000 | 20,000 | 1,000 (real), 1,000 (fake) |
Face2Face | 60,000 | 18,000 | 20,000 | 1,000 (real), 1,000 (fake) |

Example of a Cutout data augmentation. Random regions of the original image (left) are masked out by black rectangles. Every epoch, the rectangular mask changes in form and all images are resized to 64
64 resolution.All datasets have train, validation, test, and fine-tune sets. The size of each dataset is shown in Table 4
. Our training set is only fine-tuned with 1,000 samples for real and fake images, respectively, and we use the validation set to check the training strategy. Our FDFtNet is trained with Stochastic Gradient Descent (SGD) with momentum for 300 epochs on all 3 datasets. The learning rate is initialized at 0.3 and annealed using a cosine function. The momentum rate is set to 0.9 and the mini-batch size is set to 128. Early stopping is applied, when the validation loss ceases to decrease for 20 epochs. All other weight parameters are initialized based on the research by Glorot and Bengio
[glorot2010understanding]. To reenact the most challenging scenario in detecting fake images, all input images are resized to 6464 resolution.Data augmentation.
Because we just use 1,000 samples for each class, we apply data augmentation. Input images are translated into a width and height range of [-2, 2] with the nearest-padding on empty pixels generated after translation. Zoom and rotation are also applied to a degree range of [-0.2, 0.2]. We also perform random horizontal flipping. These data augmentations are applied to all fine-tune sets. For validation and test sets, only a
/ scaling augmentation to the input image is applied.Cutout. Cutout method applies squared zero masks on a random location of each input image. Fig. 4 presents an example of a Cutout data augmentation. DeVries et al. [devries2017improved] used random zero masks of 16 pixels for CIFAR-10 (3232 pixels images), 5 random iteration parameters for cutting, and 16 random size multipliers for the cutting masks. We use 44 pixels mask, 3 iterations, and 5-size multipliers for cutting masks for 6464 images ( = 3 and = 5). Since we use random translation, we do not use random center cropping, which was used in the original paper. When we conducted with the original setting, we faced severe underfitting with no convergence of losses. We observed higher performance with a setting of low Cutout parameters ( = 3 and = 5) as compared to the implementation without Cutout, which showed strong overfitting. Because we fine-tune with a small amount of data, we apply this non-aggressive parameter setting.
4.2 Performance evaluation
We present our overall performance results in Table 5, where the same test datasets (see Table 4) are used for evaluating the baselines and our model. In Table 5
, the baselines and the backbones are interchangeably used, where the backbone is the pre-trained CNN network used in our FDFtNet. We use the accuracy (ACC) and area under the receiver operating characteristic (AUROC) as evaluation metrics for our experiments. We experimented with all four baseline models on each dataset with similar training strategies for each dataset. As shown in Table
5, the experimental results show that our FDFtNet has superior detection performance in both ACC and AUROC, compared to all the baselines. In terms of training data size, our model shows high performance using 1,000 images for real and fake, respectively. We will now explain the detailed performance improvement for each dataset.Model | Dataset | PGGAN | Deepfake | Face2Face | |||
---|---|---|---|---|---|---|---|
Backbone | ACC (%) | AUROC | ACC (%) | AUROC | ACC (%) | AUROC | |
SqueezeNet | baseline | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 |
FDFtNet (Ours) | SqueezeNet | 88.89 | 92.76 | 92.82 | 97.61 | 87.73 | 94.20 |
ShallowNetV3 | baseline | 85.73 | 92.90 | 89.77 | 92.81 | 83.35 | 88.49 |
FDFtNet (Ours) | ShallowNetV3 | 88.03 | 94.53 | 94.29 | 97.83 | 84.55 | 93.28 |
ResNetV2 | baseline | 84.80 | 88.58 | 81.52 | 89.72 | 58.83 | 62.47 |
FDFtNet (Ours) | ResNetV2 | 84.83 | 94.05 | 91.03 | 96.08 | 85.15 | 92.91 |
Xception | baseline | 87.12 | 94.96 | 95.10 | 98.92 | 85.78 | 93.67 |
FDFtNet (Ours) | Xception | 90.29 | 95.98 | 97.02 | 99.37 | 96.67 | 98.23 |
PGGAN. To yield the best detection performance, we freeze the weight parameters of all layers of the pre-trained models. FTT with parameter = 3 is used, and MBblockV3 with parameter = 2 is added; the same data augmentation is applied. Table 5 shows the results of our models compared with the baseline models. Our results show that Xception, among all baseline models, achieved the highest performance (87.12% ACC and 94.96% AUROC). Our model showed a performance of 90.29% ACC and 95.98% AUROC, which is higher than that of ShallowNetV3 with ensemble [tariq2019gan]. ShallowNetV3 is improved from 85.73% and 92.90% ACC to 88.03% and 94.53% AUROC, respectively, similar to the ensemble version. SqueezeNet baseline shows the lowest baseline performance, but it is significantly improved to a similar level to that of ShallowNetV3, from 50.00% to 92.76%, by applying our model.
Deepfake. Here also, the same data augmentation techniques are applied. For FTT, we use = and = for MBblockV3. Cutout has = iteration parameters and = 10 multiplier parameters. The results show that all models achieve significant improvement in performance. Table 5 indicates that Xception has the highest performance of 95.10% ACC and 98.92% AUROC. Using our approach, this baseline model is also improved to 97.02% ACC and 99.37% AUROC. ShallowNetV3 has 89.77% ACC and 92.81% AUROC. They increased to 94.29% ACC and 97.83% AUROC, respectively. ResNetV2 is also improved from 81.52% ACC and 89.72% AUROC to 91.03% ACC and 96.08% AUROC. SqueezeNet baseline shows the lowest performance, 50.00% ACC and AUROC, but is improved to 92.82% ACC and 97.61% AUROC.
Face2Face. The training strategies for Face2Face are very similar to those of the Deepfake dataset. Data augmentation is also applied. , , , and are set to , , , and , respectively. The interesting point is that ResNetV2 baseline performed poorly (58.83% ACC and 62.47% AUROC), but significant improvements are made using our methods (85.15% ACC and 92.91% AUROC). Our results demonstrate the generalization ability of our approach, improving the poorly performing baseline above 90% across all models and datasets. Compared to FaceForensics Benchmark Results [faceforencis], the highest state-of-the-art method is Xception, which shows 96.4% ACC in Deepfake and 86.9% ACC in Face2Face. Our FDFtNet achieves higher performance (97.02% and 96.67%) than the current state-of-the-art method for the same dataset.
5 Ablation study, discussions, and limitations
Method | backbone | Dataset | Acc | AUROC |
---|---|---|---|---|
With FTT | Xception | Deepfake | 97.02% | 99.37% |
Without FTT | Xception | Deepfake | 94.56% | 98.89% |
In Section 3.3, we explained the reason for using FTT. In this section, we validate each module and technique through an ablation study. In Table 6, we choose the Xception model and the Deepfake dataset to compare our model with and without the FTT, while all other settings remain the same. With FTT, we can achieve about 2.5% higher performance than without FTT as shown in Table 6. Our current work has the following limitations: First, we used both real and fake data for training and fine-tuning, but we have constrained resources in practice. In FakeTalkerDetect [jeon2019faketalkerdetect] for fake detection, researchers used Siamese networks for training only on real data. However, in our implementation, few-shot learning and unbalanced learning are major obstacles to achieving high performance. Second, transfer learning is required to improve the performance. We trained each model on each dataset, e.g., PGGAN, Deepfake, and Face2Face. For future work, we plan to research the transfer learning ability to further generalize our model.
6 Conclusion
We propose FDFtNet, which is a robust fine-tuning neural network-based architecture, to detect fake images and significantly improve the baseline CNN architectures. Our model achieves the state-of-the-art accuracy in fake image detection on the GAN-based dataset and the Deepfake-based dataset. Our experimental results with the use of a limited amount of data show the exploration and exploitation of image feature space beyond the pre-trained models. Our results show that FDFtNet is a promising method for detecting fake images generated by powerful deep learning methods, requiring only a small amount of images for re-training. Therefore, FDFtNet can be a viable option even for detecting new fake images in a real-world scenario, where available datasets are extremely limited. Further, we offer open source versions of our work for it to be widely leveraged by the research community
111https://anonymous.4open.science/r/FDFtNet/.