Over the last few years, we assisted in an escalation of methods for the production of increasingly more realistic synthetically generated images [karras2018, sg2, Karras2020ada, song2021scorebased]. The first architectures produced blurry and low-resolution images with a general lack of details. Recently, giant steps have been made to raise the bar and overcome those issues. This is evident by the recent release of a new gan architecture by NVIDIA, namely StyleGAN3 [Karras2021], which produces high-quality images that can easily fool human eyes.
On the one hand, the authors of image generators put much effort into generating very realistic pictures. On the other hand, they are aware of the variety of problems an overly realistic architecture can create. Generated images can be used over social media for many malicious intents, from scams to identity stealing, and the general public is not ready to face this menace. A recent work [lago2021] shows how is it difficult for humans to tell real and generated faces apart when they have just a few seconds to make the decision. The interviewed people rated StyleGAN2 [sg2] images as real in the of the cases, whereas real images were rated as real only in the of the cases. Similarly, the study conducted in [nightingale2022ai] shows how synthetic faces prove even more trustworthy at human inspection. Notice that these studies do not even consider the more recent StyleGAN3 yet.
Given these premises, it is evident that being able to detect if an image is a natural photograph or it has been synthetically generated is becoming a task of utmost importance. For this reason, the multimedia forensics community is developing a series of techniques to solve the synthetically generated image detection problem. Some methods are based on hand-crafted features fed to classifiers[marra2019gans, barni2020cnn, bonettini2021icpr, mandelli2021forensic]. A wide variety of solutions prefer a purely data-driven approach based on training an end-to-end detector [wang2020cnn, cozzolino2021towards]. However, many of these techniques tend to suffer in classifying images that deviate from the characteristics of their training set. Unfortunately, it is nowadays impractical to assume that the characteristic of any synthetically generated image can be perfectly known at training stage, as new image generation techniques are developed continuously. For this reason, the latest research trend is to develop methods that can generalize well, detecting images generated with unseen techniques [gragnaniello2021gan, mandelli2021forensic].
In this paper, we tackle the problem of gan-generated image detection. This is, given an image under analysis, to detect if it is a real photograph or it has been synthetically generated by a gan. We consider the realistic and challenging scenario in which test images may come from generators that were unknown to the analyst at training time. To solve the gan-generated image detection problem, we propose an ensemble of cnn.
The proposed method leverages two main ideas to increase the robustness of unseen generated images. First, cnn contributing to the ensemble should be as much “orthogonal” as possible. We propose a training strategy that increases the diversity among the different learners for this purpose. This prevents the cnn from overfitting the image generators used in training, thus enabling the ensemble to take a better decision on newly generated images. Second, the detection problem is better defined over real images than synthetic ones. Indeed, it is safer to assume that the analyst can train on a broad set of real photographs that better represent the real-image class. On the contrary, it is hard to assume that the analyst can train on synthetic images generated with all the possible existing techniques, as these change and get updated too frequently in time. Therefore, we propose a score aggregation strategy that better favour decisions towards the real-image class.
Our experimental campaign is designed to test the proposed training and aggregation strategies on top of a baseline cnn detector based on EfficientNet [tan2019efficientnet]. We show that our technique is able to: (i) better draw the separation line between real and synthetic images by separating the score distributions of the two classes; (ii) accurately detect StyleGAN3 images as gan-generated, even though they have never been used in training.
2 Proposed method
We can summarize the main objectives of the gan-generated image detection problem into three primary tasks: (i) generalize very well to new gan unseen during training phase; (ii) be robust against post-processing operations applied on images; (iii) achieve a missed detection rate (i.e., the number of synthetic images detected as real) as low as possible. To improve gan generalization and robustness to editing operations, we propose a procedure based on orthogonal training of multiple cnn, all based on the same backbone architecture but each trained on a different training dataset. At the testing stage, we propose a patch selection and aggregation strategy that considerably reduces the missed detection rate.
Fig. 1 reports the sketch of the proposed testing pipeline. In a nutshell, given a query image, we classify it as being real or synthetically generated by selecting several patches from it and passing them through multiple orthogonal cnn. Then, we aggregate the patch scores and fuse the cnn results into a single prediction associated with the entire image. We provide more details about the proposed approach in the following lines.
2.1 Orthogonal Training
To improve the generalization of the proposed method against new gan, we train multiple cnn over different training datasets, which are “orthogonal” one another (with a slight abuse of terminology). For clarity’s sake, we consider two datasets “orthogonal” if one of the following conditions is met:
the datasets include images depicting different semantic content (e.g., cats or humans);
the datasets include images that underwent different post-processing (e.g., uncompressed or compressed images);
the datasets include images that underwent different compressions (e.g., different JPEG implementations);
the datasets include images synthesized by different gan.
The key idea of the proposed training orthogonalization is that every single cnn should capture slightly different traces with respect to the others. The ensemble of many cnn trained in an orthogonal fashion proves to achieve improvements with respect to training a unique cnn over the whole entirety of data at disposal.
The common backbone used for all the proposed cnn is the EfficientNet-B4 model [tan2019efficientnet]
, well known in the computer vision and multimedia forensics communities due to the outstanding results achieved in many tasks although requiring few network parameters[bonettini2021icpr]. Each cnn works at patch level, always considering squared RGB patches of pixels as input and providing a single score per patch. Considering an ensemble of cnn analyzing patches per image, we define as
the score estimated by the-th cnn for the -th patch, with and .
To improve the robustness against various post-processing operations, we apply strong data augmentations as suggested in many state-of-the-art works for synthetic image detection [gragnaniello2021gan, cozzolino2021towards]. The list of possible augmentations emulates common editing operations that can be applied by amateur users when retouching their photographs. Moreover, malicious users could also apply the same operations to hide the traces left by the synthetic generation process. We consider horizontal and vertical flip, random -degree rotation, histogram equalization, random blur, random changes in brightness, contrast, color and saturation, random downscale and upscale, and finally JPEG Compression with quality factors randomly selected from to
. Each augmentation is applied with probability, except for JPEG compression, which is applied with probability . The parameters are those defined in [Buslaev2020].
At testing stage, for each cnn, we obtain different scores associated with the patches extracted from the query image. Real and synthetic patches are associated with negative and positive scores, respectively. We fuse these scores by following the aggregation strategy presented in the next section to obtain the final image score.
2.2 Patch Aggregation Strategy
Given a test image, every orthogonal cnn returns many scores associated with the patches extracted from the image. When fusing the patch scores, we aim at reducing the detection errors on the synthetic images. Thus, the missed detection rate is the most critical parameter to maintain as low as possible.
The proposed approach is based on the consideration that, when training a generic gan detector, it is reasonable to assume that the characteristics of real images are easier to be captured than those of synthetic ones. Indeed, everybody could collect a set of original photographs and assign them the label “real”, whereas collecting a sufficiently vast and various synthetic dataset might be more elaborate and certainly requires a little expertise. Moreover, contrarily to the “real” class, the “synthetic” one is constantly and rapidly evolving, as new proposed methodologies for generating synthetic content emerge every day, not limited to gan only [song2021scorebased, Dhariwal2021]. Given these premises, we can reasonably assume that many cnn detectors trained over orthogonal datasets might correctly classify a real query image with a high precision level, as long as they are accurately trained. We cannot make the same assumption for synthetic query images, as they might be generated from novel unseen gan.
Here comes the proposed patch aggregation strategy. When a query image passes from a cnn detector and all its extracted patches are classified as real, the cnn classifies the entire image as real. If at least one patch among those extracted from the test image is detected as synthetic, the cnn assigns the entire image to the synthetic-image class. In particular, the cnn score associated with the image is the best score achieved among all the patches for what concerns the detected class. Since real and synthetic images are associated with negative and positive scores, respectively, the image is assigned the minimum score among the patches if the detected class is real-image. Otherwise, we assign the maximum score. Formally, the image score by the -th cnn is defined as
Eventually, we equally weight the orthogonal cnn to assign the global image score, which is the arithmetic mean among all the image scores returned by the networks, i.e., .
3 Experimental analysis
We perform our investigations over the dataset used for the competition recently organized by NVIDIA on StyleGAN3 Synthetic Image Detection [nvidia_challenge] within the DARPA’s SemaFor program. The purpose was to simulate an open-world setting in which new unseen gan (e.g., StyleGAN3 [Karras2021]) should be detected.
The real class of the testing data consisted of images selected from three public datasets: the FFHQ [ffhq] (depicting human faces taken from photographs), the Metfaces [metfaces] (depicting human faces taken from works of art), and the AFHQ2 [afhq] (including photographs of animal faces from three domains of cat, dog, and wildlife). The synthetic images to be tested were all generated through the recently released StyleGAN3 network [Karras2021], trained on real images selected from the three previously reported datasets. Every real dataset corresponds to two possible synthetic versions of it, the version r and the version t, according to the specific StyleGAN3 configuration chosen at generation stage. The images from Metfaces and AFHQ2 datasets did not undergo post-processing or compression, while a few synthetic images from the FFHQ dataset underwent compression and resizing.
The competition did not pose any limit on the kind of training data to be used for developing the proposed gan detector, except for removing from the training data the real images belonging to the testing dataset and every synthetic image generated through StyleGAN3. Given these premises, our testing dataset coincides with the testing dataset of the NVIDIA competition. The training dataset consists of different datasets, purposely built so to implement the orthogonal cnn training described in Section 2.1. Every orthogonal dataset is exploited for training an EfficientNet-B4, working with squared RGB patches of size . Following our previous considerations, we build “orthogonal” datasets:
Dataset . This dataset includes all the real images from FFHQ, Metfaces and AFHQ2 available for training (). The synthetic images () are selected from the synthetic versions of the three datasets, generated through state-of-the-art models for synthetic image generation (i.e., StyleGAN2 [sg2], StarGAN-v2 [afhq], Taming Transformers [taming_tx], FaceVid2Vid [facev2v] and Score-based models [song2021scorebased]). During training, the images undergo multiple augmentations from the list reported in Section 2.1, JPEG compression included. Then, patch per image is randomly selected and fed to the cnn.
Dataset . This dataset includes the same real and synthetic images exploited for , with the difference that here we first randomly extract patch per image, then we apply the same augmentations defined for . From the point of view of the post-processing applied, is orthogonal to , especially for the JPEG compression. Indeed, by construction, all the patches from are aligned to the pixel grid introduced by JPEG compression, while in the patches can have any random alignment. As already shown in [mandelli2020wifs], taking care of the JPEG grid alignment is of paramount importance for multimedia forensics tasks. The datasets and allow exploring this issue for the gan detection problem.
Dataset . This dataset includes only the real images from AFHQ2 available for training () and an equal number of their synthetic versions generated through StyleGAN2 and StarGAN-v2. random patches are extracted per image, then undergo all the augmentations, except for JPEG compression. focuses only on one semantic category (i.e., the animal faces), on a few gan and is entirely orthogonal to and for what regards JPEG compression.
Dataset . Ideally, this dataset would include only the real images from Metfaces available for training () and an equal number of their synthetic versions generated through StyleGAN2. This would guarantee complete semantic orthogonality with respect to . Actually, the training process was unstable due to the very limited number of Metfaces images. To augment the dataset dimensions, we decided to include in also the AFHQ2-related images. We extract random patches per image and apply augmentations, except for JPEG compression. As well as , is entirely orthogonal to and for what concerns JPEG compression.
Dataset . This dataset includes only real images from FFHQ available for training () and almost synthetic versions of them generated through StyleGAN2, Taming Transformers, FaceVid2Vid and Score-based models. We randomly extract patch per image and pass it through the augmentations, JPEG compression included. is entirely orthogonal to and concerning the semantic content and is partially orthogonal for the gan used. Moreover, is entirely orthogonal to about JPEG alignment.
At deployment stage, we extract RGB patches from the query image in different ways according to the cnn to be fed. For the cnn trained over , we randomly select patches per image. For the remaining cnn, we always feed them with around patches per image, aligned with the pixel grid introduced by JPEG compression. This operation is done to ensure that the potential editing traces undergone by the test patches match with the JPEG training augmentations. Indeed, for building , the training patches can be misaligned to the JPEG grid, while the remaining datasets always match the JPEG grid alignment, if JPEG is present.
3.2 Experimental setup
We keep of the training images for training phase, leaving the remaining
for the validation. As commonly done in cnn training, we initialize the network weights using those trained on the ImageNet database. Every cnn is trained using cross-entropy loss and Adam optimizer with default parameters for a maximum ofepochs. The learning rate is initialized to and is decreased by a factor if the loss does not decrease for epochs. Training is stopped if the loss does not improve for more than epochs, then the model providing the best validation loss is selected. The experimental code is available at https://github.com/polimi-ispl/GAN-image-detection.
This section reports the results achieved by the performed experimental campaign. First, we show the performance of the proposed patch aggregation strategy, then we evaluate the orthogonal cnn training. Eventually, we compare our results with state-of-the-art.
Patch aggregation. To show the effectiveness of the proposed patch aggregation strategy, we compare our approach with standard patch aggregation methodologies. For brevity’s sake, we show the benefits of our strategy only on the results achieved by one single cnn, as the trend is the same for all the considered networks.
|AFHQ2-r||AFHQ2-t||Metfaces-r||Metfaces-t||FFHQ-r||FFHQ-t||FFHQ-r, res-comp||FFHQ-t, res-comp||Global|
Fig. 2 depicts the achieved image scores’ distributions by aggregating the patch scores returned by the cnn trained on . In particular, Fig. 2(a) reports the results of the proposed method: if at least one patch is detected as synthetic, the image is assigned the “best” score among the synthetic ones. Figs. 2(b)-(c)-(d) show the results obtained by modifying this strict condition, letting the number of patches required for assigning the label “synthetic” grow to , and patches, respectively. Figs 2
(e)-(f) report the results obtained by selecting the arithmetic mean and the median among the patch scores, respectively. For each scenario, we report the corresponding auc of the roc curve and the tpr and fpr achieved in the confusion matrix.
Our approach achieves the highest auc and an extremely high detection accuracy for synthetic images at the cost of a few false alarms. As we increase the number of patches detected as synthetic for assigning the final score to the image (see Figs. 2(b)-(c)-(d)), the number of false alarms reduces, but the missed detections also increase. The arithmetic mean and median of the patch scores are far from being competitive with the proposed method.
Orthogonal CNN training. Table 1 reports the results of every single cnn and of the ensemble. We show the auc achieved on the global test set, but we also investigate different scenarios in which only the real images of a particular dataset (e.g., FFHQ, Metfaces or AFHQ2) are compared with their synthetic versions generated through StyleGAN3. The considered scenarios are: (i) real AFHQ2 vs. synthetic AFHQ2 generated with r configuration; (ii) real AFHQ2 vs. synthetic AFHQ2 generated with t configuration; (iii) real Metfaces vs. synthetic Metfaces, r-versions; (iv) real Metfaces vs. synthetic Metfaces, t-versions; (v) real FFHQ vs. synthetic FFHQ, r-versions, without resizing and compression; (vi) real FFHQ vs. synthetic FFHQ, t-versions, without resizing and compression; (vii) real FFHQ vs. real FFHQ, r-versions, with resizing and compression; (viii) real FFHQ vs. synthetic FFHQ, r-versions, with resizing and compression.
The cnn ensemble often reports the best results. Regarding the single cnn, the best methods on average are those trained over and . This was expected, as these cnn were trained over a larger and more various amount of data with respect to the last three of them. However, every orthogonal training dataset carries important contributions related to the specific kind of data it is focused on. For instance, and achieve extremely high auc on AFHQ2 and Metfaces datasets, respectively. achieves almost perfect auc over FFHQ undergone post-processing operations.
All cnn report acceptable results in the global test scenario. Nonetheless, due to their specific training implementation, some cnn might be more prone to detection errors than others in particular test scenarios, whereas their ensemble always maintains robust. Aiming at simulating realistic situations in which test images come from unknown generative models, the ensemble of multiple cnn proves to be a valid option for synthetic image detection, paving the way towards robust and generalized solutions.
Comparison with state-of-the-art. The proposed gan detector ranked first in the competition organized by NVIDIA, outperforming the results achieved by many expert teams in the field of multimedia forensics. Indeed, our method achieved the highest auc over the global test set, as well as the best results in all the eight testing scenarios described previously. We refer the interested reader to [nvidia_challenge] for any additional details and for comparing the state-of-the-art results.
In this paper we proposed a synthetic image detector based on an ensemble of cnn, which are trained to increase the diversity within the ensemble. Our score aggregation strategy takes into account the fact that some image generators can be unknown at training time. Results show that these ideas help improving the detector accuracy on StyleGAN3 images that have never been used for training.
Despite the promising results, the orthogonality among the trained cnn is only empirically verified at test time by observing the detector accuracy. Future work will be devoted to a deeper study of the cnn diversity from a more theoretical view point. This will enable the development of ad-hoc training strategies.