1 Introduction
The use of diffusion models for the generation of synthetic media is arousing great interest among both researchers and practitioners. Besides the high quality and photorealism of the generated images, it is the opportunity to model a wide variety of subjects and contexts that appears to be particularly interesting. In fact, diffusion models can be guided by textual descriptions or pilot sketches to generate images of a virtually unlimited set of categories, bounded only by our imagination. Therefore, this technology represents a powerful tool for artists, game designers, and any type of creative users. Unfortunately, this includes also malicious users, that may take advantage of this increased flexibility to generate fake media more fit to their disinformation goals. In this paper, we try to assess how prepared we are to face this new threat.
Some recent papers have begun studying the detection DM-based images. In particular, in [10, 11] it was noted that the lack of explicit 3D modeling of objects and surfaces causes asymmetries in shadows and reflected images. Furthermore, global semantic inconsistency can be observed, to some extent, in lighting. These traces can certainly be exploited to identify today’s DM images. However, if the rapid advancement of GAN images can be taken as a paradigm, new DM-based methods can be expected to soon overcome these limitations and generate images that satisfy all necessary semantic constraints, be it lighting, perspective or any other aspect.

Indeed, most state-of-the-art detectors of fake images rely on traces that are invisible to the human eye. Even images that are visually perfect, with no evident semantic inconsistency, can be distinguished from real images based on the traces that are inherent of the generation process. In fact, any method used to create synthetic visual data embeds some peculiar traces in their output images, that are related to the actions taken in the generation process. These traces are different from those typical of modern digital devices enabling fake image detection [30]. Moreover, each generation architecture is characterized by its own peculiar traces, thereby allowing also for source attribution. The presence and distinctiveness of such traces has been proven by extracting a sort of spatial-domain artificial fingerprints [21, 32]
, but also through frequency-domain analyses showing that the upsampling operation performed in most GAN architectures give rise to clear spectral peaks
[34, 12].Based on these concepts, several promising CNN-based detectors of GAN images have been developed. However, the problem is far from being solved. On one hand, new and more sophisticated generation architectures are proposed by the day, some of them, like StyleGAN3 [16]
, aiming explicitly at minimizing the presence of these undesired traces. On the other hand, even state-of-the-art detectors, based on a supervised learning approach, have a hard time generalizing to architectures never seen in training. Moreover, they suffer a significant performance drop when image quality is impaired, as it always happens on social networks, which routinely apply some resizing and compression operations. This is because forensics traces are very weak and can be easily removed even by these non-malicious processing steps.
An interesting experiment on generalization has been recently conducted by NVIDIA within the DARPA SemaFor program. Performers were asked to detect StyleGAN3 images, with the constraint that no images generated with this architecture could be used in training. Despite the inherent difficulty of the task, some encouraging results have been achieved [22]. A more recent contest (IEEE VIP Cup) extended this analysis to images coming also from diffusion models architectures, with encouraging results [4]. Also in [29] it is proposed an initial study on the detection and attribution of diffusion models that looks promising. However, results are presented only in ideal conditions and no robustness analysis is considered.
This work aims at providing some more information on the detection of DM images and, possibly, contributing some useful guidelines for further developments. In particular, we want to answer two fundamental questions: i) are DM images characterized by hidden artifacts similar to those observes in GAN images? ii)
To what extent are current state-of-the-art detectors effective on this type of images? To answer these questions, we generated a large variety of synthetic images using the most recent generators. Then, we carried out an analysis of their artifacts, and finally studied the performance of some deep learning-based detectors on them, not only in ideal conditions but also in more challenging scenarios. where images are compressed and resized.
2 Background
In this section we summarize the most important findings in the literature towards the development of successful and robust deep learning-based detectors for synthetic images. Much of the previous research relates to images generated by GANs as these architectures have so far dominated the field.
A widely agreed fact is the key importance of augmentation, including especially blurring and compression, to ensure robustness. On the same line, training set diversity was found to help generalizing to unseen architectures, as shown in [31] where a simple pre-trained Resnet50 is trained on 20 different categories of ProGAN images. Such observations have been reinforced by later works [13, 20] proving the tight relation between augmentation and training diversity, on one side, and detectors’ reliability, on the other. Working on local patches also appears to be important [2] as well as analyzing both local and global features [14].
Another central discovery concerns the need to avoid any loss of information in the pre-processing of training and test images, as well as in all layers of the neural network, especially those closest to the input. Most of all, it is important to avoid resizing, a common practice in deep learning to adapt images to fixed input layers, as this entails image resampling and interpolation, which may erase the subtle high-frequency traces left by the generation process. To preserve these precious (and invisible) forensics artifacts, several strategies can be considered:
i) training the networks on local patches, cropped from the image with no resizing; ii) making the final decision on the whole image through some fusion strategy; iii) avoid downsampling steps in the first layers of the network [13], as also suggested in related forensics applications.These measures allow to minimize information losses in the precious high-frequency image components (see also Section 3). A more extensive analysis of the performance of synthetic image detectors [13]
shows that pre-training all models on large datasets (e.g., ImageNet) keeps being important, while using residuals in input instead of the original images does not improve performance, and extreme augmentation provides only marginal gains.
ProGAN | BigGAN | StyleGAN2 | Taming Tran. | DALL·E Mini |
![]() |
![]() |
![]() |
![]() |
![]() |
GLIDE | Latent Diffusion | Stable Diffusion | ADM | DALL·E 2 |
![]() |
![]() |
![]() |
![]() |
![]() |
.
3 Artifact analysis
Previous work [21, 32] established the existence of GAN fingerprints and their dependence on both the GAN architecture (number and type of layers) and its specific parameters (filter weights). In particular, in [21], fingerprints are extracted in the spatial domain mimicking the pipeline used to extract the PRNU pattern for device identification [19]. First, the scene content is estimated through a denoising filter , , and removed from the input image to obtain the so-called noise residual, . This latter is assumed to be the sum of a deterministic component, the GAN fingerprint , and a random noise component , so the fingerprint is estimated by simply averaging a large number residuals .
In this work we use the same procedure, using the denoising filter proposed in [33], which proved already successful for camera fingerprint extraction [5]. We average the noise residuals of 1000 images, then take the Fourier transform of the result to carry out a spectral analysis. Fig.2 shows the amplitude of these spectra for several architectures of interest, both GANs [15, 1, 17] or VQ-GANs [9, 6] (top row), and DMs [23, 26, 27, 25] (bottom row). For all GANs, strong peaks are clearly visible in the spectra [34, 31], implying the presence of quasi-periodical patterns, the fingerprints, in the synthetic images. Interestingly, the same happens with some recent diffusion models, such as GLIDE, Latent Diffusion and Stable Diffusion, which suggests good results for fingerprint-based forensic tools also in these cases. On the other hand, such peaks are much weaker for other architectures ADM and DALL·E 2, predicting more controversial results in these cases, as will be confirmed by the experimental analysis on next Section.
4 Detection performance
In this Section we present the results of experiments carried out on images generated by several state-of-the-art generative models including GANs, transformers, and DMs: ProGAN [15], StyleGAN2 [17], StyleGAN3 [16], BigGAN [1], EG3D [3], Taming Transformer [9], DALL·E Mini [6], DALL·E 2 [25], GLIDE [23], Latent Diffusion [26], Stable Diffusion [27] and ADM (Ablated Diffusion Model) [8]
. For text-to-image data we use language prompts from COCO validation and training set. Real data instead come from COCO
[18], ImageNet
[7] and UCID [28].We are especially interested in the ability of detectors to generalize to unseen architectures, relevant for realistic scenarios. Therefore, we train on images generated by a single model, and test on all others. In detail, we consider two cases: a) training only on ProGAN images, using the same setting as in [31] (362K fake images with 20 categories); b) training only on images generated by Latent Diffusion (200K fake images with 5 categories). In both cases, tests are performed on 1000 synthetic images for each model and 5000 real images.
We compare the following detectors: Spec [34], based on frequency analysis, PatchForensics [2], that relies on local patch analysis, Wang2020 [31], a ResNet50 with blurring and compression augmentation, and Grag2021 [13], same backbone as before but avoiding down-sampling in the first layer and intense augmentation. Results are given in terms of area under the receiver-operating curve (AUC) and accuracy at the fixed threshold of 0.5.
Trained on ProGAN | ||||||||
---|---|---|---|---|---|---|---|---|
Uncompressed | Resized and Compressed | |||||||
Acc./AUC% | Spec | PatchFor. | Wang2020 | Grag2021 | Spec | PatchFor. | Wang2020 | Grag2021 |
ProGAN | 83.5/ 99.2 | 64.9/ 97.6 | 99.9/100 | 99.9/100 | 49.7/ 48.5 | 50.4/ 65.3 | 99.7/100 | 99.9/100 |
StyleGAN2 | 65.3/ 72.0 | 50.2/ 88.3 | 74.0/ 97.3 | 98.1/ 99.9 | 51.8/ 50.5 | 50.8/ 73.6 | 54.8/ 85.0 | 63.3/ 94.8 |
StyleGAN3 | 33.8/ 4.4 | 50.0/ 91.8 | 58.3/ 95.1 | 91.2/ 99.5 | 52.9/ 51.9 | 50.2/ 76.7 | 54.3/ 86.4 | 58.3/ 94.4 |
BigGAN | 73.3/ 80.5 | 52.5/ 85.7 | 66.3/ 94.4 | 95.6/ 99.1 | 52.1/ 52.2 | 50.5/ 58.8 | 55.4/ 85.9 | 79.0/ 99.1 |
EG3D | 80.3/ 89.6 | 50.0/ 78.4 | 59.2/ 96.7 | 99.4/100 | 58.9/ 60.6 | 49.8/ 81.9 | 52.1/ 85.1 | 56.8/ 96.6 |
Taming Tran. | 79.6/ 86.6 | 50.5/ 69.4 | 51.2/ 66.5 | 73.5/ 96.6 | 49.0/ 49.1 | 50.0/ 64.1 | 50.5/ 71.0 | 56.2/ 94.3 |
DALL·E Mini | 80.1/ 88.1 | 51.5/ 82.2 | 51.7/ 60.6 | 70.4/ 95.6 | 59.1/ 61.9 | 50.1/ 68.7 | 51.1/ 66.2 | 62.3/ 95.4 |
DALL·E 2 | 82.1/ 93.3 | 50.0/ 52.5 | 50.3/ 85.8 | 51.9/ 94.9 | 62.0/ 65.0 | 49.7/ 58.4 | 50.0/ 44.8 | 50.0/ 64.4 |
GLIDE | 73.4/ 81.9 | 50.3/ 96.6 | 51.1/ 62.6 | 58.6/ 86.4 | 53.1/ 52.5 | 51.0/ 71.5 | 50.3/ 65.9 | 51.8/ 90.0 |
Latent Diff. | 72.1/ 78.5 | 51.8/ 84.3 | 51.0/ 62.5 | 58.2/ 91.5 | 47.9/ 46.3 | 50.6/ 65.2 | 50.7/ 69.1 | 52.4/ 89.4 |
Stable Diff. | 66.8/ 74.7 | 50.8/ 85.0 | 50.9/ 65.9 | 62.1/ 92.9 | 46.5/ 44.5 | 51.1/ 77.2 | 50.7/ 72.9 | 58.1/ 93.7 |
ADM | 55.1/ 53.3 | 50.4/ 87.1 | 50.6/ 56.3 | 51.2/ 57.4 | 49.1/ 49.1 | 51.0/ 69.1 | 50.3/ 68.1 | 50.6/ 77.2 |
AVG | 70.5/ 75.2 | 51.9/ 83.2 | 59.5/ 78.6 | 75.8/ 92.8 | 52.7/ 52.7 | 50.4/ 69.2 | 55.8/ 75.0 | 61.5/ 90.8 |
Generalization and robustness. First of all, we analyze the performance on uncompressed synthetic images in the PNG format, as originated from each model (Table 1, left). This first experiment aims to highlight that in this situation detection can be much easier, because real images, which are always JPEG compressed by a codec embedded in the camera, are characterized by JPEG compression artifacts, while synthetic images do not embed such traces. In fact, the AUC performance is almost perfect on ProGAN (seen in training) and remains pretty good also for other architectures. Even in this favorable case, however, the accuracy is often unsatisfactory, because the threshold chosen in training does not work well on images of different origin [13].
This is a simple scenario, compared to the situation where both synthetic and real images are compressed and resized as routinely happens on social media platforms. To simulate such forms of image laundering and to avoid polarization, we follow the procedure used in the IEEE VIP Cup [4]. For each image of the test, a crop with random (large) size and position is selected, resized to pixels, and compressed using a random JPEG quality factor from 65 to 100. In this challenging condition, Table 1 (right) shows a general reduction of the performance, except for ProGAN (present in training). Again, the performance is acceptable in terms of AUC, but almost random in terms of accuracy. The most difficult diffusion models are DALL·E 2 and ADM, which presented very weak artifacts in our previous analysis.
Then we trained the best performing approach (Grag2021) on images generated using ADM (Table 2, left) and tested on the resized/compressed dataset. First, we observe that an almost perfect detection is achieved not only on ADM but also on Stable Diffusion, coherently with the fact that these architectures share very similar artifacts (Fig.2). The performance on other diffusion models, instead, are not much better than those obtained on GAN generated images. This means that Stable and Latent diffusion models are characterized by different cues compared to ADM and DALL·E 2.
Fusion and calibration. Finally, we carried out a simple experiment where we fuse (simple average) the outputs of the networks trained on both datasets. Results are reported in Table 2. Of course, the performance on GAN generated images improves, while remaining reasonably good on diffusion models, however accuracy is still extremely low, due to the unsuitable fixed threshold. To improve accuracy we try using a calibration procedure (Platt scaling method [24]) under the hypothesis of having just two real images and two synthetic ones for each model. Performance greatly improves, but we still cannot reliably detect images that present artifacts significantly different from those seen during training.
Trained on | Fusion | Calibration | |
---|---|---|---|
Acc./AUC% | Latent Diffusion | ||
ProGAN | 52.0/ 78.3 | 90.2/100 | 97.0/100 |
StyleGAN2 | 58.0/ 85.0 | 56.6/ 94.6 | 86.1/ 94.6 |
StyleGAN3 | 59.5/ 87.6 | 55.4/ 93.9 | 85.5/ 93.9 |
BigGAN | 52.9/ 80.6 | 59.3/ 98.5 | 92.1/ 98.5 |
EG3D | 65.4/ 91.8 | 54.4/ 97.7 | 92.3/ 97.7 |
Taming Tran. | 78.2/ 97.3 | 61.5/ 98.2 | 91.2/ 98.2 |
DALL·E Mini | 73.9/ 97.3 | 65.9/ 97.7 | 88.4/ 97.7 |
DALL·E 2 | 50.4/ 74.2 | 50.0/ 72.5 | 66.9/ 72.5 |
GLIDE | 62.5/ 96.2 | 52.5/ 95.9 | 89.2/ 95.9 |
Latent Diff. | 97.1/ 99.9 | 84.9/ 99.8 | 96.4/ 99.8 |
Stable Diff. | 99.7/100 | 92.5/100 | 97.2/100 |
ADM | 52.9/ 81.9 | 50.8/ 80.6 | 70.8/ 80.6 |
AVG | 66.9/ 89.2 | 64.5/ 94.1 | 87.8/ 94.1 |
5 Conclusion
This work addressed the problem of detecting synthetic images generated by diffusion models. We first tested whether DM images are characterized by distinctive fingerprints just as GAN images are, obtaining a partial confirmation. Then we analyzed the performance of some state-of-the-art detectors under different realistic scenarios. Experimental results vary significantly from model to model, as these are characterized by different forensics cues. Generalization remains the main hurdle, and detectors trained only on GAN images work poorly on these new images. Including a DM model in training can help to detect images generated by similar diffusion models but results can be unsatisfactory for others. Of course, these are only preliminary results and deeper analyses are necessary to address the problem of DM image detection.
6 Acknowledgment
We gratefully acknowledge the support of this research by the Defense Advanced Research Projects Agency (DARPA) under agreement number FA8750-20-2-1004. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.
In addition, this work has received funding by the European Union under the Horizon Europe vera.ai project, Grant Agreement number 101070093, and is supported by Google and by the PREMIER project, funded by the Italian Ministry of Education, University, and Research within the PRIN 2017 program.
References
- [1] (2018) Large Scale GAN Training for High Fidelity Natural Image Synthesis. In ICLR, Cited by: Figure 2, §3, §4.
- [2] (2020) What makes fake images detectable? Understanding properties that generalize. In ECCV, Cited by: §2, §4.
- [3] (2022) Efficient geometry-aware 3d generative adversarial networks. In CVPR, pp. 16123–16133. Cited by: §4.
- [4] (2022) IEEE Video and Image Processing Cup. Note: https://grip-unina.github.io/vipcup2022/ Cited by: §1, §4.
- [5] (2020) Noiseprint: A CNN-based camera model fingerprint. IEEE Transactions on Information Forensics and Security 15 (1), pp. 14–27. Cited by: §3.
- [6] (2021-07) DALL-E Mini. External Links: Document, Link Cited by: Figure 2, §3, §4.
- [7] (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.
- [8] (2021) Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems 34, pp. 8780–8794. Cited by: Figure 2, §4.
- [9] (2021) Taming transformers for high-resolution image synthesis. In CVPR, pp. 12873–12883. Cited by: Figure 2, §3, §4.
- [10] (2022) Lighting (in)consistency of paint by text. arXiv preprint arXiv:2207.13744v2. Cited by: §1.
- [11] (2022) Perspective (in)consistency of paint by text. arXiv preprint arXiv:2206.14617v1. Cited by: §1.
- [12] (2020) Leveraging Frequency Analysis for Deep Fake Image Recognition. In CVPR, Cited by: §1.
- [13] (2021) Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. In IEEE ICME, Cited by: §2, §2, §2, §4, §4.
- [14] (2022) Fusing Global and Local Features for Generalized AI-Synthesized Image Detection. In IEEE ICIP, Cited by: §2.
- [15] (2018) Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR, Cited by: Figure 2, §3, §4.
- [16] (2021) Alias-free generative adversarial networks. NeurIPS 34, pp. 852–863. Cited by: §1, §4.
- [17] (2020) Analyzing and improving the image quality of StyleGAN. In CVPR, pp. 8110–8119. Cited by: Figure 2, §3, §4.
- [18] (2014) Microsoft COCO: Common objects in context. In ECCV, pp. 740–755. Cited by: §4.
- [19] (2006) Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security 1 (2), pp. 205–214. Cited by: §3.
- [20] (2022) Detecting GAN-generated Images by Orthogonal Training of Multiple CNNs. In IEEE ICIP, Cited by: §2.
- [21] (2019) Do GANs Leave Artificial Fingerprints?. In IEEE MIPR, pp. 506–511. Cited by: §1, §3.
- [22] (2021) StyleGAN3 Synthetic Image Detection. Note: https://github.com/NVlabs/stylegan3-detector Cited by: §1.
- [23] (2021) GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741. Cited by: Figure 1, Figure 2, §3, §4.
-
[24]
(1999)
Probabilistic outputs for support vector machines and comparison to regularized likelihood methods
. Advances in Large Margin Classiers. Cited by: §4. - [25] (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125v1. Cited by: Figure 1, Figure 2, §3, §4.
- [26] (2022) High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695. Cited by: Figure 2, §3, §4.
- [27] (2022) Stable diffusion. Note: https://github.com/CompVis/stable-diffusion Cited by: Figure 1, Figure 2, §3, §4.
- [28] (2003) UCID: An uncompressed color image database. In Storage and Retrieval Methods and Applications for Multimedia, Vol. 5307, pp. 472–480. Cited by: §4.
- [29] (2022) DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Diffusion Models. arXiv preprint arXiv:2210.06998. Cited by: §1.
- [30] (2020) Media forensics and deepfakes: an overview. IEEE Journal of Selected Topics in Signal Processing 14 (5), pp. 910–932. Cited by: §1.
- [31] (2020) CNN-generated images are surprisingly easy to spot… for now. In CVPR, Cited by: §2, §3, §4, §4.
- [32] (2019) Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints. In ICCV, Cited by: §1, §3.
- [33] (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §3.
- [34] (2019) Detecting and Simulating Artifacts in GAN Fake Images. In IEEE WIFS, pp. 1–6. Cited by: §1, §3, §4.