Computer-generated graphics (CG) are images generated by computer software such as 3D Max, Maya, and Cinema 4D. In recent years, with the aid of computer software, it is easier to generate photorealistic computer graphics (PRCG), which are quite difficult to distinguish from natural images (NI) by the naked eye. Some examples of computer graphics are shown in Figure 1. Although these rendering software suites help us to create images and animation conveniently, it could also bring serious security issues to the public if PRCG were used in fields such as justice and journalism . Therefore, as an essential topic in the domain of digital image forensics , distinguishing CG from NI has attracted increasing attention in the past decade.
Several algorithms have recently been proposed to distinguish computer-generated graphics from natural images. Xiaofen Wang et al. 
present a customized statistical model based on the homomorphic filter and use support vector machines (SVMs) as a classifier to discriminate photorealistic computer graphics (PRCG) from natural images. Zhaohong Li et al. present a multiresolution approach to distinguish CG from NI based on local binary patterns (LBPs) features and SVM classifier. Jinwei Wang et al. 
present a classification method based on the first four statistical features extracted from the quaternion wavelet transform (QWT) domain. Fei Peng et al.
present a method to extract 24 dimensions of features based on multi-fractal and regression analysis for the discrimination of computer-generated graphics and natural images. However, all of these methods have depended on handcrafted features from computer-generated graphics and natural images, and also depend on SVM as the classifier.
Deep learning has been used in many new fields and has achieved great success in recent years. Deep neural networks such as the convolutional neural network (CNN) have the capacity to automatically obtain high-dimensional features and reduce its dimensionality efficiently . Some researchers have begun to utilize deep learning to solve problems in the domain of image forensics, such as image manipulation detection , camera model identification [11, 12], steganalysis  , image copy-move forgery detection , and so on.
In this paper, we propose a method based on sensor pattern noise and deep learning to distinguish computer-generated graphics (CG) from natural images (NI). The main contributions are summarized as follows: 1) Different from the traditional methods of distinguishing CG from NI, the proposed approach utilizes a five-layer convolutional neural network (CNN) to make a classification for the input images. 2) Before being fed into the CNN-based model, these images—including the CG and NI—are clipped to image patches. 3) Several high-pass filters (HPFs) are used to remove low-frequency signal, which represents the image contents. These filters are also used to enhance the residual signal as well as sensor pattern noise introduced by the digital camera device. 4) The experimental results have shown that the proposed method with three high-pass filters can achieve 100% accuracy, although the natural images undergo a JPEG compression with a quality factor of 75.
Ii Related Works
In this paper, we propose a method of distinguishing computer-generated graphics from natural images based on sensor pattern noise and deep learning. There are several studies related to deep learning as well as sensor pattern noise used for forensics.
Ii-a Methods Based on Deep Learning
Gando et al.  presented a deep learning method based on a fine-tuned deep convolutional neural network. This method can automatically distinguish illustrations from photographs and achieve 96.8% accuracy. It outperforms other models, including custom CNN-based models trained from scratch and traditional models using handcrafted features.
Rahmouni et al. 
presented a custom pooling layer to extract statistical features and a CNN framework to distinguish computer-generated graphics from real photographic images. A weighted voting scheme was used to aggregate the local estimates of class probabilities and predict the label of the whole picture. The best accuracy in is 93.2%, obtained by the proposed Stats-2L model.
Ii-B Sensor Pattern Noise Used for Forensics
Different digital cameras introduce different noise to their output digital images. The main noise sources are due to the imperfection of CCD or CMOS sensors. It has been named as sensor pattern noise (SPN) and is used as a fingerprint to characterize an individual camera. In particular, SPN has been used in image forgery detection  and source camera identification .
Villalba et al. presented a method for video source acquisition identification based on sensor noise extraction from video key frames. Photo response non-uniformity (PRNU) is the primary part of the sensor pattern noise in an image. In , the PRNU is used to calculate the sensor pattern noise and characterize the fingerprints into feature vectors. Then, the feature vectors are extracted from the video key frames and trained by a SVM-based classifier.
Iii Proposed Method
The proposed method consists of two primary steps: image preprocessing and CNN-based model training. In the first step, the input images—including the computer-generated graphics and the natural images—are clipped to image patches, then three types of high-pass filter (HPF) are applied to the image patches. These filtered image patches constitute the positive and negative training samples. In the second step, the filtered image patches are fed to the proposed CNN-based model for training. The proposed CNN-based model is a five-layer CNN. In this section, we introduce these steps of our method in detail.
Iii-a Image Preprocessing
Iii-A1 Clipped to Image Patches
The natural images taken by cameras and the computer graphics generated by software often have a large resolution. Due to hardware memory limitations, we need to clip these full-size images into smaller image patches before they are fed into our neural network for training. This is also a data augmentation strategy in deep learning approaches to computer vision. Data augmentation  helps to increase the amount of training samples used for deep learning training and improve the generalization capability of the trained model. Therefore, we propose to clip all of the full-size images to image patches. The resolution of each image patch is 650650. We chose this size as a trade-off between processing time and computational limitations.
Both the computer-generated graphics and the natural images are clipped into image patches. All of the clipping is label-preserving operations. That is to say, we prepare the positive samples by drawing image patches from the full-size natural images. In a similar way, we get negative samples from the full-size computer-generated graphics. However, natural images taken by cameras usually have a larger resolution than computer-generated graphics. If we want the amount of negative samples and the amount of positive samples to be approximately equivalent, we need to clip more image patches in each computer-generated graphic than we do from natural images. In light of this, we set the stride size for natural images to the width of the image patches (i.e., 650). After analyzing the amount of the image patches, we set the stride size for computer-generated graphics to a smaller value (i.e., 65).
Iii-A2 Filtered with High-Pass Filter
Since the natural images and the computer-generated graphics are created from different pipelines, there should exist some distinct differences between them. As we all know, sensor pattern noise (SPN) has been used to identify the source camera of a natural image, and has obtained excellent performance [12, 11] . However, there is no sensor pattern noise in computer-generated graphics. Based on this idea, we propose our method to discriminate computer-generated graphics from natural images.
Fridrich et al.  designed several high-pass filters for the steganalysis of digital images. As it is mentioned, these filters have the ability to obtain the noise residuals and suppress the value of the low-frequency component, which represents the image content. Qian et al. proposed a customized convolutional neural network for steganalysis. This customized deep learning approach starts with a predefined high-pass filter. This predefined HPF was proposed as a noise residual model of SQUARE55 in . Furthermore, this noise residual model has been applied to deep learning-based camera model identification  as well as to deep learning-based video forgery detection , and has obtained perfect performance.
In this paper, we utilize several high-pass filters in our method to obtain the sensor noise residuals and reduce the impact of the image content. These predefined high-pass filters are employed to make a convolution operation with the image patches. Furthermore, in order to reduce the computational complexity, the image patches are first converted to grayscale. The predefined high-pass filters are applied to the grayscale image patches, then the noise residuals of the image patches are piped into the proposed convolutional neural network.
The proposed high-pass filters are shown in Figure 2. There are three types of high-pass filter used in our method. The SQUARE55 and SQUARE33 were proposed as noise residuals model in . The EDGE33 was designed by us according to the different structure of all the other filters in . In order to let the three filters have the same size, the elements in the bounding boxes of the SQUARE33 and the EDGE33 are set to zero.
Iii-B CNN-Based Model Training
The proposed convolutional neural network architecture is illustrated in Figure 3. The image patches of the input for the proposed neural network are image blocks clipped from the full-size natural images or computer-generated graphics with a resolution of 1(650650), where 1 represents the channel number of gray-scale, 650 represents the size of width and height.
There is a high-pass filter layer at the top of the proposed CNN-based model. This filter layer consists of three combinations of high-pass filters. We need to select one type of the three combinations for the deep learning training. The combination of the High_Pass_Filter3 consists of all three proposed filters; i.e., the SQUARE55, the EDGE33, and the SQUARE33. The combination of the High_Pass_Filter1 only contains the SQUARE55 filter. The combination of the High_Pass_Filter0 utilizes an average pooling layer instead of the high-pass filter layer. According to the combination used by the proposed method, the number of feature maps outputted by the filter layer is different. If the High_Pass_Filter3 is used, there will be three feature maps with size 325325 outputted by the high-pass filter layer. Otherwise, there will only be one feature map of size 325325 outputted by the high-pass filter layer.
layer, and an average pooling layer. At the bottom of the proposed model, a fully-connected layer and a softmax layer are utilized to transform the 128 dimensional feature vectors to classification probabilities of the image patches.
The kernel sizes of the convolution layers in the proposed CNN-based model are 55, 55, 33, 33, and 11, respectively. The amounts of the feature maps outputted by each layer are 8, 16, 32, 64, and 128, respectively, and the size of feature maps are 325325, 162162, 8181, 4040, and 2020, respectively. The kernel size of the average pooling in each layer is 55 and the stride size is 2. Note that the last average pooling layer has a global kernel size of 2020.
We compared our deep learning approach with the state-of-the-art methods in . The dataset used in this paper is the same as the dataset in . It consists of 1800 computer-generated graphics and 1800 natural images. The computer-generated graphics were downloaded from the Level-Design Reference Database , which contains more than 60,000 screenshots of photo-realistic video games. The game information was removed by cropping the images to a resolution of 1280650. The preprocessed images can be downloaded from the link on Github . Some computer graphics samples are shown in Figure 1. The natural images are taken from the RAISE dataset . The resolution of these natural images ranges from 30082000 to 49003200. All of these natural images were downloaded in RAW format and converted to JPEG with a quality factor of 95.
In our experiment, 900 natural images and 900 computer-generated graphics were randomly selected from the dataset for training, 800 natural images and 800 computer-generated graphics were set aside for testing, and 100 natural images and 100 computer-generated graphics for validation. Then, all of these full-size images were clipped to image patches with size 650650. The number of image patches we obtained for training was about 44,000.
Iv-B Experiment Setup
We implemented the proposed convolution neural network based on the Caffe framework
. All of the experiments were conducted on a GeForce GTX 1080ti GPU. The stochastic gradient descent algorithm was used to optimize the proposed CNN-based model. The initial learning rate was set to 0.001. The learning rate update policy was set towith the value of 0.0001 and the value of 0.75. The parameters of and
were set to 0.9 and 0.0005, respectively. The batch size of training was set to 64. Namely, 64 image patches were fed to the CNN-based model for each iteration. After 80 epochs of iteration, the trained CNN-based model was obtained for testing.
In order to get the performance of the proposed CNN-based model, we applied the trained model to the testing dataset. All of the full-size images in the testing dataset needed to be preprocessed. The preprocessing for the testing images was similar to the preprocessing of the images in the training. After preprocessing, the testing images were clipped into image patches. Then, these image patches were fed to the trained CNN-based model, and the prediction results for the image patches were obtained. Based on the prediction results of the image patches, we deployed a majority vote scheme to obtain the classification results for the full-size images.
Iv-C Experimental Results
Iv-C1 Different Numbers of High-Pass Filters
As shown in Figure 3, the proposed convolution neural network has three combinations for the high-pass filter layer. Each of the combinations has different numbers of high-pass filters. We trained all of these combinations for 80 epoch iterations and obtained two trained models for each of the combinations. In other words, we obtained a model of 50 epochs and a model of 80 epochs for the combination of High_Pass_Filter3 after training the proposed network for 80 epoch iterations. We also obtained the same number of models for the other two combinations. Figure 4 and Figure 5 show the evolutions of training loss and validation accuracy in the procedure of iteration. The validation accuracy is shown in Figure 4, and the training loss is shown in Figure 5. It is observed that the proposed method with High_Pass_Filter3 converges much faster than the others and achieves much higher prediction accuracy.
To evaluate the classification performance of the proposed method with different numbers of high-pass filters, we tested these models obtained in the training procedure on the testing dataset. The classification accuracy is shown in Table I. Note that the size of the image patches in the method of Rahmouni et al. in  is 100100. In our experiments, we set the size of the image patches to 650650 to meet the requirement of our neural network architecture. A majority vote scheme was applied to the testing results of the image patches to obtain the classification results for the full-size images.
Compared with the state-of-the-art method of Rahmouni et al. in , our method with the high-pass filter obtained better performance. Furthermore, the proposed method with High_Pass_Filter3 outperformed the others and obtained the best performance. The classification accuracy for the full-size images could achieve 100%. These experimental results demonstrate the effectiveness of the high-pass filter in the preprocessing procedure for our proposed deep learning approach.
|Image Patches||Full-Size Images|
|model of 50 epochs||model of 80 epochs||model of 50 epochs||model of 80 epochs|
|the proposed HPF3||99.98%||99.95%||100%||100%|
|the proposed HPF1||99.87%||99.77%||100%||99.83%|
|the proposed HPF0||88.28%||87.77%||93.37%||93.12%|
|Rahmouni et al. ||84.8%||93.2%|
Iv-C2 Different Quality Factors of Natural Images
We also evaluated the robustness of our proposed method with different quality factors. In this experiment, 2000 natural images in RAW format were downloaded from the RAISE-2k dataset . We randomly selected 1800 natural images for our robustness experiment. These RAW images were converted to JPEG format with quality factors of 95, 85, and 75, respectively. Then, we could obtain three sub-datasets with different quality factors of natural images for our experiment. Each of the sub-datasets were then divided into training (50%), testing (40%), and validation (10%) to form the datasets for the robustness experiment of quality factors. Note that the computer-generated graphics in this experiment remained untouched. These computer-generated graphics were compressed with a reasonable quality factor when the author collected this dataset.
For the filter layer, we utilized High_Pass_Filter3 to achieve the best performance in this experiment. Figure 6 and Figure 7 show the evolutions of training loss and validation accuracy in the iteration procedure. The validation accuracy is shown in Figure 6, and the training loss is shown in Figure 7. The classification accuracy for different quality factors of natural images is shown in Table II. It is observed that the proposed method with High_Pass_Filter3 achieves a perfect performance. Although the compression with different quality factors has an impact on the classification accuracy of image patches, due to the majority vote scheme used for the full-size images, all of the classification accuracies for different quality factors of the natural images are 100%.
|Image Patches||Full-Size Images|
|model of 50 epochs||model of 80 epochs||model of 50 epochs||model of 80 epochs|
In this paper, we develop an approach to distinguish computer-generated graphics from natural images based on sensor pattern noise and a convolutional neural network. The experimental results show that the proposed method obtains better performance than the method in  on the same dataset. Currently, there are several computer-generated graphics datasets [5, 7] for forensics research. However, many images in these datasets are smaller than 650 pixels in width or height. This cannot meet the size requirement of the proposed convolutional neural network. In the future, we will focus on the improvement of our CNN-based model for smaller images. Furthermore, applying a trained CNN-based model to discriminate the computer-generated graphics from other existing datasets—namely one model for all datasets—would be another interesting future work.
-  Rocha, A.; Scheirer, W.; Boult, T.; Goldenstein, S. Vision of the unseen: Current trends and challenges in digital image and video forensics. ACM Computing Surveys 2011, 43, 26–40.
-  Stamm, M.C.; Wu, M.; Liu, K.R. Information forensics: An overview of the first decade. IEEE Access 2013, 1, 167–200.
-  Rahmouni, N.; Nozick, V.; Yamagishi, J.; Echizen, I. Distinguishing Computer Graphics from Natural Images Using Convolution Neural Networks. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Rennes, France, December 4-7, 2017; 1-6.
-  Wang, X.; Liu, Y.; Xu, B.; Li, L.; Xue, J. A statistical feature based approach to distinguish PRCG from photographs. Computer Vision and Image Understanding 2014, 128, 84–93.
-  Li, Z.; Zhang, Z.; Shi, Y. Distinguishing computer graphics from photographic images using a multiresolution approach based on local binary patterns. Security & Communication Networks 2015, 7, 2153–2159.
-  Wang, J.; Li, T.; Shi, Y.Q.; Lian, S.; Ye, J. Forensics feature analysis in quaternion wavelet domain for distinguishing photographic images and computer graphics. Multimedia Tools & Applications 2016, 1-17.
-  Peng, F.; Zhou, D.L.; Long, M.; Sun, X.M. Discrimination of natural images and computer generated graphics based on multi-fractal and regression analysis. AEU - International Journal of Electronics and Communications 2017.
-  Yao, Y.; Shi, Y.; Weng, S.; Guan, B. Deep Learning for Detection of Object-Based Forgery in Advanced Video. Symmetry 2018, 10, 1-10.
-  Bayar, B.; Stamm, M.C. A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer. Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), 2016, 5-10.
-  Rao, Y.; Ni, J. A deep learning approach to detection of splicing and copy-move forgeries in images. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Abu Dhabi, UAE, 4–7 December 2016; 1-6.
-  Bondi, L.; Baroffio, L.; Güera, D.; Bestagini, P.; Delp, E.J.; Tubaro, S. First Steps Toward Camera Model Identification With Convolutional Neural Networks. IEEE Signal Processing Letters 2017, 24, 259–263.
-  Tuama, A.; Comby, F.; Chaumont, M. Camera model identification with the use of deep convolutional neural networks. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Abu Dhabi, UAE, 4–7 December 2016; 1–6.
-  Xu, G.; Wu, H.Z.; Shi, Y.Q. Structural Design of Convolutional Neural Networks for Steganalysis. IEEE Signal Processing Letters 2016, 23, 708–712.
-  Ye, J.; Ni, J.; Yi, Y. Deep Learning Hierarchical Representations for Image Steganalysis. IEEE Transactions on Information Forensics and Security 2017, 12, 2545–2557.
-  Gando, G.; Yamada, T.; Sato, H.; Oyama, S.; Kurihara, M. Fine-tuning deep convolutional neural networks for distinguishing illustrations from photographs. Expert Systems with Applications 2016, 66, 295–301.
-  Pandey, R.C.; Singh, S.K.; Shukla, K.K. Passive forensics in image and video using noise features: A review. Digital Investigation 2016, 19, 1–28.
-  Orozco, A.L.S.; Corripio, J.R. Image source acquisition identification of mobile devices based on the use of features. Multimedia Tools & Applications 2016, 75, 7087–7111.
-  Villalba, L.J.G.; Orozco, A.L.S.; López, R.R.; Castro, J.H. Identification of smartphone brand and model via forensic video analysis. Expert Systems with Applications 2016, 55, 59–69.
-  Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, 3–6 December 2012; 1097–1105.
-  Fridrich, J.; Kodovsky, J. Rich Models for Steganalysis of Digital Images. IEEE Transactions on Information Forensics and Security 2012, 7, 868–882.
-  Qian, Y.; Dong, J.; Wang, W.; Tan, T. Deep learning for steganalysis via convolutional neural networks. In Proceedings of the SPIE 9409, Media Watermarking, Security, and Forensics, San Francisco, CA, USA, 9–11 February 2015; Volume 9409, p. 94090J.
Ioffe, S.; Szegedy, C.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; 448–456.
Nair, V.; Hinton, G.E.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; 807–814.
-  Dang-Nguyen, D.T.; Pasquini, C.; Conotter, V.; Boato, G. RAISE - a raw images dataset for digital image forensics. ACM Multimedia Systems, Portland, Oregon, March 18-20, 2015. 219–224.
-  Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; 675–678.
-  (2017) Level-design Reference Database. [Online]. Available: http://level-design.org/referencedb.
-  (2017) CGvsPhoto Github Project. [Online]. Available: https://github.com/NicoRahm/CGvsPhoto.