DeepAI
Log In Sign Up

CNN-based fast source device identification

Source identification is an important topic in image forensics, since it allows to trace back the origin of an image. This represents a precious information to claim intellectual property but also to reveal the authors of illicit materials. In this paper we address the problem of device identification based on sensor noise and propose a fast and accurate solution using convolutional neural networks (CNNs). Specifically, we propose a 2-channel-based CNN that learns a way of comparing camera fingerprint and image noise at patch level. The proposed solution turns out to be much faster than the conventional approach and to ensure an increased accuracy. This makes the approach particularly suitable in scenarios where large databases of images are analyzed, like over social networks. In this vein, since images uploaded on social media usually undergo at least two compression stages, we include investigations on double JPEG compressed images, always reporting higher accuracy than standard approaches.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/07/2020

DIPPAS: A Deep Image Prior PRNU Anonymization Scheme

Source device identification is an important topic in image forensics si...
08/29/2018

Camera-based Image Forgery Localization using Convolutional Neural Networks

Camera fingerprints are precious tools for a number of image forensics t...
04/06/2020

On-device Filtering of Social Media Images for Efficient Storage

Artificially crafted images such as memes, seasonal greetings, etc are f...
05/06/2018

A Counter-Forensic Method for CNN-Based Camera Model Identification

An increasing number of digital images are being shared and accessed thr...
06/23/2018

Robust Image Identification for Double-Compressed JPEG Images

It is known that JPEG images uploaded to social networks (SNs) are mostl...
11/04/2020

The Forchheim Image Database for Camera Identification in the Wild

Image provenance can represent crucial knowledge in criminal investigati...
01/17/2020

Combining PRNU and noiseprint for robust and efficient device source identification

PRNU-based image processing is a key asset in digital multimedia forensi...

I Introduction

In this paper, we tackle the problem of image source device identification. State-of-the-art solutions rely on photo response non-uniformity (PRNU) [22], a characteristic noise trace left by the camera sensor on all acquired images. PRNU is caused by imperfections in the sensor manufacturing process and represents a unique pattern noise associated to a certain device. It can be used both to trace the exact origin of an image and to establish its integrity [9, 10].

The classic pipeline is based on the availability of a certain number of images coming from the same device in order to carry out a reliable PRNU estimation. Then the image under test can be compared with the sensor fingerprint through denoising and peak to correlation energy (PCE) computation

[8]. However, computing the PCE between a query image noise and a fingerprint might be computationally expensive, and may become a major bottleneck when source identification requires scanning huge databases of device fingerprints [21, 24, 26]. With the increasing resolution of digital sensors, the search time may become prohibitive even for a moderate fingerprint database.

To manage large image databases some popular methods look for compressed or alternative representations of the sensor noise, even if this is not an easy task given the noisy-like nature of PRNU. One of the first methods introduces the concept of fingerprint digest: for each query fingerprint, only pixels with the largest magnitude are extracted [15]. Instead, in [2] the idea is to reduce the resolution of the PRNU values by quantizing them into a single bit, while in [3] a composite fingerprint is introduced. In [31] compression is obtained using random projections followed by binary quantization of the projected values, and in [5] this technique is investigated for JPEG images. Alternatively, the authors of [29] design a tree-like search strategy to split PRNU matching problem into a series of sub-problems.

We consider an alternative approach to quickly compare PRNUs and test images by leveraging convolutional neural networks (CNNs) to achieve good performance on small image patches. Notice that relying on small patches makes PRNU-based forgery localization possible. Moreover, it enables analyses on social networks, where images are often available only in cropped formats. CNNs have been already successfully applied for camera model identification [30, 4], for tracing the origin of an image in a social network [7] or for PRNU anonymization [6]. However, less attention has been devoted to PRNU-based device identification. Only recently, in [19]

sensor noise extraction has been improved by using a residual-based deep learning process that replaces the standard agnostic denoising procedure, leading to improved source attribution performance.

In this paper, we devise a data-driven approach for source identification based on PRNU. Differently from [19], our goal is to replace the PCE test in favour of an alternative method based on CNNs. Using deep learning to compare patches for forensics applications has been recently proposed in [16, 25, 12] by means of Siamese networks. These networks learn discriminative data representations that can be used to compare input patches especially when camera-based artifacts are considered. In our scenario, patches to be compared are extracted from the estimated PRNU and the noise residual of the test image, which do not possess any type of periodicity and are hard to compress. For this reason, we avoid embedded representations and rely on the whole original information, by proposing a network architecture that learns the best similarity function for source identification. Our approach is inspired by [32], where a general similarity function is learnt directly from raw image pixels to compare patches for visual tasks. In our case we use a 2-channel-based CNN, which, through proper training, is able to learn the best way to compare patches for device identification.

Ii Proposed method

Given a set of images coming from device , it is possible to estimate the device PRNU through denoising and maximum-likelihood estimation [22, 9]. Given a test image , the noise residual can be extracted using a denoising-based procedure [9]. Standard PRNU-based device attribution methods compare and in order to detect a possible match [8]. In this work, we propose to perform this comparison through a CNN. For each pair of query image and candidate device , we feed and to the CNN. The network returns a CNN-based identification score which is directly associated to the coherence between the image and the device. By thresholding we can infer whether comes from device or not.

Fig. 1: Scheme of the proposed pair-wise correlation network (PCN).

Ii-a Network architecture

We investigate two CNN architectures rather different one from the other. With the specific purpose of testing a shallow network enabling fast computations, the first architecture is drawn using only convolutional layers as depicted in Fig. 1. For each pair , we follow these operations:

  1. feed the pair to

    steps of 2D convolution, leaky relu and max pooling, using the parameters shown in Fig. 

    1;

  2. apply a pair-wise correlation pooling, which is a layer tailored to our specific problem;

  3. obtain a single score through a fully-connected layer.

The pair-wise correlation pooling layer is inspired to the conventional metrics used to identify the source device on images, i.e., normalized cross-correlation and PCE. Since these are based on cross-correlation between noise residual and PRNU, the pair-wise correlation pooling computes a correlation as well between pairs of input features. Being the input of the pair-wise correlation pooling, the output of the layer is defined as:

(1)

where is the channel index. The pair-wise correlation pooling layer can be seen as a simplified version of bilinear pooling layer [20]. Actually, bilinear pooling computes the correlation between all the features maps, whereas pair-wise pooling evaluate this only for adjacent pairs of features. The main motivation behind pair-wise pooling is to provide fast computations. Indeed, evaluating the complete cross-correlation between all feature maps may be time-consuming.

The second network we propose is known in the literature as Inception-ResNet V2, designed for going deeper and providing very accurate results [28]

. Differently from the first solution which is trained from scratch, we initialize the network weights, except for the last layer, using the weights trained on Imagenet database

[13]. Since Inception-ResNet V2 works on RGB images, we stack the PRNU and the residue as color components.

Hereinafter, we denote the two network architectures as PCN (pair-wise correlation network) and INC (inception).

Ii-B Network training

Fig. 2: Sketch of the proposed training strategy. For every device , the network is fed using noise residuals coming from device , paired with PRNU (coherent pair) and with PRNU coming from a different device (non-coherent pair). The CNN is able to learn a similarity measure for the source identification task.

To train the network, we assume that a set of devices is available. For each device, we assume having a PRNU estimate (obtained from a set of flat-field images) as well as a set of natural images. From each natural image belonging to device , we extract the noise residue .

As shown in Fig. 2, the network is trained by concurrently feeding it with coherent pairs (i.e., PRNU and residue of the same device), and non-coherent pairs (i.e., PRNU and residue of diverse devices). In order to provide both coherent and non-coherent cases for each device inside one batch of data, we consider a batch-size of twice the cardinality of . For every , the coherent pair is created by randomly picking from the set of residues of device , while the non-coherent pair takes from another device different from , .

As far as the loss function is concerned, we face the problem as a binary classification problem, adopting the standard sigmoid cross-entropy loss. In other words, label

is assigned to coherent pairs and label otherwise. The network thus learns to return a score , which is high for coherent pairs and low for non-coherent ones. We stop the training when the accuracy (i.e., the fraction of correct predictions) over the validation set is maximized. Specifically, we use Adam optimization algorithm [18] with learning rate and patience initialized at and , respectively, for a maximum number of epochs.

Ii-C Network deployment

Once the CNN is trained, we can use it in two scenarios:

  • closed-set scenario: given an image, we aim at identifying the source among a finite pool of devices;

  • open-set scenario: given an image and a candidate device, we aim at inferring if the device shot that image or not.

To solve these problems, we always feed the network with pairs of PRNU and image residue. In the closed-set scenario, the camera returning the highest CNN score is associated to the image. In the open-set scenario, we threshold the CNN score

in order to attribute the image to the device with a certain false alarm probability. Notice that, despite the CNN is trained over a specific set of devices

, it can be used also to compare PRNUs and residues from devices never seen during training (i.e., ). As a matter of fact, the CNN learns how to compute a distance measure between a PRNU and a residue, thus it is not bounded to work on a closed-set of devices.

Iii Experimental analysis

In this section, we first describe the dataset, the experimental set-up and the evaluation metrics, then we report numerical results discussing the main achievements.

Iii-a Dataset

The input to the network is always a pair of PRNU and image noise residue. In our experiments, we always crop the central pixel region from each image in order to limit network complexity. We let range from to

in order to study the relationship between image resolution and accuracy. As commonly done in CNN-based solutions, we normalize both PRNUs and residues by their standard deviation before feeding them to the network.

In order to test our method on a significant amount of devices, we consider both the Dresden image database [14] and the Vision dataset [27]. To avoid excessively old camera models, we only pick devices whose imagery have resolution greater than pixels. For each device we exclusively investigate JPEG compressed images, as these represent the most frequent data for a forensics analyst to deal with.

Focusing on computing reliable PRNU fingerprints, we select devices if at least JPEG images depicting scenes of flat surfaces are available. We end up with a total amount of devices, from Dresden and from Vision. To estimate the PRNU we select only flat-field images, whereas, in order to fairly test our method, we make use of images showing natural scenes taken from indoor and outdoor scenarios. We work with more than images belonging to Dresden and almost images from Vision. On average, there are more than natural images per device, randomly split in three disjoint datasets: training ( of the images), validation () and evaluation (remaining ).

The training dataset includes devices from Vision and devices from Dresden. It is worth mentioning that has been built using only devices of different models. In doing so, we avoid introducing an important constraint into the learning process, that is, training using cameras of the same model. Indeed, this may help enhancing the final performance, although requiring various instances of the same model at investigation side, which is actually unlikely to happen. Results are computed on images from the evaluation set of all devices. Thereby, we are simulating a real situation in which the system is tested over unknown camera models as well.

Iii-B Evaluation metrics

In the closed-set scenario, we test the residue of the query image against all PRNUs. To assess the attribution accuracy, we consider all query residues in the evaluation set for each device. The closed-set accuracy score is defined as the average fraction of correct predictions per device.

The open-set scenario is evaluated as a binary classification problem: distinguishing correlating pairs of PRNU-residue from non-correlating ones. We therefore resort to receiver operating characteristic (ROC) curves. For each camera we consider all noise residues from that camera as positive samples, whereas the set of negatives includes all the residues not belonging to that camera. Each ROC curve draws the relationship between true positive rate (TPR) and false positive rate (FPR), averaged over the set of available devices. As compact metric we use , defined as the area under the curve () for the open-set problem. The goal is to achieve a high value of , ideally 1.

Iii-C Results

All tests have been run on a workstation equipped with one Intel® Core i9-9980XE (36 Cores @3.00 GHz), RAM 126 GB, one QUADRO P6000 (3840 CUDA Cores @1530 MHz), 24 GB, running Ubuntu 18.04.2.

Closed-set scenario. Fig. 3 shows results for the closed-set scenario. In particular, is depicted as a function of the average computational time for testing one pair of PRNU-residue, and of the crop size . Notice that for the INC architecture we draw results only for , as larger image dimensions would require additional GPU memory. As state-of-the-art comparison, we perform the classic PCE test [8] between each noise residual in the evaluation dataset and all available PRNUs.

The leaner network set-up (i.e., the PCN) reveals to be fast and accurate at the same time: fixing a value, the required time is always at least times lower than PCE and times lower than the second strategy. With respect to PCE, accuracy increases as well: PCE is always worst and gets similar results only for the highest values. The deepest network (i.e., the INC) is computationally heavier as expected, but shows a larger accuracy gap with respect to PCE. For instance, considering , INC achieves whereas PCN only reaches and PCE obtains . To achieve similar accuracy, PCE should consider an image size of more than twice that of INC architecture, with the consequent increase in computational complexity.

For what concerns the required computational times, it is worth noting that the main advantage with respect to PCE is the possibility of feeding the CNNs with multiple pairs of PRNU-residue. These pairs can be processed in parallel by the GPU, at least until there is available memory for storing data. As a consequence, the larger the amount of candidate devices, the higher the temporal benefit compared to PCE. For instance, when testing a query image over candidate cameras, we always take up less than 8 GB of GPU memory even using the INC configuration. Whether more devices were available, the average testing time would remain basically unchanged until reaching the maximum memory size.

Fig. 3: Accuracy as a function of time [milliseconds] and crop size .
Fig. 4: Accuracy as a function of crop size .

Open-set scenario. Fig. 4 reports results for the open-set scenario. It is worth noting that both proposed architectures outperform PCE for any value. Notice that we are not analyzing as a function of the required time, since the open-set scenario reduces investigations to only one pair of PRNU-residue. In this scenario, both CNN and PCE report comparable computational time, in the order of few milliseconds. However, as previously shown, whenever various images should be tested against the same PRNU, we could exploit the data parallelization property of GPU to test multiple images together, thus saving important computation time.

JPEG re-compression. Eventually, in order to simulate scenarios where images underwent some post-processing operations, we test our method on double JPEG-compressed images. This step is purposely designed to simulate real case set-up, in which only images from social networks (which typically undergo at least two compression steps) may be available. In this vein, we re-compress all the images using JPEG compression with quality factors 80 and 90. Then, we extract the noise residuals [9], defining them as . As far as PRNUs are concerned, we can distinguish two scenarios: (i) whether flat-field single-compressed images are available, exploit the PRNUs estimated from these; (ii) if only flat-field double-compressed images are available, use them to compute the PRNUs defined as . For a given patch-size, we experiment two test cases: (i) test over pairs of PRNUs and residues ; (ii) test over pairs of PRNUs and residues . Moreover, we investigate multiple training configurations: we can exploit the network trained over the single-compressed dataset, or we can re-train the network from scratch as well, according to the specific test case.

Experiments show that we can achieve also percentage points of more than PCE for both the JPEG qualities and . This performance improvement is evident especially for the smallest patch-sizes (i.e., ). For instance, Fig. 5 depicts the results achieved on the closed-set scenario using PCN configuration on double-compressed dataset with JPEG quality 90. It is worth noting that training the CNN on the specific evaluation case always helps enhancing the performance with respect to training over the single-compressed dataset, even though every reported curve outperforms the corresponding PCE results.

Fig. 5: Accuracy as a function of crop size , considering double-JPEG compressed dataset with quality 90 and PCN network. and refer to training and evaluation datasets, respectively.

Iv Conclusions

In this paper we propose a fast solution to the image device identification problem. In particular, we leverage PRNU and standard image noise residual extraction, but we substitute the correlation stage with a 2-channel-based convolutional neural network, able to learn the best similarity function for source device identification. Our method proves to be faster than PCE in case a large amount of potential provenance devices is investigated. Moreover, it requires much less query image content to obtain enhanced attribution accuracy. Eventually, we evaluate the proposed methodology on images that underwent double-JPEG compression, in order to simulate more complex scenarios. Experiments show that our method can be considered a viable alternative to PCE, especially whenever only small-size images are available.

Given these promising results, our future work will focus on video sequences. Indeed, compression artifacts [11, 1] and stabilization issues [17, 23] make video source identification a complex task of undoubted interest.

References

  • [1] E. Altinisik, K. Tasdemir, and H.T. Sencar (2019) Mitigation of H.264 and H.265 video compression for reliable PRNU estimation. IEEE Transactions on Information Forensics and Security 15, pp. 1557–1571. Cited by: §IV.
  • [2] S. Bayram, H.T. Sencar, and N. Memon (2012)

    Efficient Sensor Fingerprint Matching Through Fingerprint Binarization

    .
    IEEE Transactions on Information Forensics and Security 4 (4), pp. 1404–1413. Cited by: §I.
  • [3] S. Bayram, H.T. Sencar, and N. Memon (2015) Sensor fingerprint identification through composite fingerprints and group testing. IEEE Transactions on Information Forensics and Security 10 (3), pp. 597–612. Cited by: §I.
  • [4] L. Bondi, L. Baroffio, D. Guera, P. Bestagini, E.J. Delp, and S. Tubaro (2017) First steps toward camera model identification with convolutional neural networks. IEEE Signal Processing Letters 24 (3), pp. 259–263. Cited by: §I.
  • [5] L. Bondi, P. Bestagini, F. Pérez-González, and S. Tubaro (2019) Improving PRNU compression through preprocessing, quantization, and coding. IEEE Transactions on Information Forensics and Security 14 (3), pp. 608–620. Cited by: §I.
  • [6] N. Bonettini, L. Bondi, D. Güera, S. Mandelli, P. Bestagini, S. Tubaro, and E. J. Delp (2018) Fooling PRNU-based detectors through convolutional neural networks. In European Signal Processing Conference (EUSIPCO), pp. 957–961. Cited by: §I.
  • [7] R. Caldelli, I. Amerini, and C.T. Li (2018) PRNU-based Image Classification of Origin Social Network with CNN. In European Signal Processing Conference, pp. 1357–1361. Cited by: §I.
  • [8] M. Chen, J. Fridrich, M. Goljan, and J. Lukáš (2007) Source digital camcorder identification using sensor photo response non-uniformity. In Security, Steganography, and Watermarking of Multimedia Contents IX, Vol. 6505. Cited by: §I, §II, §III-C.
  • [9] M. Chen, J. Fridrich, M. Goljan, and J. Lukás (2008) Determining image origin and integrity using sensor noise. IEEE Transactions on Information Forensics and Security 3 (1), pp. 74–90. Cited by: §I, §II, §III-C.
  • [10] G. Chierchia, G. Poggi, C. Sansone, and L. Verdoliva (2014) A Bayesian-MRF approach for PRNU-based image forgery detection. IEEE Transactions on Information Forensics and Security 9 (4), pp. 554–567. Cited by: §I.
  • [11] W. Chuang, H. Su, and M. Wu (2011) Exploring compression effects for improved source camera identification using strongly compressed video. In 2011 18th IEEE International Conference on Image Processing, pp. 1953–1956. Cited by: §IV.
  • [12] D. Cozzolino and L. Verdoliva (2020) Noiseprint: a cnn-based camera model fingerprint. IEEE Transactions on Information Forensics and Security 15 (), pp. 144–159. External Links: Document, ISSN 1556-6021 Cited by: §I.
  • [13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §II-A.
  • [14] T. Gloe and R. Böhme (2010) The ’Dresden Image Database’ for benchmarking digital image forensics. Journal of Digital Forensic Practice 3, pp. 150–159. Cited by: §III-A.
  • [15] M. Goljan and J. Fridrich (2010) Managing a large database of camera fingerprints. In Media Forensics and Security II, SPIE Electronic Imaging Symposium, Cited by: §I.
  • [16] M. Huh, A. Liu, A. Owens, and A.A. Efros (2018) Fighting fake news: image splice detection via learned self-consistency. In European Conference on Computer Vision, Cited by: §I.
  • [17] M. Iuliani, M. Fontani, D. Shullani, and A. Piva (2019) Hybrid reference-based video source identification. Sensors 19 (3), pp. 649. Cited by: §IV.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §II-B.
  • [19] M. Kirchner and C. Johnson (2019-12) SPN-CNN: Boosting Sensor-Based Source Camera Attribution with Deep Learning. In IEEE International Workshop on Information Forensics and Security, Cited by: §I, §I.
  • [20] T. Lin, A. RoyChowdhury, and S. Maji (2015) Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1457. Cited by: §II-A.
  • [21] X. Lin and C.T. Li (2017) Large-scale image clustering based on camera fingerprints. IEEE Transactions on Information Forensics and Security 12 (4), pp. 793–808. Cited by: §I.
  • [22] J. Lukas, J. Fridrich, and M. Goljan (2006) Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security 1 (2), pp. 205–214. Cited by: §I, §II.
  • [23] S. Mandelli, P. Bestagini, L. Verdoliva, and S. Tubaro (2020) Facing device attribution problem for stabilized video sequences. IEEE Transactions on Information Forensics and Security 15 (), pp. 14–27. External Links: Document, ISSN 1556-6021 Cited by: §IV.
  • [24] F. Marra, G. Poggi, C. Sansone, and L. Verdoliva (2017) Blind PRNU-based image clustering for source identification. IEEE Transactions on Information Forensics and Security 12 (9), pp. 2197–2211. Cited by: §I.
  • [25] O. Mayer and M.C. Stamm (2020) Forensic Similarity for Digital Images. IEEE Transactions on Information Forensics and Security 15, pp. 1331–1346. Cited by: §I.
  • [26] Q.T. Phan, G. Boato, and F.G.B. De Natale (2018) Accurate and scalable image clustering based on sparse representation of camera fingerprint. IEEE Transactions on Information Forensics and Security 14 (7), pp. 1902–1916. Cited by: §I.
  • [27] D. Shullani, M. Fontani, M. Iuliani, O. Al Shaya, and A. Piva (2017) VISION: a video and image dataset for source identification. EURASIP Journal on Information Security 2017 (1), pp. 15. Cited by: §III-A.
  • [28] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In

    Thirty-first AAAI conference on artificial intelligence

    ,
    Cited by: §II-A.
  • [29] S. Taspinar, H.T. Sencar, S. Bayram, and N. Memon (2017) Fast camera fingerprint matching in very large databases. In IEEE International Conference on Image Processing, Cited by: §I.
  • [30] A. Tuama, F. Comby, and M. Chaumont (2016) Camera model identification with the use of deep convolutional neural networks. In IEEE International Workshop on Information Forensics and Security, Cited by: §I.
  • [31] D. Valsesia, G. Coluccia, T. Bianchi, and E. Magli (2015) Compressed fingerprint matching and camera identification via random projections. IEEE Transactions on Information Forensics and Security 10 (7), pp. 1472–1485. Cited by: §I.
  • [32] S. Zagoruyko and N. Komodakis (2015) Learning to compare image patches via convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.