Trustworthy multimedia content is important to a number of institutions in today’s society, including news outlets, courts of law, police investigations, intelligence agencies, and social media websites. As a result, multimedia forensics approaches have been developed to expose tampered images, determine important information about the processing history of images, and identify details about the camera make, model, and device that captured them. These forensic approaches operate by detecting the visually imperceptible traces, or “fingerprints,” that are intrinsically introduced by a particular processing operation .
Recently, researchers have developed deep learning methods that target digital image forensic tasks with high accuracy. For example, convolutional neural network (CNN) based systems have been proposed that accurately detect traces of median filtering, resizing [3, 4], inpainting , multiple processing operations [6, 7, 8], processing order , and double JPEG compression [10, 11]. Additionally, researchers have proposed approaches to identify the source camera model of digital images [12, 13, 14, 15], and identify their origin social media website .
However, there are two main drawbacks to many of these existing approaches. First is that many deep learning systems assume a closed set of forensic traces, i.e. known and closed set of possible editing operations or camera models. That is, these methods require prior training examples from a particular forensic trace, such as the source camera model or editing operation, in order to identify it again in the future. This requirement is a significant problem for forensic analysts, since they are often presented with new or previously unseen forensic traces. Additionally, it is often not feasible to scale deep learning systems to contain the large numbers of classes that a forensic investigator may encounter. For example, systems that identify the source camera model of an image often require hundreds of scene-diverse images per camera model . To scale such a system to contain hundreds or thousands of camera models requires a prohibitively large data collection undertaking.
A second drawback of these existing approaches is that many forensic investigations do not require explicit identification of a particular forensic trace. For example when analyzing a splicing forgery, which is a composite of content from multiple images, it is often sufficient to simply detect a region of an image that was captured by a different source camera model, without explicitly identifying that source. That is, the investigator does not need to determine the exact processing applied to the image, or the true sources of the pasted content, just that an inconsistency exists within the image. In another example, when verifying the consistency of an image database, the investigator does not need to explicitly identify which camera models were used to capture the database, just that only one camera model or many camera models were used.
Recently, researchers have proposed CNN-based forensic systems for digital images that do not require a closed and known set of forensic traces. In particular, research in 
proposed a system to output a binary decision indicating whether a particular image was captured by a camera model in a set of known camera models used for training, or from an unknown camera model that the classifier has not been trained to identify. Work in showed that features learned by a CNN for source camera model identification can be iteratively clustered to identify spliced images. The authors showed that this type system can detect spliced images even when the camera models were not used to train the system.
In this paper, we propose a new digital image forensics approach that operates on an open set of forensic traces. This approach, which we call forensic similarity, determines whether two image patches contain the same or different forensic traces. This approach is different from other forensics approaches in that it does not explicitly identify the particular forensic traces contained in an image patch, just whether they are consistent across two image patches. The benefit of this approach is that prior knowledge of a particular forensic trace is not required to make a similarity decision on it.
To do this, we propose a two part deep learning system. In the first part, called the feature extractor, we use a CNN to extract general low-dimensional forensic features, called deep features, from an image patch. Prior research has shown that CNNs can be trained to map image patches onto a low-dimensional feature space that encodes general and high-level forensic information about that patch[17, 18, 14, 19, 3, 20]. Next, we use a three layer neural network to map pairs of these deep features onto a similarity score, which indicates whether the two image patches contain the same forensic trace or different forensic traces.
We experimentally evaluate the efficacy of our proposed approach in several scenarios. We evaluate the ability of our proposed forensic similarity system at determining whether two image patches were 1) captured by the same or different camera model, 2) manipulated the same or different editing operation, and 3) manipulated same or different manipulation parameter, given a particular editing operation. Importantly, we evaluate performance on camera models, manipulations and manipulation parameters not used during training, demonstrating that this approach is effective in open-set scenarios.
Furthermore, we demonstrate the utility of this approach in two practical applications that a forensic analyst may encounter. In the first application, we demonstrate that our forensic similarity system detects and localizes image forgeries. Since image forgeries are often a composite of content captured by two camera models, these forgeries are exposed by detecting the image regions that are forensically dissimilar to the host image. In the second application, we show that the forensic similarity system verifies whether a database of images was captured by all the same camera model, or by different camera models. This is useful for flagging social media accounts that violate copyright protections by stealing content, often from many different sources.
This paper is an extension of our previous work in . In our previous work, we proposed a proof-of-concept similarity system for comparing the source camera model of two image patches, and evaluated on a limited set of camera models. In this work, we extend  in several ways. First, we reframe the approach as a general system that is applicable to any measurable forensic trace, such as manipulation type or editing parameter, not just source camera model. Second, we significantly improve the system architecture and training procedure, which led to an over 50% reduction in classification error. Finally, we experimentally evaluate this approach in an expanded range of scenarios, and demonstrate utility in two practical applications.
The remaining parts of the manuscript are outlined as follows. In Sec. II, we motivate and formalize the concept of forensic similarity. In Sec. III, we detail our proposed deep-learning system implementation and training procedure. In this section, we describe how to build and train the CNN-based feature extractor and similarity network. In Sec. IV, we evaluate the effectiveness of our proposed approach in a number of forensic situations, and importantly effectiveness on unknown forensic traces. Finally, in Sec. V-A, we demonstrate the utility of this approach in two practical applications.
Ii Forensic Similarity
Prior multimedia forensics approaches for digital images have focused on identifying or classifying a particular forensic trace (e.g. source camera model, processing history) in an image or image patch. These approaches, however, suffer from two major drawbacks in that 1) training samples from a particular trace are required to identify it, and 2) not all forensic analyses require identification of a trace. For example, to expose a splicing forgery it is sufficient to identify that the forged image is simply composite content from different sources, without needing to explicitly identify those sources.
In this paper, we propose a new general approach that addresses these drawbacks. We call this approach forensic similarity. Forensic similarity is an approach that determines if two image patches have the same or different forensic trace. Unlike prior forensic approaches, it does not identify a particular trace, but still provides important forensic information to an investigator. The main benefit of this type of approach is that it is able to be practically implemented in open-set scenarios. That is, a forensic similarity based system does not inherently require training samples from a forensic trace in order to make a forensic similarity decision. Later, in Sec. III, we describe how this approach is implemented using a CNN-based deep learning system.
In this work, we define a forensic trace to be a signal embedded in an image that is induced by, and captures information about, a particular signal processing operation performed on it. Forensic traces are inherently unrelated to the perceptual content of the image; two images depicting different scenes may contain similar forensic traces, and two images depicting similar scenes may contain different forensic traces. Common mechanisms that induce a forensic trace an image are: the camera model that captured the image, the social media website where the image was downloaded from, and the processing history of the image. A number of approaches have been researched to extract and identify the forensic traces related to these mechanisms [7, 14, 21].
These prior approaches, however, assume a closed set of forensic traces. They are designed to perform a mapping where is the space of image patches and is the space of known forensic traces that are used to train the system, e.g. camera models or editing operations. However when an input image patch has a forensic trace , the identification system is still forced to map to the space , leading to erroneous an result. That is, the system will misclassify this new “unknown” trace as a “known” one in .
This is problematic since in practice forensic investigators are often presented with images or image patches that contain a forensic trace for which the investigator does not have training examples. We call these unknown forensic traces. This may be a camera model that does not exist in the investigator’s database, or previously unknown editing operation. In these scenarios, it is still important to glean important forensic about the image or image patch.
To address this, we propose a system that is capable of operating on unknown forensic traces. Instead of building a system to identify a particular forensic trace, we ask the question “do these two image patches contain the same forensic trace?” Even though the forensic similarity system may have never seen a particular forensic trace before, it is still able to distinguish whether they are the same or different across two patches. This type of question is analogous to the content-based image retrieval problem, and the speaker verification problem .
We define forensic similiarity as the function
that compares two image patches. This is done by mapping two input image patches to a score indicating whether the two image patches have the same or different forensic trace. A score of 0 indicates the two image patches contain different forensic traces, and a score of 1 indicates they contain the same forensic trace. In other words
To construct this system, we propose a forensic similarity system consisting of two main conceptual parts, which are shown in the system overview in Fig. 1. The first conceptual part is called the feature extractor
which maps an input image to a real valued N-dimensional feature space. This feature space encodes high-level forensic information about the image patch . Recent research in multimedia forensics has shown that convolutional neural networks (CNNs) are powerful tools for extracting general, high-level forensic information from image patches . We specify how this is done in Sec. III, where we describe our proposed implementation of the forensic similarity system.
Next we define the second conceptual part, the similarity function
that maps pairs of forensic feature vectors to asimilarity score that takes values from 0 to 1. A low similarity score indicates that the two image patches and have dissimilar forensic traces, and a similarity score indicates that the two forensic traces are highly similar.
Finally, we compare the similarity score of two image patches and to a threshold such that
In other words, the proposed forensic similarity system takes two image patches and as input. A feature extractor maps these two input image patches to a pair of feature vectors and , which encode high-level forensic information about the image patches. Then, a similarity function maps these two feature vectors to a similarity score, which is then compared to a threshold. A similarity score above the threshold indicates that and have the same forensic trace (e.g. processing history or source camera model), and a similarity score below the threshold indicates that they have different forensic traces.
Iii Proposed Approach
In this section, we describe our proposed deep learning system architecture and associated training procedure for forensic similarity. In our proposed forensic similarity architecture and training procedure, we build upon successful CNN-based techniques used in multimedia forensics literature, as well as propose a number of innovations that are aimed at extracting robust forensic features from images and accurately determining forensic similarity between two image patches.
Our proposed forensic similarity system consists of two conceptual elements: 1) a CNN-based feature extractor that maps an input image onto a low-dimensional feature space that encodes high level forensic information, and 2) a three-layer neural network, which we call the similarity network, that maps pairs of these features to a score indicating whether two image patches contain the same forensic trace. This system is trained in two successive phases. In the first phase, called Learning Phase A, we train the feature extractor. In the second phase, called Learning Phase B we train the similarity network. Finally, in this section we describe an entropy-based method of patch selection, which we use to filter out patches that are not suitable for forensic analysis.
Iii-a Learning Phase A - Feature Extractor
Here we describe the deep-learning architecture and training procedure of the feature extractor that maps an input image patch onto a low dimensional feature space, which encodes forensic information about the patch. This is the mapping described by (3). Later, pairs these feature vectors are used as input to the similarity network described below in Sec. III-B.
Developments in machine learning research have shown that CNNs are powerful tools to use as generic feature extractors. This is done by robustly training a deep convolutional neural network for a particular task, and then using the neuron activations, at a deep layer in the network, as a feature representation of an image
. These neuron activations are called “deep features,” and are often extracted from the last fully connected layer of a network. Research has shown that deep features extracted from a CNN trained for object recognition tasks can be used to perform scene recognition tasks and remote sensing tasks .
In multimedia forensics research, it has been shown that deep features based approaches also very powerful for digital image forensics tasks. For example, work in 
showed that deep features from a network trained on one set of camera models can be used to train an support vector machine to identify a different set of camera models. Work in showed that deep features from a CNN trained on a set of camera models can be used to determine whether the camera model of the input image model was used to train the system. Furthermore, it has been shown that deep features from a CNN trained for camera model identification transfer very well to other forensic tasks such manipulation detection , suggesting that deep features related to digital forensics are general to a variety of forensics tasks.
To build a forensic feature extractor, we adapt the MISLnet CNN architecture developed in , which has been utilized in a number of works that target different digital image forensics tasks including manipulation detection [6, 7, 27] and camera model identification [27, 20]. Briefly, this CNN consists of 5 convolutional blocks, labeled ‘conv1’ through ‘conv5’ in Fig. 2
and two fully connected layers labeled ‘fc1’ and ‘fc2’. Each convolutional block, with the exception of the first, contains a convolutional layer followed by batch normalization, activation, and finally a pooling operation. The two fully connected layers, labeled ‘fc1’ and ‘fc2,’ each consist of 200 neurons with hyperbolic tangent activation. Further details of this CNN are found in.
To use this CNN as a deep feature extractor, an image patch is fed forward through the (trained) CNN. Then, the activated neuron values in the last fully connected layer, ‘fc2’ in Fig. 2, are recorded. These recorded neuron values are then used as a feature vector that represents high-level forensic information about the image. The extraction of deep-features from a image patch is the mapping in (3), where the feature dimension corresponding to the number of neurons in ‘fc2.’
The architecture of this CNN-based feature extractor is similar to the architecture we used in our prior work in . However, in this work we alter the CNN architecture in three ways to improve the robustness of the feature extractor. First, we use full color image patches in RGB as input to the network, instead of just the green color channel used previously. Since many important forensic features are expressed across different color channels, it is important for the network to learn these feature representations. This is done by modifying each 55 convolutional kernel to be of dimension 553, where the last dimension corresponds to image’s the color channel. Second, we relax the constraint imposed on the first convolutional layer in  which is used to encourage the network to learn prediction error residuals. While this constraint is useful for forensics tasks of single channel image patches, it is not immediately translatable to color images. Third, we double the number of kernels in the first convolutional layer from 3 to 6, to increase the expressive power of the network.
The feature extractor architecture is depicted by each of the two identical ‘Feature Extractor’ blocks in Fig. 2. In our proposed system, we use two identical feature extractors, in ‘Siamese’ configuration , to map two input image patches and to a feature space and . This configuration ensures that the system is symmetric, i.e. the ordering of and does not matter. We refer to the Siamese feature extractor blocks as using hard sharing, meaning that the exact same weights and biases are shared between the two blocks.
Iii-A2 Training Methodology
In our proposed approach we first train the feature extractor during Learning Phase A
. To do this, we add an additional fully-connected layer with softmax activation to the feature extractor architecture. We provide the feature extractor with image patches and labels associated with the forensic trace of each image patch. Then, the we iteratively train the network using stochastic gradient descent with a cross-entropy loss function. Training is performed for 30 epochs with an initial learning rate of 0.001, which is halved every three epochs, and a batch size of 50 image patches.
During Learning Phase A we train the feature extractor network on a closed set of forensic traces referred to as “known forensic traces.” Research in  found that training a CNN in this way yields deep-feature representations that are general to other forensic tasks. In particular, it was shown that when a feature extractor was trained for camera model identification, it was very transferable to other forensic tasks. Because of this, during Learning Phase A we train the feature extractor on a large set of image patches with labels associated with their source camera model.
In this work, we train two versions of the feature extractor network: one feature extractor that uses 256256 image patches as input and another that uses 128128 image patches as input. We note that to decrease the patch size further would require substantial architecture changes due to the pooling layers. In each case, we train the network using 2,000,000 image patches from the 50 camera models in the “Camera model set A” found in Table I.
The feature extractor is then updated again in Learning Phase B, as described below in Sec. III-B. This is significantly different than in our previous work in , where the feature extractor remains frozen after Learning Phase A. In our experimental evaluation in Sec. IV-A, we show that allowing the feature extractor network to update during Learning Phase B significantly improves system performance.
Iii-B Learning Phase B - Similarity Network
Here, we describe our proposed neural network architecture that maps a pair of forensic feature vectors and to a similarity score as described in (4). The similarity score, when compared to a threshold, indicates whether the pair of image patches and have the same or different forensic traces. We call this proposed neural network the similarity network, and is depicted in the right-hand side of Fig. 2. Briefly, the network consists of 3 layers of neurons, which we view as a hierarchical mapping of two input features vectors to successive feature spaces and ultimately an output score indicating forensic similarity.
The first layer of neurons, labeled by ‘fcA’ in Fig. 2
, contains 2048 neurons with ReLU activation. This layer maps an input feature vectorto a new, intermediate feature space . We use two identical ‘fcA’ layers, in Siamese (hard sharing) configuration, to map each of the input vectors and into and .
This mapping for the kth value of the intermediate feature vector is calculated by an artificial neuron function:
which is the weighted summation, with weights through , of the elements in the deep-feature vector , bias term and subsequent activation by ReLU function . The weights and bias for each element of are arrived at through stochastic gradient descent optimization as described below.
Next the second layer of neurons, labeled by ‘fcB’ in Fig. 2, contains 64 neurons with ReLU activation. As input to this layer, we create a vector
that is the concatenation of , and , where is the element-wise product operation. This layer maps the input vector to a new ‘similarity’ feature space using the artificial neuron mapping described in (6). This similarity feature space encodes information about the relative forensic information between patches and .
Finally, a single neuron with sigmoid activation maps the similarity vector to a single score. We call this neuron the ‘similarity neuron,’ since it outputs a single score , where a small value indicates and contain different forensic traces, and larger values indicate they contain the same forensic trace. To make a decision, we compare the similarity score to a threshold typically set to 0.5.
The proposed similarity network architecture differs from our prior work in  in that we increase the number of neurons in ‘fcA’ from 1024 to 2048, and we add to the concatenation vector the elementwise multiplication of and . Work in  showed that the element-wise product of feature vectors were powerful for speaker verification tasks in machine learning systems. These additions increase the expressive power of the similarity network, and as a result improve system performance.
Iii-B2 Training Methodology
Here, we describe the second step of the forensic similarity system training procedure, called Learning Phase B. In this learning phase, we train the similarity network to learn a forensic similarity mapping for any type of measurable forensic trace, such as whether two image patches were captured by the same or different camera model, or manipulated by the same or different editing operation. We control which forensic traces are targeted by the system with the choice of training sample and labels provided during training.
Notably, during Learning Phase B, we allow the error to back propagate through the feature extractor and update the feature extractor weights. This allows the feature extractor to learn better feature representations associated with the type of forensic trace targeted in this learning phase. Allowing the feature extractor to update during Learning Phase B significantly differs from the implementation in , which used a frozen feature extractor.
We train the similarity network (and update the feature extractor simultaneously) using stochastic gradient descent for 30 epochs, with an initial learning rate of 0.005 which is halved every three epochs. The descriptions of training samples and associated labels used in learning Phase B are described in Sec. IV, where we investigate efficacy on different types of forensic traces.
|Camera model set A|
|Apple iPhone 4||Canon PC1730||Huawei Honor 5x||Nikon Coolpix S7000||Praktica DCZ5.9||Sony DSC-W800|
|Apple iPhone 4s||Canon Powershot A580||LG G2||Nikon Coolpix S710||Ricoh GX100||Sony DSC-WX350|
|Apple iPhone 5||Canon Powershot ELPH 160||LG G3||Nikon D200||Rollei RCP-7325XS||Sony DSC-H50|
|Apple iPhone 5s||Canon Powershot S100||LG Nexus 5x||Nikon D3200||Samsung Galaxy Note4||Sony DSC-T77|
|Apple iPhone 6||Canon Powershot SX530 HS||Motorola Droid Maxx||Nikon D7100||Samsung Galaxy S2||Sony NEX-5TL|
|Apple iPhone 6+||Canon Powershot SX420 IS||Motorola Droid Turbo||Panasonic DMC-FZ50||Samsung Galaxy S4|
|Apple iPhone 6s||Canon SX610 HS||Motorola X||Panasonic FZ200||Samsung L74wide|
|Agfa Sensor530s||Casio EX-Z150||Motorola XT1060||Pentax K-7||Samsung NV15|
|Canon EOS SL1||Fujifilm FinePix S8600||Nikon Coolpix S33||Pentax OptioA40||Sony DSC-H300|
|Camera model set B|
|Apple iPad Air 2||Canon Ixus70||Fujifilm FinePix XP80||LG Nexus 5||Olympus Stylus TG-860||Samsung Galaxy S3|
|Apple iPhone 5c||Canon PC1234||Fujifilm FinePix J50||Motorola Nexus 6||Panasonic TS30||Samsung Galaxy S5|
|Agfa DC-733s||Canon Powershot G10||HTC One M7||Nikon D70||Pentax OptioW60||Samsung Galaxy S7|
|Agfa DC-830i||Canon Powershot SX400 IS||Kodak EasyShare C813||Nikon D7000||Samsung Galaxy Note3||Sony A6000|
|Blackberry Leap||Canon T4i||Kodak M1063||Nokia Lumia 920||Samsung Galaxy Note5||Sony DSC-W170|
|Camera model set C|
|Agfa DC-504||Canon Powershot A640||LG Realm||Olympus mju-1050SW||Samsung Galaxy Note2|
|Agfa Sensor505x||Canon Rebel T3i||Nikon Coolpix S3700||Samsung Galaxy Lite||Samsung Galaxy S6 Edge|
|Canon Ixus55||LG Optimus L90||Nikon D3000||Samsung Galaxy Nexus||Sony DSC-T70|
Denotes from the Dresden Image Database 
Iii-C Patch Selection
Some image patches may not contain sufficient information to be reliably analyzed for forensics purposes . Here, we describe a method for selecting image patches that are appropriate for forensic analysis. In this paper we use an entropy based selection method to filter out pairs of image patches prior to analyzing their forensic similarity. This filter is employed during evaluation only and not while training.
To do this, we view a forensic trace as an amount of information encoded in an image that has been induced by some processing operation. An image patch is a channel that contains this information. From this channel we extract forensic information, via the feature extractor, and then compare pairs of these features using the similarity network. Consequently, an image patch must have sufficient capacity in order to encode forensic information.
When evaluating pairs of image patches, we ensure that both patches have sufficient capacity to encode a forensic trace by measuring their entropy. Here, entropy is defined as
is the probability that a pixel has luminance valuein the image patch. Entropy
is measured in nats. We estimateby measuring the proportion of pixels in an image patch that have luminance value .
When evaluating image patches, we ensure that both image patches have entropy between 1.8 and 5.2 nats. We chose these values since 95% of image patches in our database fall within this range. Intuitively, the minimum threshold for our patch selection method eliminates flat (e.g. saturated) image patches, which would appear the same regardless of camera model or processing history. This method also removes patches with very high entropy. In this case, there is high pixel value variation in the image that may obfuscate the forensic trace.
Iv Experimental Evaluation
We conducted a series of experiments to test the efficacy of our proposed forensic similarity system in different scenarios. In these experiments, we tested our system accuracy in determining whether two image patches were 1) captured by the same or different camera model, 2) manipulated by the same or different editing operation, and 3) manipulated by the same or different manipulation parameter, given a particular editing operation. These scenarios were chosen for their variety in types of forensic traces and because those traces are targeted in forensic investigations [14, 15, 7]. Additionally, we conducted experiments that examined properties of the forensic similarity system, including: the effects of patch size and post-compression, comparison to other similarity measures, and the impact of network design and training procedure choices.
The results of these experiments show that our proposed forensic similarity system is highly accurate for comparing a variety of types of forensic traces across two image patches. Importantly, these experiments show this system is accurate even on “unknown” forensic traces that were not used to train the system. Furthermore, the experiments show that our proposed system significantly improves upon prior art in , reducing error rates by over 50%.
To do this, we started with a database of 47,785 images collected from 95 different camera models, which are listed in Table I. Images from 26 camera models were collected as part of the Dresden Image Database “Natural images” dataset . The remaining 69 camera models were from our own database composed of point-and-shoot, cellphone, and DSLR cameras from which we collected at minimum 300 images with diverse and varied scene content. The camera models were split into three disjoint sets, , , and . Images from were used to train the feature extractor in Learning Phase A, images from and were used to train the similarity network in Learning Phase B, and images from were used for evaluation only. First, set was selected by randomly selecting 50 camera models from among those for which there were at least 40,000 non-overlapping 256256 patches were available. Next, camera model set was selected by randomly choosing 30 camera models, from among the remaining, which contained had least 25,000 total non-overlapping 256256 patches. Finally, the remaining 15 camera models were assigned to .
In all experiments, we started with a pre-trained feature extractor that was trained from 2,000,000 randomly chosen image patches from camera models in (40,000 patches per model) with labels corresponding to their camera model, as described in Sec. III. For all experiments, we started with this feature extractor since research in  showed that deep features related to camera model identification are a good starting point for extracting other types of forensic information.
Next, in each experiment we conducted Learning Phase B to target a specific type of forensic trace. To do this, we created a training dataset of pairs of image patches. These pairs were selected by randomly choosing 400,000 image patches of size 256256 from images in camera model sets and
, with 50% of patch pairs chosen from the same camera model, and 50% from different camera models. For experiments where the source camera model was compared, a label of 0 or 1 was assigned to each pair corresponding to whether they were captured by different or the same camera model. For experiments where we compared the manipulation type or manipulation parameter, these image patches were then further manipulated (as described in each experiment below) and a label assigned indicating the same or different manipulation type/parameter. Training was performed using Tensorflow v1.10.0 with a Nvidia GTX 1080 Ti.111Pre-trained models and example code for each experiment are available from the project repository at gitlab.com/MISLgit/forensic-similarity-for-digital-images and our laboratory website misl.ece.drexel.edu/downloads/.
To evaluate system performance, we created an evaluation dataset of 1,200,000 pairs of image patches, which were selected by randomly choosing 256256 image patches from the 15 camera models in set (“unknown” camera models not used in training). We also included image patches from 10 camera models randomly chosen from set . One device from each of these 10 “known” camera models was withheld from training, and only images from these devices were used in this evaluation dataset. For experiments where we compared the manipulation type or manipulation parameter, the pairs of image patches in the evaluation dataset were then further manipulated (as described in each experiment below) and assigned a label indicating the same or different manipulation type/parameter.
Iv-a Source Camera Model Comparison
In this experiment, we tested the efficacy of our proposed forensic similarity approach for determining whether two image patches were captured by the same or different camera model. To do this, during Learning Phase B we trained the similarity network using the an expanded training dataset of 1,000,000 pairs of 256256 image patches selected from camera models in and , with labels corresponding to whether the source camera model was the same or different. Evaluation was then performed on the evaluation dataset of 1,200,000 pairs chosen from camera models in (known) and (unknown).
Fig. 3 shows the accuracy of our proposed forensic similarity system, broken down by camera model pairing. The diagonal entries of the matrix show the correct classification rates of when two image patches were captured by the same camera model. The non-diagonal entries of the matrix show the correct classification rates of when two image patches were captured by different camera models. For example, when both image patches were captured by a Canon Rebel T3i our system correctly identified their source camera model as “the same” 98% of the time. When one image patch was captured by a Canon PowerShot A640 and the other image patch was captured by a Nikon CoolPix S710, our system correctly identified that they were captured by different camera models 99% of the time.
The overall classification accuracy for all cases was 94.00%. The upper-left region shows classification accuracy for when two image patches were captured by known camera models, Casio EX-Z150 through iPhone 6s. The total accuracy for the known versus known cases was 95.93%. The upper-right region shows classification accuracy for when one patch was captured by an unknown camera model, Agfa DC-504 through Sony Cybershot DSC-T70, and the other patch was captured by a known camera model. The total accuracy for the known versus unknown cases was 93.72%. The lower-right region shows classification accuracy for when both image patches were captured by unknown camera models. For the unknown versus unknown cases, the total accuracy was 92.41%. This result shows that while the proposed forensic similarity system performs better on known camera models, the system is accurate on image patches captured by unknown camera models.
In the majority of camera model pairs, our proposed forensic similarity system is highly accurate, achieving 95% accuracy in 257 of the 325 unique pairings of all camera models, and 95 of the 120 possible pairs of unknown camera models. There are also certain pairs where the system does not acchieve high comparison accuracy. Many of these cases occurred when two image patches were captured by similar camera models of the same manufacturer. As an example, when one camera model was an iPhone 6 and the other an iPhone 6s, the system only achieved a 26% correct classification rate. This was likely due to the similarity in hardware and processing pipeline of both of these cellphones, leading to very similar forensic traces. This phenomenom was also observed in the cases of Canon Powershot A640 versus Canon Ixus 55, any combination of LG phones, Samsung Galaxy S6 Edge versus Samsung Galaxy Lite, and Nikon Coolpix S3700 versus Nikon D3000.
The results of this experiment show that our proposed forensic similarity system is effective at determining whether two image patches were captured by the same or different camera model, even when the camera models were unknown, i.e. not used to train the system. This experiment also shows that, while the system achieves high accuracy in most cases, there are certain pairs of camera models where the system does not achieve high accuracy and this often due to the underlying similarity of the camera model systems themselves.
Iv-A1 Patch Size and Re-Compression Effects
A forensic investigator may encounter smaller image patches and/or images that have undergone additional compression. In this experiment, we examined the performance of our proposed system when presented with input images that have undergone a second JPEG compression and when the patch size is reduced to a of size 128128.
To do this, we repeated the above source camera model comparison experiment in four scenarios: input patches with size 256256, input patches of size 128128, JPEG re-compressed patches of size 256256, and finally JPEG re-compressed patches of size 128128. We first created a copies of the training dataset and evaluation dataset, but where each image was JPEG compressed by quality factor 95 before extracting patches. We then trained the similarity network (Learning Phase B) in each of the four scenarios. For experiments with 128128 patches, we used the same 256256 patches but cropped so only the top-left corner remained.
Source camera model comparison accuracy for each scenario is shown in Table II. The column indicates when both patches were from known camera models, when one patch was from a known camera model and one from an unknown camera model, and finally indicates when both were from unknown camera models. Generally, classification accuracy decreased when using the smaller 128128 patches and when secondary JPEG compression was introduced. For example, total classification rates decreased from 93.61% for 256256 patches to 92.64% for 128128 patches without compression and from 91.83% for 256256 patches to 88.63% for 128128 patches with compression.
The results of this experiment show that operating at a finer resolution and/or introducing secondary JPEG compression negatively impacts source camera model comparison accuracy. However, the proposed system is still able to operate at a relatively high accuracy in these scenarios.
Iv-A2 Other Approaches
In this experiment, we compared the accuracy of our proposed approach to other approaches including distance metrics, support vector machines (SVM), extremely randomized trees (ER Trees), and prior art in .
For the machine learning approaches, we trained each method on deep features of the training dataset extracted by the feature extractor after Learning Phase A. We did this to emulate Learning Phase B where the machine learning approach is used in place of our proposed similarity network. We compared a support vector machine (SVM) with RBF kernel , , and an extremely randomized trees (ER Trees) classifier with 800 estimators and minimum split depth of 3. We also compared to the method proposed in , and used the same training and evaluation data as with our proposed method.
For the distance measures, we extracted deep features from the evaluation set after Learning Phase B. This was done to give a more fair comparison to the machine learning systems, which have the benefit of the additional training data. We measured the distance between each pair of deep features and compared to a threshold. The threshold for each approach was chosen to be the one that maximized total accuracy.
The total classification accuracy achieved on the evaluation set is shown in Table III, with the proposed system accuracy of 93.61% shown for reference. For the fixed distance measures, the 2-Norm distance achieved the highest accuracy of 93.28%, and the Infinite Norm distance achieved the lowest accuracy, among those tested, at 91.73%. The Bray-Curtis distance, which was used in  to cluster image patches based on source camera model, achieved an accuracy of 92.57%.
For the learned measures, the ER Trees classifier achieved an accuracy of 92.44% and the SVM achieved an accuracy of 92.84%, both lower than our proposed similarity system. We also compared against the system proposed in our previous work , which achieved a total accuracy of 85.70%. The results of this experiment show that our proposed system outperforms other distance measures and learned similarity measures. The experiment also shows the system proposed in this paper significantly improves upon prior work in , and decreased the system error rate by over 50%.
Iv-A3 Training methods
In this experiment, we examined the effects of two design aspects in the Learning Phase B training procedure. In particular, these aspects are 1) allowing the feature extractor to update, i.e. unfrozen during training, and 2) using a diverse training dataset. This experiment was conducted to explicitly compare to the training procedure in , where the feature extractor was not updated (frozen) in Learning Phase B and only a subset of available training camera models were used.
To do this, we created an additional training database of 400,000 image patch pairs of size 256256, mimicing the original training dataset, but containing only image patches captured by camera models in set . This was done since the procedure in  specified to conduct Learning Phase B on camera models that were not used in Learning Phase A. We refer to this as training set B, and the original training set as AB. We then performed Learning Phase B using each of these datasets. Furthermore, we repeated each training scenario where the learning rate multiplier in each layer in the feature extractor layer was set to 0, i.e. the feature extractor was frozen. This was done to compare to the procedure in  which used a frozen feature extractor.
The overall accuracy achieved by each of the four scenarios is shown in Table IV. When using training on set B with a frozen feature extractor, which is the same procedure used in , the total accuracy on the evaluation image patches was 90.24%. When allowing the feature extractor to update, accuracy increased by 0.72 percentage points to 90.96%. When increasing training data diversity to camera model set AB, but using a frozen feature extractor the accuracy achieved was 92.56%. Finally, when using a diverse dataset and an unfrozen feature extractor, total accuracy achieved was 93.61%.
The results of this experiment show that our proposed training procedure is a significant improvement over the procedure using in , improving accuracy 3.37 percentage points. Furthermore, we can see the added benefit of our proposed architecture enhancements when comparing the result MS‘18 in Table III, which uses both the training procedure and system architecture of . Improving the system architecture alone raised classification rates from 85.70% to 90.24%. Improving the training procedure further raised classification rates to 93.61%, together reducing the error rate by more than half.
|Training Data||Feature Extractor||Accuracy|
Iv-B Editing Operation Comparison
|Resizing (bilinear)||Scaling factor|
|Gaussian blur (55)|
|Meidan blur||Kernel size|
|JPEG Compression||Quality factor|
|Adaptive Hist. Eq.|
|Weiner filter||Kernel size|
|Salt + pepper noise||Percent|
|Known Manipulations||Unknown Manipulations|
|Manip. Type||Orig.||Resize||Gauss. Blur||Med. Blur||AWGN||JPEG||Sharpen||Hist. Eq.||Wiener||Web Dither||Salt Pepper|
A forensic investigator is often interested in determining whether two image patches have the same processing history. In this experiment, we investigated the efficacy of our proposed approach for determining whether two image patches were manipulated by the same or different editing operation, including “unknown” editing operations not used to train the system.
To do this, we started with the training database of image patch pairs. We then modified each patch with one of the eight “known” manipulations in Table V, with a randomly chosen editing parameter. We manipulated 50% of the image patch pairs with the same editing operation, but with different parameter, and manipulated 50% of the pairs with different editing operations. The known manipulations were the same manipulations used in  and . We repeated this for the evaluation database, using both the “known” and “unknown” manipulations. Wiener filtering was performed using the SciPy python library, web dithering was performed using the Python Image Library, and salt and pepper noise was performed using the SciPy image processing toolbox (skimage). We note that the histogram equlaization and JPEG compression manipulations were performed on the whole image. We then performed Learning Phase B using the manipulated training database, with labels associated with each pair corresponding to whether they have been manipulated by the same or different editing operation. Finally, we evaluated accuracy on the evaluation dataset, with patches processed in a similar manner.
Table VI shows the correct classification rates of our proposed forensic similarity system, broken down by manipulation pairing. The first eight columns show rates for when one patch was edited with a known manipulation and the other patch was edited with an unknown manipulation. The last three columns show rates for when both patches were edited by unknown manipulations. For example, when one image patch was manipulated with salt and pepper noise and the other patch was manipulated with histogram equalization, our proposed system correctly identified that they have been edited by different manipulations at a rate of 95%. When both image patches were edited with Wiener filtering, our proposed system correctly identified that they were edited by the same manipulation at a rate of 96%. The total accuracy for the known versus known cases was 97.0%, but are not shown for the sake of brevity.
There are certain pairs of manipulations for which the proposed system does not achieve high comparison accuracy. These include Wiener filtering versus Gaussian bluring, web dithering versus sharpening, salt and pepper versus sharpening, and web dithering versus salt and pepper noise. The first example is likely due to the smoothing similarities between Wiener filtering and Gaussian blurring. The latter cases are likely due to the addition of similar high frequency artifacts introduced by the sharpening, web dithering, and salt and pepper manipulations. Despite these cases, our proposed system achieves high accuracy even when one or both manipulations are unknown in the majority of manipulation pairs.
The results of this experiment demonstrate that our proposed forensic similarity is system is effective at comparing the processing history of image patches, even when image patches have undergone an editing operation that was unknown, i.e. not used during training.
Iv-C Editing Parameter Comparison
In this experiment, we investigated the efficacy of our proposed approach for determining whether two image patches have been manipulated by the same or different manipulation parameter. Specifically, we examined pairs of image patches that had been resized by the same scaling factor or that had been resized by different scaling factors, including “unknown” scaling factors that were not used during training. This type of analysis is important when analyzing spliced images where both the host image and foreign content were resized, but the foreign content was resized by a different factor.
To do this, we started with the training database of image patch pairs. We then resized each patch with one of the seven “known” resizing factors in
using bilinear interpolation. We resized 50% of the image patch pairs with the same scaling factor, and resized 50% of the pairs with different scaling factors. We repeated this for the evaluation database, using both the “known” scaling factors and “unknown” scaling factors in. We then performed Learning Phase B using the training database of resized image patches, with labels corresponding to whether each pair of image patches was resized by the same or different scaling factor.
The correct classification rates of our proposed approach are shown in Fig. 4, broken down by tested resizing factor pairings. For example, when one image patch was resized by a factor of 0.8 and the other image patch was resized by a factor of 1.4, both unkown scaling factors, our proposed system correctly identified that the image patches were resized by different scaling factors at rate of 99%. Cases where at least one patch has been resized with an unknown scaling factor are highlighted in blue. Cases where both patches have been resized with an unknown scaling factor our outlined in red.
Our system achieves greater than 90% correct classification rates in 33 of 45 tested scaling factor pairings. There are also some cases where our proposed system does not achieve high accuracy. These cases tend to occur when presented with image patches that have been resized with different but similar resizing factors. For example, when resizing factors of 1.4 and 1.3 are used, the system correctly identifies the scaling factor as different 12% of the time.
The results of this experiment show that our proposed approach is effective at comparing the manipulation parameter in two image patches, a third type of forensic trace. This experiment shows that our proposed approach is effective even when one or both image patches have been manipulated by an unknown parameter of the editing operation not used in training.
V Practical applications
The forensic similarity approach is a versatile technique that is useful in many different practical applications. In this section, we demonstrate how a forensic similarity is used in two types of forensic investigations: image forgery detection and localization, and image database consistency verification.
V-a Forgery detection and localization
Here we demonstrate the utility of our proposed forensic similarity system in the important forensic analysis of forged images. In forged images, an image is altered to change its perceived meaning. This can be done by inserting foreign content from another image, as in a splicing forgery, or by locally manipulating a part of the image. Forging an image inherently introduces a localized inconsistency of the forensic traces in the image. We demonstrate that our proposed similarity system detects and localizes the forged region of an image by exposing that it has a different forensic trace than the rest of the image. Importantly, we show that our proposed system is effective on “in-the-wild” forged images, which are visually realistic and have been downloaded from a popular social media website.
We do this on three forged images that were downloaded from www.reddit.com, for which we also have access to the original version. First, we subdivided each forged image into image patches with 50% overlap. Next, we selected one image patch as a reference patch and calculated the similarity score to all other image patches in the image. We used the similarity system trained in Sec. IV-A to determine whether two image patches were captured by the same or different camera model with secondary JPEG compression. We then highlighted the image patches with similarity scores less than a threshold, i.e. contain a different forensic trace than the reference patch.
Results from this procedure on the first forged image are shown in Fig. 5. The original image is shown in Fig. 5a. The spliced version is shown Fig. 5b, where an actor was spliced into the image. When we selected a reference patch from the host (original) part of the image, the image patches in the spliced regions were highlighted as forensically different as shown in Fig. 5c. We note that our forensic similarity based approach is agnostic to which regions are forged and which are authentic, just that they have different forensic signatures. This is seen in Fig. 5d when we selected a spliced image patch as the reference patch. Fig. 5e shows then when we performed this analysis on the original image, our forensic similarity system does not find any forensically different patches from the reference patch.
The second row of Fig. 5 shows forensic similarity analysis using networks trained under different scenarios. Results using the network trained to determine whether two image patches have the same or different source camera model without JPEG post-compression are shown in Fig. 5f for patch size 256256, in Fig. 5g for patch size 128128, and with JPEG post-compression in Fig. 5h for patch size 128128. The result using the network trained to determine whether two image patches have been manipulated by the same or different manipulation type is shown in Fig. 5i.
Results from splicing detection and localization procedure on a second forged image are shown in Fig. 6, where a set of toys were spliced into an image of a meeting of government officials. When we selected reference patches from the host image, the spliced areas were correctly identified as containing a different forensic traces, exposing the image as forged. This is seen in Fig. 6b with 256256 patches, and in Fig. 6c with 128128 patches. The 128128 case showed better localization of the spliced region and additionally identified the yellow airplanes as different than the host image, which were not identified by the similarity system using larger patch sizes.
In a final example, shown in Fig. 7, the raindrop stains on a mans shirt were edited out the image by a forger using a brush tool. When we selected a reference patch from the unedited part of the image, the manipulated regions were identified as forensically different, exposing the tampered region of the image. This is seen in Fig. 7c with 256256 patches, and in Fig. 7d with 128128 patches. In the 128128 case, the smaller patch size was able to correctly expose that the man’s shirt sleeve was also edited.
The results of presented in this section show that our proposed forensic similarity based approach is a powerful technique that can be used to detect and localize image forgeries. These results showed that just correctly identifying that forensic differences existed in the images was sufficient to expose the forgery, even though the technique did not identify the particular forensic traces in the image.
V-B Database consistency verification
|threshold||Type 0||Type 1||Type 2|
In this section, we demonstrate a that the forensic similarity system detects whether a database of images has either been captured by all the same camera model, or by different camera models. This is an important task for social media websites, where some accounts illicitly steal copyrighted content from many different sources. We refer to these accounts as “content aggregators”, who upload images captured by many different camera models. This type of account contrasts with “content generator” accounts, who upload images captured by one camera model. In this experiment, we show how forensic similarity is used to differentiate between these types of accounts.
To do this, we generated three types of databases of images. Each database contained images and were assigned to one of three “Types.” Type 0 databases contained images taken by the same camera model, i.e. a content generator database. Type 1 databases contained images taken by one camera model, and 1 image taken by a different camera model. Finally, Type 2 databases contain images, each taken by different camera models. We consider the Type 1 case the hardest to differentiate from a Type 0 databse, whereas the Type 2 case is the easiest to detect. We created 1000 of each database type from images captured by camera models in set , i.e. unknown camera models not used in training, with the camera models randomly chosen for each database.
To classify a database as consistent or inconsistent, we examined each unique image pairings of the database. For each image pair, we randomly selected 256256 image patches from each image and calculated the similarity scores across the two images. Similarity was calculated using the similarity network trained in Sec. IV-A. Then, we calculated the median value of scores for each whole-image pair. For image pairs captured by the same camera model this value is high, and for two images captured by different camera models this value is low. We then compare the th lowest value calculated from the entire database to a threshold. If this th lowest value is above the threshold, we consider the database to be consistent, i.e. from a content generator. If this value is below the threshold, then we consider the database to be inconsistent, i.e. from a content aggregator.
Table VII shows the rates at which we correctly classify Type 0 databases as “consistent” (i.e. all from the same camera model) and Type 1 and Type 2 databases as “inconsistent”, with images per database, and patches chosen from each image. For a threshold 0.5, we correctly classified 92.4% of Type 0 databases as consistent, and incorrectly classified 8.1% of Type 1 databases as consistent. This incorrect classification rate of Type 1 databases is decreased by increasing the threshold. Even at a very low threshold of 0.1, our system correctly identified all Type 2 databases as inconsistent.
The results of this experiment show that our proposed forensic similarity system is effective for verifying the consistency of an image database, an important type of practical forensic analysis. Since none of the images used in this experiment were from camera models used to train the forensic system, an identification type approach would not have been appropriate. In this application it was not important to identify which camera models were used in a particular database, but whether the images came from the same or different camera models.
In this paper we proposed a new digital image forensics technique, called forensic similarity, which determines whether two image patches contain the same or different forensic traces. The main benefit of this approach is that prior knowledge, e.g. training samples, of a forensic trace are not required to make a forensic similarity decision on it. To do this, we proposed a two part deep-learning system composed of a CNN-based feature extractor and a three-layer neural network, called the similarity network, which maps pairs of image patches onto a score indicating whether they contain the same or different forensic traces. We experimentally evaluated the performance of our approach on three types of common forensic scenarios, which showed that our proposed system was accurate in a variety of settings. Importantly, the experiments showed this system is accurate even on “unknown” forensic traces that were not used to train the system and that our proposed system significantly improved upon prior art in , reducing error rates by over 50%. Furthermore, we demonstrated the utility of the forensic similarity approach in two practical applications of forgery splicing and localization, and image database consistency verification.
-  M. C. Stamm, M. Wu, and K. Liu, “Information forensics: An overview of the first decade,” Access, IEEE, vol. 1, pp. 167–200, 2013.
-  J. Chen, X. Kang, Y. Liu, and Z. J. Wang, “Median filtering forensics based on convolutional neural networks,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 1849–1853, 2015.
-  B. Bayar and M. C. Stamm, “On the robustness of constrained convolutional neural networks to jpeg post-compression for image resampling detection,” in ICASSP, 2017 IEEE. IEEE, 2017, pp. 2152–2156.
-  J. Bunk, J. H. Bappy, T. M. Mohammed, L. Nataraj, A. Flenner, B. Manjunath, S. Chandrasekaran, A. K. Roy-Chowdhury, and L. Peterson, “Detection and localization of image forgeries using resampling features and deep learning,” in Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2017, pp. 1881–1889.
X. Zhu, Y. Qian, X. Zhao, B. Sun, and Y. Sun, “A deep learning approach to patch-based image inpainting forensics,”Signal Processing: Image Communication, 2018.
-  B. Bayar and M. C. Stamm, “A deep learning approach to universal image manipulation detection using a new convolutional layer,” in Workshop on Info. Hiding and Multimedia Sec. ACM, 2016, pp. 5–10.
-  ——, “Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection,” IEEE Transactions on Information Forensics and Security, 2018.
-  M. Boroumand and J. Fridrich, “Deep learning for detecting processing history of images,” Electronic Imaging, vol. 2018, no. 7, pp. 1–9, 2018.
-  B. Bayar and M. C. Stamm, “Towards order of processing operations detection in jpeg-compressed images with convolutional neural networks,” Electronic Imaging, vol. 2018, no. 7, pp. 1–9, 2018.
-  M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini, B. Tondi, and S. Tubaro, “Aligned and non-aligned double jpeg detection using convolutional neural networks,” Journal of Visual Communication and Image Representation, vol. 49, pp. 153–163, 2017.
-  I. Amerini, T. Uricchio, L. Ballan, and R. Caldelli, “Localization of JPEG double compression through multi-domain convolutional neural networks,” in IEEE CVPR Workshop on Media Forensics, vol. 3, 2017.
-  A. Tuama, F. Comby, and M. Chaumont, “Camera model identification with the use of deep convolutional neural networks,” in Information Forensics and Security (WIFS), Workshop on. IEEE, 2016, pp. 1–6.
-  L. Bondi, D. Güera, L. Baroffio, P. Bestagini, E. J. Delp, and S. Tubaro, “A preliminary study on convolutional neural networks for camera model identification,” Electronic Imaging, vol. 2017, no. 7, pp. 67–76, 2017.
-  L. Bondi, L. Baroffio, D. Güera, P. Bestagini, E. J. Delp, and S. Tubaro, “First steps toward camera model identification with convolutional neural networks,” IEEE Signal Processing Letters, pp. 259–263, 2017.
-  B. Bayar and M. C. Stamm, “Augmented convolutional feature maps for robust CNN-based camera model identification,” in Image Processing (ICIP), 2017 IEEE International Conference on. IEEE, 2017, pp. 1–4.
-  I. Amerini, T. Uricchio, and R. Caldelli, “Tracing images back to their social network of origin: a CNN-based approach,” in Information Forensics and Security (WIFS), Workshop on. IEEE, 2017, pp. 1–5.
-  B. Bayar and M. C. Stamm, “Towards open set camera model identification using a deep learning framework,” in Acoustics, Speech and Signal Processing (ICASSP), Int. Conference on. IEEE, 2018, pp. 1–4.
-  L. Bondi, S. Lameri, D. Güera, P. Bestagini, E. J. Delp, and S. Tubaro, “Tampering detection and localization through clustering of camera-based CNN features,” in Computer Vision and Pattern Recognition Workshops. IEEE, 2017, pp. 1855–1864.
-  O. Mayer and M. C. Stamm, “Learned forensic source similarity for unknown camera models,” in Acoustics, Speech and Signal Processing (ICASSP), Int. Conference on. IEEE, 2018, pp. 1–4.
-  O. Mayer, B. Bayar, and M. C. Stamm, “Learning unified deep-features for multiple forensic tasks,” in Proceedings of the 6th ACM Workshop on Information Hiding and Multimedia Security. ACM, 2018, pp. 1–6.
-  B. Bayar and M. C. Stamm, “A generic approach towards image manipulation parameter estimation using convolutional neural networks,” in Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security. ACM, 2017, pp. 147–157.
-  J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep learning for content-based image retrieval: A comprehensive study,” in International Conference on Multimedia. ACM, 2014, pp. 157–166.
-  G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), Int. Conference on. IEEE, 2016, pp. 5115–5119.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in International Conference on Machine Learning, 2014, pp. 647–655.
-  B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Advances in neural information processing systems, 2014, pp. 487–495.
-  O. A. Penatti, K. Nogueira, and J. A. dos Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?” in Proc. of the IEEE CVPR Workshops, 2015, pp. 44–51.
-  B. Bayar and M. C. Stamm, “Design principles of convolutional neural networks for multimedia forensics,” Elec. Imaging, pp. 77–86, 2017.
-  J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a ”siamese” time delay neural network,” in Advances in neural information processing systems, 1994, pp. 737–744.
-  H.-S. Lee, Y. Tso, Y.-F. Chang, H.-M. Wang, and S.-K. Jeng, “Speaker verification using kernel-based binary classifiers with binary operation derived features,” in Acoustics, Speech and Signal Processing (ICASSP), International Conference on. IEEE, 2014, pp. 1660–1664.
-  T. Gloe and R. Böhme, “The’dresden image database’for benchmarking digital image forensics,” in Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, 2010, pp. 1584–1590.
-  D. Güera, S. K. Yarlagadda, P. Bestagini, F. Zhu, S. Tubaro, and E. J. Delp, “Reliability map estimation for CNN-based camera model attribution,” arXiv preprint arXiv:1805.01946, 2018.