Image Forensics: Detecting duplication of scientific images with manipulation-invariant image similarity

by   M. Cicconet, et al.

Manipulation and re-use of images in scientific publications is a concerning problem that currently lacks a scalable solution. Current tools for detecting image duplication are mostly manual or semi-automated, despite the availability of an overwhelming target dataset for a learning-based approach. This paper addresses the problem of determining if, given two images, one is a manipulated version of the other by means of copy, rotation, translation, scale, perspective transform, histogram adjustment, or partial erasing. We propose a data-driven solution based on a 3-branch Siamese Convolutional Neural Network. The ConvNet model is trained to map images into a 128-dimensional space, where the Euclidean distance between duplicate images is smaller than or equal to 1, and the distance between unique images is greater than 1. Our results suggest that such an approach has the potential to improve surveillance of the published and in-peer-review literature for image manipulation.



There are no comments yet.


page 7

page 9

page 11


Boosting Image Forgery Detection using Resampling Detection and Copy-move analysis

Realistic image forgeries involve a combination of splicing, resampling,...

Boosting Image Forgery Detection using Resampling Features and Copy-move analysis

Realistic image forgeries involve a combination of splicing, resampling,...

Learning to identify image manipulations in scientific publications

Adherence to scientific community standards ensures objectivity, clarity...

Detecting Photoshopped Faces by Scripting Photoshop

Most malicious photo manipulations are created using standard image edit...

Scientific Image Tampering Detection Based On Noise Inconsistencies: A Method And Datasets

Scientific image tampering is a problem that affects not only authors bu...

Identifying partial mouse brain microscopy images from Allen reference atlas using a contrastively learned semantic space

Precise identification of mouse brain microscopy images is a crucial fir...

Multimodal Pivots for Image Caption Translation

We present an approach to improve statistical machine translation of ima...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Duplicative data reporting in the biomedical literature is more prevalent than most people are aware [1]. One common form of data duplication, regardless of intent, is the re-use of scientific images, across multiple publications or even within the same publication. In some cases, images are altered before being re-used [1]

. Changing orientation, perspective or image statistics, introducing skew or crop, and deleting or inserting data into the original image plane are all ways in which image data may be altered prior to inappropriate introduction, or re-introduction, into the reporting of experimental outcomes

[10, 4, 2]. While the scientific community has affirmatively recognized the need for preventing the incorporation of duplicative or flawed image data into the scientific record, a consistent approach to screening and identifying problematic image data has yet to be established [8, 9].

Cases of image data duplication and/or manipulation have often been detected by fellow scientists111E.g.:,, or by editorial staff during the manuscript review process. Efforts to move towards automation include tools developed to isolate regions of manipulation within images already flagged as suspicious [7]. However, current methods for identifying duplicative and/or manipulated images largely rely on individual visual identification with accompanying application of qualitative similarity measures 222E.g.: Given the rate at which the scientific literature is expanding, it is not feasible for all cases of potential image manipulation to be detected by human eyes. Thus, there is a continued need for automated tools to detect potential duplications, even in the presence of manipulation, to allow for more focused, thorough evaluation of this smaller errant image candidate pool. Such a tool would be invaluable to scientists and research staff on many levels, from figure screening as a step in improving raw data maintenance and manuscript preparation at the laboratory level [10], to the routine screening by journal editorial staff of submitted manuscripts prior to the peer-review process [8, 5].

The general problem of detecting similar images has been well studied in the field of computer vision. For example, the challenge of determining if two images contain the same human subject, despite large changes in orientation and lighting, is closely related to the problem we wish to address. Recent breakthroughs in deep Convolutional Neural Networks (ConvNets) have driven rapid progress in this area


In this paper, we apply modern methods in facial recognition to address the problem of detecting image manipulation and re-use in scientific work. Specifically, we train a ConvNet to learn an image embedding, such that images with the same original content, albeit manipulated through a common set of image manipulations, appear close to each other in the embedding space. We train this model on a large corpus of simulated image manipulations, and test on a small set of 54 manipulated images from known instances of image duplication/manipulation

333The test images were previously described as problematic and either corrected or retracted from the literature. Sourced from and/or

. To our knowledge, this is the first application of deep learning to the detection of image re-use in the scientific literature.

An overview of the paper is as follows: In Section 2 we review methods for image similarity based on deep learning that influenced this work. In Subsection 3.1, we present the model architecture that is trained for image embedding. In Subsection 3.2

, we discuss the triplet loss function used to train the model. In Subsection 

3.3, we describe how we generated a large training set of simulated image manipulations, and trained the model. Finally, in Section 4 we present results of the model on real cases of image duplication and/or manipulation, and show the learned embedding.

2 Related Work

This work is primarily based on [3], [11], and [6].

The classic model for image similarity was proposed in [3] in the context of face verification: a siamese neural network. This network has two branches that share parameters during training. Each branch is composed of layers of convolutions and non-linearities followed by fully connected layers. The two branches are connected at the bottom by the norm. During training, pairs of images known to be similar or dissimilar are fed to the network, and the loss function is designed to encourage the network to learn a representation that makes the distance between the two representations small or large, respectively.

In [11], the authors improved upon the standard siamese network model by adding one extra branch, thus training on image triplets instead of pairs. A triplet consists of an anchor, a positive example (“same” or “similar” to the anchor image), and a negative example (“different” from the anchor). A triplet loss was designed to drive similar images to be nearby, and dissimilar images to be far apart, encouraging the embedding space to be locally Euclidean. A clever trick that enables fast convergence is the use of hard negative mining: selecting examples where “different” images are close according to the current metric, and “similar” images are far apart.

In [6], the authors kept the 2-branch architecture, but used a non-conventional “metric” (possibly assuming negative values) at the connection of the two branches, with a cross-entropy loss function. This approach allows for the model to learn a function that gives a binary output, rather than a distance between images, which has the advantage of not requiring the user to establish a threshold of proximity for images to be the same, as required in [3] and [11].

For our application, we found the binary output option to be more interesting from a user’s perspective, since the threshold for “sameness” can be difficult to set properly. However, experiments with the loss function proposed in [6] led us to abandon it due to its instability for images that are actually the same. We settled with a modification that enforces a threshold of 1, beyond which images are considered different, and let the network learn the appropriate scaling required for the metric to comply with such separation. We borrow the triplet loss strategy from [11], for faster training.

3 Model

We aim to solve the following problem: given two images, and , determine if they are the same or different, where is considered to be the same as if it is a manipulated version of . We sought to find a solution to this problem in the form of a function , an image-forensic metric, that computes a distance between two images, satisfying:

  • ;

  • when is a manipulated version of ;

  • when and are different images.

3.1 Architecture

We used a triplet network architecture [11]

, with the 3 branches sharing parameters. Each branch consisted of 4 convolution layers, each with ReLU non-linearity, followed by 2 fully-connected layers. We also included a few standard tricks-of-the-trade, such as batch normalization, local response normalization, and network-in-network layers. The resulting image representation

is a vector of dimension

. A summary of the model is shown in the left panel of Figure 1 (a). Complete details are accessible in the source code444See file at We experimented with a considerable number of variations on network depth and hyper-parameters, though we did not perform a thorough or automated search for the optimal architecture.

(b) (c)
Figure 1: (a) Model details. (b) Model training diagram. (c) Model testing diagram.

3.2 Triplet Loss

Let be the representation at the bottom of branch for image , . Our forensic metric is defined as


where are parameters to be learned. Now, with the convention that the anchor images feed through branch , “same” through branch , and “different” through branch , we define



is the sigmoid function. Our triplet loss is then


where is a batch of triplets , i.e., anchor, same, different.

This loss forces (thus ), and (thus ), therefore imposing a virtual threshold of as criteria for similarity as measured by .

3.3 Training Procedure

Positive examples of image manipulation corresponding to data confirmed by institutional or regulatory bodies as problematic, which may include retracted and or corrected data, are not publicly available at the scale that would be required to train a high capacity ConvNet. Thus, we approached this problem by simulating examples of image manipulation to generate a large training set, and testing on a small set of 54 real-world examples of inappropriately duplicated images in peer-reviewed publications555Images were originally identified as candidates for testing through PubPeer [] and/or Retraction Watch []. Data were flagged for concern in a prior setting, and described on the above sites, and/or at the original parent journal as either corrected or retracted from the literature. Images were then downloaded directly from journal websites (high quality jpeg where available) and/or images were exported as .tif files directly from downloaded manuscript .pdfs with no additional compression, embedded profile color management, and/or conversion of colorspace, and resolution was determined automatically. Downloaded images, where needed, were further parsed into individual .tif panels using Adobe Photoshop versions CS6 and CC..

To guide the generation of simulated data, we first evaluated the most common forms of image manipulation within our test set. We identified the following operations: identity, rotation, translation, scale, or perspective transform; local or global histogram adjustment; partial erasing. In addition, we also accounted for operations that are common in the preparation of images for scientific manuscripts, such as the insertion of text or drawings. Examples of some of these deformations are shown in Figure 2.

We started by gathering micrographs, mainly666One class of images was taken from the NYU Mouse Embryo Tracking Database:
from the Image and Data Analysis Core at Harvard Medical School, and from the Broad Bioimage Benchmark Collection777 There are about 20 classes of images, from various cell types and model organisms, including C. elegans, adipocytes, mouse nuclei, and mouse embryo cells. The data was cropped (with no overlap) in patches of pixels, totaling 5215 images that were randomly split into training (4000), validation (500), and test (715) sets888This data is made available at

At each training step two distinct batches of images are sampled from the entire training set. The first batch is reserved for the “anchor” branch of the 3-way siamese net, and the second for the “different” branch. For each anchor image, a corresponding image for the “similar” branch is obtained by on-the-fly deformation of the anchor. Deformations vary in degree (how much) and number (how many), according to the following pseudo-code, where rand()

is a sample from the uniform distribution in

, randreflection() is random reflection, randpptf() a random perspective transform, randtform() a random similarity transform (rotation, scale, translation), crop() is a centered crop, randgammaadj() is a random gamma adjustment, and randlocaledit() is a random local edit (change in pixel intensity).

        r = rand()
        if r < 0.9:
            im1 = randreflection(im)  if rand() < 0.5 else im
            im2 = randpptf(im1)       if rand() < 0.5 else im1
            im3 = randtform(im2)      if rand() < 0.5 else im2
            im3 = im
        im4 = crop(im3)
        if r < 0.9:
            im5 = randgammaadj(im4)   if rand() < 0.5 else im4
            im6 = randlocaledit(im5)  if rand() < 0.5 else im5
            im6 = im4
        return im6

The “anchor” and “different” images on the triplet are also center-cropped to

to be of the same size as the “different” image, which needs cropping to eliminate border effects introduced by the deformations. Random clutter is added (with certain probability) to all images in the triplet – it can be either random text or a random rectangle. Details on parameters of each individual deformation and clutter are in the source code

999See and at Some examples of deformations are shown in Figure 2.

original rotation translation clutter
perspective histogram erase combined
Figure 2: Deformations. “clutter” corresponds to the addition of random text and a random box; “histogram” corresponds to local and global pixel intensity adjustment; “combined” corresponds to a sample run of the algorithm described in Subsection 3.3.

4 Results and Discussion

Our model reached peak accuracy on the validation set at about training iterations for a batch size of images. The following table summarizes the accuracy on the validation and test sets for synthetic images, as well as the accuracy on a small dataset of real duplications and/or manipulations, containing cases. Performance on real cases is the average of 10 runs of the predictor; Figure 3 shows details of such prediction. Notice that the model is consistent on the errors it makes, and becomes more confident on the answer as the number of training steps increase, though it makes slightly more errors. This indicates early-stopping might be a good strategy when deploying the model on real-world cases – though the real-world images dataset is too small to draw any conclusions.

training steps acc. valid. acc. test, synth. acc. test, real
0.96 0.95 0.94
0.98 0.97 0.92
Figure 3: 10 runs on real-world test examples after training for 1500 steps (a) and 5000 steps (b). Squares correspond to pairs of “same” images, circles to “different” – thus all squares (resp. circles) should be above (resp. below) the horizontal line of likelihood equal to (those that are, are colored green, those that are not, are colored red). Classes are sorted in the same way for both plots.

Unfortunately at this time we are unable to publish real-world example images. Some examples of synthetic image pairs, along with prediction, are shown in Figure 4. Trained models are available at the project’s page. Figure 5 shows a snapshot of the PCA of the embeddings of synthetic images, as represented by the -dimensional output of the CNN. A video of the embedding is available at

Correct Predictions Wrong Predictions
Same Same
Same Same
Same Diff
Same Diff
Figure 4: Predictions on pairs of images from our synthetic dataset. For correct predictions, we are just showing “same” since the goal is candidate detection and we care less about when the model says that images are different (in most cases, they will be very different). For wrong predictions we are showing both “same” and “diff” cases to see what types of errors the model makes.
Figure 5: Top: snapshot of PCA of embedding of CNN representations of 1024 synthetic images. Bottom: selected zoomed in areas, highlighting how similar images appear next to each other. Video:

5 Conclusions and Future Work

We have demonstrated a proof-of-concept that siamese networks have the potential to improve surveillance of the published and in-peer-review literature for duplicated images. This approach may not prove accurate enough to definitively determine image duplication, but rather could serve to narrow down the pool of images which are subjected to further review. Surprisingly, we found that many of the errors in the test set involved histogram/contrast alterations, despite this being one of the easier cases to spot by the human eye. We added both local intensity and gamma changes in the training set, and will continue to explore intensity alterations as a way to improve accuracy of the algorithm (e.g. by adding JPEG compression as one of the manipulations).

One of the main roadblocks to this research is the lack of a public, large-scale database of image manipulation cases on which to further test the model. The challenge here is not only of generating one such dataset, but also of securing the proper permissions to release the data, given the legal issues involved. We are continually expanding our dataset and will make it available as soon as possible.

Another interesting topic of future research would be to implement Grad-CAM [12] style network inspection to gather information for why the network thinks two images are similar, when it finds them to be.

6 Acknowledgements

We would like to thank current and former members of the Harvard Medical School (HMS) Office of Academic and Research Integrity (ARI): Gretchen Brodnicki, J.D., Keri Godin M.S., Mortimer Litt, M.D., Jennifer Ryan, J.D., and Blake Talbot, M.P.H.; former members of the HMS Image Data Analysis Core (IDAC): Tiao Xie, Ph.D. (Definiens AG) and Yichao Joy Xu, Ph.D.(Xito Technologies Ltd.); and Katia Oleinik, M.S. (Boston University) for their expertise and effort in supporting image data integrity initiatives developed by the HMS ARI/IDAC team. This research was partially supported by a gift from Elsevier, and we thank IJsbrand Jan Aalbersberg, Ph.D., Jessica Cox, Ph.D., and Ron Daniel, Ph.D., at Elsevier Research Integrity and Elsevier Labs, for their ongoing interest in and discussions of this work.


  • [1] E. M. Bik, A. Casadevall, and F. C. Fang. The prevalence of inappropriate image duplication in biomedical research publications. mBio, 7(3):published online, 2016.
  • [2] M. Blatt and C. Martin. Manipulation and misconduct in the handling of image data. The Plant Cell, 25:3147–3148, 2013.
  • [3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In

    IEEE Computer Vision and Pattern Recognition

    , volume 1, pages 539–546, 2005.
  • [4] D. W. Cromey. Avoiding twisted pixels: Ethical guidelines for the appropriate use and manipulation of scientific digital images. Science and Engineering Ethics, 16(4):639–667, 2009.
  • [5] N. Gilbert. Science journals crack down on image manipulation. Nature, Oct 9:published online, 2009.
  • [6] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
  • [7] L. Koppers, H. Wormer, and K. Ickstadt. Towards a systematic screening tool for quality assurance and semiautomatic fraud detection for images in the life sciences. Science and Engineering Ethics, 23(4):1113–1128, 2017.
  • [8] M. Rossner. How to guard against image fraud. The Scientist Magazine, Mar 1:published online, 2006.
  • [9] M. Rossner. A false sense of security. Cell Biology, 183(4):573–574, 2008.
  • [10] M. Rossner and K. M. Yamada. What’s in a picture? the temptation of image manipulation. Cell Biology, 166(1):11–15, 2004.
  • [11] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
  • [12] R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR, abs/1610.02391, 2016.