Benchmarking Scientific Image Forgery Detectors

05/26/2021 ∙ by João P. Cardenuto, et al. ∙ University of Campinas 29

The scientific image integrity area presents a challenging research bottleneck, the lack of available datasets to design and evaluate forensic techniques. Its data sensitivity creates a legal hurdle that prevents one to rely on real tampered cases to build any sort of accessible forensic benchmark. To mitigate this bottleneck, we present an extendable open-source library that reproduces the most common image forgery operations reported by the research integrity community: duplication, retouching, and cleaning. Using this library and realistic scientific images, we create a large scientific forgery image benchmark (39,423 images) with an enriched ground-truth. In addition, concerned about the high number of retracted papers due to image duplication, this work evaluates the state-of-the-art copy-move detection methods in the proposed dataset, using a new metric that asserts consistent match detection between the source and the copied region. The dataset and source-code will be freely available upon acceptance of the paper.



There are no comments yet.


page 1

page 3

page 4

page 5

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Integrity researchers have been reporting a threat to scientific image integrity for a long time[12, 18, 10, 3]. This improbity has achieved even areas like cancer [19], or more dramatically used on ’paper mills’ services [5].

Although this serious problem has been an urge in the scientific community, to the best of our knowledge, few forensics works are dedicated to this topic. So far, no rich annotated forensic benchmark containing scientific tampering images was published. We believe that a large dataset would foster the forensic community to work more actively in this subject and assist state-of-the-art forensic techniques that might require large training datasets.

In this sense, we tried to collect known doctored scientific images, but we faced two main issues that make us avoid these methodology: legal and practical. To publish a dataset with real tampering cases, we would have to face copyrights and legal aspects of pointing third-party works that were retracted due to suspicious manipulation. Even if we decided to manage this legal aspect, we had to be guided by a retraction notice relative to the issued images. However, after reading several retraction notices, we realized that many of them are not precise enough to pinpoint the issued region’s at a pixel level, which would not lead to an accurate ground-truth. The example of a real retraction notice due to an honest error, depicted in Fig. 1, shows this inaccuracy, in which the highlighted words ‘some lanes’ and ‘not the appropriate ones’ translate to an ambiguous region and cause.

Fig. 1: Example of a retraction note extracted from Cell Death & Diseases (, Last access May, 2021). The highlighted words in yellow ‘some lanes’ and ‘not the appropriate ones’ illustrate inaccurate regions and ambiguous causes of the retraction.

On the other hand, we notice that a representative number of retracted images were due to duplication and basic image processing operations that could be automatically created. Therefore, this work presents the RECOD Scientific Image Integrity Library (RSIIL) that enables creating a synthetic scientific image tampering dataset with enriched pixel-wise ground-truth and without any associated legal issue. With this library, we created the RECOD Scientific Image Integrity Dataset (RSIID) with the most common image operations reported by scientific integrity researchers [18, 3]. To create this benchmark, we doctored 2,923 figures from creative common sources resulting in 39,423 tampered figures (26,496 for training and 12,927 for testing). In addition, we propose a new metric to evaluate copy-move forgery detection dedicated to scientific images using an enriched ground-truth map, that assert a consistent detection match between the cloned region and its source. Finally, using this new dataset and metric, we evaluate the performance of the state-of-the-art copy-move forgery detection [7, 6, 21], establishing a baseline and setting the ground for any future investigation.

We organize the remaining of this paper into seven sections: Section II presents related work while Section III details the proposed library, RSIIL. Section IV presents the dataset RSIID while Section V

brings out a new evaluation metric aimed at a more consistent copy-move detection evaluation. Section 

VI presents an analysis of state-of-the-art copy-move forgery detectors on the proposed dataset setting the ground for future research while Section VII presents the conclusions and future work directions.

Ii Related Work

To the best of our knowledge, few works try to design a tampering benchmark focused on scientific images. So far, we were only able to find two works that address scientific integrity image datasets. The first one is from Xiang and Acuna [22], which created a synthetic tampering dataset of scientific images from the web. They doctored microscopy and western blot images using three types of manipulations that they claim to be the common cause of problems in scientific papers: cleaning of an image region with a single color or noise (Cleaning); copying an alien content region into the image (Splicing); and applying visual adjustments in the image content (Retouching). Their dataset contains 747 manually manipulated scientific images, of which 616 are dedicated to Removal. As we were only able to find the pre-print version of [22], we could not find any released data. Due to this, the quality of their manipulations and the dataset license is still unclear. Despite the authors manually constructed the dataset to create a more realistic scenario, their dataset is still limited to a small size that might not represent the diversity of scientific images. Besides this, the dataset is highly concentrated on Cleaning, preventing one to properly evaluate the robustness of a forensic method among all modalities.

The second one is the work of Koker et al. [11], named as Bio-Image Near-Duplicate Examples Repository (BINDER), which have the pioneering idea of using legal issue-less scientific images for an integrity dataset. This dataset is limited to finding near-duplicate images, aiming to find image re-use across scientific publications. Their dataset has 10,179 non-overlapping patches tiled in or pixels. To create their dataset, they gathered microscopy images from the following public repositories: NYU Mouse Embryo Tracking Database111 (Last access May, 2021 (METD), the Broad Bioimage Benchmark Collection222 (Last access May, 2021) (BBBC), the Adiposoft Image Dataset333 (Last access May, 2021) (AID), and the Open Microscopy Image Data Resource444 (Last access May, 2021) (IDR). Besides, they also applied some geometric, brightness/contrast, and compression transformations on some images. However, their dataset is still not as realistic as the figures presented in scientific publications. Despite scientific images often embed graphical elements and captions, hampering to detect re-use, the authors did not add these elements to the images. In addition, they did not apply any local tampering (region-level), which is also a typical manipulation in inappropriate image re-use [3].

In addition to these works, we also found two scientific integrity initiatives that collect real cases of retracted papers. The first is the Retraction Watch Database555 (Last access May, 2021) maintained by the Retraction Watch666A non-profit organization affiliated with the Center for Scientific Integrity and dedicated to report and discuss cases of retracted papers and related issues.. This database has more than 20,000 metadata of retracted, corrected, or concerned papers. The metadata presents the paper’s title, retraction reason, authors, and Digital Object Identifier (DOI), among other fields. Although this database is not dedicated to image integrity issues, it is possible to filter the retracted papers to this category. However, only the paper’s metadata will be retrieved – due to legal aspects; it is not possible to retrieve the articles PDF, Figures, or Retraction notice –, which is a drawback of this database.

The second is the HEADT Centre Image Integrity Database777 (Last access May, 2021), an initiative focused on researchers and developers working on scientific image manipulation detection. Their database contains more than 500 images’ metadata from retracted papers due to image manipulation. In addition to the basic information of a paper (Title, Authors, Publisher, Journal), they also added a text description of each manipulation, including the figure’s panel in which the manipulation occurs and its category (e.g., copy-move). This text description is based on the retraction notice associated with the figure; therefore, some text also presents ambiguity as depicted in the Figure 1. Despite this text description, we could not find any manipulation map at pixel level in this dataset, which we believe is needed to evaluate a detection method properly.

Iii RECOD Scientific Image Integrity Library - RSIIL

Before working with synthetic data, we tried to gather real-world problematic scientific images. To avoid any bias from our side, we relied upon retracted papers due to image problems given that they have a retraction notice resultant of an integrity investigation. However, to publish an accessible benchmark for forensic research, we possibly would have to deal with some legal aspects (e.g., figure copyright and causing possible defamation to someone).

As reported by Adam Marcus [13], a retracted paper could make their authors feel their reputation harmed and make them sue journals for defamation. Azoulay et al. [1] also indicate that retraction due to misconduct ––which are the most important papers to be included in a forensic benchmark— has a significant reputation penalty to their authors. Even co-authors that might not be involved in the image manipulation, who already suffered severe consequences [14], could be affected by such benchmark since it would promote their association with the retracted paper.

Besides this legal aspect, we also faced some practical issues regarding data annotation. When manually annotating the problematic figures’ regions following their retraction notice, we experienced an absence of standard, including vague sentences (as illustrated in Figure 1), resulting on unreliable ground-truths.

Because of these issues, we decide to avoid using real-world scientific problematic image and create a photorealistic dataset using the library introduced in this section. Thus, this Section presents the types of forgeries implemented in the library (Sub-Section A), explains how the library mimics realistic figures as they usually are presented in scientific documents (Sub-Section B), and addresses the manipulation ground-truth (Sub-Section C). Finally, the section also discusses how the proposed library is amenable to extensions of new image manipulations types (Sub-Section D).

Iii-a Library functionalities

The goal of the library is to implement the most common image manipulations reported in the scientific community. Although we are aware of the possibilities of more complexity tools for image manipulation, for example, the creation of scientific images using artificial intelligence (AI) algorithms


, we suspect that these tools are not vastly used yet due to their complexities. Therefore, while designing the library, we adopt the forgery function based on the most common image processing operation accessible for a non-expert in AI or Computer Vision. We also design each block of the library to allow it to be extended to other more complex operations in the future.

Following the research from Bik et al. [3] and Rossner and Yamada [18], we selected three main types of manipulations that can be recreated using common image processing software (e.g., Adobe Photoshop):

  1. Retouching: The process of image beautification leading to an experiment misreading. This modality implements contrast, brightness, and blurring adjustments that highlight or obfuscate an image region. Figure 2 depicts an image that we applied retouching with our library. In Figure (b)b, we used a Gaussian filter within the selected objects to obfuscate its content. Figure (e)e illustrates an image with contrast and brightness adjustment, in which the method changes the selected object pixels intensity to cause an experimental misreading.

    (a) Original
    (b) Manipulated
    (c) Manipulation Map
    (d) Original
    (e) Manipulated
    (f) Manipulation Map
    Fig. 2: Example of image Retouching forgery implemented in the library. (a) and (d) are original images without any manipulation; (b) is the manipulated version of (a) using blurring retouching; (e) is the brightness/contrast manipulated version of (d); (c) and (f) are the ground-truth map that indicate the manipulated regions of (b) and (e).
  2. Cleaning: The result of obfuscating a foreground object using a background region. For this modality, we use inpainting and a brute-force routine. For the inpainting, we use the method of Criminisi et al. [8] implemented by Moura.888Code available at (Last access March, 2021) For the brute-force routine, we develop an in-house method to mimic the forgery procedure of a person seeking to cover an object using the background. To implement this routine, we select a foreground object ; then, using brute force, we fit on a background region that has the most similar color histogram of this object; finally, we copy into and blur the border of , smoothing (feather edges) the difference from the copied and the neighborhood of . Figure (b)b depicts the result of inpainting on the top-right cell of the image, and Figure (e)e depicts the result of the brute-force routine.

    (a) Original
    (b) Manipulated
    (c) Manipulation Map
    (d) Original
    (e) Manipulated
    (f) Foreground Map
    (g) Background Map
    Fig. 3: Example of image Cleaning forgery implemented in the library. (a)-(c) depict the inpainting method of [8] added to the library. (d)-(g) depict the Brute-Force cleaning routine. (g) indicates the background regions of (d) selected to cover (clean) the cells indicated by (f). Each color in (f) and (g) represent a different ID that helps to track the regions involved in the forgery.
  3. Duplication: The action of copying and pasting a region of an image within the same or another image, using or not post-processing operations. Note that this definition includes both copy-move and splicing. We organized this category into three sub-categories:

    1. Copy-Move Forgery: Duplication of a region within the same image using geometric transformations (translation, rotation, flip, and scaling) and post-processing (e.g., retouching). All transformations can be combined with another. Due to the intrinsic result of scaling, we always combined it with another operation, otherwise it would cover the source object region. Besides these transformations, we also implemented a random object-to-background copy-move (that we named Random). This routine copies a random object to a background region that has the same shape as .

    2. Overlap Forgery: Creation of two images with an overlap region from a single one. From a source image , we select different regions of that share an overlap area to create two images from these regions. Any of these new regions can suffer post-processing to obfuscate its source. Figure 5 depicts the creation of an image with an overlap area.

    3. Splicing: Creation of an image composition that uses a donor figure’s elements into a host one. Figure 6 depicts an Splicing forgery.

(a) Translation
(b) Rotation
(c) Flip
(d) Scaling
(e) Original
Fig. 4: Example of Copy-Move Forgery implemented in the library. The object of image (e) containing an arrow is duplicated with (a) translation, (b) rotation, (c) flip, and (d) scaling and pasted within the same image.
(a) Source Image
(b) Region A
(c) Region B
Fig. 5: Example of Overlap forgery included in the library. (a) represent a source image that is divided in overlapping regions A and B, and then presented as unique images in (b) and (c). The region A (b) suffer a post-processing brightness adjustment to make harder to compare with region B (c)
(a) Donor
(b) Host
(c) Splicing Result
Fig. 6: Example of Splicing forgery function included in the library. The object highlighted with a red arrow from the donor image (a) is placed in a background region of the host image (b) resulting in (c).

Despite all cases are generated without any human interaction, the result images may confuse even an attentive person. To produce forgeries as realistic as possible, some functions from the library require as input an object map (segmentation map). The object map locates each object inside the image and assists a forgery function to execute the falsification more likely as a human would do.

Iii-B Realistic Scientific Figures

As our key objective is to create scientific figures, we include two features (frequently present in such figures) in the library: captions/indicative letters and compound figures.

  1. Caption/indicative letters: Scientific figures often present indicative letters or captions that overlay the image’s content. As a result, this overlay is a splicing operation between a letter or a word within the experiment image that could raise a false alarm during forgery detection. Therefore, we add to the library the possibility to mimic this overlap behavior as it appears on scientific papers. We include three different levels of indicative verbosity. Level 1 includes only indicative letters around each panel of the figure. Level 2 includes the features of Level 1 and a random word around each panel. Level 3 includes all features from Level 2 and an indicative letter inside each panel. Figure 7 depicts all these levels of verbosity.

    Fig. 7: Example of a compound figure with different levels of indicative letters verbosity. From top to bottom, a compound figure without any indicative letter receives different levels of indicative verbosity, depicted by one of its panels; at the bottom, the result figures for each level.
  2. Compound figures: A Compound figure is a composition of multiple images that are organized in panels. These figures usually appear in articles to represent an overview of an experiment. To avoid creating unrealistic compound figures, we make use of figure templates based on real cases. These templates are image masks that can inform each panel’s location to the method, as well as their type (e.g., graphs, photos).

To create Compound figures, we implemented a routine that has as input a set of realistic compound figures templates , a dataset of scientific images (to be included in the compound figure), a source image ( is not in ), and a forgery function (to be applied in ). Figure 8 illustrates this routine. Thus, the method selects a template from with at least one panel whose aspect ratio is similar to the aspect ratio of . Then, the routine applies the forgery in , creating (a forged version of ). Later, a figure with the same size of is created, and is resized and placed in the panel of with the most similar aspect ratio of . Finally, all other panels of are filled with different images from that have similar aspect ratio to those panels or with fake graphics.

Fig. 8: Pipeline of Compound figure creation. (a) Method’s Input: From left to right, set of Compound Figures templates; scientific source image dataset; and input image with the chosen forgery function. (b) Method’s Operations: Selects a template based on the aspect ratio of the input image; then, retrieves all images from the source dataset that fit the chosen template; later, creates Fake graphs (if indicated by the template); then, applies the forgery function in the input; and, finally, place all figure elements in the Compound figure. (c) The output figure

Iii-C Data Annotation

In spite of the importance of reliable ground-truth to evaluate a forensic method, to the best of our knowledge, there is no scientific forgery image dataset that presents an enriched ground-truth. Hence, all tampered operations implemented in the library provide detailed maps to indicate the manipulated regions. Each object involved in the tampering operation is indicated with a different ID in the ground-truth, which helps pinpoint the object’s exact location before and after the forgery, as depicted in Figure (g)g. The library also enables the creation of a JSON file containing metadata related to the forgery. This metadata includes the source images, the method and arguments used, and the location of each panel inside the Compound figure. As the metadata includes the source images and the forgery methods applied, one can evaluate provenance analysis [15] using these information as reference.

Iii-D Library Extension

Given that scientific image tampering improves over time to convince even researchers [17], the benchmark of tampering detection also should include cutting-edge forgery techniques. In this sense, to facilitate the inclusion of new manipulation in RSIIL, we implemented a high-level routine that receives as one of its arguments a forgery function and applies it to an image. This routine is responsible for regulating the application of any new manipulation, asserting its guidelines to the ground-truth and the metadata associated with the forgery. Because of this, any new forgery function capable of returning the forged image along with its manipulation map –a pixel-wise map locating the forgery inside the image– can be easily added to the library to generate Simple or Compound tampering figures.

Iv RECOD Scientific Image Integrity Dataset - RSIID

In addition to the library we introduced in the previous section, we created a dataset to serve as future benchmark for the area. For that, we selected the most frequent retracted types of images from the biomedical area. For this, we followed the orientation of Bucci [4] and Bik et al. [3] that report a high image manipulation rate on images from Western Blot techniques and Microscopy imagery. With this in mind, we downloaded real scientific images collected from diverse sources to apply the forgeries.

To avoid any legal aspect of creating manipulated images and aiming to publish the dataset with a common creative license, we only downloaded data available under public domain999 (Last access May, 2021) (PD) or common creative attributed101010 (Last access May, 2021) (CC-BY) licenses. These license allow us to remix, transform, and reuse the images without asking for the author’s authorization.

We use the following data source to gather the image collection:

  1. Broad Bioimage Benchmark Collection111111 (Last access May, 2021) (BBBC), :
    A collection of freely downloadable microscopy image sets. From this source, we selected the datasets BBBC038, BBBC039, and BBBC019. The first two are dedicated to segmented nuclei images and have object-mask –that are needed for forgeries at object-level–. The last dataset (BBBC019) is dedicated to cell migration which we use for Overlap forgeries.

  2. PubMed Central (PMC)121212 (Last access May, 2021):
    PMC is a free article archive of biomedical and life sciences. To download each figure from this repository, we use an API131313 (Last access May, 2021 available by PubMed, in which we could select images that only have PD or CC-BY licenses. We choose to include published western blots images. To include the western blots to the source dataset, after downloading the PMC figures, we manually extracted the panels that had western blots for each figure. We based the templates images used for creating Compound Figures based real figures retrieved from this repository.

  3. TNBC [16]:
    This dataset, announced in Naylor et al. [16]

    , was designed aiming at nuclei segmentation of cells by deep neural networks. Therefore, this dataset has a high-quality object map that we make use of to assists the forgery operations at object-level.

Table I shows the number of source figures for each collection by type (Microscopy or Western blot) pointing if they have object-mask annotation, and the number of template images created based on the figures from PMC.

Source Image Dataset
Collection Microscopy Western Blot Object-Mask
Compound Figures Template
Template Source*
  • Templates created based on the figures from the collection.

TABLE I: RSSID Source Image and Template Collection

Iv-a Dataset Construction

While designing the dataset, we project it to evaluate a forensic tool in different tasks with different complexities. Because of this, the dataset is organized so that a user can easily find the data and its annotation for each forgery modality. Thus, we divided the dataset into two types of figure complexities: Simple and Compound.

  1. Simple Scientific Figures: Figures with this complexity are represented by a single experiment image, (e.g., Figure 2a). To include a tampering figure in this complexity type, we forge an original figure using Retouching, Cleaning, or Duplication techniques, implemented in our library. To avoid unexpressive forgeries, we only included doctored figures in the dataset that have at least 500 manipulated pixels. In addition to the tampered figures, we also reserved a pristine directory in which we include the original images, so that a user can easily evaluate false positives. Figure 9 illustrates the organization of Simple figures in our dataset.

  2. Compound Scientific Figures: This complexity type is described in Section III-B and depicted in Figure 8.

    We divided the Compound Figures into two types of tampering: Intra-Panels and Inter-Panels.

    1. Intra-Panels are forgeries that are present in just one panel of the figure. To create this tampering type, we add a Simple forgery as one of the figure’s panels. Forgeries that need more than one source image (e.g., splicing) or that generate more than one doctored figure (e.g., overlap) were not included in this modality.

    2. Inter-Panel are forgered figures that have two or more panels involved in the manipulation process. This modality aims to evaluate duplications among two or more panels within the same figure. These duplications can be at object-level, region-level, or panel-level. At object-level, the objects from a donor panel are copied into a host, using splicing operation. At region-level, an overlap forgery operation creates two panels with overlapping areas. At panel-level, the entire panel is duplicated with or without post-processing (e.g., retouching, cleaning, or geometric transformations).

    For each Compound figure, we generated the three levels of indicative letters verbosity, as described in Section III-B. Figure 10 illustrates the organization of Compound figures in our dataset.

    To assist machine learning forensics techniques, we further divided the dataset into training/test sets. Tables

    II and III express the number of manipulated figures included in each modality. Note that, overlap forgery appears only in the test set, since this modality is similar to the copy-move, and this protocol will force the generalizability of a forensic tool among the methods.

Modality Train Test
Number of Figures Number of Figures
Source of Forgery Figures
Duplication Copy-Move
Cleaning Inpainting
Retouching Blurring
Total of Figures
TABLE II: Number of Simple Figures per modality in the dataset
Modality Train Test
Number of Figures Number of Figures
Source of Forgery Figures
Inter-Panel Duplication Copy-Move
Intra-Panel Duplication Copy-Move
Cleaning Inpaiting
Retouching Blurring
Total of Figures
TABLE III: Number of Compound Figures per modality in the dataset
Fig. 9: Organization of Simple forgery images in the dataset.
Fig. 10: Organization of Compound forgery images in the dataset

V Copy-Move Forgery Detection Proposed Metric

Popular metrics used on Copy-Move Forgery Detection (CMFD) (e.g., F1-score and Precision) make use of True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN) detection concepts at pixel-level, as described in Table IV.

Predicted - Detection Map
Positive (Suspect Pixels) Negative (Non-Suspect Pixels)
Ground-Truth Positive ( Tampered Pixel) TP FN
Negative (Pristine Pixel) FP TN
TABLE IV: Confusion Matrix Copy-Move Forgery Pixel Level

As a drawback, these metrics cannot assert if both regions of a copy-move (the source and its copy) are in the ground-truth, since there is no consistency check. Because of this, some contradiction might occur during the evaluation. For instance, Figure 11 illustrates a detection map that has an inconsistent detection in which the copied objects with and are inconsistent with the ground-truth. For both objects, only the source or its copy overlaps with the ground-truth, which is inconsistent, since both (object and its copy) are expected to be included during detection. However, when evaluating this detection map with traditional true positive score, both regions would be considered as true positive hit.

To mitigate this drawback, this section introduces a new metric that takes advantage of the enriched pixel-wise ground-truth of the dataset. The proposed metric is a variation on how to consider a pixel as true positive in a detection map named as the Consistent True Positive score (CTP) and defined as:
Given a ground-truth map with manipulated regions, a detection map with copy-pasted regions, each one of the regions included in has connected components (the source object and all its copies), and each region in has connected components. Let be a detected region from and a tampered region indicated by the ground-truth. Also, let be a pixel from , such that . Thus, is a consistent true positive if exists , such that, at least two connected components from intersects .

In other words, to consider a region from the detection map as a consistent true positive, at least two components from (the source and at least one of its copies) have to intersect the ground-truth.

As Figure 12 depicts, a region from the detection map can overlap with two or more region from the ground-truth. Given that the goal of is consistency, we only consider as the region of the ground-truth that has the maximum intersection area with the detected region. Hence, . As a result, will have higher penalty on Precision and F1-score metrics, if calculated with .

Thus, the equation of F1-score and Precision using CTP become:

Fig. 11: Example of an inconsistent detection map. Although the detection map has two overlapping regions with the ground-truth, each object and its copy —indicated by the detection map— does not intersect simultaneously with the ground-truth.
Fig. 12: Example of CTP for a detected region that overlap two or more objects from the ground-truth. The ground-truth present three copied objects. The detection map present two connected components with the same ID that overlaps more than one object from the ground-truth. Only the object with the larger area is considered as true positive, the others would be considered inconsistent.

Vi Evaluating CMFD Methods

Duplication of scientific images is one of the threats highly reported and studied in the literature [3, 18, 20] which includes copy-move, a well-studied forgery in the digital forensic field. Although this field presents multiples CMFD solutions for natural images, we could not find any study that evaluates their performance in the scientific image domain. In this sense, to assist any future forensic method with a baseline, we investigated the performance of popular CMFD solutions on natural images applied to the RSIID. In addition, we checked the difference of F1-score using the proposed consistent true positive metric and the regular true positive one. For this, we choose the following CMFD methods:

  1. Efficient Dense-Field from Cozzolino et al. [7]. During the evaluation, we use the implementation of Ehret [9]. Ehret released two versions of [7] using Zernike and SIFT features. To distinguish this method from the others, we named them Zernike-PM and SIFT-PM, since these detectors use the PatchMatch algorithm [2] to match similar blocks contents.

  2. CMFD library implemented by Christlein et al. [6]. We selected SIFT and SURF methods from this library since the others were not efficient enough to be explored on such a large dataset. To distinguish them from the previous CMFD detectors, we named them SIFT-NN and SURF-NN since they use a regular approximate nearest-neighbor approach to match similar blocks.

  3. Busternet from Wu et al. [21]. This method is a deep neural architecture for CMFD. During the evaluation, we use the pre-trained version of the model released by Wu et al. [21].

To evaluate SIFT-PM, Zernike-PM, SIFT-NN, and SURF-NN using

, we modified their implementation, including a routine that assigns each detected object and its copies a unique ID. For the sake of reproducibility, we released the methods source-code with this modification in the same repository of RSIIL. On the other hand, to evaluate Busternet, we normalized its output [0,255], then binarized all pixels greater than 100 to 1, otherwise 0. As Busternet is based on neural networks, we could not find an explainable methodology that would track the matching among different objects and their copies. Thus, to the

metric, all detected and ground-truth objects are set with the same . Consequently, would not be able to properly check inconsistencies on figures with more than one tampered object for Busternet’s output; however, is still valid and useful to check if Busternet’s output overlaps with two or more connected components from the ground-truth.

As a baseline approach, the evaluation protocol consists of running all methods without any training or fine-tuning and measuring their output with . During the evaluation, we use all figures from the test set applicable for CMFD (i.e., images with duplicated areas within the same image). We group the baseline results into Simple and Compound scientific figures, which were divided by modalities. All copy-move modalities presented in Figures 13 and 14 can also include scaling. Since scaling cannot be applied alone, we did not indicate when this operation is combined with others.

(a) CMFD Simple Figure Evaluation. The best method is Busternet because its polygon contains all the others, indicating that it performs better for all operations. The shrinking polygons area from to indicates that all methods show inconsistencies in their detection map.
(b) CMFD Inter-Panel Figure Evaluation on different levels of indicative letters verbosity. In this plot, each method fares differently for each modality. The polygons from SIFT-NN and SURF-NN have larger area than the others methods, indicating that they are robust to more operations than the others methods. The shrinking polygons area from to indicates the higher the caption Level, the lower is the method’s effectiveness.
(c) CMFD Inter-Panel Figure Evaluation on different levels of indicative letters verbosity. All methods show low performance and concentrate in the center of the radar, indicating that this is a challenging modality.
Fig. 13: Evaluation Baseline Results. Inside the parenthesis of each copy-move modality, there is the transformation used during the copy-move forgery. All F1-scores presented in this figure are normalized . The best result for each duplication modality is indicated with the color of the respective detector. (a) Result for Single Figure Evaluation using and . (b) Result for Inter-Panel Figure Evaluation using across all levels of indicative letters verbosity, indicated by the number in its subtitle (i.e., V1 for verbosity Level 1). (c) Result for Intra-Panel Figure Evaluation using across all levels of indicative letters verbosity.

Duplication detection output per modality

Fig. 14: Comparative duplication detection output per modality. The purple color represents a pristine/non-suspect region, and each other color in the ground-truth and detection maps represents a different ID assigned to each object and its copies. Inside the parenthesis of each method, we insert the metric normalized into [0, 100]. All compound figures are with level 1 of indicative letters verbosity.

Figure 13 presents a radar graph visualization in which the forgery modalities are arranged in the radius axes. Each CMFD methods’ result is represented with a different color in the radar char. In this visualization, we insert the score of each method along the modality axis (e.g., copy-move with flip) which start from the radar center (score zero) until its border (highest score); thus, as farther a method point (color point) is from the center as better is the method for the axis copy-move modality. After inserting all points of a method for each copy-move modality, we connected those points resulting in a polygon. The larger the polygon area, the better the method’s robustness among different forgery modalities. Also, comparing each detector’s robustness to the operations, this type of visualization helps identifying possible complementary behaviors among different methods. As an example, consider Figure (a)a left panel. In this case, we have five modalities being compared (e.g., Copy-Move with Flip, Cleaning with Brute-Force, and Copy-Move with Translation). This char shows the results of five methods represented by each different polygon color (see legend on the right of the figure). The best method in this figure is Busternet (in orange) while the two worse methods (SURF-NN and Zernike-PM) are in superposition at the center (smaller areas). In the following, we discuss the forgery evaluation for each modality.

Vi-a Simple Figure Forgery Baseline

In this modality, we tested the chosen methods on Simple figures, forged with Cleaning (Brute-Force) and Copy-Move. Although the chosen CMFD detectors have high efficacy on natural image benchmarks, their performance drastically decrease when applied to our scientific dataset. As Figure (a)a shows, the best CMFD method in the Simple Figure Evaluation was Busternet [21], despite its modest scores.

For this modality, we also compare each methods’ performance between F1-score using and . We notice a difference in these scores for all methods, represented by the area reduction from their polygon chars, indicating the existence of copy-move inconsistencies on their detection maps, which is also depicted in Figure 14. The second row of this figure shows an example of an inconsistent detection map, in which Busternet activates just one of the connected components involved in the manipulation, resulting in . On the other hand, in the same row, SIFT-NN detects both regions (object and copy), resulting in .

Vi-B Inter-Panel Figure Baseline

We evaluate the Inter-Panel tampered figures for all indicative verbosity levels in Copy-Move (at panel-level), Splicing, and Overlap forgeries. Figure (b)b shows the result for the Inter-Panel forgery evaluation using . In this modality, the radar visualization allowed us to notice some complementary performance among the chosen detectors. For instance, SURF-NN and Zernike-PM show a complementary behavior to copy-move with rotation and retouching. We believe that this complementary aspect indicates that a fusion/ensemble technique might enhance their individual robustness. For the Inter-Panel scenario, the flipped copy-move, Splicing, and Overlap forgery showed to be the most challenging forgeries. In addition, the indicative letters are shown to have a perceptible impact in this scenario, reducing by up to seven points from level 1 to level 3 for some detectors. Althouth Busternet achieves the best performance in Simple Figure modality, when applied to compound figures, it leads to a higher false-positive rate, as depicted in Figure 14 by activating the entire image.

We also noticed that graphs and indicative letters are the most common causes of false alarms in the Compound Figures scenario, as illustrated by the third, fourth, and fifth rows of Figure 14, which SIFT-PM wrongly activates letters and graphs regions.

These findings help us to see where researchers should focus on when dealing with the scientific image forgery detection problem.

Vi-C Intra-Panel Figure Baseline

We evaluate the Intra-Panel tampered figures for all levels of indicative verbosity in Cleaning (Brute-Force) and Copy-Move (at object-level).

As presented by Figure 13, this is the most challenging scenario, in which the detectors scored lower than four on for all evaluated operations. A possible explanation for this is the lower percentage number of doctored pixels in these figures than in other modalities. The detectors’ low performance does not allow us to measure the impact of verbosity levels in the figures properly. However, as Figure 14 shows, graphs and indicative letters would also be one cause of false alarms in this modality.

Vii Conclusions

In addition to the daunting scenario of fraud in science —due to the increase of image misconduct cases—, there is a legal issue related to copyrights and judicial aspects that prevents one from creating a large collection of fraudulent scientific images, even for an in-depth forensic study to benchmark and drive the development of appropriate detection methods.

Therefore, this work introduced a library and a dataset to assist the scientific integrity and forensic community to overcome this legal hurdle. We believe that by presenting a large dataset to the forensic community, we are fostering the development of more complex and robust detection tools (e.g., AI-based models).

The proposed library implements the most common image manipulation forgeries described by scientific integrity researchers. Also, it is extendable to more complex tampering operations. As a special feature, the library generates an enriched ground-truth addressing all regions affected before and after applying a tampering function, assigning a unique ID for the regions involved (when applicable). Using this library on creative common scientific images, we created a dataset with 39,423 manipulated figures freely available.

Leveraging the dataset’s enriched ground-truth, we proposed a metric that avoids inconsistent detection during CMFD evaluation. Using this metric, we evaluate popular CMFD methods on our dataset. Although we choose high-cited and effective CMFD tools for natural images, all solutions presented a lower performance when transferred to the scientific image domain. This is not a fault of such algorithm as they were not designed for this specific setup. However, these findings show an important lack of methods and a tremendous research opportunity for new specialized detectors aiming at finding forgeries in scientific-related images. In addition, we notice that some of the chosen algorithms present complementary performance and might benefit from a fusion approach.

Notwithstanding the large size and diversity of the proposed dataset, we believe that science will report more sophisticated tampering operations in the near future, as warned by [17]. Thus, we are also concerned about more issue-less and freely available scientific integrity datasets with complex, enhanced, and realistic tampering modalities, aiding the design of more robust detectors.

Therefore, as future work, in addition to investigating robust forensic solutions using AI-based or fusion-based methods, we believe that studies on automated realistic scientific forgeries would also assist the forensic community in fighting scientific misconduct. Furthermore, we believe that the detailed pixel-wise ground-truth of RSIID opens a research opportunity to explore eXplainable AI solutions that might assist analysts on sensitive cases, such as misconduct investigations.


This research was supported by São Paulo Research Foundation (FAPESP), under the thematic project DéjàVu, grants 2017/12646-3 and 2020/02211-2.


  • [1] P. Azoulay, A. Bonatti, and J. L. Krieger (2017) The career effects of scandal: evidence from scientific retractions. Research Policy 46 (9), pp. 1552–1569. Cited by: §III.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), Vol. 28, pp. 24. Cited by: item 1.
  • [3] E.M. Bik, A. Casadevall, and F.C. Fang (2016) The prevalence of inappropriate image duplication. Biomedical Research publications 7 (3). Cited by: §I, §I, §II, §III-A, §IV, §VI.
  • [4] E. Bucci (2018) Automatic detection of image manipulations in the biomedical literature. Nature Cell death & disease 9 (3), pp. 400. Cited by: §IV.
  • [5] D. Chawla (2020-02) A single ‘paper mill’ appears to have churned out 400 papers, sleuths find. Science. External Links: Document, Link Cited by: §I.
  • [6] V. Christlein, C. Riess, J. Jordan, C. Riess, and E. Angelopoulou (2012-12) An evaluation of popular copy-move forgery detection approaches. IEEE Transactions on Information Forensics and Security 7 (6), pp. 1841–1854. External Links: Document, Link Cited by: §I, item 2.
  • [7] D. Cozzolino, G. Poggi, and L. Verdoliva (2015) Efficient dense-field copy-move forgery detection. IEEE Transactions on Information Forensics and Security 10 (11), pp. 2284–2297. Cited by: §I, item 1.
  • [8] A. Criminisi, P. Pérez, and K. Toyama (2004)

    Region filling and object removal by exemplar-based image inpainting

    IEEE Transactions on image processing 13 (9), pp. 1200–1212. Cited by: Fig. 3, item 2.
  • [9] T. Ehret (2018-07) Automatic detection of internal copy-move forgeries in images. Image Processing On Line 8, pp. 167–191. External Links: Document, Link Cited by: item 1.
  • [10] N. Gilbert (2009-10) Science journals crack down on image manipulation. Nature. External Links: Document, Link Cited by: §I.
  • [11] T. E. Koker, S. S. Chintapalli, S. Wang, B. A. Talbot, D. Wainstock, M. Cicconet, and M. C. Walsh (2021) On identification and retrieval of near-duplicate biological images: a new dataset and protocol. In

    International Conference on Pattern Recognition (ICPR)

    External Links: Link Cited by: §II.
  • [12] J. Krueger (2002-04) Forensic examination of questioned scientific images. Accountability in Research 9 (2), pp. 105–125. External Links: Document, Link Cited by: §I.
  • [13] A. Marcus (2019-12) Pitt researchers sue journal for defamation following retraction. External Links: Link Cited by: §III.
  • [14] P. Mongeon and V. Larivière (2013) The collective consequences of scientific fraud: an analysis of biomedical research. In Proceedings of ISSI 2013, Proceedings of the International Conference on Scientometrics and Informetrics. Vienna: Austrian Institute of Technology, pp. 1897–1899. Cited by: §III.
  • [15] D. Moreira, A. Bharati, J. Brogan, A. Pinto, M. Parowski, K. Bowyer, P. Flynn, A. Rocha, and W. Scheirer (2018) Image provenance analysis at scale. IEEE Transactions on Image Processing 27 (12), pp. 6109–6123. Cited by: §III-C.
  • [16] P. Naylor, M. Lae, F. Reyal, and T. Walter (2017-04) Nuclei segmentation in histopathology images using deep neural networks. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), External Links: Document, Link Cited by: item 3.
  • [17] C. Qi, J. Zhang, and P. Luo (2020)

    Emerging concern of scientific fraud: deep learning and image manipulation

    bioRxiv. Cited by: §III-A, §III-D, §VII.
  • [18] M. Rossner and K. Yamada (2004) What’s in a picture? The temptation of image manipulation. The Journal of Cell Biology 166 (1), pp. 11–15. Cited by: §I, §I, §III-A, §VI.
  • [19] Q. Schiermeier (2002-01) German task force outraged by changes to science fraud report. Nature 415 (6867), pp. 3–3. External Links: Document, Link Cited by: §I.
  • [20] M. Wjst (2021-02) Scientific integrity is threatened by image duplications. American Journal of Respiratory Cell and Molecular Biology 64 (2), pp. 271–272. External Links: Document, Link Cited by: §VI.
  • [21] Y. Wu, W. AbdAlmageed, and P. Natarajan (2018) BusterNet: detecting image copy-move forgery with source/target localization. In European Conference on Computer Vision (ECCV), Cited by: §I, item 3, §VI-A.
  • [22] Z. Xiang and D. Acuna (2020) Scientific image tampering detection based on noise inconsistencies: a method and datasets. arXiv preprint arXiv:2001.07799. Cited by: §II.