The PS-Battles Dataset - an Image Collection for Image Manipulation Detection

by   Silvan Heller, et al.

The boost of available digital media has led to a significant increase in derivative work. With tools for manipulating objects becoming more and more mature, it can be very difficult to determine whether one piece of media was derived from another one or tampered with. As derivations can be done with malicious intent, there is an urgent need for reliable and easily usable tampering detection methods. However, even media considered semantically untampered by humans might have already undergone compression steps or light post-processing, making automated detection of tampering susceptible to false positives. In this paper, we present the PS-Battles dataset which is gathered from a large community of image manipulation enthusiasts and provides a basis for media derivation and manipulation detection in the visual domain. The dataset consists of 102'028 images grouped into 11'142 subsets, each containing the original image as well as a varying number of manipulated derivatives.



There are no comments yet.


page 2

page 4


Effects of Image Compression on Face Image Manipulation Detection: A Case Study on Facial Retouching

In the past years, numerous methods have been introduced to reliably det...

Analysing Statistical methods for Automatic Detection of Image Forgery

Image manipulation and forgery detection have been a topic of research f...

Holistic Image Manipulation Detection using Pixel Co-occurrence Matrices

Digital image forensics aims to detect images that have been digitally m...

Resampling Forgery Detection Using Deep Learning and A-Contrario Analysis

The amount of digital imagery recorded has recently grown exponentially,...

Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text

Real world multimedia data is often composed of multiple modalities such...

Learning to identify image manipulations in scientific publications

Adherence to scientific community standards ensures objectivity, clarity...

Neural Imaging Pipelines - the Scourge or Hope of Forensics?

Forensic analysis of digital photographs relies on intrinsic statistical...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Media creation in the digital age is an increasingly distributed process where the line between producer and consumer of media gets more and more blurry. In such a prosumer [11]

ecosystem, material is often sourced from various places, reused, manipulated, and shared several times which makes proper source attribution a difficult task. As image manipulation software evolves and its use becomes more widespread, there is a need to verify the effectiveness of manipulation detection algorithms against images created by a diverse spectrum of tools, manipulations and proficiency in manipulation. However, not all tampering changes the semantic content of the image. Detecting JPEG-Compression and post-processing are not necessarily as relevant to users as manipulations which change content or context of an image. While a lot of work has been done to detect the presence of manipulations, we are not aware of out-of-the-box classifiers for tampered images. One promising avenue are machine learning techniques which however require large amounts of data to work. We hope that by providing a large extendable dataset, research on automated classifiers can be stimulated.

To further research in the area of modification detection, derivation detection as well as source identification in the visual domain, we present the PS-Battles dataset. It is comprised of images sourced from the popular photoshopbattles subreddit111 which is home to a large community of both amateur and professional digital artists who regularly hold contests in digital image manipulation. For every submitted original image, the community creates several, often humorous derivative images or so-called photoshops which are then judged by other members of the community. Examples of such original images and the community-created derivatives can be seen in Figure 1. is one of the most popular websites in the world, as of early 2018 ranking 7 globally and 4 in the US333 The photoshopbattles community has 12.9 million subscribers, which makes it the 33 largest community on reddit444

The presented dataset contains 11’142 subsets consisting of the original image as well as several corresponding photoshops for a total of 102’028 images. For every derivative image, the dataset contains additional metadata about the image’s author, the time of its creation, and its reception within the community. Since the photoshopbattles community is quite active, the dataset is extensible over time.

(a) ’Original’ image:
(b) ’at the barber’ by ’totalitarian_jesus’:
(c) ’General Kenobi!’ by ’mandal0re’:
(d) ’Original’ image from
(e) ’Lies I tell you’ by ’GalacticBystander’:
(f) ’Cosmic Selfie’ by ’-JenM-’:
Fig. 1: Examples of Original and Derivative Images

The remainder of the paper is organized as follows: Section II reviews related work. Section III introduces the PS-Battles dataset and details some of its properties, and Section IV concludes.

Ii Related Work

Both visual near duplicate detection as well as image tampering detection have gained in interest over the past years. The authors of [1] provide an overview of image tampering detection techniques with a focus on passive or blind image forgery detection methods. They observe a lack of established benchmarks and of public testing databases which evaluates the actual accuracy of digital image forgery methods. [1]. Most research reviewed either evaluates against automatically generated forgeries or a small set of manually created derivates. The authors are not aware of any large dataset for image tampering detection. [2], for instance, focuses on Copy-Move Forgery Detection and evaluates against a dataset with 48 base images and derivatives. [10] also provides an extensive overview of the field of digital forensics. The authors also note that the evaluation of existing and new algorithms must be improved. The analysis of detection results in nearly all papers surveyed lacks the rigor […], making the assessment of their utility difficult. [10]. Both [1] and [10] also provide an extensive overview of single / double JPEG compression detection. For originals in our dataset, there is no guarantee that no JPEG compression artifacts will be found. We feel this is an advantage as it better represents the real-world use case tampering algorithms will face. The CASIA555 dataset [4]

is probably the most similar to our proposed dataset and contains 12’614 images, of which 5’123 are tampered with. The advantage of our approach is however that the community we source our content from is still active, so it is very likely that the dataset further grows in the future. Additionally, the untampered images of the CASIA dataset are completely untouched which is an unrealistic assumption for real-world applications. There are other datasets, for instance the Copydays dataset 

[6, 7], which is generated using automated artificial attacks and used for copy detection. The CISDE [9] dataset provides 1’845 spliced picture blocks with a fixed size of 128 128. The spliced blocks lack context however, making them semantically meaningless.

Recently, the RAISE dataset [3] was introduced which contains 8’156 untampered high-resolution raw images. There, the authors also discuss the lack of a comprehensive large-scale dataset. In a recent paper focusing on image provenance [8], also a dataset related to the photoshopbattles community is introduced. Their dataset however contains only 10’421 images and is focused comment chains, where derivations of derivations are made.

The fact that our dataset consists of image derivatives which have been generated using current industry-standard image manipulation techniques can in certain instances also be considered a disadvantage. [5]

for example has shown that it is possible to generate modifications of faces using deep learning which are visually practically indistinguishable from original images. Such modifications have however not yet reached the main stream and have not made their way in any image tampering dataset we are aware of.

Iii Dataset Description

Fig. 2: Derivates per Original

The following section describes the structure and properties of the proposed PS-Battles dataset and also presents in detail how it was collected. The dataset itself can be obtained from GitHub via

Iii-a Overview

On average, the dataset contains 7.9 photoshops for every original image. Figure 2 shows the distribution of the number of derivate images per original.

As we can see on Figure 2, there are a lot of posts with a small number of derivates and even the most popular posts do not exceed 67 derivates. This is expected as derivate creation is heavily moderated. The combined size of all images in the dataset is 40.2 GB. The distribution of the sizes of the individual files by file type can be seen in Figure 4.

As the dataset is community-generated, images vary in resolution and aspect ratio. Figure 3 shows the distribution of width and height of the images in pixels.

Fig. 3: Image Width and Height Density

Throughout the dataset, image height spans the range from 136 pixels for the smallest to 20’000 pixels while image width goes from 68 pixels in the most narrow image to 12’024 pixels in the widest. We feel this diversity in image dimensionality is beneficial as it makes the dataset more challenging.

Fig. 4: Size per File Type

Iii-B Collection Method

In order to compile the dataset, we used a publicly available dump of reddit content666 from which all posts and corresponding comments of the photoshopbattles subreddit were extracted. The moderators of the subreddit ensure that every post contains a link to the original image and every top-level comment contains a link to a photoshop. Lower level comments are then used by the community to discuss the manipulated images.

We only considered posts and comments with a score above 20 to filter spam which was not already caught by community moderation and to ensure a minimal quality of manipulations. Figure 5 shows the distribution of top-level domains. For the purpose of creating a large dataset, only supporting would have been sufficient. All images except one were in PNG or JPEG format. We cut the one image with WEBP format to reduce processing complexity. We left out all images which when crawled had a bytesize of less than 10 kB since those are mostly either removed images or thumbnails whose quality is too low for meaningful analysis.

Iii-C Structure

The git repository mentioned above contains two primary sources of metadata – originals.tsv and photoshops.tsv – describing the original images and their resulting photoshops, respectively. The metadata for the original images contains the image URL, a unique id, the file size in bytes, a reference to the reddit post where the image was used, the username of the post, the community score as contained in the datadump, and image dimensions. It also contains a checksum to validate if the image was downloaded correctly. The metadata for the derived images has the same structure but additionally contains for every derivative the id of the original image. Instead of referencing the corresponding reddit post, the metadata for the photoshops references the relevant top-level comment where the image was posted.

Fig. 5: Distribution of TLDs

Running the provided script will use these metadata files in order to obtain the images from their respective sources and place them into the dataset directory. The original images will be placed into the originals subdirectory and renamed in accordance with their id while the photoshops will end up in a directory corresponding to the id of their original image beneath dataset/photoshops. Therefore, all derivations of the image
dataset/originals/(id).(filetype) will be located in the directory

Iii-D Discussion

As the photoshopbattles community is very active, new iterations of the dataset will grow in size. Very recently, the community has started to accept GIF manipulations as submissions. Including those in a new version of the dataset would be interesting for the domain of video tampering detection which is not discussed here. Another possibility for new versions of the dataset is manually filtering the comment chains for derivates of the created photoshops as the authors of [8] have done on a small subset of photoshops.

Iv Conclusion

In this paper, we introduced the PS-Battles dataset. It contains 11’142 original images and 91’886 derivates of those images from the photoshopbattles subreddit. The dataset is intended to provide a long-lasting benchmark for image tampering detection and derivate detection methods. Given the wide range of derivates regarding semantics, methods, and skill we expect the dataset to provide a significant challenge for tampering detection methods.


This work was partly supported by the Chist-Era project IMOTION with contributions from the Swiss National Science Foundation (SNSF, contract no. 20CH21_151571).


  • [1] Gajanan K Birajdar and Vijay H Mankar. Digital image forgery detection using passive techniques: A survey. Digital Investigation, 10(3):226–245, 2013.
  • [2] Vincent Christlein, Christian Riess, Johannes Jordan, Corinna Riess, and Elli Angelopoulou. An evaluation of popular copy-move forgery detection approaches. IEEE Transactions on Information Forensics and Security, 7(6):1841–1854, 2012.
  • [3] Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conotter, and Giulia Boato. Raise: A raw images dataset for digital image forensics. In Proceedings of the 6 ACM Multimedia Systems Conference (MMSys 2015), pages 219–224, Portland, OR, USA, March 2015. ACM.
  • [4] Jing Dong, Wei Wang, and Tieniu Tan. CASIA image tampering detection evaluation database. In Proceedings of the IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP 2013), pages 422–426, Beijing, China, July 2013. IEEE.
  • [5] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)

    , pages 4700–4708, Honolulu, HI, USA, 2017.
  • [6] Hervé Jégou. Inria copydays dataset., 2018. Accessed: 2018-02-21.
  • [7] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the 10 European Conference on Computer Vision (ECCV 2008), Part I, pages 304–317, Marseille, France, 2008. Springer.
  • [8] Daniel Moreira, Aparna Bharati, Joel Brogan, Allan Pinto, Michael Parowski, Kevin W Bowyer, Patrick J Flynn, Anderson Rocha, and Walter J Scheirer. Image provenance analysis at scale. arXiv preprint arXiv:1801.06510, 2018.
  • [9] Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. Columbia image splicing detection evaluation dataset. AuthSplicedDataSet/AuthSplicedDataSet.htm. Accessed: 2018-02-21.
  • [10] Anderson Rocha, Walter Scheirer, Terrance Boult, and Siome Goldenstein. Vision of the unseen: Current trends and challenges in digital image and video forensics. ACM Computing Surveys (CSUR), 43(4):26, 2011.
  • [11] Alvin Toffler. The Third Wave. Random House Value Publishing, 1987.