Media creation in the digital age is an increasingly distributed process where the line between producer and consumer of media gets more and more blurry. In such a prosumer 
ecosystem, material is often sourced from various places, reused, manipulated, and shared several times which makes proper source attribution a difficult task. As image manipulation software evolves and its use becomes more widespread, there is a need to verify the effectiveness of manipulation detection algorithms against images created by a diverse spectrum of tools, manipulations and proficiency in manipulation. However, not all tampering changes the semantic content of the image. Detecting JPEG-Compression and post-processing are not necessarily as relevant to users as manipulations which change content or context of an image. While a lot of work has been done to detect the presence of manipulations, we are not aware of out-of-the-box classifiers for tampered images. One promising avenue are machine learning techniques which however require large amounts of data to work. We hope that by providing a large extendable dataset, research on automated classifiers can be stimulated.
To further research in the area of modification detection, derivation detection as well as source identification in the visual domain, we present the PS-Battles dataset. It is comprised of images sourced from the popular photoshopbattles subreddit111https://www.reddit.com/r/photoshopbattles/ which is home to a large community of both amateur and professional digital artists who regularly hold contests in digital image manipulation. For every submitted original image, the community creates several, often humorous derivative images or so-called photoshops which are then judged by other members of the community. Examples of such original images and the community-created derivatives can be seen in Figure 1. Reddit222reddit.com is one of the most popular websites in the world, as of early 2018 ranking 7 globally and 4 in the US333https://www.alexa.com/siteinfo/reddit.com. The photoshopbattles community has 12.9 million subscribers, which makes it the 33 largest community on reddit444http://redditmetrics.com/r/photoshopbattles.
The presented dataset contains 11’142 subsets consisting of the original image as well as several corresponding photoshops for a total of 102’028 images. For every derivative image, the dataset contains additional metadata about the image’s author, the time of its creation, and its reception within the community. Since the photoshopbattles community is quite active, the dataset is extensible over time.
Ii Related Work
Both visual near duplicate detection as well as image tampering detection have gained in interest over the past years. The authors of  provide an overview of image tampering detection techniques with a focus on passive or blind image forgery detection methods. They observe a lack of established benchmarks and of public testing databases which evaluates the actual accuracy of digital image forgery methods. . Most research reviewed either evaluates against automatically generated forgeries or a small set of manually created derivates. The authors are not aware of any large dataset for image tampering detection. , for instance, focuses on Copy-Move Forgery Detection and evaluates against a dataset with 48 base images and derivatives.  also provides an extensive overview of the field of digital forensics. The authors also note that the evaluation of existing and new algorithms must be improved. The analysis of detection results in nearly all papers surveyed lacks the rigor […], making the assessment of their utility difficult. . Both  and  also provide an extensive overview of single / double JPEG compression detection. For originals in our dataset, there is no guarantee that no JPEG compression artifacts will be found. We feel this is an advantage as it better represents the real-world use case tampering algorithms will face. The CASIA555http://forensics.idealtest.org/ dataset 
is probably the most similar to our proposed dataset and contains 12’614 images, of which 5’123 are tampered with. The advantage of our approach is however that the community we source our content from is still active, so it is very likely that the dataset further grows in the future. Additionally, the untampered images of the CASIA dataset are completely untouched which is an unrealistic assumption for real-world applications. There are other datasets, for instance the Copydays dataset[6, 7], which is generated using automated artificial attacks and used for copy detection. The CISDE  dataset provides 1’845 spliced picture blocks with a fixed size of 128 128. The spliced blocks lack context however, making them semantically meaningless.
Recently, the RAISE dataset  was introduced which contains 8’156 untampered high-resolution raw images. There, the authors also discuss the lack of a comprehensive large-scale dataset. In a recent paper focusing on image provenance , also a dataset related to the photoshopbattles community is introduced. Their dataset however contains only 10’421 images and is focused comment chains, where derivations of derivations are made.
The fact that our dataset consists of image derivatives which have been generated using current industry-standard image manipulation techniques can in certain instances also be considered a disadvantage. 
for example has shown that it is possible to generate modifications of faces using deep learning which are visually practically indistinguishable from original images. Such modifications have however not yet reached the main stream and have not made their way in any image tampering dataset we are aware of.
Iii Dataset Description
The following section describes the structure and properties of the proposed PS-Battles dataset and also presents in detail how it was collected. The dataset itself can be obtained from GitHub via https://github.com/dbisUnibas/ps-battles.
On average, the dataset contains 7.9 photoshops for every original image. Figure 2 shows the distribution of the number of derivate images per original.
As we can see on Figure 2, there are a lot of posts with a small number of derivates and even the most popular posts do not exceed 67 derivates. This is expected as derivate creation is heavily moderated. The combined size of all images in the dataset is 40.2 GB. The distribution of the sizes of the individual files by file type can be seen in Figure 4.
As the dataset is community-generated, images vary in resolution and aspect ratio. Figure 3 shows the distribution of width and height of the images in pixels.
Throughout the dataset, image height spans the range from 136 pixels for the smallest to 20’000 pixels while image width goes from 68 pixels in the most narrow image to 12’024 pixels in the widest. We feel this diversity in image dimensionality is beneficial as it makes the dataset more challenging.
Iii-B Collection Method
In order to compile the dataset, we used a publicly available dump of reddit content666http://files.pushshift.io/reddit/ from which all posts and corresponding comments of the photoshopbattles subreddit were extracted. The moderators of the subreddit ensure that every post contains a link to the original image and every top-level comment contains a link to a photoshop. Lower level comments are then used by the community to discuss the manipulated images.
We only considered posts and comments with a score above 20 to filter spam which was not already caught by community moderation and to ensure a minimal quality of manipulations. Figure 5 shows the distribution of top-level domains. For the purpose of creating a large dataset, only supporting imgur.com would have been sufficient. All images except one were in PNG or JPEG format. We cut the one image with WEBP format to reduce processing complexity. We left out all images which when crawled had a bytesize of less than 10 kB since those are mostly either removed images or thumbnails whose quality is too low for meaningful analysis.
The git repository mentioned above contains two primary sources of metadata – originals.tsv and photoshops.tsv – describing the original images and their resulting photoshops, respectively. The metadata for the original images contains the image URL, a unique id, the file size in bytes, a reference to the reddit post where the image was used, the username of the post, the community score as contained in the datadump, and image dimensions. It also contains a checksum to validate if the image was downloaded correctly. The metadata for the derived images has the same structure but additionally contains for every derivative the id of the original image. Instead of referencing the corresponding reddit post, the metadata for the photoshops references the relevant top-level comment where the image was posted.
Running the provided download.sh script will use these metadata files in order to obtain the images from their respective sources and place them into the dataset directory. The original images will be placed into the originals subdirectory and renamed in accordance with their id while the photoshops will end up in a directory corresponding to the id of their original image beneath dataset/photoshops.
Therefore, all derivations of the image
dataset/originals/(id).(filetype) will be located in the directory
As the photoshopbattles community is very active, new iterations of the dataset will grow in size. Very recently, the community has started to accept GIF manipulations as submissions. Including those in a new version of the dataset would be interesting for the domain of video tampering detection which is not discussed here. Another possibility for new versions of the dataset is manually filtering the comment chains for derivates of the created photoshops as the authors of  have done on a small subset of photoshops.
In this paper, we introduced the PS-Battles dataset. It contains 11’142 original images and 91’886 derivates of those images from the photoshopbattles subreddit. The dataset is intended to provide a long-lasting benchmark for image tampering detection and derivate detection methods. Given the wide range of derivates regarding semantics, methods, and skill we expect the dataset to provide a significant challenge for tampering detection methods.
This work was partly supported by the Chist-Era project IMOTION with contributions from the Swiss National Science Foundation (SNSF, contract no. 20CH21_151571).
-  Gajanan K Birajdar and Vijay H Mankar. Digital image forgery detection using passive techniques: A survey. Digital Investigation, 10(3):226–245, 2013.
-  Vincent Christlein, Christian Riess, Johannes Jordan, Corinna Riess, and Elli Angelopoulou. An evaluation of popular copy-move forgery detection approaches. IEEE Transactions on Information Forensics and Security, 7(6):1841–1854, 2012.
-  Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conotter, and Giulia Boato. Raise: A raw images dataset for digital image forensics. In Proceedings of the 6 ACM Multimedia Systems Conference (MMSys 2015), pages 219–224, Portland, OR, USA, March 2015. ACM.
-  Jing Dong, Wei Wang, and Tieniu Tan. CASIA image tampering detection evaluation database. In Proceedings of the IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP 2013), pages 422–426, Beijing, China, July 2013. IEEE.
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In , pages 4700–4708, Honolulu, HI, USA, 2017.
-  Hervé Jégou. Inria copydays dataset. http://lear.inrialpes.fr/~jegou/data.php#copydays, 2018. Accessed: 2018-02-21.
-  Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the 10 European Conference on Computer Vision (ECCV 2008), Part I, pages 304–317, Marseille, France, 2008. Springer.
-  Daniel Moreira, Aparna Bharati, Joel Brogan, Allan Pinto, Michael Parowski, Kevin W Bowyer, Patrick J Flynn, Anderson Rocha, and Walter J Scheirer. Image provenance analysis at scale. arXiv preprint arXiv:1801.06510, 2018.
-  Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. Columbia image splicing detection evaluation dataset. http://www.ee.columbia.edu/ln/dvmm/downloads/ AuthSplicedDataSet/AuthSplicedDataSet.htm. Accessed: 2018-02-21.
-  Anderson Rocha, Walter Scheirer, Terrance Boult, and Siome Goldenstein. Vision of the unseen: Current trends and challenges in digital image and video forensics. ACM Computing Surveys (CSUR), 43(4):26, 2011.
-  Alvin Toffler. The Third Wave. Random House Value Publishing, 1987.