Will Large-scale Generative Models Corrupt Future Datasets?

11/15/2022
by   Ryuichiro Hataya, et al.
0

Recently proposed large-scale text-to-image generative models such as DALL·E 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users' prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently a tremendous amount of generated images have been shared on the Internet. Meanwhile, today's success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: "will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?" This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained on “contaminated” datasets on various tasks including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets are available via https://github.com/moskomule/dataset-contamination.

READ FULL TEXT

page 2

page 4

page 6

page 8

page 15

page 16

page 17

research
10/26/2022

DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models

With recent advancements in diffusion models, users can generate high-qu...
research
08/18/2022

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

Automatically discovering failures in vision models under real-world set...
research
08/15/2019

Cosmological N-body simulations: a challenge for scalable generative models

Deep generative models, such as Generative Adversarial Networks (GANs) o...
research
05/25/2023

Securing Deep Generative Models with Universal Adversarial Signature

Recent advances in deep generative models have led to the development of...
research
12/04/2020

A Note on Data Biases in Generative Models

It is tempting to think that machines are less prone to unfairness and p...
research
10/03/2022

WorldGen: A Large Scale Generative Simulator

In the era of deep learning, data is the critical determining factor in ...
research
06/22/2022

The ArtBench Dataset: Benchmarking Generative Models with Artworks

We introduce ArtBench-10, the first class-balanced, high-quality, cleanl...

Please sign up or login with your details

Forgot password? Click here to reset