Will Large-scale Generative Models Corrupt Future Datasets?

Hataya, Ryuichiro; Bao, Han; Arai, Hiromi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.08095 (cs)

[Submitted on 15 Nov 2022 (v1), last revised 10 Aug 2023 (this version, v2)]

Title:Will Large-scale Generative Models Corrupt Future Datasets?

Authors:Ryuichiro Hataya, Han Bao, Hiromi Arai

View PDF

Abstract:Recently proposed large-scale text-to-image generative models such as DALL$\cdot$E 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users' prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently, a tremendous amount of generated images have been shared on the Internet. Meanwhile, today's success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: "\textbf{will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?}" This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained with "contaminated" datasets on various tasks, including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets and the codes for experiments will be publicly released for future research. Generated datasets and source codes are available from \url{this https URL}.

Comments:	ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.08095 [cs.CV]
	(or arXiv:2211.08095v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.08095

Submission history

From: Ryuichiro Hataya [view email]
[v1] Tue, 15 Nov 2022 12:25:33 UTC (15,120 KB)
[v2] Thu, 10 Aug 2023 00:22:27 UTC (17,980 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Will Large-scale Generative Models Corrupt Future Datasets?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Will Large-scale Generative Models Corrupt Future Datasets?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators