Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Sun, Zeyi; Wu, Tong; Zhang, Pan; Zang, Yuhang; Dong, Xiaoyi; Xiong, Yuanjun; Lin, Dahua; Wang, Jiaqi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.00093 (cs)

[Submitted on 31 May 2024 (v1), last revised 3 Oct 2024 (this version, v2)]

Title:Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Authors:Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

View PDF HTML (experimental)

Abstract:Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.

Comments:	Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2406.00093 [cs.CV]
	(or arXiv:2406.00093v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.00093

Submission history

From: Pan Zhang [view email]
[v1] Fri, 31 May 2024 17:59:56 UTC (21,435 KB)
[v2] Thu, 3 Oct 2024 08:20:17 UTC (23,701 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators