SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Huang, Yuzhou; Xie, Liangbin; Wang, Xintao; Yuan, Ziyang; Cun, Xiaodong; Ge, Yixiao; Zhou, Jiantao; Dong, Chao; Huang, Rui; Zhang, Ruimao; Shan, Ying

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.06739 (cs)

[Submitted on 11 Dec 2023]

Title:SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Authors:Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

View PDF HTML (experimental)

Abstract:Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.06739 [cs.CV]
	(or arXiv:2312.06739v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.06739

Submission history

From: Yuzhou Huang [view email]
[v1] Mon, 11 Dec 2023 17:54:11 UTC (26,531 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators