M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

Chi, Xiaowei; Zhang, Rongyu; Jiang, Zhengkai; Liu, Yijiang; Wang, Yatian; Qi, Xingqun; Luo, Wenhan; Gao, Peng; Zhang, Shanghang; Liu, Qifeng; Guo, Yike

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.17963 (cs)

[Submitted on 29 Nov 2023 (v1), last revised 13 Apr 2024 (this version, v2)]

Title:M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

Authors:Xiaowei Chi, Rongyu Zhang, Zhengkai Jiang, Yijiang Liu, Yatian Wang, Xingqun Qi, Wenhan Luo, Peng Gao, Shanghang Zhang, Qifeng Liu, Yike Guo

View PDF HTML (experimental)

Abstract:While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose \textbf{$M^{2}Chat$}, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an $M^{3}Adapter$ that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, $M^{3}Adapter$ tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of $M^{3}Adapter$ while preserving the coherence of semantic context comprehension, we introduce a two-stage $M^{3}FT$ fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our $M^{2}Chat$ surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems. The demo and code are available at \red{this https URL}.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.17963 [cs.CV]
	(or arXiv:2311.17963v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.17963

Submission history

From: Xiaowei Chi [view email]
[v1] Wed, 29 Nov 2023 11:30:33 UTC (13,176 KB)
[v2] Sat, 13 Apr 2024 04:16:18 UTC (13,541 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators