From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Guo, Jiaxian; Li, Junnan; Li, Dongxu; Tiong, Anthony Meng Huat; Li, Boyang; Tao, Dacheng; Hoi, Steven C. H.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.10846 (cs)

[Submitted on 21 Dec 2022 (v1), last revised 8 May 2023 (this version, v3)]

Title:From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Authors:Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, Steven C.H. Hoi

View PDF

Abstract:Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.

Comments:	CVPR 2023 Camera Ready Version
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2212.10846 [cs.CV]
	(or arXiv:2212.10846v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.10846

Submission history

From: JiaXian Guo [view email]
[v1] Wed, 21 Dec 2022 08:39:36 UTC (11,078 KB)
[v2] Sat, 4 Mar 2023 12:42:54 UTC (11,073 KB)
[v3] Mon, 8 May 2023 06:04:04 UTC (14,931 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators