APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Cao, Guiming; Shi, Kaize; Fu, Hong; Zhang, Huaiwen; Xu, Guandong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.06827 (cs)

[Submitted on 12 Jan 2024 (v1), last revised 23 Jan 2024 (this version, v2)]

Title:APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Authors:Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu

View PDF HTML (experimental)

Abstract:Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.

Comments:	7 pages,3 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2401.06827 [cs.CV]
	(or arXiv:2401.06827v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.06827

Submission history

From: Guiming Cao [view email]
[v1] Fri, 12 Jan 2024 04:54:01 UTC (437 KB)
[v2] Tue, 23 Jan 2024 08:54:15 UTC (425 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators