Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Jaewoo Lee; Boyang Li; Sung Ju Hwang

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Abstract

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.

Anthology ID:: 2024.emnlp-main.291
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5060–5080
Language:
URL:: https://aclanthology.org/2024.emnlp-main.291
DOI:
Bibkey:
Cite (ACL):: Jaewoo Lee, Boyang Li, and Sung Ju Hwang. 2024. Concept-skill Transferability-based Data Selection for Large Vision-Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5060–5080, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Concept-skill Transferability-based Data Selection for Large Vision-Language Models (Lee et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.291.pdf

PDF Cite Search