What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Tiong, Anthony Meng Huat; Zhao, Junqi; Li, Boyang; Li, Junnan; Hoi, Steven C. H.; Xiong, Caiming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.02415 (cs)

[Submitted on 3 Apr 2024]

Title:What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Authors:Anthony Meng Huat Tiong, Junqi Zhao, Boyang Li, Junnan Li, Steven C.H. Hoi, Caiming Xiong

View PDF HTML (experimental)

Abstract:Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (this https URL), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.02415 [cs.CV]
	(or arXiv:2404.02415v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.02415

Submission history

From: Anthony Meng Huat Tiong [view email]
[v1] Wed, 3 Apr 2024 02:40:35 UTC (610 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators