VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Duan, Haodong; Yang, Junming; Qiao, Yuxuan; Fang, Xinyu; Chen, Lin; Liu, Yuan; Agarwal, Amit; Chen, Zhe; Li, Mo; Ma, Yubo; Sun, Hailong; Zhao, Xiangyu; Cui, Junbo; Dong, Xiaoyi; Zang, Yuhang; Zhang, Pan; Wang, Jiaqi; Lin, Dahua; Chen, Kai

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.11691 (cs)

[Submitted on 16 Jul 2024 (v1), last revised 11 Sep 2024 (this version, v2)]

Title:VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Authors:Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Amit Agarwal, Zhe Chen, Mo Li, Yubo Ma, Hailong Sun, Xiangyu Zhao, Junbo Cui, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

View PDF HTML (experimental)

Abstract:We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at this https URL and is actively maintained.

Comments:	Updated on 2024.09.12
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.11691 [cs.CV]
	(or arXiv:2407.11691v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.11691

Submission history

From: Haodong Duan [view email]
[v1] Tue, 16 Jul 2024 13:06:15 UTC (866 KB)
[v2] Wed, 11 Sep 2024 17:10:36 UTC (916 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators