MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

He, Zexue; Wang, Yu; Yan, An; Liu, Yao; Chang, Eric Y.; Gentili, Amilcare; McAuley, Julian; Hsu, Chun-Nan

Computer Science > Computation and Language

arXiv:2310.14088 (cs)

[Submitted on 21 Oct 2023 (v1), last revised 14 Nov 2023 (this version, v3)]

Title:MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Authors:Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y. Chang, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu

View PDF

Abstract:Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.

Comments:	Accepted to EMNLP 2023. Camera-ready version: updated IRB, added more evaluation results on LLMs such as GPT4, LLaMa2, and LLaMa2-chat
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.14088 [cs.CL]
	(or arXiv:2310.14088v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.14088

Submission history

From: Zexue He [view email]
[v1] Sat, 21 Oct 2023 18:59:41 UTC (674 KB)
[v2] Fri, 27 Oct 2023 16:00:49 UTC (677 KB)
[v3] Tue, 14 Nov 2023 21:59:56 UTC (677 KB)

Computer Science > Computation and Language

Title:MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators