Are We Done with MMLU?

AP Gema, JOJ Leang, G Hong, A Devoto… - arXiv preprint arXiv …, 2024 - arxiv.org
… across 30 subsets of MMLU. After our manual re-annotation effort, we study how the errors
in MMLU impact LLM evaluation. First, we re-evaluate leading LLMs on MMLU-Redux, and …

Changing answer order can decrease mmlu accuracy

V Gupta, D Pantoja, C Ross, A Williams… - arXiv preprint arXiv …, 2024 - arxiv.org
We focus our investigation on the MMLU dataset, … we find that all ten models are affected
by our answer shuffling. This indicates that serious non-robustness in benchmarking with MMLU

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong… - arXiv preprint arXiv …, 2024 - arxiv.org
We explore the impact of OCR prompts and Chain of … This augmentation was done by
human experts with the … information when both text and images are presented together, and our …

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Y Jin, Z Li, C Zhang, T Cao, Y Gao, P Jayarao… - arXiv preprint arXiv …, 2024 - arxiv.org
… In this section, we present the overall design of Shopping MMLU, featuring 57 tasks across 4
key … We also include Zephyr and Vicuna-13B, which are tuned with general domain IFT over …

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

I Plaza, N Melero, C del Pozo, J Conde… - arXiv preprint arXiv …, 2024 - arxiv.org
… In this paper, we consider the case of the well-known … Next, the three versions of the selected
MMLU questions are run … Table 3 shows the results of the analysis done on the questions …

GenQA: Generating Millions of Instructions from a Handful of Prompts

J Chen, R Qadri, Y Wen, N Jain, J Kirchenbauer… - arXiv preprint arXiv …, 2024 - arxiv.org
… This can be done by feeding the following static prompt to … and evaluation benchmarks we
consider are described below. … or MMLU, and are not meant to be representative of MMLU

Language models (mostly) know what they know

S Kadavath, T Conerly, A Askell, T Henighan… - arXiv preprint arXiv …, 2022 - arxiv.org
… within the specific subtask we are evaluating, though in the case of MMLU we randomize
across the … This is done as a simple way of approximating a soft label using many hard labels. …

Are We Done with MMLU?

A Pradipta Gema, JOJ Leang, G Hong… - arXiv e …, 2024 - ui.adsabs.harvard.edu
… Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions
across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies …

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

P Chen, S Yu, Z Guo, B Haddow - arXiv preprint arXiv:2406.12822, 2024 - arxiv.org
… This is done by sampling Aya’s English split to match the size of … Translated benchmarks
We use three translated test sets where two are … 2021), designated as M-MMLU in this work. …

Large Language Model Unlearning via Embedding-Corrupted Prompts

CY Liu, Y Wang, J Flanigan, Y Liu - arXiv preprint arXiv:2406.07933, 2024 - arxiv.org
… of unlearning Because we are in the LLM setting, we use a … Probing We also incorporate
a probing evaluation, as done … Note that MMLU has its own validation set, so we believe that …