Google Scholar

Are We Done with MMLU?

AP Gema, JOJ Leang, G Hong, A Devoto… - arXiv preprint arXiv …, 2024 - arxiv.org

… across 30 subsets of MMLU. After our manual re-annotation effort, we study how the errors
in MMLU impact LLM evaluation. First, we re-evaluate leading LLMs on MMLU-Redux, and …

Save Cite Cited by 3 Related articles View as HTML

[PDF] arxiv.org

Changing answer order can decrease mmlu accuracy

V Gupta, D Pantoja, C Ross, A Williams… - arXiv preprint arXiv …, 2024 - arxiv.org

… We focus our investigation on the MMLU dataset, … we find that all ten models are affected
by our answer shuffling. This indicates that serious non-robustness in benchmarking with MMLU…

Save Cite Cited by 11 Related articles All 2 versions View as HTML

[PDF] arxiv.org

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong… - arXiv preprint arXiv …, 2024 - arxiv.org

… We explore the impact of OCR prompts and Chain of … This augmentation was done by
human experts with the … information when both text and images are presented together, and our …

Save Cite Cited by 11 Related articles All 2 versions View as HTML

[PDF] arxiv.org

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Y Jin, Z Li, C Zhang, T Cao, Y Gao, P Jayarao… - arXiv preprint arXiv …, 2024 - arxiv.org

… In this section, we present the overall design of Shopping MMLU, featuring 57 tasks across 4
key … We also include Zephyr and Vicuna-13B, which are tuned with general domain IFT over …

[PDF] arxiv.org

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

I Plaza, N Melero, C del Pozo, J Conde… - arXiv preprint arXiv …, 2024 - arxiv.org

… In this paper, we consider the case of the well-known … Next, the three versions of the selected
MMLU questions are run … Table 3 shows the results of the analysis done on the questions …

Save Cite Cited by 1 Related articles All 2 versions View as HTML

[PDF] arxiv.org

GenQA: Generating Millions of Instructions from a Handful of Prompts

J Chen, R Qadri, Y Wen, N Jain, J Kirchenbauer… - arXiv preprint arXiv …, 2024 - arxiv.org

… This can be done by feeding the following static prompt to … and evaluation benchmarks we
consider are described below. … or MMLU, and are not meant to be representative of MMLU …

Save Cite Cited by 6 Related articles View as HTML

[PDF] arxiv.org

Language models (mostly) know what they know

S Kadavath, T Conerly, A Askell, T Henighan… - arXiv preprint arXiv …, 2022 - arxiv.org

… within the specific subtask we are evaluating, though in the case of MMLU we randomize
across the … This is done as a simple way of approximating a soft label using many hard labels. …

Save Cite Cited by 139 Related articles All 2 versions View as HTML

Are We Done with MMLU?

A Pradipta Gema, JOJ Leang, G Hong… - arXiv e …, 2024 - ui.adsabs.harvard.edu

… Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions
across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies …

Save Cite Related articles

[PDF] arxiv.org

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

P Chen, S Yu, Z Guo, B Haddow - arXiv preprint arXiv:2406.12822, 2024 - arxiv.org

… This is done by sampling Aya’s English split to match the size of … Translated benchmarks
We use three translated test sets where two are … 2021), designated as M-MMLU in this work. …

Save Cite Related articles View as HTML

[PDF] arxiv.org

Large Language Model Unlearning via Embedding-Corrupted Prompts

CY Liu, Y Wang, J Flanigan, Y Liu - arXiv preprint arXiv:2406.07933, 2024 - arxiv.org

… of unlearning Because we are in the LLM setting, we use a … Probing We also incorporate
a probing evaluation, as done … Note that MMLU has its own validation set, so we believe that …

Save Cite Cited by 11 Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Are We Done with MMLU?

Changing answer order can decrease mmlu accuracy

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

GenQA: Generating Millions of Instructions from a Handful of Prompts

Language models (mostly) know what they know

Are We Done with MMLU?

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?

Large Language Model Unlearning via Embedding-Corrupted Prompts