Are We Done with MMLU?
… across 30 subsets of MMLU. After our manual re-annotation effort, we study how the errors
in MMLU impact LLM evaluation. First, we re-evaluate leading LLMs on MMLU-Redux, and …
in MMLU impact LLM evaluation. First, we re-evaluate leading LLMs on MMLU-Redux, and …
Changing answer order can decrease mmlu accuracy
… We focus our investigation on the MMLU dataset, … we find that all ten models are affected
by our answer shuffling. This indicates that serious non-robustness in benchmarking with MMLU…
by our answer shuffling. This indicates that serious non-robustness in benchmarking with MMLU…
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
… We explore the impact of OCR prompts and Chain of … This augmentation was done by
human experts with the … information when both text and images are presented together, and our …
human experts with the … information when both text and images are presented together, and our …
Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models
… In this section, we present the overall design of Shopping MMLU, featuring 57 tasks across 4
key … We also include Zephyr and Vicuna-13B, which are tuned with general domain IFT over …
key … We also include Zephyr and Vicuna-13B, which are tuned with general domain IFT over …
Spanish and LLM Benchmarks: is MMLU Lost in Translation?
I Plaza, N Melero, C del Pozo, J Conde… - arXiv preprint arXiv …, 2024 - arxiv.org
… In this paper, we consider the case of the well-known … Next, the three versions of the selected
MMLU questions are run … Table 3 shows the results of the analysis done on the questions …
MMLU questions are run … Table 3 shows the results of the analysis done on the questions …
GenQA: Generating Millions of Instructions from a Handful of Prompts
… This can be done by feeding the following static prompt to … and evaluation benchmarks we
consider are described below. … or MMLU, and are not meant to be representative of MMLU …
consider are described below. … or MMLU, and are not meant to be representative of MMLU …
Language models (mostly) know what they know
S Kadavath, T Conerly, A Askell, T Henighan… - arXiv preprint arXiv …, 2022 - arxiv.org
… within the specific subtask we are evaluating, though in the case of MMLU we randomize
across the … This is done as a simple way of approximating a soft label using many hard labels. …
across the … This is done as a simple way of approximating a soft label using many hard labels. …
Are We Done with MMLU?
… Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions
across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies …
across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies …
Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
… This is done by sampling Aya’s English split to match the size of … Translated benchmarks
We use three translated test sets where two are … 2021), designated as M-MMLU in this work. …
We use three translated test sets where two are … 2021), designated as M-MMLU in this work. …
Large Language Model Unlearning via Embedding-Corrupted Prompts
… of unlearning Because we are in the LLM setting, we use a … Probing We also incorporate
a probing evaluation, as done … Note that MMLU has its own validation set, so we believe that …
a probing evaluation, as done … Note that MMLU has its own validation set, so we believe that …