Multiple-Choice Questions are Efficient and Robust LLM Evaluators

Z Zhang, L Xu, Z Jiang, H Hao, R Wang - arXiv preprint arXiv:2405.11966, 2024 - arxiv.org
We present GSM-MC and MATH-MC, two multiple-choice (MC) datasets constructed by
collecting answers and incorrect predictions on GSM8K and MATH from over 50 open-
source models. Through extensive experiments, we show that LLMs' performance on the MC
versions of these two popular benchmarks is strongly correlated with their performance on
the original versions, and is quite robust to distractor choices and option orders, while the
evaluation time is reduced by a factor of up to 30. Following a similar procedure, we also …

[PDF][PDF] Multiple-Choice Questions are Efficient and Robust LLM Evaluators

ZZZJL Xu, HHR Wang - researchgate.net
We present GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and
incorrect predictions on GSM8K from 60 opensource models. Through extensive
experiments, we show that LLMs' performance on the MC version of this popular benchmark
is strongly correlated with their performance on the original version and is quite robust to
distractor choices and option orders, while the evaluation time is reduced by a factor of up to
30. Following similar procedures, we introduce MATH-MC, constructed from MATH, and …
Showing the best results for this search. See all results