F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Sun, Yu; Chen, Keyu; Wang, Shujie; Li, Peiji; Guo, Qipeng; Yan, Hang; Qiu, Xipeng; Huang, Xuanjing; Lin, Dahua

Computer Science > Computation and Language

arXiv:2401.14869 (cs)

[Submitted on 26 Jan 2024 (v1), last revised 20 Aug 2024 (this version, v2)]

Title:F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Authors:Yu Sun, Keyu Chen, Shujie Wang, Peiji Li, Qipeng Guo, Hang Yan, Xipeng Qiu, Xuanjing Huang, Dahua Lin

View PDF

Abstract:Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.

Comments:	ACL 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2401.14869 [cs.CL]
	(or arXiv:2401.14869v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.14869

Submission history

From: Yu Sun [view email]
[v1] Fri, 26 Jan 2024 13:55:32 UTC (8,947 KB)
[v2] Tue, 20 Aug 2024 05:27:44 UTC (8,948 KB)

Computer Science > Computation and Language

Title:F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators