MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Dai, Jianbo; Lu, Jianqiao; Feng, Yunlong; Huang, Dong; Zeng, Guangtao; Ruan, Rongju; Cheng, Ming; Tan, Haochen; Guo, Zhijiang

Computer Science > Computation and Language

arXiv:2405.11430 (cs)

[Submitted on 19 May 2024 (v1), last revised 4 Nov 2024 (this version, v2)]

Title:MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Authors:Jianbo Dai, Jianqiao Lu, Yunlong Feng, Dong Huang, Guangtao Zeng, Rongju Ruan, Ming Cheng, Haochen Tan, Zhijiang Guo

View PDF

Abstract:Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 26 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs' capabilities and limitations. MHPP, evaluation pipeline, and leaderboard can be found in this https URL.

Comments:	43 pages, dataset and code are available at this https URL, leaderboard can be found at this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.11430 [cs.CL]
	(or arXiv:2405.11430v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.11430

Submission history

From: Jianbo Dai [view email]
[v1] Sun, 19 May 2024 03:08:02 UTC (2,754 KB)
[v2] Mon, 4 Nov 2024 12:21:52 UTC (2,547 KB)

Computer Science > Computation and Language

Title:MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators