ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Liu, Yuliang; Tang, Xiangru; Cai, Zefan; Lu, Junjie; Zhang, Yichi; Shao, Yanjun; Deng, Zexuan; Hu, Helan; Yang, Zengxian; An, Kaikai; Huang, Ruijun; Si, Shuzheng; Chen, Sheng; Zhao, Haozhe; Li, Zhengliang; Chen, Liang; Zong, Yiming; Wang, Yan; Liu, Tianyu; Jiang, Zhiwei; Chang, Baobao; Qin, Yujia; Zhou, Wangchunshu; Zhao, Yilun; Cohan, Arman; Gerstein, Mark

Computer Science > Computation and Language

arXiv:2311.09835v1 (cs)

[Submitted on 16 Nov 2023 (this version), latest version 21 Aug 2024 (v5)]

Title:ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

View PDF

Abstract:Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{this https URL}.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2311.09835 [cs.CL]
	(or arXiv:2311.09835v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.09835

Submission history

From: Xiangru Tang [view email]
[v1] Thu, 16 Nov 2023 12:03:21 UTC (21,189 KB)
[v2] Wed, 17 Apr 2024 17:13:03 UTC (12,439 KB)
[v3] Wed, 12 Jun 2024 10:31:57 UTC (6,632 KB)
[v4] Tue, 18 Jun 2024 12:49:41 UTC (6,750 KB)
[v5] Wed, 21 Aug 2024 13:36:30 UTC (11,107 KB)

🚨2024-09-29: arxiv.org is experience DB issues. The announce tonight will be 3 hours later than usual.🚨

Computer Science > Computation and Language

Title:ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

🚨2024-09-29: arxiv.org is experience DB issues. The announce tonight will be 3 hours later than usual.🚨

Computer Science > Computation and Language

Title:ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators