MPIrigen: MPI Code Generation through Domain-Specific Language Models

Schneider, Nadav; Hasabnis, Niranjan; Vo, Vy A.; Kadosh, Tal; Krien, Neva; Capotă, Mihai; Tamir, Guy; Willke, Ted; Ahmed, Nesreen; Pinter, Yuval; Mattson, Timothy; Oren, Gal

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2402.09126v2 (cs)

[Submitted on 14 Feb 2024 (v1), last revised 23 Apr 2024 (this version, v2)]

Title:MPIrigen: MPI Code Generation through Domain-Specific Language Models

Authors:Nadav Schneider, Niranjan Hasabnis, Vy A. Vo, Tal Kadosh, Neva Krien, Mihai Capotă, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren

View PDF HTML (experimental)

Abstract:The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: this https URL

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2402.09126 [cs.DC]
	(or arXiv:2402.09126v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2402.09126

Submission history

From: Nadav Schneider [view email]
[v1] Wed, 14 Feb 2024 12:24:21 UTC (581 KB)
[v2] Tue, 23 Apr 2024 16:59:46 UTC (575 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:MPIrigen: MPI Code Generation through Domain-Specific Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:MPIrigen: MPI Code Generation through Domain-Specific Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators