Scope is all you need: Transforming LLMs for HPC Code

Kadosh, Tal; Hasabnis, Niranjan; Vo, Vy A.; Schneider, Nadav; Krien, Neva; Wasay, Abdul; Ahmed, Nesreen; Willke, Ted; Tamir, Guy; Pinter, Yuval; Mattson, Timothy; Oren, Gal

Computer Science > Computation and Language

arXiv:2308.09440 (cs)

[Submitted on 18 Aug 2023 (v1), last revised 29 Sep 2023 (this version, v3)]

Title:Scope is all you need: Transforming LLMs for HPC Code

Authors:Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

View PDF

Abstract:With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks.

Subjects:	Computation and Language (cs.CL); Programming Languages (cs.PL)
Cite as:	arXiv:2308.09440 [cs.CL]
	(or arXiv:2308.09440v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2308.09440

Submission history

From: Tal Kadosh [view email]
[v1] Fri, 18 Aug 2023 10:12:03 UTC (603 KB)
[v2] Thu, 21 Sep 2023 08:17:51 UTC (602 KB)
[v3] Fri, 29 Sep 2023 16:11:13 UTC (974 KB)

Computer Science > Computation and Language

Title:Scope is all you need: Transforming LLMs for HPC Code

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Scope is all you need: Transforming LLMs for HPC Code

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators