Studying LLM Performance on Closed- and Open-source Data

Ahmed, Toufique; Bird, Christian; Devanbu, Premkumar; Chakraborty, Saikat

Computer Science > Software Engineering

arXiv:2402.15100 (cs)

[Submitted on 23 Feb 2024]

Title:Studying LLM Performance on Closed- and Open-source Data

Authors:Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty

View PDF HTML (experimental)

Abstract:Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.

Subjects:	Software Engineering (cs.SE); Machine Learning (cs.LG)
Cite as:	arXiv:2402.15100 [cs.SE]
	(or arXiv:2402.15100v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2402.15100

Submission history

From: Toufique Ahmed Dr. [view email]
[v1] Fri, 23 Feb 2024 05:17:28 UTC (1,244 KB)

Computer Science > Software Engineering

Title:Studying LLM Performance on Closed- and Open-source Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Studying LLM Performance on Closed- and Open-source Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators