Learning and Evaluating Contextual Embedding of Source Code

Kanade, Aditya; Maniatis, Petros; Balakrishnan, Gogul; Shi, Kensen

Computer Science > Software Engineering

arXiv:2001.00059 (cs)

[Submitted on 21 Dec 2019 (v1), last revised 17 Aug 2020 (this version, v3)]

Title:Learning and Evaluating Contextual Embedding of Source Code

Authors:Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

View PDF

Abstract:Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

Comments:	Published in ICML 2020. This version (v.3) is the final camera-ready version of the paper. It contains the re-computed results, based on the open-sourced datasets
Subjects:	Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
Cite as:	arXiv:2001.00059 [cs.SE]
	(or arXiv:2001.00059v3 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2001.00059

Submission history

From: Petros Maniatis [view email]
[v1] Sat, 21 Dec 2019 05:05:22 UTC (321 KB)
[v2] Wed, 8 Jul 2020 22:06:21 UTC (275 KB)
[v3] Mon, 17 Aug 2020 21:40:59 UTC (282 KB)

Computer Science > Software Engineering

Title:Learning and Evaluating Contextual Embedding of Source Code

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Learning and Evaluating Contextual Embedding of Source Code

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators