\addbibresource

ref.bib

Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization

Zhi Chen, Lingxiao Jiang Centre for Research on Intelligent Software Engineering
School of Computing and Information Systems
Singapore Management University
Singapore
zhi.chen.2023, [email protected]
(2024)
Abstract.

In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings, which are capable of leveraging valuable knowledge from distributed and isolated datasets, is increasingly crucial.This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, showing the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, showing their potential risks in leaking data.

Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaborative trained code models. We demonstrate that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating data privacy or copyright. Our study further explores the patterns of effectiveness and memorization in incremental learning, emphasizing the importance of the sequence in which individual participant datasets are introduced. Also, we identify the memorization phenomenon of cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with strategic recommendations for practitioners and researchers to optimize the use of multisource datasets, thereby propelling the cross-organizational collaboration forward.

Collaborative Training, Memorization, Large Language Model, Code Generation
journalyear: 2024copyright: rightsretainedconference: 39th IEEE/ACM International Conference on Automated Software Engineering ; October 27-November 1, 2024; Sacramento, CA, USAbooktitle: 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24), October 27-November 1, 2024, Sacramento, CA, USAdoi: 10.1145/3691620.3695021isbn: 979-8-4007-1248-7/24/10ccs: Software and its engineering Collaboration in software developmentccs: Computing methodologies Simulation evaluationccs: Security and privacy

1. Introduction

Refer to caption

Description: This figure illustrates the overall workflow of our study, which includes constructing cross-organizational datasets, collaboratively training code models using various methods (centralized, federated, and incremental learning), and evaluating the models. The evaluation of effectiveness encompasses next token prediction accuracy, as well as the correctness and utility of the generated code. The assessment of memorization in the participants’ training data involves prompt construction, data extraction, and memorization detection. Specific tools, benchmarks, and metrics are employed for detailed analysis.

Figure 1. Overview

Large language models for code (jiang2023impact; nijkamp2022codegen; zhuo2023popquizpretrainedcode) automatically generate code snippets, functions, or entire programs based on given inputs, significantly enhancing developer productivity and aiding in software development (du2024evaluating; xia2023automated; zhang2023multilingual). Effective training of code generation models requires large and diverse source code datasets. However, reliance on open-source repositories is becoming increasingly unsustainable. For instance, StarCoder2 (li2023starcoder; lozhkov2024starcoder) has been trained on a massive dataset aggregated from various platforms like GitHub and Kaggle, demonstrating that current models have nearly exhausted the available open-source training data. Moreover, open-source datasets pose significant risks, including the presence of vulnerable or malicious code and legal concerns related to the commercial use of copyleft-licensed code (sun2022coprotector). Studies on GitHub Copilot have shown that models can inherit vulnerabilities from unvetted code (mcmahan2017communication). Given these concerns, there is a growing need to use collaborative approaches to explore the untapped value of proprietary (closed-source) code datasets from different organizations (hoang2024collaborative; sim2024incentives).

Several collaborative training methods are available, but privacy concerns remain a significant obstacle. Traditional centralized training is effective when data can be aggregated (truong2021privacy), but due to privacy concerns, such as sensitive information and legal constraints on data sharing, it becomes impractical (henze2016moving; liu2021fate). These challenges necessitate the exploration of privacy-preserving collaborative learning methods such as federated learning (lo2021systematic) and incremental learning (gepperth2016incremental). Federated learning allows for collaborative training without centralizing data, enabling participants to maintain control over their private datasets (hard2018federated; li2020review; yang2019federated). Incremental learning, which updates models gradually with new data, offers a promising solution for dynamic environments where data continuously evolves (perez2010incremental).

However, even though certain methods can protect privacy by ensuring data remains unseen during training, studies on data extraction attacks reveal that models can still leak training data due to memorization (carlini2021extracting; Al-Kaswan2024memorization; yang2024unveiling). This is a significant concern that may discourage organizations from providing their private data for collaborative training (banabilah2022federated; zhang2022federated). Moreover, preprocessing cross-organizational datasets poses a substantial challenge due to the unseen nature of other participants’ data. One issue is the presence of cross-organizational clones, since code clones not only waste computing resources on duplicates but also increase the likelihood of these clones being memorized (yang2024unveiling). Ideally, the model should learn to generalize from the training data and develop the capability to generate new code, rather than reproducing the training data verbatim due to memorization. This is crucial because using verbatim generated code that is the same as some copyrighted code can lead to legal issues, as demonstrated by Oracle’s lawsuit against Google. In this case, Google developed Android without a Java license and copied its APIs, resulting in a copyright infringement case over ”nine lines of code”.111Google LLC v. Oracle America, Inc. https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.

Key Research Question.

The main objective of this paper is to better understand the promise and peril of collaborative training in the context of code generation task using several cross-organizational code datasets. This research aims to investigate a key question:

How do the effectiveness and memorization patterns vary in code models trained under different collaborative training settings?

Our investigation underscores the critical impact of dataset size and diversity on the effectiveness of collaborative trained code models, where effectiveness is measured as the model’s next token prediction ability and the correctness and utility of its generated code. We found that federated learning approaches yield results comparable to centralized training while maintaining data confidentiality during training and showing lower memorization rates during inference. Centralized training, however, tends to exhibit increased memorization, particularly with duplicate-heavy datasets. Both centralized and federated models showed higher memorization of cross-organizational clones than incremental models. Additionally, the effectiveness and memorization tendencies of incremental learning heavily depend on the order of participant datasets introduction. Crucially, our findings highlight the ongoing threat of data exposure during the inference stage, even without direct observation of the training data.

Main Contributions.
  • We have conducted a comprehensive analysis of various collaborative training setups, assessing the impact of dataset size, diversity, and data presentation sequence on the effectiveness of these methods for code generation.

  • To the best of our knowledge, we are the first to systematically examine the phenomenon of training data memorization in various centralized, federated, and incremental learning settings, identifying the associated risks of training data leakage.

  • Our findings provide actionable insights and recommendations for industry professionals and academic researchers, aiming to facilitate collaborative training practices, maximize the potential of extensive, multisource code repositories, and minimize the risk of code leakage. These insights ultimately urge the enhancement of privacy- and copyright-preserving capabilities of large code models while propelling cross-organizational collaboration forward.

Paper Structure.

Section 2 details the methodology of our study. Section 3 describes the datasets we collected. Section 4 describes the experimental setup. Section 5 presents our evaluation results, analyzing model effectiveness and memorization patterns. Section 6 discusses the findings and threats to validity. Section 7 reviews related work. Section 8 concludes with key findings.

2. Methodology

In this section, we present our specific research questions (Section 2.1) and the workflow and tools employed to answer the questions. The workflow of our study is illustrated in Figure 1. It begins with the explanation of our dataset construction method (Section 2.2), followed by a description of the collaborative training methods we used to train models (Section 2.3). Subsequently, we outline the method and metrics used to evaluate the effectiveness of the trained models (Section 2.4). Finally, we detail the training data extraction techniques and memorization evaluation methods employed to assess the extent of data memorization (Section 2.5).

2.1. Research Questions

We aim to investigate the promise and peril of collaborative training in the following research questions.

RQ1. What factors most significantly impact the effectiveness of collaborative training methods for code generation models?
Motivation.

To enhance the practical utility of code generation models trained on diverse datasets from multiple organizations, it is essential to understand how different factors influence collaborative training methods. By exploring how the size and diversity of datasets, as well as the sequence of data presentation, impact the performance of these models, we can derive valuable insights. This research seeks to identify these factors to inform the development of collaborative training strategies—such as centralized training, federated learning, and incremental learning—that optimize model effectiveness and support their application in real-world scenarios.

RQ2. To what extent is data from different participants memorized in various collaborative training settings?
Motivation:

Privacy concerns regarding the potential leakage of sensitive training data pose a significant barrier to organizational participation in collaborative training. Even with techniques like federated learning and incremental learning, which ensure that training data remains unseen during the training process, there remains a risk of data leakage through memorization during inference. Understanding how data from different participants is memorized and uncovering the memorization patterns can provide insights for improving collaborative training methods to mitigate memorization risks and enhance privacy or copyright preservation, thereby encouraging more organizations to engage in collaborative training and increasing the utility of valuable untapped proprietary datasets.

RQ3. How are cross-organizational code clones memorized in collaborative models?
Motivation:

Collaborative training scenarios present unique challenges, particularly concerning cross-organizational code clones. While centralized training can efficiently remove these clones, federated learning and incremental learning prevent participants from performing cross-dataset checking and filtering. This limitation can lead to the persistence of cross-organizational clones. A higher occurrence of code clone snippets can increase the risk of unintentional verbatim code exposure (yang2024unveiling). For instance, clones might include licensed code reused properly within organizations, but if a model reproduces this code verbatim due to memorization, users might unknowingly misuse these clones, potentially violating licensing regulations. Additionally, the quality of the generated code could be compromised if these clones contain vulnerabilities. This RQ is to evaluate how these clones are memorized in code models trained under different collaborative settings, providing insights into memorization patterns and highlighting the need for specialized dataset preprocessing in collaborative training scenarios.

2.2. Dataset Construction Method

Our investigation on collaborative training naturally needs datasets from different participants or organizations. Although we cannot use real-world proprietary codebases, we can construct separate datasets from open-source code repositories to simulate multisource datasets for our evaluation.

Cross-Organizational Datasets Construction Approach.

Due to the difficulty of obtaining proprietary code datasets from industry sources for collaborative training, our methodology involves collecting cross-organizational datasets from GitHub repositories while adhering to the following principles:

  • Ensuring that the code in one dataset comes from a single organization while the code in different datasets comes from different organizations, simulating scenarios where each participant in a collaborative training setting has their own private codebase.

  • Limiting the datasets to a single programming language to facilitate more consistent evaluation of effectiveness and memorization issues in the trained models.

Based on these principles, our methodology involves curating Python code files from the open-source repositories of three prominent tech organizations hosted on GitHub: Facebook (F)222Facebook is now Meta. Although the company has undergone a rebranding, many repositories on GitHub continue to use the name Facebook. Therefore, for consistency and clarity within this context, we will refer to the company as Facebook (F)., Microsoft (M), and Google (G). These Python code files are primarily developed by internal software engineers from these organizations, enabling effective simulation of collaborative training methods in real-world scenarios.

Data Collection Platform.

We utilize Google’s BigQuery to collect our code datasets as it contains extensive GitHub data333https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code. This GitHub data on BigQuery is updated weekly and can be accessed through efficient SQL queries, ensuring timely access to the latest GitHub data. Section 3 gives more details about our collected codebases.

2.3. Collaborative Training Methods

There are different ways to perform collaborative training using datasets from different participants or organizations. We summarize three common methods in Table 1: traditional centralized training, ideal for mutually trusting participants who combine datasets on a centralized location; and federated learning and incremental learning, able to train models in a decentralized manner. The latter two methods prevent dataset centralization, ensuring that the training data remains unseen during the training process, and consequently, to some extent, safeguard the privacy of the source data.

Table 1. Collaborative Training Methods
Method Decentralized? Synchronous?
Centralized Training
Federated Learning
Incremental Learning

In terms of the synchronicity in the training process across different datasets, that is, whether in each training round (epoch) of a model, the data from all parties are involved in the training and contribute to the model’s update, we classify the three methods into Synchronous Collaborative Training (e.g., centralized training and federated learning) and Asynchronous Collaborative Training (e.g., incremental learning with sequential dataset training).

We provide a detailed explanation of the three methods using a unified representation, to better illustrate the collaborative training approaches utilized in this study.

2.3.1. Dataset and Model Representation

Datasets.

We use Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote a dataset from a participant i𝑖iitalic_i. Given n𝑛nitalic_n participants, the centralized union of all their datasets is denoted as DC=i=1nDisubscript𝐷𝐶superscriptsubscript𝑖1𝑛subscript𝐷𝑖D_{C}=\cup_{i=1}^{n}D_{i}italic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In our study, we have three datasets DFsubscript𝐷𝐹D_{F}italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, DMsubscript𝐷𝑀D_{M}italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, DGsubscript𝐷𝐺D_{G}italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT from Facebook, Microsoft, Google, respectively. Each data point in the dataset can be a Python code file, a Python class, or a function, optionally associated with some docstrings or comments. These datasets will be used in various ways to train various models for code generation tasks in our evaluation.

Models.

Our study focuses on models that are based on deep neural networks, as they have been shown to be effective for code generation tasks (cert2022ijcai; du2024evaluating; lozhkov2024starcoder; liu2024your). We denote a model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, together with its internal weights ΘisubscriptΘ𝑖\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, potential inputs Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and potential outputs Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as Yi=Mi(Θi,Xi)subscript𝑌𝑖subscript𝑀𝑖subscriptΘ𝑖subscript𝑋𝑖Y_{i}=M_{i}(\Theta_{i},X_{i})italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). There may exist ground-truth outputs Yi¯¯subscript𝑌𝑖\bar{Y_{i}}over¯ start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG for the input Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and a model trained on the ground-truth data should have adjusted its internal weights ΘisubscriptΘ𝑖\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT so that the differences between Yi=Mi(Θi,Xi)subscript𝑌𝑖subscript𝑀𝑖subscriptΘ𝑖subscript𝑋𝑖Y_{i}=M_{i}(\Theta_{i},X_{i})italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Yi¯¯subscript𝑌𝑖\bar{Y_{i}}over¯ start_ARG italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG are minimized.

In our study, each participant i𝑖iitalic_i can individually train a model on its own dataset Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as usual to minimize the differences between the Mi(Θi,Di)subscript𝑀𝑖subscriptΘ𝑖subscript𝐷𝑖M_{i}(\Theta_{i},D_{i})italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and its ground truth Di¯¯subscript𝐷𝑖\bar{D_{i}}over¯ start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. When it comes to collaborative training, the settings need to be adjusted as follows.

2.3.2. Centralized Training.

This training method is ideal when two participants share a profound mutual trust. In this approach, the participants train a common model using centralized datasets that combine information from all participants. That is, the method is to train a centralized model MCsubscript𝑀𝐶M_{C}italic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT so that the differences between YC=MC(ΘC,DC)subscript𝑌𝐶subscript𝑀𝐶subscriptΘ𝐶subscript𝐷𝐶Y_{C}=M_{C}(\Theta_{C},D_{C})italic_Y start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) and DC¯¯subscript𝐷𝐶\bar{D_{C}}over¯ start_ARG italic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG are minimized, where DC=DFDMDGsubscript𝐷𝐶subscript𝐷𝐹subscript𝐷𝑀subscript𝐷𝐺D_{C}=D_{F}\cup D_{M}\cup D_{G}italic_D start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT.

2.3.3. Federated Learning.

This is a method for multiple participants to collaboratively train one central model as well while keeping their data localized (shanbhag2022exploring). This method enhances privacy and mitigates the risks associated with data centralization  (yang2019federated). Its key idea is for each participant to calculate the updates needed for the central model weights using their own dataset locally and only share the weight updates with all the participants. Thus, a key component of federated learning is often the aggregation strategy used to aggregate weight updates from individual participants.

In our study, we applied two federated learning aggregation strategies, FedAvg (mcmahan2017communication) and FedYogi (reddi2021adaptive), to diversify our experimental settings. The FedAvg algorithm (mcmahan2017communication) simply averages the model weights updated by each participant to form the global model weights. It is often used for cases when datasets across parties are homogeneous. That is, FedAvg trains a model MFedAvg(ΘFedAvg,X)subscript𝑀𝐹𝑒𝑑𝐴𝑣𝑔subscriptΘ𝐹𝑒𝑑𝐴𝑣𝑔𝑋M_{FedAvg}(\Theta_{FedAvg},X)italic_M start_POSTSUBSCRIPT italic_F italic_e italic_d italic_A italic_v italic_g end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_F italic_e italic_d italic_A italic_v italic_g end_POSTSUBSCRIPT , italic_X ) where X𝑋Xitalic_X is unknown, and each participant locally trains a Mi(Θi,Di)subscript𝑀𝑖subscriptΘ𝑖subscript𝐷𝑖M_{i}(\Theta_{i},D_{i})italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and ΘFedAvg=1ni=1nwiΘisubscriptΘ𝐹𝑒𝑑𝐴𝑣𝑔1𝑛superscriptsubscript𝑖1𝑛subscript𝑤𝑖subscriptΘ𝑖\Theta_{FedAvg}=\frac{1}{n}\sum_{i=1}^{n}w_{i}\cdot\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_F italic_e italic_d italic_A italic_v italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the weight of the i𝑖iitalic_i-th participant’s contribution which is often based on the size of Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that the averaging operation is often done at the end of each training round (epoch). Also, to facilitate the averaging operation, it would be better for individual Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs to have the same structure (e.g., the same numbers and positions of the weights).

The FedYogi algorithm (reddi2021adaptive) is similar to FedAvg, but adapts the Yogi optimizer (zaheer2018adaptive) to adjust the model weights and the model learning rates for non-IID data (zhao2018federated) across participants during training. Thus, FedYogi is often used for cases when datasets across parties are heterogeneous.

2.3.4. Incremental Learning.

This method involves gradual updates to a model with new datasets (wu2019large). It is particularly useful in situations where the data evolves over time, allowing the model to adapt to the new data without being retrained from scratch (van2022three), suitable for not only collaborative training, but also internal training with one organization. That is, it trains a sequence of models [M1(Θ1,D1),M2(Θ2,D2),,Mn(Θn,Dn)]subscript𝑀1subscriptΘ1subscript𝐷1subscript𝑀2subscriptΘ2subscript𝐷2subscript𝑀𝑛subscriptΘ𝑛subscript𝐷𝑛[M_{1}(\Theta_{1},D_{1}),M_{2}(\Theta_{2},D_{2}),\cdots,M_{n}(\Theta_{n},D_{n})][ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_Θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] such that Θi+1subscriptΘ𝑖1\Theta_{i+1}roman_Θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT are initialized with ΘisubscriptΘ𝑖\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT but updated according to Di+1subscript𝐷𝑖1D_{i+1}italic_D start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT without referring back to Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the last model Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is often used as the final collaborative model MIsubscript𝑀𝐼M_{I}italic_M start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. Note that, to facilitate the initialization of Θi+1subscriptΘ𝑖1\Theta_{i+1}roman_Θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT from ΘisubscriptΘ𝑖\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is often better for all models Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to use the same structure. Also, the order of using [D1,D2,]subscript𝐷1subscript𝐷2[D_{1},D_{2},\cdots][ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ ] datasets can affect the trained models. In our study, we sequentially train various incremental models using our datasets in different orders. We use the order of the datasets used to train a model as the name of the model. For example, we use MF2M2Gsubscript𝑀𝐹2𝑀2𝐺M_{F2M2G}italic_M start_POSTSUBSCRIPT italic_F 2 italic_M 2 italic_G end_POSTSUBSCRIPT to denote the model that is incrementally trained from the Facebook (DFsubscript𝐷𝐹D_{F}italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT), Microsoft (DMsubscript𝐷𝑀D_{M}italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT), and Google (DGsubscript𝐷𝐺D_{G}italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) codebases in that order.

2.4. Effectiveness Evaluation Method

This paper focuses on code generation tasks using collaborative models, which involves creating code snippets from prompts or specifications to enhance software development productivity. To evaluate the effectiveness of code generation models, we selected two primary metrics: perplexity and pass@k. These metrics provide a balanced assessment of the model’s predictive capabilities and practical utility, making them the most suitable choice for RQ1 in our study.

Perplexity: Evaluating Next-Token Prediction Ability.

Perplexity measures the model’s ability to predict subsequent tokens, ensuring syntactic correctness. Lower perplexity values correspond to improved predictive performance (iyer1997analyzing).

Pass@k: Evaluating Code Correctness and Utility.

Pass@k for a model is defined as the probability that at least one of the top-k code samples generated by the model for a query problem passes the unit tests defined for the problem. Higher pass@k values indicate better performance in providing relevant and accurate code solutions (chen2021evaluating). For each trained model, this measurement is calculated using the EvalPlus (liu2024your) benchmark, which builds upon the HumanEval (chen2021evaluating) benchmark. EvalPlus enhances the scope and robustness of HumanEval by incorporating a more diverse set of real-world coding problems.

2.5. Memorization Evaluation Methods

As our research goal is to investigate the memorization of each participant’s training data, we adapt the data extraction strategies used by Al-Kaswan et al. (Al-Kaswan2024memorization), which formulates a targeted data extraction security game to extract data from models. In the targeted attack scenario, the adversary is provided with a prefix and is tasked with recovering the suffix associated with the prefix from the training data. Targeted attacks are more critical for security because they allow the extraction of specific information, such as sensitive configuration, personal identifiers, or proprietary algorithms (10189147; liao2021generating).

Prompt Construction for Data Extraction

Different from the setting in (Al-Kaswan2024memorization), our training data is available, which allows us to construct prefix prompts directly from each organization’s dataset instead of from an identified extractable dataset (Al-Kaswan2024memorization). For constructing prompts, we choose to use function signatures with docstrings as our ”prefix” prompt. This format better reflects real-world scenarios where an adversary has access to an API’s function signature and functionality description document and aims to extract the function’s coding details in the function body.

Specifically, we use static analysis to parse the source code from the training data into abstract syntax trees (ASTs) to extract functions. Subsequently, two filtering conditions are applied to construct prefix prompts: each function must have a corresponding docstring, and the combined length of the tokenized function signature and docstring must not exceed 512 tokens.444GPT2 can only handle a total token length of 1024 (including input and newly generated tokens), so we set the maximum length of the input and the maximum length of the newly generated tokens to 512, respectively. Listing 1 provides a concrete example of the function prompt utilized in our evaluation.

Listing 1: Function Prompt Example
def async_close(self, **kwargs: Any) -> bool:
"""
‘async_close()‘ must be called at the very end of any script that uses the asynchronous ‘opena‘ feature. This calls ‘async_join()‘ first and then closes the thread pool used for the asynchronous operations.
Returns:
status (bool): True on success
"""

After prefix prompts are extracted, we feed them into each of the collaboratively trained models (Section 2.3) to get the models’ generation outputs, and then measure the amount of training data memorization in the generated outputs.

Memorization Detection.

We can detect memorized data by comparing the similarity between the outputs generated by the models and the individual participants’ datasets. The availability of the organizations’ training dataset in our study allows us to easily make the comparison to check if there are duplications between the generated code snippets and the training datasets. We adapt the memorization detection technique from Yang et al. (yang2024unveiling), which employs the Simian clone detection tool555https://simian.quandarypeak.com/ to detect Type-1 clones between the generated code and the training code. A Type-1 clone, or exact clone, refers to identical segments of code (with minimum six lines as the default setting in Simian). If the model produces these exact replicas, it strongly suggests memorization. Therefore, we classify such a clone as an instance of memorization.

Memorization Evaluation.

To better quantify the extent of training data memorization in the model-generated code, we introduce the Memorization Ratio, which is defined in the following. Given a set of specific prompts, the code model generates a set of code. Simian is then used to detect x distinct blocks of code that are identical to some blocks of code in a training dataset; these x𝑥xitalic_x blocks of code are considered as memorized code. The Memorization Ratio is then calculated by summing the numbers of lines within all the blocks and then dividing by the total number of lines in all the generated code. Mathematically, this can be represented as:

(1) Mem. Ratio=i=1xlines of code in memorized blocki lines of code in all generationsMem. Ratiosuperscriptsubscript𝑖1𝑥lines of code in memorized blocki lines of code in all generations\text{Mem. Ratio}=\frac{\sum_{i=1}^{x}\mbox{lines of code in memorized block${% }_{i}$}}{\sum\mbox{ lines of code in all generations}}Mem. Ratio = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT lines of code in memorized block end_ARG start_ARG ∑ lines of code in all generations end_ARG

3. Datasets

This section presents some characteristics of the datasets we collected from different organizations (Section 2.2) and performs some preprocessing for the following evaluation.

Collecting Organization’s Codebase.

We utilized Google’s BigQuery to collect all open-source licensed Python files from the GitHub database, resulting in a total of 27,128,930 files, amounting to 188.3 GB of data. Additionally, to identify the repositories on GitHub that belong to a certain organization, we manually identified some repositories’ names that are very likely related to Google, Microsoft, and Facebook, and use them to extract organization’ codebase. Table 2 shows sample names of the repositories collected for the organizations. In total, we collected three Python codebase, one for each organization. There were 125,847 files (1018.93 MB) for Google, 33,560 files (703.29 MB) for Microsoft, and 5,207 files (38.58 MB) for Facebook.

Table 2. Organizations’ Repositories
Org. # of Repos Sample Repos
Google (G) 32 google, google-research, google-deepmind, etc.
Microsoft (M) 10 microsoft, Azure, MicrosoftEdge, etc.
Facebook (F) 5 facebook, facebookresearch, fbsamples, etc.
Preprocessing and Splitting.

As there can be duplicate files or low-quality code in the codebases that may affect model training, we respectively preprocessed each dataset using methods employed in the training of the CodeParrot and PyCodeGPT models (cert2022ijcai). These methods are based on heuristics proposed by OpenAI’s Codex (chen2021evaluating) and have been further refined and enriched. Sample filtering criteria used are as follows:

  • Removal of duplicate code files with method MinHash + LSH.

  • Filtering out files with a fraction of alphanumeric characters less than 0.25.

  • Removing files containing the phrase ”auto-generated” or similar within the first five lines.

We then split each codebase into a training set and a validation set to facilitate model training. Basic statistics of the resultant codebases are shown in Table 3.

Table 3. Dataset Splits
Dataset Split Files Count Size
Google Training 53,545 501.77 MB
Validation 13,387 125.97 MB
Microsoft Training 15,251 327.47 MB
Validation 3,813 87.63 MB
Facebook Training 2,148 19.34 MB
Validation 538 4.99 MB
Cross-Org Codebase Characteristics.

As shown in Table 4, we measured average metrics per megabyte, including lines of code (LOC), number of classes, number of functions, and number of docstrings across three datasets. Notably, there are discernible variations in these metrics among the datasets.

Table 4. Basic Metrics Per Megabyte
Dataset LOC Classes Funcs Docs
Google 5,309.70 184.87 946.05 547.44
Microsoft 4,823.18 173.50 460.67 358.63
Facebook 7,588.45 248.58 1,435.30 557.21

Note: LOC - Lines of Code, Classes - Number of classes, Funcs - Number of functions, Docs - Number of docstrings.

Internal clone detection was then performed with a threshold of minimum six lines to define a clone block for each organization’s datasets. Figure 2 shows the clone statistics per megabyte for datasets from Google, Microsoft, and Facebook. The metrics analyzed include the average number of clone blocks and the lines of code (LOC) within these clone blocks. We examined these statistics to understand the extent of duplicated content within the datasets, as the findings by Yang et al. (yang2024unveiling) suggest that the occurrence of duplicate samples is correlated with an increased tendency for data memorization by code models.

Refer to caption

Figure 2. Clones Statistics Per Megabyte

The result reveals that the Microsoft dataset has the highest number of clone blocks (755.40) and clone LOC (17500.42) per megabyte, indicating a significant presence of duplicated content, which could lead to increased memorization during model training. In comparison, the Google and Facebook datasets have fewer clone blocks and LOC, indicating relatively lower duplication.

4. Experiment Setup

This section describes the specific experimental settings implemented to provide answers to each research question.

4.1. Base Model

To minimize interference from existing training data, we chose GPT-2 as our base model because it was trained on the WebText dataset (radford2019language), which includes substantial web data but not specifically GitHub code. This choice ensures that the Python dataset we collected from GitHub is relatively new to GPT-2’s training data, reducing the potential memorization effect of the base model. Although there are other models that meet our requirements, we selected GPT-2 because it serves as the foundation for many widely used models, such as CodeParrot.

Table 5. Comparison of Large Language Models
Model Training Data Log(PPL)/Zlib Release Date
Basic LLMs (~125M Parameters)
GPT-2 WebText (8M web documents) 0.0020256 Feb 18, 2019
GPT-Neo-125M The Pile(include code repos) 0.0014813 Apr 6, 2021
CodeParrot-Small Python code from GitHub 0.0008022 Nov 5, 2021
PyCodeGPT Python scripts from Github 0.0008196 Jan 4, 2023
Advanced LLMs (~7/8B Parameters)
LLaMA-2-7B Web, books, code data 0.0006497 Jul 18, 2023
Mistral-7B-v0.1 Mixed web and code data 0.0007225 Sep 20, 2023
CodeLlama-7B-Python High-quality Python code 0.0003863 Mar 14, 2024
LLaMA-3-8B Enhanced web and code data 0.0006599 Apr 17, 2024

To assess potential overlap between our collected Python codebase and LLMs’ pre-training data, we employed a membership inference attack using the PPL-Zlib Ratio metric, which measures the ratio of log perplexity to zlib entropy (yang2024unveiling). A lower ratio suggests that a code snippet was likely seen during pre-training (carlini2021extracting). We sampled 10% of our data and computed the average ratio for each LLM. As shown in Table 5, models specifically trained on Python code, such as CodeParrot, PyCodeGPT, and CodeLlama-Python, display lower ratios compared to similarly sized models. Advanced models like LLaMA-3 and Mistral, which also include code in their training, similarly show low ratios. Given the unavailability of real-world proprietary codebases and the risk of data leakage from GitHub repositories, GPT-2’s higher PPL-Zlib Ratio, compared to models trained on Python code, suggests a lower likelihood of overlap with our dataset, making it a more suitable choice for our evaluation of collaborative training scenarios.

4.2. Collaborative Training

We conducted all collaborative training using the NVIDIA A100-PCIE-40GB GPU, which features 40GB of high-bandwidth memory.

Training Settings.
  • For centralized learning (CL), we aggregated the three codebases into a single dataset and trained the model for 10 epochs. Due to computational resource constraints, we set the training batch size to 2. For other hyperparameters, we followed the configurations used by CodeParrot666https://huggingface.co/codeparrot/codeparrot-small.

  • For federated learning (FL), we conducted a total of 10 rounds of training, with each client training on its own codebase for 1 epoch each during each round using the Flower federated learning framework.777https://flower.ai/docs/framework/tutorial-series-what-is-federated-learning.html For the federated learning aggregation methods FedAvg and FedYogi, we utilized the default hyperparameters implemented in previous work (reddi2021adaptive) and maintained the same setup as centralized learning for each client’s own training.

  • For incremental learning (IL), we considered all six distinct sequences for the three codebases: Facebook (F), Microsoft (M), and Google (G). The models were trained sequentially on each codebase for 10 epochs using the same hyperparameters as in the centralized setting.

Trained Models.

We obtained nine collaborative models from collaborative training: one centralized model (Centralized_FMG), two federated learning models (Federated_Avg_FMG and Federated_Yogi_FMG), and six incremental learning models (Incremental_SEQUENCE). The SEQUENCE represents the training order of the datasets from Facebook (F), Microsoft (M), and Google (G) (either F2M2G, F2G2M, M2F2G, M2G2F, G2F2M, or G2M2F). Additionally, we trained three baseline models, one for each dataset (Facebook_Only, Microsoft_Only and Google_Only) for comparison with various collaborative models.

4.3. Effectiveness Evaluation Settings for RQ1

The combined evaluation dataset, derived from the unseen validation datasets of all participants, is used to calculate the Perplexity score. To assess correctness and utility, we estimated the pass@k metric in the EvalPlus benchmark with n_samples=200𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠200n\_samples=200italic_n _ italic_s italic_a italic_m italic_p italic_l italic_e italic_s = 200, following the settings of previous work (chen2021evaluating), performing sampling with temperatures ranging from 0.1 to 1.0, and selecting the optimal value for each metric, as outlined in earlier research (cert2022ijcai).

4.4. Memorization Evaluation Settings for RQ2

Prompt Construction.

We followed the prompt construction method from Section 2.5, extracting all functions and selecting signatures and docstrings of appropriate lengths as prompts. Given the large number of prompts obtained, we randomly sampled 10% of the function prompts from each codebase for code generation. The outcomes of our prompt construction are presented in Table 6.

Table 6. Function Prompts
Dataset Total Functions Total Prompts Sampled
Google 475,256 187,900 18,790
Microsoft 150,248 58,068 5,807
Facebook 27,340 8,159 816
Data Extraction and Memorization Evaluation.

To ensure efficient data extraction, we set the temperature to 0.6 and the top-p (nucleus sampling) to 0.6, following best practices outlined by Yu et al. (yu2023bag), which assess various techniques for enhancing the training data extraction process from language models. Additionally, we configured the number of generations per prompt to 5 and limited the maximum number of newly generated tokens to 512. For the memorization evaluation, we followed the methods described in Section 2.5, using the Simian tool with a default threshold of a minimum of 6 lines of code to report Type-1 clones, which are considered instances of memorization.

4.5. Cross-Org Clone Memorization Evaluation Settings for RQ3

Collecting Cross-Organizational Clones.

To evaluate how such clones are memorized in collaborative models, we first identified the clones within the training datasets using the Simian tool, applying a default threshold of six consecutive lines to define a clone. Since these clones may not be complete functions and there are no function headers or docstrings, we need to construct prompts for the clones differently from Section 2.5: we chose to use the first half of a clone snippet as prefix prompts, and fed them to the models to generate the rest of the code. For particularly long clones exceeding the 1024-token limit of GPT-2 after tokenization, we split them into smaller portions before creating the prefixes and suffixes.

First, we detected cross-organizational clones that are common across all three training data. However, due to the significant size disparity among these datasets—19.34 MB for Facebook, 327.47 MB for Microsoft, and 501.77 MB for Google, there were only 41 common clones. The limited number of prefix prompts for common clones led to inconclusive results. To enhance the evaluation of cross-organizational clones and obtain more robust and evaluable samples, we focused only on the Microsoft and Google datasets. These two datasets are larger, allowing for more clones. We identified 316 common clones between Microsoft and Google, encompassing a total of 7,536 lines (see Table 7). To manage extra-long clones, we divided them into smaller portions to ensure the tokenized lengths of the prefixes were under 512 tokens. This process resulted in 349 prefix prompts, providing a more substantial basis for evaluation.

Table 7. Cross-silo Clones
Dataset Total Lns Clone Blks Clone Lns
Google 12,040,783 316 7536
Microsoft 6,501,197

Note: Total Lns: Total lines of code in the training data. Clone Blks: Num of clone blocks. Clone Lns: The total line counts of the clone blocks.

Model Training with Two Datasets Only.

To better evaluate the memorization of cross-organizational clone from Microsoft (M) and Google (G) codebases, we trained additional models: Centralized_MG, Federated_Avg_MG, Federated_Yogi_MG, Incremental_M2G, and Incremental_G2M. These models were trained using only the two datasets with the same hyperparameter settings described in Section 4.2.

Clones Memorization Detection and Evaluation.

Finally, we applied the same methods as described in Section 2.5 to assess the memorization of cross-organizational clones, but using a detection threshold of three lines instead of six, excluding the other minimal three lines used as prefixes.

5. Empirical Evaluation Results

5.1. Effectiveness of Collaborative Models

The evaluation results for RQ1: What factors most significantly impact the effectiveness of collaborative training methods for code generation models? are shown in Table 8.

Table 8. Perplexity Scores and Pass@k Results
Model Perplexity Pass@1 Pass@10 Pass@100
Baseline
GPT-2 1084.78 0.0% 0.0% 0.0%
Facebook_Only 181.54 0.0% 0.0% 0.0%
Microsoft_Only 39.52 0.009% 0.090% 0.763%
Google_Only 5.36 0.213% 1.190% 3.281%
Synchronous Collaborative Settings
Centralized_FMG 3.32 0.058% 0.530% 2.338%
Federated_Avg_FMG 3.71 0.598% 2.212% 3.943%
Federated_Yogi_FMG 4.02 0.506% 1.781% 3.770%
Asynchronous Collaborative Settings
Incremental_F2M2G 5.30 0.546% 1.588% 3.800%
Incremental_F2G2M 26.34 0.003% 0.030% 0.305%
Incremental_M2F2G 5.36 0.521% 1.630% 3.165%
Incremental_M2G2F 37.45 0.012% 0.121% 1.068%
Incremental_G2F2M 25.42 0.0% 0.0% 0.0%
Incremental_G2M2F 53.38 0.006% 0.060% 0.458%
Table 9. Memorization Evaluation Results for Different Training Datasets
Model Google Microsoft Facebook
Lns of Gen. Mem. Blks Mem. Lns Mem. Ratio Lns of Gen. Mem. Blks Mem. Lns Mem. Ratio Lns of Gen. Mem. Blks Mem. Lns Mem. Ratio
Synchronous Training Settings
Centralized_FMG 2,961,075 554 7,372 0.249% 994,977 4,732 63,215 6.353% 170,951 4 40 0.023%
Federated_Avg_FMG 2,954,757 723 7,799 0.263% 1,305,876 901 6,753 0.517% 168,833 11 112 0.066%
Federated_Yogi_FMG 3,014,251 899 12,253 0.407% 1,492,648 8 56 0.004% 162,671 3 31 0.019%
Asynchronous Training Settings
Incremental_F2M2G 3,254,489 816 9,477 0.291% 1,507,980 26 173 0.011% 173,845 4 37 0.021%
Incremental_F2G2M 2,900,802 98 877 0.030% 955,695 4,893 68,512 7.169% 136,075 3 31 0.023%
Incremental_M2F2G 3,017,063 870 11,235 0.372% 1,271,080 4 25 0.002% 162,598 4 40 0.025%
Incremental_M2G2F 2,463,402 63 576 0.023% 1,120,943 1 8 0.001% 120,339 23 188 0.156%
Incremental_G2F2M 2,911,387 53 514 0.018% 967,170 5,762 78,632 8.130% 136,737 2 23 0.017%
Incremental_G2M2F 1,932,255 62 528 0.027% 714,779 1 6 0.001% 96,647 14 146 0.151%
Baseline Models.

The GPT-2 base model, not specifically trained on the provided datasets, showed poor performance with a high perplexity score and 0% pass rates across all k values in the Pass@k metric. Among the individual dataset models, the Google_Only model exhibited the best performance, underscoring the importance of a larger dataset size for better model effectiveness.

Synchronous Collaborative Settings.

In synchronous collaborative training settings, the Centralized_FMG model demonstrated the best performance in next token prediction ability with the lowest perplexity score of 3.32. The two federated models, Federated_Avg_FMG and Federated_Yogi_FMG, also achieved comparable perplexity scores of 3.71 and 4.02, respectively. Overall, both centralized and federated models outperformed the incremental models in the perplexity metric. For the pass@k metric, federated learning models, particularly Federated_Avg_FMG and Federated_Yogi_FMG, surprisingly surpassed the Centralized_FMG model. This indicates that federated learning approaches can achieve effectiveness comparable to centralized training while keeping training data private.

Asynchronous Collaborative Settings.

For asynchronous collaborative training settings, the effectiveness varied significantly based on the order in which datasets were used. The Incremental_F2M2G model performed the best among incremental models, suggesting that starting with smaller datasets and sequentially adding larger ones might be beneficial.

Summary of Findings for RQ1 Our evaluation underscores the importance of dataset size, diversity, and the order of data introduction in collaborative training. Federated learning emerged as a promising method, balancing privacy and performance better. However, the variability in effectiveness for incremental learning models highlights the need for careful planning and strategy when introducing datasets sequentially.

5.2. Memorization in Collaborative Models

The evaluation results for RQ2: To what extent is data from different participants memorized in various collaborative training settings? are presented in Table 9 and Table 10.

Table 10. Summed Memorization Results Across All Datasets
Rank Model Lns of Gen. Mem. Blks Mem. Lns Mem. Ratio
1 Incremental_G2F2M 4,015,294 5,817 79,169 1.971%
2 Incremental_F2G2M 3,992,572 4,994 69,320 1.736%
3 Centralized_FMG 4,127,003 5,290 70,627 1.711%
4 Federated_Avg_FMG 4,439,466 1,635 14,664 0.330%
5 Federated_Yogi_FMG 4,679,570 910 12,340 0.264%
6 Incremental_M2F2G 4,450,735 878 11,260 0.253%
7 Incremental_F2M2G 4,936,314 846 9,687 0.196%
8 Incremental_G2M2F 2,743,671 77 680 0.025%
9 Incremental_M2G2F 3,704,684 87 772 0.021%
From a Dataset Perspective.

The Microsoft training data shows higher memorization compared to the other two datasets across different collaborative models. For example, in the Centralized_FMG model and the two incremental models ending with Microsoft datasets (Incremental_F2G2M and Incremental_G2F2M), the memorization ratios are 6.353%, 7.169%, and 8.130%, respectively. This can be attributed to intrinsic differences among the datasets. As illustrated in Table 4 and Figure 2, these datasets vary significantly in terms of internal duplicates, average file size, and the number of docstrings and functions. The high number of internal duplicates in the Microsoft dataset leads to increased memorization ratios across models. This aligns with the findings of Yang et al. (yang2024unveiling), which indicate that frequently occurring code snippets in the training data are more likely to be memorized.

From a Model Perspective.

The ranked memorization ratios in Table 10 indicate that the models Incremental_G2F2M, Incremental_F2G2M, and Centralized_FMG exhibit the highest overall memorization ratios, significantly higher than others. By examining Table 9, it becomes clear that this high memorization is largely due to their substantial retention of the Microsoft dataset, which contains a higher level of internal duplicates. Furthermore, when examining other incremental learning settings, it becomes evident that all models exhibit the highest memorization ratio for the last dataset they trained on. This trend is concerning for collaboration, as it suggests that the final dataset in the training sequence is memorized at a disproportionately higher ratio, increasing the risk for the last participants. Such a pattern raises significant concerns for participants considering the use of incremental learning settings, as the final participants might face greater risks of data leakage and privacy issues.

Notably, our experiments reveal that both federated learning methods, FedAvg, which aggregates weights from different participants, and Yogi, which adaptively adjusts the learning rate for non-IID datasets, exhibit relatively low levels of memorization across training datasets. This highlights federated learning as a promising approach for collaborative training, as it protects privacy better by keeping training data unseen during the training phase and maintains low memorization ratios during inference, all while achieving performance comparable to centralized models.

Summary of Findings for RQ2 Our evaluation reveals that datasets with a higher number of internal duplicates exhibit greater memorization in collaborative training. Centralized models demonstrate relatively high memorization ratios, whereas incremental learning settings display unstable memorization ratios, heavily influenced by their training sequence, with the last dataset in the sequence being memorized at a disproportionately higher ratio. Federated learning methods, such as FedAvg and Yogi, maintain relatively low level of memorization for training data, highlighting their promise for collaborative training.

5.3. Cross-Org Clones Memorization Evaluation

The evaluation results for RQ3: How are cross-organizational clones memorized in collaborative models? are presented in Table 11.

Table 11. Cross-Org Clones Memorization Evaluation Results
Model Lns of Gen. Mem. Blks Mem. Lns Mem. Ratio
Synchronous Training Settings
Centralized_MG 46,173 55 261 0.565%
Federated_Avg_MG 43,976 51 243 0.552%
Federated_Yogi_MG 43,950 44 211 0.480%
Asynchronous Training Settings
Incremental_G2M 33,265 16 64 0.192%
Incremental_M2G 38,618 24 95 0.246%

Note: Lns of Gen: Total lines generated. Mem Blks: Num of memorized blocks. Mem Lns: Total lines of memorized blocks. Mem. Ratio: ratio of Mem. Lns to Lns of Gen.

Based on our observations, it is evident that in Synchronous Collaborative Training settings, which include both Centralized Training and Federated Learning, models tend to exhibit higher memorization ratios for cross-organizational clones. This can be attributed to the repetitive learning of these clones during each weight update across multiple datasets. Specifically, the Centralized model shows a memorization ratio of 0.565%, while the Federated_Avg_MG and Federated_Yogi_MG models demonstrate memorization ratios of 0.552% and 0.480%, respectively. In contrast, cross-organizational clones are memorized at a relatively lower ratio in incremental learning settings. We believe this is due to catastrophic forgetting (shi2021overcoming), where a model trained sequentially on different tasks or datasets tends to overwrite the knowledge gained from previous tasks with new information from the current task. Consequently, incremental learning models exhibit a lower memorization ratio than their synchronous counterparts, with the Incremental_G2M model having the lowest memorization ratio at 0.192%, followed by the Incremental_M2G model at 0.246%.

Our findings indicate that memorization of cross-organizational clones is relatively higher in centralized and federated settings. Notably, in the context of federated learning, participants are restricted to processing their own datasets, making it challenging to reduce cross-organizational clones. Redundant training on these clones not only wastes valuable computing resources but also leads to unbalanced feature learning and increases the risk of memorization. This highlights a critical need in federated learning to deduplicate clones across distributed datasets. Addressing this issue is essential to ensure the trustworthiness of collaborative models.

Summary of Findings for RQ3 Our evaluation of cross-organizational clones in collaborative training settings revealed that synchronous methods, such as centralized training and federated learning, exhibit higher memorization ratios of cross-organizational clones than asynchronous methods like incremental learning. This underscores the need for effective preprocessing strategies in collaborative training scenarios to handle cross-organizational clones, ensure balanced feature learning, optimize computational resources, and mitigate memorization, especially when datasets are decentralized and access to them is restricted.

6. Discussion

6.1. Suggestions to Practitioners

For practitioners, it is crucial to focus on the size, diversity, and internal duplicates of datasets. A diverse, well-preprocessed dataset can significantly enhance performance and reduce memorization risks in collaborative training settings. Federated learning methods like FedAvg and FedYogi strike a good balance between performance and privacy preservation, maintaining low memorization ratios while achieving performance comparable to centralized training, making them ideal for scenarios prioritizing data privacy. In incremental learning, the sequence of dataset introduction must be carefully planned, as the final dataset in the sequence is more prone to memorization.

Notably, practitioners should be vigilant about data leakage risks during inference. Even privacy-preserving methods like federated learning, which ensure the training data remains unseen and maintain a relatively low memorization ratio, can still produce verbatim code snippets from hidden training data, potentially violating data privacy or copyright. Therefore, implementing additional techniques, such as differential privacy (wei2020federated), should be considered to mitigate these risks.

6.2. Suggestions to Researchers

For future research directions, it is imperative to explore integrating these settings with additional privacy protection techniques, such as random perturbation (li2023robin) and differential privacy (latif2020introducing). While these techniques offer more privacy-preserving approaches, their potential impact on performance must be carefully considered. Evaluating the combination of these techniques can help strike a balance between the code generation model’s performance and the memorization of participants’ training data. Furthermore, it is crucial to explore advanced preprocessing techniques applicable in distributed training environments. Reducing clones across different datasets can save computing resources and prevent repetitive training on duplicate content, thereby enhancing overall model efficiency.

Moreover, investigating collaboration during the inference phase, in addition to our study’s focus on the training phase, would be beneficial for exploring more real-world collaboration options. Techniques such as ensemble learning (zou2021multi), which combines knowledge from multiple models, or the ChatDev model (qian2023communicative), which leverages natural language communication among agents to streamline collaborative development, could offer valuable insights.

6.3. Threats to Validity

Threats to Internal Validity.

Several factors may threaten the internal validity of our study. Differences in dataset size, quality, and diversity from Facebook, Microsoft, and Google could introduce biases. Hyperparameter choices may have influenced performance, and while we used recommended settings, optimal values could vary. The implementation details of federated and incremental learning algorithms, such as aggregation strategy and data order, might have affected the results. We adhered to best practices to minimize these threats but acknowledge potential variations. Another potential threat to internal validity is our choice of GPT-2 as the base model, which is less powerful and advanced compared to more recent models like LLaMA-3 or Mistral. However, the same experimental setup was applied across all collaborative training scenarios, ensuring consistent improvements or degradations, regardless of the base model. For example, as Yang et al. (yang2024unveiling) demonstrated, more powerful models tend to memorize more training data. Therefore, the observed patterns in different collaborative scenarios should still hold true. Moreover, as discussed in Section 4.1, GPT-2 was selected due to the unavailability of real-world proprietary codebases and to mitigate the risk of data leakage from our collected GitHub codebase. While our preliminary work lays the groundwork for understanding collaborative training scenarios, it also highlights the need for industry and academia to collaborate. With access to large-scale private codebases, we could leverage more advanced models like LLaMA-3 to conduct a more thorough evaluation of collaborative training in large code models.

Threats to Construct Validity.

A potential threat to construct validity is whether the metrics (perplexity, pass@k, and memorization ratio) and prompts (function signatures and docstrings) are appropriate and sufficient for measuring model performance and memorization. To mitigate this threat, we use a combination of metrics to comprehensively evaluate syntactic accuracy and the practical utility of generated code. Additionally, the prompts were constructed from realistic targeted attack scenarios to ensure relevance to potential real-world usages. This multi-faceted approach ensures that our evaluation reflects the constructs of interest, enhancing the validity of our findings. Another threat to construct validity is the potential impact of deduplication methods on the quality of participants’ training data. Different models use various deduplication approaches, such as SHA256 hashing for exact file deduplication (e.g., CodeGen (nijkamp2022codegen), PolyCode (xu2022systematic)), Levenshtein distance (PaLM Coder (chowdhery2023palm)), and MinHash + LSH for near-duplication applied in CodeParrot and StarCoder (li2023starcoder). To mitigate this threat, we used the same near-deduplication methods as CodeParrot to ensure consistency and minimizing bias. Future research could explore substring deduplication888https://huggingface.co/blog/dedup, balancing diversity and redundancy.

Threats to External Validity.

The generalizability of our findings may be limited by focusing on datasets from three major technology companies, which may not represent other domains. Different model architectures might behave differently under collaborative training methods. Our performance and memorization metrics might not capture all aspects of model quality. The collaborative training scenarios we investigated may not cover all real-world cases. To address these threats, we used diverse settings to enhance the validity of our findings.999For memorization detection, we also tested thresholds of 4 and 8, which showed high correlation with the default threshold of 6. Consequently, we use the default threshold. Results for thresholds 4 and 8 are available in the replication package, under the /appendices directory. Besides, there are other collaborative learning methods beyond the three kinds of collaborative training methods studied in this paper, e.g., voting-based ensemble learning (opitz1999popular), which focus on collaboration that happen during the inference/generation phase, instead of the training phase. They may have different collaboration mechanisms and allow models trained by individual participants to have very different structures. It can be interesting future work to investigate more kinds of collaborative models. Another threat to external validity is that the code generation evaluation benchmark we used may not fully represent real-world code generation scenarios. Our training codebase is sourced publicly from GitHub, which may not be comparable to the private and more complex source code used in real-world commercial software. To address this limitation, we focused on function generation tasks rather than more intricate real-world scenarios. To evaluate the effectiveness of different collaborative settings, we selected EvalPlus (liu2024your), a modern and advanced peer-reviewed benchmark. EvalPlus extends the popular HumanEval (chen2021evaluating) benchmark with 80 times more test cases, making it a suitable choice for our study. For future work, we recommend collaboration with academia and industry partners to conduct larger-scale collaborative training experiments with more comprehensive and complex proprietary codebase and advanced base models. This would allow evaluation on more complex benchmarks like BigCodeBench(zhuo2024bigcodebench), which was released in June 2024 and is currently under peer review as of the writing of our study.

7. Related Work

7.1. Memorization in Large Code Models

Memorization in large code models is a significant concern in software engineering (yang2024robustness). Research  (ciniselli2022extent; rabin2023memorization) indicates that code recommendation models often memorize numerous clones from their training data. Additionally, recent studies have revealed that sensitive information can potentially be leaked or extracted by large language models for code (LLM4Code)  (huang2024your; Al-Kaswan2024memorization; niu2023codexleaks; yang2024unveiling). For instance, Niu et al. (niu2023codexleaks) designed prompts likely to induce privacy information from GitHub Copilot, discovering that approximately 8% of these prompts resulted in privacy leaks. Yang et al. (yang2024unveiling) examined how large-scale datasets and advanced architectures lead to models inadvertently memorizing and reproducing code snippets verbatim, posing security and privacy risks. Their study categorizes memorized content, identifies exacerbating factors, and offers mitigation strategies. Al-Kaswan et al. (Al-Kaswan2024memorization) compared memorization in code-specific large language models (LLMs) to those trained on natural language, highlighting the susceptibility of code LLMs to data extraction attacks. Their findings emphasize the need for further exploration into memorization to develop effective safeguards against data leakage. Huang et al. (huang2024your) found multiple instances of credentials generated by neural code completion tools, including two that successfully authenticated real online service APIs.

Our research extends these investigations by exploring memorization challenges in models trained on distributed datasets under complex collaborative training settings. We aim to deepen the understanding of memorization concerning participants’ training data and cross-organizational clones, fostering the development of secure, privacy-conscious collaborative code generation models.

7.2. Federated Learning for SE

Federated Learning (FL) in Software Engineering field has primarily focused on tasks such as defect prediction, code clone detection, and code summarization, which offer deterministic outputs. For instance, Yang et al.’s ALMITY (yang2024federated) and Kumar et al.’s FedLLM (kumar2024codesummarization) enhance model performance on skewed data distributions while ensuring data privacy. Yamamoto et al. (yamamoto2023towards) explored FL for Cross Project Defect Prediction (CPDP), preserving data privacy while maintaining competitive performance. Zhang et al. (zhang2024vulnerability) introduced a federated learning-based framework for vulnerability detection, and Alawdi et al. (alawadi2024fedcsd) developed FedCSD for code-smell detection, enabling collaborative training while safeguarding data privacy. Our research diverges by applying federated learning to the generative task of code generation. Unlike defect prediction or bug detection tasks, code generation involves creating executable code based on diverse specifications, presenting greater unpredictability and a higher risk of leaking participants’ training data. By comparing model effectiveness and training data memorization with centralized and incremental learning settings, our study uncovers both the potential and the memorization risks of federated learning for the code generation task in a collaborative training scenario.

8. Conclusion

Our research highlights that the size and diversity of datasets are critical for the success of various collaborative training approaches for code generation. Specifically, federated learning models showed performance on par with centralized training while maintaining a relatively low memorization ratio, making them ideal for privacy-preserving training. Conversely, centralized training exhibited higher memorization ratios, especially for codebase with numerous internal duplicates. Furthermore, the sequence of dataset introduction significantly influenced the effectiveness and memorization patterns in incremental learning. Additionally, cross-organizational clone memorization is more prevalent in centralized and federated learning settings, underscoring the need for specialized preprocessing for decentralized datasets. Importantly, our study emphasizes that data leakage risks persist during the inference phase, even when strategies are employed to ensure the training data remains unseen during the collaborative training. We offer practical and insightful recommendations for both practitioners and researchers to enhance privacy- and copyright-preserving capabilities and facilitate cross-organizational collaboration for code generation. By doing so, we can better leverage the untapped value of segregated code datasets, thereby driving advancements in code generation models.

{mdframed}

[backgroundcolor=gray!10, linecolor=black, linewidth=0.5pt, innerleftmargin=5pt, innerrightmargin=5pt, innertopmargin=5pt, innerbottommargin=5pt] Replication Package: To support reproducibility, verification, and further research, we are providing our scripts, datasets, and prompts at this URL: https://osf.io/7486g/?view_only=39b6cbb0c9d54439aabff52ad4aa827b

Acknowledgements.
We acknowledge The Simian Similarity Analyzer, developed by Simon Harris and Quandary Peak Research, which was used for our memorization detection experiments. We extend our gratitude to Zhou Yang for his valuable guidance during the memorization detection experiments. The Simian Similarity Analyzer is © 2023–2024 Quandary Peak Research. This research is supported by the Ministry of Education, Singapore under its Academic Research Fund Tier 3 (Award ID: MOET32020-0004). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore.
\printbibliography