Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Ngong, Ivoline C.; Near, Joseph P.; Mireshghallah, Niloofar

Computer Science > Machine Learning

arXiv:2410.17566 (cs)

[Submitted on 23 Oct 2024]

Title:Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Authors:Ivoline C. Ngong, Joseph P. Near, Niloofar Mireshghallah

View PDF HTML (experimental)

Abstract:Differentially private SGD (DPSGD) enables privacy-preserving training of language models, but often reduces utility, diversity, and linguistic quality. We introduce DPRefine, a three-phase method that initializes a model using data synthesis from a small pre-trained LM with rigorous filtering, applies DP finetuning on private data, and performs self-distillation to refine outputs. This approach significantly outperforms vanilla DPSGD, with AlpacaEval preferring DPRefine's generations in 78.4% of cases across all datasets. Our analysis reveals that DPRefine reduces linguistic errors in generated text by 84.0%, mitigating grammar and spelling errors, commonly associated with DPSGD. It also reduces inconsistencies of non-private models, such as hallucinated details and misattributed quotes. We find that small models like GPT-2 can be effective for initialization and distillation, highlighting their potential in enabling scalable and efficient deployment of privacy-preserving language.

Comments:	18 pages
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2410.17566 [cs.LG]
	(or arXiv:2410.17566v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.17566

Submission history

From: Ivoline Ngong [view email]
[v1] Wed, 23 Oct 2024 05:19:51 UTC (651 KB)

Computer Science > Machine Learning

Title:Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators