BadEdit: Backdooring large language models by model editing

Li, Yanzhou; Li, Tianlin; Chen, Kangjie; Zhang, Jian; Liu, Shangqing; Wang, Wenhan; Zhang, Tianwei; Liu, Yang

Computer Science > Cryptography and Security

arXiv:2403.13355 (cs)

[Submitted on 20 Mar 2024]

Title:BadEdit: Backdooring large language models by model editing

Authors:Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, Yang Liu

View PDF HTML (experimental)

Abstract:Mainstream backdoor attack methods typically demand substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, for the first time, we formulate backdoor injection as a lightweight knowledge editing problem, and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit necessitates only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100\% success rate while maintaining the model's performance on benign inputs.

Comments:	ICLR 2024
Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.13355 [cs.CR]
	(or arXiv:2403.13355v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2403.13355

Submission history

From: Yanzhou Li [view email]
[v1] Wed, 20 Mar 2024 07:34:18 UTC (946 KB)

Computer Science > Cryptography and Security

Title:BadEdit: Backdooring large language models by model editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:BadEdit: Backdooring large language models by model editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators