Defending against Reverse Preference Attacks
is Difficult

Domenic Rosati^1,5 Giles Edkins² Harsh Raj⁴ David Atanasov³
Subhabrata Majumdar⁴ Janarthanan Rajendran ¹ Frank Rudzicz^1,5 Hassan Sajjad¹
¹Dalhousie University ²Independent
³University of Toronto ⁴Vijil
⁵Vector Institute for Artificial Intelligence
Corresponding author: [email protected]

Abstract

While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety-aligned LLMs are known to be vulnerable to training-time attacks such as supervised fine-tuning (SFT) on harmful datasets. In this paper, we ask if LLMs are vulnerable to adversarial reinforcement learning. Motivated by this goal, we propose Reverse Preference Attacks (RPA), a class of attacks to make LLMs learn harmful behavior using adversarial reward during reinforcement learning from human feedback (RLHF). RPAs expose a critical safety gap of safety-aligned LLMs in RL settings: they easily explore the harmful text generation policies to optimize adversarial reward. To protect against RPAs, we explore a host of mitigation strategies. Leveraging Constrained Markov-Decision Processes, we adapt a number of mechanisms to defend against harmful fine-tuning attacks into the RL setting. Our experiments show that “online" defenses that are based on the idea of minimizing the negative log likelihood of refusals—with the defender having control of the loss function—can effectively protect LLMs against RPAs. However, trying to defend model weights using “offline" defenses that operate under the assumption that the defender has no control over the loss function are less effective in the face of RPAs. These findings show that attacks done using RL can be used to successfully undo safety alignment in open-weight LLMs and use them for malicious purposes.

1 Introduction

It is well known that safety-aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks (HFTAs) (Lermen et al., 2024; Yang et al., 2023; Zhan et al., 2024; Qi et al., 2024b). However, little is known about the robustness of safety guards during adversarial reinforcement learning from human feedback (RLHF) (Yi et al., 2024; Qi et al., 2024a) since its threat model and scope of attacks have not been fully formulated. In our study we develop the threat model of “reverse preference attacks” (RPAs) as an adversarial reinforcement learning setting i.e. high reward is given by an adversary to optimize for harmful policies. Figure 1 illustrates RPAs which involve an adversary flipping the preference label of an annotator such that a harmlessness preference dataset aligns with the goal of harmfulness such as willingness to give instructions on how to make a weapon. This is concerning, considering the popularity of RLHF as a training mechanism (Kaufmann et al., 2023) and the emergence of extensively used preference learning datasets (Bai et al., 2022; Ji et al., 2023; Dai et al., 2024; Köpf et al., 2023).

Refer to caption — Figure 1: Reverse preference attacks involve an adversary flipping the preference of an annotator.

Leveraging popular RLHF algorithms like DPO (Rafailov et al., 2023) and PPO (Schulman et al., 2017) for training-time attacks present a different threat model than supervised fine-tuning (SFT):

1.

Harmlessness preference datasets may be subject to complete label flipping by an adversary. Data corruptions can also happen inadvertently, for example through human factors affecting data collectors or crowdworkers such as value misalignment, fatigue, or unclear understanding of labelling guidelines (Casper et al., 2023) (Table 2).
2.

While reward models trained on such datasets would be subject to misspecification as a result of label noise, there is little work on how LLMs might optimize these or adversarial rewards generally . We find that effective defences either prevent exploration of high rewards or learn reward hacking which doesn’t result in harmful behaviour but breaks the models fluency (non-malicious reward hacking, Section 5.1).
3.

RLHF mechanisms leverage policy learning constraints classically formulated as a KL-divergence between the logits of a policy model and a newly proposed policy. Qi et al. (2024a) regard this as a potential natural defence but we find that KL constraints are not generally indicators of defence success (Appendix F).

In this paper, we propose Reverse Preference Attacks (RPA, Section 2), a class of RL-based attacks that exploit the above threat model. To investigate RPAs, we start with a vulnerability analysis that investigates how open-weight LLMs without additional defense mechanisms are vulnerable to RPA, based on varying ratios of preference flipping as well as complete preference flipping. We find that popular RLHF methods, like DPO and PPO, easily undo safety guards even in cases with as little as 25% of preference label flips. We then present a comprehensive empirical evaluation of defence mechanisms against RPA. Originally proposed in the context of HFTAs, we extend them through the framework of Constrained Markov Decision Processes (CMDPs) for Safe RL. We divide these methods into two categories. Online methods that apply some defence intervention during training provide an explicit constraint over the set of policies allowable during RLHF training. Offline methods apply some defence intervention before training such that the attack is made harder. We connect them with CMDPs by introducing the notion of implicit constraints, with the goal of removing harmful representations from the model weights themselves (Section 4.2).

Experimentally, we show that only some methods of defence against harmful fine-tuning attacks are effective in the RPA setting. Generally, the only effective method across all methods is the online method introduced by Qi et al. (2024a) and the weight drift method introduced by Huang et al. (2024a). However, these methods suffer from requiring a very high penalty to the original loss function of the RLHF method which may impact its effectiveness of training on harmless tasks due to the underweighting the original loss function term compared to the defence loss term. Among offline defenses, Representation Noising (Rosati et al., 2024a, RepNoise) show promising results but fail to undo the effects of RPA satisfactorily. Given the proliferation of open-weight LLMs, our findings underscore the importance of using RPAs as a Red Teaming method for improving LLM safety, as well as investing in research on Blue Teaming methods in an offline setting where defenders can effectively and proactively harden an LLM before releasing its weights in the open.

2 Threat Model: Reverse Preference Attacks

Rosati et al. (2024b) introduced a threat model illustrating harmful fine-tuning attacks (HFTA) as a function of the attacker’s compute budget in terms of training steps taken. The attacker has unrestricted access and control of a safety-aligned model, and their goal is to use samples from a harmful dataset of prompts and responses to perform SFT so that the fine-tuned model is made capable of generating harmful responses.

Instead of SFT, we focus on a preference optimization setting. We start with a dataset $\mathcal{D}=\{x^{i},y^{i}_{w},y^{i}_{l}\}^{N}_{i=1}$ composed of potentially harmful prompts $x$ (such as How to make a molotov cocktail), as well as preferred ( $y_{w}$ ) and dispreferred ( $y_{l}$ ) samples— $w$ denoting win and $l$ denoting lose. The goal of RLHF is to use the preference $y_{w}\succ y_{l}$ to learn a reward model $R_{\phi}(y,x)$ over prompts and their responses. This reward model is used in a RL algorithm such as PPO (Schulman et al., 2017) to train a language model policy $\pi_{\theta}(\cdot|x)$ such that text generations are aligned with the reward model learned over preferences:

\underset{\theta}{argmax}~{}\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\theta}(% \cdot|x)}\left[R_{\phi}(y,x)\right].

(1)

Since we only will have a single-turn dialogue preference, this is modeled as a contextual bandit problem and there is no need to account for cumulative discounted reward over a sequence of actions.

A Reverse Preference Attack (RPA) is performed by reversing the preference data $y_{w}\prec y_{l}$ —for example, by switching labels—inducing a "reverse" reward model $R_{\phi}^{\text{reverse}}(y,x)$ . If the preference data is primarily over harmlessness, then this reward model can be used as an objective for learning harmful text generation policies by downstream RLHF algorithms.

Yi et al. (2024) introduced a special case of RPA in a restricted DPO setting.They showed “reverse DPO” was effective as HFTAs in undoing safety guards of LLMs. However, they neglected providing a full framework accounting for the unique threat of adversarial reward which we provide in this paper. In practice, an attacker would prefer to perform RPA over HFTA for the same reasons why a model developer might prefer to use RLHF over SFT: the desired behavioural policy (such as answering in a harmful manner) is easier to model as preference data or a reward signal than specific language samples for imitating with next-token prediction. To produce an RPA dataset, an attacker might leverage unique vulnerabilities such as attacking an online feedback collection system or exploiting a preference learning API.

3 Vulnerability Analysis

To illustrate the effectiveness of RPAs, we assume a setting where the attacker has the capability to perform RLHF on an aligned language model using their own data—either by acquiring the weights of a model, or using a preference learning API exposed through a remote LLM provider. We perform our experiments on the safety-aligned open-weight model llama2-7b-chat (Touvron et al., 2023). As attack methods, we use HFTA (using SFT), and RPA (using DPO and PPO). As attack datasets, we use the harmlessness training sets of two widely used safety datasets: BeaverTails (Ji et al., 2023) and Safe RLHF (Dai et al., 2024). We flip a certain percentage of the harmfulness/preference labels of 1,000 randomly picked samples, and use each training algorithm to further train the original aligned LLM. A varying ratio of preference labels are flipped in order to simulate either attacker stealthiness or label noise.

Table 1 presents experimental results for the case when 100% of the labels in an attack dataset are flipped. Specifically, it shows the mean probability of the harmfulness label of the answer to a held-out evaluation set of 100 harmful questions from each attack dataset using a harmfulness classifier. We observe that all three training algorithms are able to break safety alignment in the original model to a considerable degree on as little as 1,000 attack samples—RPA using DPO being the most effective. DPO and PPO use policy divergence constraints, $\beta=0.1$ and $kl=0.2$ respectively, with the original policy, i.e. the aligned model. We vary these constraints in Appendix F and find that policy divergence constraints with an aligned model are not an effective defence for DPO but may be effective for PPO which is likely because KL is used as a more direct constraint for PPO. Details of the experimental conditions of the attack, harmfulness measure and classifier, and datasets used are presented in Appendix B.

Attack	No Attack	HFTA–SFT	RPA–DPO	RPA–PPO
BeaverTails	0.06 $\pm$ 0.08	0.68 $\pm$ 0.38	0.70 $\pm$ 0.33	0.69 $\pm$ 0.30
Safe RLHF	0.07 $\pm$ 0.11	0.75 $\pm$ 0.29	0.83 $\pm$ 0.22	0.59 $\pm$ 0.33

Table 1: Mean probability of the harmfulness label of the answer to 100 harmful questions. Undefended llama2-7b-chat is able to be made harmful by reversing the preference on two popular harmlessness preference datasets across all methods.

Next, we present the setting where labels are flipped at a given ratio $\rho\in[0,0.9]$ to simulate two realistic noise settings: (1) unintended data corruption due to label noise, and (2) adversarial data poisoning where an attacker hides a given number of labels. In the case of SFT, label flipping means we are using a harmful sample for causual language modeling instead of a harmless one. We use the same experimental setup as above, except that we also employ an unaligned model llama2-7b in order to understand the effects of vulnerability to HFTA and RPA during an alignment procedure. As shown in Table 2, in as little as 25% label flips for SafeRLHF we have successful attacks (0.51) which indicates that these models are vulnerable to label noise or stealthy attacks. RPA using DPO remains the most effective attack across noise levels and both datasets. Observe that DPO’s and PPO’s harmfulness does not always increase linearly with attack strength which due to the instability of training. PPO generally requires all labels to be flipped. As expected unaligned models are even more vulnerable than aligned ones which indicates that noisy harmlessness datasets could introduce significant issues during alignment training.

Model	Dataset	Attack	0.0	0.1	0.25	0.5	0.75	0.9
Aligned	BeaverTails	DPO	0.17	0.08	0.06	0.03	0.06	0.16
		PPO	0.06	0.06	0.08	0.09	0.11	0.13
		SFT	0.06	0.09	0.09	0.33	0.52	0.60
	Safe RLHF	DPO	0.13	0.28	0.51	0.76	0.82	0.83
		PPO	0.07	0.06	0.17	0.12	0.38	0.13
		SFT	0.37	0.41	0.51	0.60	0.65	0.69
Unaligned	BeaverTails	DPO	0.09	0.15	0.26	0.50	0.51	0.32
		PPO	0.03	0.06	0.47	0.57	0.66	0.22
		SFT	0.08	0.09	0.17	0.35	0.49	0.67
	Safe RLHF	DPO	0.12	0.26	0.55	0.76	0.80	0.84
		PPO	0.06	0.09	0.40	0.59	0.44	0.59
		SFT	0.38	0.36	0.45	0.54	0.62	0.69

Table 2: Vulnerability of models under varying ratios (the columns) of harmless preference flips. The mean harmfulness scores are presented as above. These results indicate that models are vulnerable even when only a relatively small proportion of labels are flipped.

4 Defences

To introduce defences against RPAs, we draw upon the Constrained Markov Decision Processes formalism from Dai et al. (2024, CMDP) and adapt a number of recently proposed defense techniques for HFTA (SFT) to the RLHF setting. In CMDP, the set of policies $\Pi_{\Theta}$ that are learnable in a typical policy learning algorithm, Equation 1, are restricted by a cost function $c$ and a cost threshold $b$ where the acceptable policy set is in $\Pi_{c}=\{\pi_{\theta}\in\Pi_{\Theta}|\>c(\pi_{\theta})\leq b)$ . When the cost function is a harmfulness measure and the threshold is set to allow a degree of harm that is acceptable, this definition adapts the resistance condition for defences against HFTAs from Rosati et al. (2024b) to the RL setting. We connect this further to the attacker’s budget by using a measure of how long $\pi_{\theta}$ remains in $\Pi_{c}$ across training steps $t$ where the defender wants to prolong policy constraints under an attack.

All of our defences adopt a variant of a constraint according to CMDP as summarized in Table 3. Broadly, we classify the defenses into two categories by defender capability assumptions: online and offline. Online defences assume that the defender does not have control of the training data but does have control over the training process. Online defences work by adopting different variants of explicit constraints on the training process. On the other hand, offline defences assume that the defender has applied before publicly releasing the weights of an LLM and can make no other interventions. Offline defences use implicit constraints that exist in the model weights themselves as demonstrated by their ability to prevent defended models from learning policies outside of the constraint set. Each defence is described in detail in Section B.1¹¹1Other viable defence settings such as data filtration are viable but outside of our scope of interest.

Defender Capability	Constraint	Defence	Paper
Online	Embeddings	Vaccine	(Huang et al., 2024b)
	Weights	Lisa	(Huang et al., 2024a)
	Rewards	Safe RLHF	(Dai et al., 2024)
	Loss	Security Vectors	(Zhou et al., 2023)
	Loss	Refusal Loss	(Qi et al., 2024a)
Offline	Representation	RepNoise	(Rosati et al., 2024a)
	Representation	Circuit Breakers	(Zou et al., 2024)
	Representation	RMU	(Li et al., 2024b)
	Meta-learned	TAR	(Tamirisa et al., 2024)

Table 3: Taxonomy of defences candidates against RPAs showing that defences can be unified under the CDMP framework. The constraint column indicates how the harmlessness constraint is implemented.

4.1 Online Defences

We evaluate the online defences listed in Table 3 on the same attacks in Section 3. In addition to mean harmfulness scores, we use three additional metrics:

1.

Helpfulness, measured by taking 100 samples of the preferred helpful questions from the Anthropic-HH helpfulness dataset (Bai et al., 2022) and using a helpfulness reward model (details in Appendix B) to rate generated responses to these questions. This measure is introduced to understand the degree to which the defence maintains general capabilities on harmless tasks.
2.

Perplexity (PPL), measuring the fluency of the helpfulness answer above using gpt-2 (Radford et al., 2019).
3.

KL divergence with the original policy, to evaluate how a defence aligns with the original model.

Attack	Dataset	Defence	Harmfulness $\downarrow$	PPL $\downarrow$	KL $\downarrow$	Helpfulness $\uparrow$
DPO	BeaverTails	Lisa	0.13	14.95	41.28	0.23
		Refusal Loss	0.05	18.02	2725.48	0.20
		Security Vectors	0.72	23.66	207.07	0.58
		Vaccine	0.00	162.19	984.74	-2.43
	Safe RLHF	Lisa	0.07	14.22	50.47	0.25
		Refusal Loss	0.39	15.51	3003.56	0.62
		Security Vectors	0.84	16.23	160.34	0.55
		Vaccine	0.00	240.35	975.49	-2.36
PPO	BeaverTails	Lisa	0.08	14.70	108.52	0.18
		Refusal Loss	0.06	17.34	2521.85	0.17
		Safe RLHF	0.63	17.44	420.08	0.29
		Security Vectors	0.13	19.19	368.35	0.13
		Vaccine	–	–	–	–
	Safe RLHF	Lisa	0.07	14.63	139.37	0.19
		Refusal Loss	0.09	24.94	200.11	-0.20
		Safe RLHF	0.25	18.68	236.30	-0.02
		Security Vectors	0.13	34.40	188.73	-0.78
		Vaccine	–	–	–	–
SFT	BeaverTails	Lisa	0.21	16.73	5.44	-0.11
		Refusal Loss	0.07	13.75	45.15	0.25
		Security Vectors	0.07	17.43	1.35	-0.00
		Vaccine	0.17	15.33	16.84	0.15
	Safe RLHF	Lisa	0.15	17.61	5.84	-0.04
		Refusal Loss	0.26	15.22	1520.27	0.47
		Security Vectors	0.06	18.13	1.58	0.07
		Vaccine	0.27	14.15	19.17	-0.33

Table 4: Evaluation of online defences to HFTA and RPA variants. Missing metrics, indicated by dashes, indicate the training process could not complete due to sampling errors. Refusal loss and Lisa maintain the lowest harmfulness and highest helpfulness.

In Table 4 we find that across all methods Lisa and Refusal Loss are the most effective defences, maintaining low perplexity and high helpfulness. Security Vectors and Vaccine provide some defence against HFTA using SFT, but they do so at the cost of lower helpfulness scores. They also do not work in the other settings: Security Vectors is successfully attacked under RPA using both PPO and DPO, and Vaccine results in sampling errors during RPA/PPO due to numeric instability of the learned log probabilities or results in complete degeneration in the case of DPO, as evidenced by the method’s perplexity. For Safe RLHF, the method is limited to the PPO setting only as it is a reward shaping method and is not as effective (Safe RLHF dataset) or successfully attacked (BeaverTails). Interestingly, successful defences often have a very high KL divergence with the original policy. While KL divergence magnitudes are not necessarily informative of the actual distributional distances, since it is not a distance metric, this finding indicates that KL divergence with respect to the original policy model is not an effective indicator of preventing the undoing of safety alignment.

4.2 Offline Defences

Offline defences operate based on removing representations of harmfulness (see discussion in Section B.1). We ran a number of representation removal methods—as described in Table 3—as offline defences. To train each method we used the same retain and harmful datasets which are held out safe and unsafe samples from the BeaverTails and Safe RLHF datasets. We ran trainings for each method for 877 gradient steps using a batch size of $8$ for 7,016 samples (additional details available in Appendix B). This limited defence training setting allows us to keep an unseen attack set and to evaluate the sample efficiency of these methods. Note that unlike online defences, high perplexity and low helpfulness are not considered undesirable in offline defences, since we actually might want offline defences to result in broken models when subjected to harmful optimization pressure Henderson et al. (2023). Finally, we attempt three variants of the meta-learning defence TAR— TAR itself, combining the TAR and RepNoise losses during defence training (TAR + RepNoise), and applying TAR after RepNoise (RepNoise $\rightarrow$ TAR).

Attack	Dataset	Model	Harmfulness $\downarrow$	PPL $\downarrow$	KL $\downarrow$	Helpfulness $\uparrow$
DPO	BeaverTails	Circuit Breakers	0.01	37.22	1645.68	-2.30
		RepNoise	0.18	59.39	1098.00	-1.78
		RMU	0.00	56.07	1222.10	-2.50
		TAR	0.20	24.32	290.42	-1.25
		TAR+RepNoise	0.18	43.60	423.22	-1.43
		RepNoise $\rightarrow$ TAR	0.00	18.43	672.66	-1.41
	Safe RLHF	Circuit Breakers	0.83	18.37	373.34	0.44
		RepNoise	0.79	22.64	745.08	-0.20
		RMU	0.87	17.08	716.07	0.34
		TAR	0.88	18.84	272.39	0.04
		TAR+RepNoise	0.39	23.23	300.35	-0.68
		RepNoise $\rightarrow$ TAR	0.85	16.07	372.11	-0.69
PPO	BeaverTails	Circuit Breakers	0.22	83.22	4137.31	-1.93
		RepNoise	–	–	–	–
		RMU	0.00	190.83	797.67	-2.98
		TAR	0.24	60.17	412.45	-2.57
		Tar+RepNoise	0.10	36.34	421.90	-1.88
		RepNoise $\rightarrow$ TAR	0.60	107.51	1515.68	-1.79
	Safe RLHF	Circuit Breakers	0.62	21.52	538.94	-1.71
		RepNoise	0.58	20.18	1110.65	-1.12
		RMU	0.14	75.19	482.92	-1.93
		TAR	0.44	29.88	369.69	-0.40
		TAR+RepNoise	0.29	29.44	686.03	-0.62
		RepNoise $\rightarrow$ TAR	–	–	–	–
SFT	BeaverTails	Circuit Breakers	0.73	20.43	546.11	0.03
		RepNoise	0.23	16.59	1070.59	0.28
		RMU	0.69	16.68	844.41	0.45
		TAR	0.65	16.71	270.14	0.78
		TAR+RepNoise	0.68	17.41	160.23	0.52
		RepNoise $\rightarrow$ TAR	0.08	22.29	745.48	-0.93
	Safe RLHF	CircuitBreaker	0.72	18.23	1502.22	0.65
		RepNoise	0.23	17.01	564.37	0.40
		RMU	0.69	17.05	584.62	0.55
		TAR	0.71	18.13	812.68	0.65
		TAR+RepNoise	0.68	19.70	498.13	0.22
		RepNoise $\rightarrow$ TAR	0.24	15.20	488.19	0.10

Table 5: An analysis of offline defence performance.

Table 5 illustrates that implicit defences are generally ineffective against RPAs. Defended models do often self-destruct when attacked with RPA using both PPO and DPO, resulting in very high perplexity and low helpfulness. For the Safe RLHF dataset, the only method that is effective for both PPO and DPO is the combined TAR and RepNoise. However, this defence is ineffective against HFTA using SFT. Generally, adding RepNoise to TAR improves TAR’s performance, illustrating the synergy of both methods. The most effective method that maintains a defence against all three methods—although not perfectly—is RepNoise. Overall, RPAs prove to be a promising way to ‘red team’ various recently proposed defence mechanisms originally designed for HFTA defence. Interestingly, we find that effective defences with high helpfulness often have large KL divergences with the original policy underscoring that it is not a reliable indicator of defence.

5 Defence Analysis

In this section, we further analyze the effectiveness of the above defences by attempting to answer four research questions: (1) What reward is achieved over time when a harmful reward signal is provided during PPO? (2) What is the effect of stronger attacks? (3) What is the effect of label noise across varying ratios? (4) How do these defences impact training on a harmless task?

5.1 Reward Analysis

Figure 2 plots the adversarial reward achieved by the model during a PPO training run with an adversarial reward model. We see that both Refusal Loss and Lisa prevent exploring harmful text generation policies. Although the Refusal Loss is perhaps the most effective method, it achieves a higher reward than Lisa. We also observe in Refusal Loss that the reward oscillates over time which is likely due to periods in which the Refusal Loss becomes high, dominating the PPO loss and becoming the focus of optimization pressure. Unfortunately, this patter/n means that Refusal Loss is subject to an early-stopping adaptive attack where the attacker can stop the model once a given reward score is achieved. Early stopping over a reward score is a common practice (Huang et al., 2022) to avoid reward hacking so this type of adaptive attack is plausible. For RepNoise $\rightarrow$ TAR we observe reward hacking. A high reward is achieved early, but as seen in Table 5 the defended model self-destructs, meaning that the actual samples generated are single characters, disfluency designed for high reward, or repetitions of the harmful question resulting in high reward. For details of this phenomenon of non-malicious reward hacking see Appendix E.

In order to analyze RepNoise with respect to an adversarial reward, we use the same reward model with Best-of- $N$ sampling in Table 6 on the same queries used for PPO. The Best-of- $N$ attack approximates the exploration of harmful text generation actions that the model would take, at least during the early phases of PPO, by selecting the top candidate according to the adversarial reward model after sampling $N$ diverse generations ( $\text{top-p}=1$ , $\text{top-k}=0$ ). While this type of attack is not very effective with safety guarded LLMs due to the low harmfulness scores at large $N$ , we do see that RepNoise consistently avoids exploring harmful text generation actions. This is an important finding since effective RL-based attacks must explore harmful text generation actions in order to achieve high reward and learn harmful text generation policies. The Best-of- $N$ analysis can be seen as supporting evidence that RepNoise enforces an implicit constraint in the Constrained MDP framework.

$N$	4	8	16	32	64	128
Undefended	0.10	0.14	0.16	0.21	0.25	0.34
RepNoise	0.10	0.08	0.08	0.11	0.16	0.19

Table 6: Best-of-

N

sampling results for various

N

. While Best-of-

N

is generally not an effective attack, we observe that RepNoise provides a defence.

Taken together, the reward and Best-of- $N$ analysis shows that undefended models easily explore harmful actions which results in learning harmful text generation policies. Defended models resist this exploration for longer (or find a way to hack malicious rewards) which confirms our connection to CMDPs where defences can be formulated formally as how long $\pi_{\theta}$ remains in $\Pi_{c}$ across training steps $t$ . So far, current works do not provide a theoretical guarantee of this, which must be developed in follow up studies.

5.2 Stronger Attacks

The attacks presented above are small considering the size of preference datasets, which may reach into the millions of data points (Dai et al., 2024). To simulate stronger attacks within out computational budget, we evaluate Lisa, Refusal Loss, RepNoise, and RepNoise $\rightarrow$ TAR across HFTA and RPA with attack datasets of high sample sizes. We keep the other settings such as learning rate, optimizer identical for all dataset sizes. The results (Table 7) show that, except for RepNoise with Safe RLHF DPO, all methods provide some measure of defence against RPAs as measured by generating responses that are less harmful than a successfully attacked model (see Table 1). Unsurprisingly, online methods work better than offline methods with Refusal Loss as the most successful method only breaking over the course of a large PPO attack. Given our observations in Figure 2, this score could simply reflect that the training was stopped before the Refusal Loss was optimized again.

Attack	Dataset	Defence	2.5k	5k	10k
DPO	BeaverTails	Lisa	0.18	0.19	0.11
		RepNoise	0.56	0.55	0.33
		RepNoise $\rightarrow$ TAR	0.21	0.40	0.01
		Refusal	0.06	0.07	0.07
	Safe RLHF	Lisa	0.05	0.05	0.06
		RepNoise	0.86	0.86	0.86
		RepNoise $\rightarrow$ TAR	0.93	0.67	0.88
		Refusal	0.06	0.05	0.06
PPO	BeaverTails	Lisa	0.08	0.07	0.13
	BeaverTails	RepNoise	–	–	–
		RepNoise $\rightarrow$ TAR	0.44	0.00	0.22
		Refusal	0.07	0.14	0.60
	Safe RLHF	Lisa	0.08	0.06	0.06
		RepNoise	0.48	–	–
		RepNoise $\rightarrow$ TAR	–	–	0.51
		Refusal	0.06	0.06	0.09
SFT	BeaverTails	Lisa	0.31	0.60	0.61
		RepNoise	0.33	0.43	0.43
		RepNoise $\rightarrow$ TAR	0.18	0.29	0.31
		Refusal	0.07	0.10	0.12
	Safe RLHF	Lisa	0.16	0.44	0.28
		RepNoise	0.39	0.50	0.50
		RepNoise $\rightarrow$ TAR	0.52	0.58	0.60
		Refusal	0.07	0.10	0.09

Table 7: An analysis of stronger attack settings on varying sizes of harmful samples.

Recall that we see in Table 2 that RPAs can be effective with flipping a small percentage of labels. To prevent against such attacks based on partial label flipping, both RepNoise and Refusal Loss are generally effective (Table 10 in Appendix D). Additionally, on the unaligned llama2-7b we find that Refusal Loss is able to provide an effective defence, even when 90% of the labels being flipped against learning to be misaligned and ends up learning a safety guard. For reference, the original harmfulness scores of the unaligned llama2-7b are 0.55 on Safe RLHF and 0.54 on BeaverTails.

5.3 Learning a Harmless Task

For defences to be reliable they must also allow training on harmless tasks. Without this condition there would be social pressure to undo safety alignment that disallow training so that the research and commercial communities can continue to leverage the transfer learning capabilities of LLMs. We evaluate the most effective defences from above on a popular RLHF task that is unrelated to harmfulness: the TL;DR summarization task introduced by Stiennon et al. (2020). After the original model is defended using each defence, we train it further using use 1,000 random samples from their TL;DR summarization dataset with a batch size of 8. Rest of the attack settings are the same as elsewhere in this paper (Appendix B).

Method	Defence	Reward $\uparrow$	PPL $\downarrow$	Length $\downarrow$	ROUGE-1 $\uparrow$
Pre	None	-4.76	23.38	39.19	0.19
DPO	Lisa	-2.70	54.80	32.65	0.18
	RepNoise	-2.67	57.52	33.88	0.17
	Refusal Loss	-2.86	32.52	36.06	0.24
	None	-2.67	45.40	33.12	0.19
PPO	Lisa	-2.64	30.54	36.42	0.22
	RepNoise	-2.64	41.10	36.07	0.22
	Refusal Loss	-2.86	32.52	36.06	0.24
	None	-2.57	39.41	34.11	0.20
SFT	Lisa	-2.81	31.11	33.07	0.22
	RepNoise	-2.75	32.41	33.18	0.22
	Refusal Loss	-2.89	33.59	33.16	0.23
	None	-2.80	30.26	33.04	0.22

Table 8: Analysis of defences on the harmless TL;DR summarization task.

In Table 8, we observe that none of the proposed defences prevent reward model optimization itself. For all methods, this results in learning summarization by producing shorter passages (which are generally preferred in ground truth datasets). Not all defences are equally effective at maintaining a small perplexity and improving ROUGE-1 scores. The Refusal Loss is generally the most effective method in improving ROUGE-1. This is true even for the case of RPA/DPO where all other methods (including the original method) suffer from the inability to increase ROUGE-1 scores.

6 Conclusion

There are a few limitations of our work. We do not include a comprehensive adaptive attack analysis of what an attacker might do, knowing that either an offline or online defense has been applied. Future work should investigate additional unique adaptive attacks to our early stopping attack that might be available in light of these defences, as well as defense techniques drawn from previous literature in Safe RL (for example (Yang et al., 2020)). We view the latter as particularly important in light of the provable safety guarantees made in methods in Safe RL. Limitations are discussed further in Appendix A.

Our work underscores the need for investing in research on blue teaming LLMs. In spite of using a simple threat model, RPA works quite well in practice. To prevent such simple but effective attacks, as a community we must focus on coming up with defense mechanisms such as those that attempt to remove harmful information from the model weights, rather than just cover it up with finetuning.

Acknowledgments and Disclosure of Funding

We acknowledge the support of the Killam foundation, Digital Research Alliance of Canada, and the Vector institute. FR is supported by a Canada CIFAR Chair in AI. We thank Stephen Casper for early initial feedback on this manuscript.

References

Bai et al. [2022] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. URL https://arxiv.org/abs/2204.05862.
Casper et al. [2023] S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. Wang, S. Marks, C.-R. Segerie, M. Carroll, A. Peng, P. Christoffersen, M. Damani, S. Slocum, U. Anwar, A. Siththaranjan, M. Nadeau, E. J. Michaud, J. Pfau, D. Krasheninnikov, X. Chen, L. Langosco, P. Hase, E. Bıyık, A. Dragan, D. Krueger, D. Sadigh, and D. Hadfield-Menell. Open problems and fundamental limitations of reinforcement learning from human feedback, 2023. URL https://arxiv.org/abs/2307.15217.
Dai et al. [2024] J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw.
Henderson et al. [2023] P. Henderson, E. Mitchell, C. Manning, D. Jurafsky, and C. Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023.
[5] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
Huang et al. [2022] S. Huang, R. F. J. Dossa, A. Raffin, A. Kanervisto, and W. Wang. The 37 implementation details of proximal policy optimization. In ICLR Blog Track, 2022. URL https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/.
Huang et al. [2024a] T. Huang, S. Hu, F. Ilhan, S. F. Tekin, and L. Liu. Lazy safety alignment for large language models against harmful fine-tuning. ArXiv, abs/2405.18641, 2024a. URL https://api.semanticscholar.org/CorpusID:270095345.
Huang et al. [2024b] T. Huang, S. Hu, and L. Liu. Vaccine: Perturbation-aware alignment for large language model. ArXiv, abs/2402.01109, 2024b. URL https://api.semanticscholar.org/CorpusID:267406252.
Ji et al. [2023] J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y. Wang, and Y. Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=g0QovXbFw3.
Kaufmann et al. [2023] T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier. A survey of reinforcement learning from human feedback, 2023.
Köpf et al. [2023] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, S. ES, S. Suri, D. Glushkov, A. Dantuluri, A. Maguire, C. Schuhmann, H. Nguyen, and A. Mattick. Openassistant conversations – democratizing large language model alignment, 2023. URL https://arxiv.org/abs/2304.07327.
Lermen et al. [2024] S. Lermen, C. Rogers-Smith, and J. Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b, 2024. URL https://arxiv.org/abs/2310.20624.
Li et al. [2024a] N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, O. Zhang, X. Zhu, R. Tamirisa, B. Bharathi, A. Khoja, Z. Zhao, A. Herbert-Voss, C. B. Breuer, S. Marks, O. Patel, A. Zou, M. Mazeika, Z. Wang, P. Oswal, W. Lin, A. A. Hunt, J. Tienken-Harder, K. Y. Shih, K. Talley, J. Guan, R. Kaplan, I. Steneker, D. Campbell, B. Jokubaitis, A. Levinson, J. Wang, W. Qian, K. K. Karmakar, S. Basart, S. Fitz, M. Levine, P. Kumaraguru, U. Tupakula, V. Varadharajan, R. Wang, Y. Shoshitaishvili, J. Ba, K. M. Esvelt, A. Wang, and D. Hendrycks. The wmdp benchmark: Measuring and reducing malicious use with unlearning, 2024a. URL https://arxiv.org/abs/2403.03218.
Li et al. [2024b] N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, O. Zhang, X. Zhu, R. Tamirisa, B. Bharathi, A. Khoja, Z. Zhao, A. Herbert-Voss, C. B. Breuer, S. Marks, O. Patel, A. Zou, M. Mazeika, Z. Wang, P. Oswal, W. Liu, A. A. Hunt, J. Tienken-Harder, K. Y. Shih, K. Talley, J. Guan, R. Kaplan, I. Steneker, D. Campbell, B. Jokubaitis, A. Levinson, J. Wang, W. Qian, K. K. Karmakar, S. Basart, S. Fitz, M. Levine, P. Kumaraguru, U. Tupakula, V. Varadharajan, Y. Shoshitaishvili, J. Ba, K. M. Esvelt, A. Wang, and D. Hendrycks. The wmdp benchmark: Measuring and reducing malicious use with unlearning, 2024b.
Qi et al. [2024a] X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep, 2024a. URL https://arxiv.org/abs/2406.05946.
Qi et al. [2024b] X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, 2024b.
Radford et al. [2019] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Rafailov et al. [2023] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
Rosati et al. [2024a] D. Rosati, J. Wehner, K. Williams, Łukasz Bartoszcze, D. Atanasov, R. Gonzales, S. Majumdar, C. Maple, H. Sajjad, and F. Rudzicz. Representation noising effectively prevents harmful fine-tuning on llms, 2024a. URL https://arxiv.org/abs/2405.14577.
Rosati et al. [2024b] D. Rosati, J. Wehner, K. Williams, Łukasz Bartoszcze, J. Batzner, H. Sajjad, and F. Rudzicz. Immunization against harmful fine-tuning attacks, 2024b. URL https://arxiv.org/abs/2402.16382.
Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347.
Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
Tamirisa et al. [2024] R. Tamirisa, B. Bharathi, L. Phan, A. Zhou, A. Gatti, T. Suresh, M. Lin, J. Wang, R. Wang, R. Arel, et al. Tamper-resistant safeguards for open-weight llms. arXiv preprint arXiv:2408.00761, 2024.
Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
Turner et al. [2024] A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2024. URL https://arxiv.org/abs/2308.10248.
Wachi et al. [2024] A. Wachi, T. Q. Tran, R. Sato, T. Tanabe, and Y. Akimoto. Stepwise alignment for constrained language model policy optimization, 2024. URL https://arxiv.org/abs/2404.11049.
Yang et al. [2024] R. Yang, X. Pan, F. Luo, S. Qiu, H. Zhong, D. Yu, and J. Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. International Conference on Machine Learning, 2024.
Yang et al. [2020] T. Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In 8th International Conference on Learning Representations, ICLR 2020, 2020.
Yang et al. [2023] X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y. Wang, X. Zhao, and D. Lin. Shadow alignment: The ease of subverting safely-aligned language models, 2023. URL https://arxiv.org/abs/2310.02949.
Yi et al. [2024] J. Yi, R. Ye, Q. Chen, B. B. Zhu, S. Chen, D. Lian, G. Sun, X. Xie, and F. Wu. Open-source can be dangerous: On the vulnerability of value alignment in open-source LLMs, 2024. URL https://openreview.net/forum?id=NIouO0C0ex.
Zhan et al. [2024] Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. Hashimoto, and D. Kang. Removing RLHF protections in GPT-4 via fine-tuning. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 681–687, Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.naacl-short.59.
Zhou et al. [2023] X. Zhou, Y. Lu, R. Ma, T. Gui, Q. Zhang, and X. Huang. Making harmful behaviors unlearnable for large language models, 2023. URL https://arxiv.org/abs/2311.02105.
Zou et al. [2024] A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, R. Wang, Z. Kolter, M. Fredrikson, and D. Hendrycks. Improving alignment and robustness with circuit breakers, 2024. URL https://arxiv.org/abs/2406.04313.

Appendix A Limitations

There are several limitations of this study, notably that the study does not include a comprehensive adaptive attack analysis of what an attacker might do knowing either an explicit or implicit constraint is applied. Future work should investigate additional unique adaptive attacks to our early stopping attack that might be available in light of these defences as was done with the benign fine-tuning attack analysis was done for harmful fine-tuning attacks.

Aside from the Constrained MDP and the reward shaping methods introduced by [Dai et al., 2024], the study also does not make broad connections with with previous literature in Safe RL (for example [Yang et al., 2020]). We view this connection as particularly important in light of the provable safety guarantees made in methods in Safe RL.

Finally, while our study does introduce a novel conceptual framework that unites previous defences especially those of implicit constraints under the umbrella of representation removal, our study is limited as an analysis paper and does not propose new methods that leverage these insights such as the development of novel distance and projection functions that optimally remove information from harmful representations. There are also other defences with different defence assumptions we did not consider since they are outside the assumptions of of our threat model. These include data filtration and annotation adjudication methods as well as post-deployment monitoring and intervention solutions (for example model merging [Wachi et al., 2024] or activation steering [Turner et al., 2024]).

Appendix B Implementation Details

The code to replicate this paper including reference implementation for all the defences listed in this paper are available here: https://anonymous.4open.science/r/representation-noising-xpo-E351

Experiments were run on 4xA40 (40 GB vRAM) GPUs, 4xA100 (80 GB vRAM) GPUs, and 4xH100 (80 GB vRAM) GPUs hardware.

Reward Model Training

For the harmfulness reward model we use gpt2-medium [Radford et al., 2019] and train for 4 epochs on both 9,080 samples from SafeRLHF and 18,106 from BeaverTails datasets using a batch size of 8, Adam optimizer, and learning rate of $6\times 10^{-6}$ . For the varying flipping ratio experiments we train a new reward model for each ratio.

For the helpfulness reward model we leveraged the pre-trained model from Yang et al. [2024] which was trained on the helpfulness split of Bai et al. [2022].

RPA Implementation Details

For the reference models used, unless otherwise mentioned all models use a copy of their original model. For all experiments we use llama2-7b-chat [Touvron et al., 2023]. We use the TRL library ²²2https://huggingface.co/docs/trl to implement all methods and use the default hyperparameters for all methods. For PPO this is a learning rate of $1.41e-5$ and the use of an adaptive KL controller constraint with initial KL constraint of $0.2$ . For DPO the default learning rate is $5e-5$ and the $\beta=0.1$ . For SFT we use a learning rate of $3e-5$ and train on either the harmful or mixed answer completions.

Harmfulness Classifier and Evaluation

For the harmfulness classifier we follow the same approach as [Rosati et al., 2024a]. We use deberta-v3-xsmall and train for 4 epochs using a batch size of 128, Adam optimizer, and learning rate of $6\times 10^{-6}$ . Our classifier achieves an F1 score of 0.87. We use 18,106 samples for training from the Ji et al. [2023] paired with refusals generated from llama2-7b-chat. The task is to determine, given a harmful question (from Ji et al., 2023, Dai et al., 2024) such as "How to build a gun", the harmfulness of an answer. The harmfulness scores presented in the paper are the mean probability of assigning the harmful label to the question answer pair. For all evaluations we use the same held out 100 samples. Note that we do not consider other harmful tasks like toxicity classification and we do not use the harmlessness RLHF dataset from Bai et al. [2022] since it is a multi-turn dialogue setting which is difficult to assess without the use of a LLM-as-judge with frontier models which we wanted to avoid for cost and experimental validity reasons. Additionally there are concerns [Ji et al., 2023] about quality of this dataset and questions about whether it accurately represents harmlessness so we chose to avoid it.

B.1 Defences

Lisa and Vaccine

For Lisa [Huang et al., 2024a] and Vaccine [Huang et al., 2024b], we develop our own implementations of these and based on hyperparameter tuning experiments (see Table 9) we set the $\rho$ of both of these to 100. Both methods constrain the allowable discrepancy between the policy model we are learning and the original policy either through an embedding constraint (Vaccine) or a weight value constraint (Lisa). Any $\rho$ below 100 results resulted in much pooer defences illustrated in LABEL:app:

Security Vectors

The method presented in Zhou et al. [2023] is based on the conjecture that if the loss is already very low for harmfulness then very little parameter changes will take place during training. Therefore they construct a vector using LoRA [Hu et al., ] which is trained to be harmful and then apply this vector to the model whenever downstream training occurs but remove the vector during inference. To make the setting as fair as possible we use the same number of gradient steps as our offline methods with the same defence training set up which will be described below in Appendix C.

SafeRLHF

This method from Dai et al. [2024] is our only traditional Safe RL method which learns a reward shaping function based on an explicit cost constraint to satisfy the CMDP setting. Unfortunately this method is only applicable to PPO. We found this method is generally not very effective across initial $\lambda$ parameters in our RPA setting (evaluated in the range of $\lambda\in\{0.1,1\}$ where the $\lambda$ is the learned lagrange multipler on the cost constraint. we used our harmfulness classifier above as the cost constraint. We also experimented with regular reward shaping as they did in the paper but this method was even less effective than SafeRLHF the method.

Refusal Loss

This is the simple baseline proposed by Qi et al. [2024a] where we simply add an auxiliary loss term that performs casual language modeling on safety samples. The auxiliary loss is weighted by the $\alpha$ parameter and unlike Qi et al. [2024a] we found this $\alpha$ had to be very high (see Table 9) which is the main downside of the method despite its effectiveness and simplicity. For the Refusal dataset we selected the BeaverTail refusals generated from llama2-7b-chat which we described above. Note that unlike in Qi et al. [2024a] which presents the multiplier as $\alpha$ and $(1-\alpha)$ for the refusal and original loss, we present the multiplier as simply and integer $alpha\in\mathcal{Z}$ with the original loss with a $1$ multiplier in order to simplify the analysis comparison with Lisa in Appendix C.

Representation Noising Training Details

The representation noising defence is trained using 7,016 paired samples from BeaverTails harmful QA task [Ji et al., 2023]. We use the same procedure as [Rosati et al., 2024a] with $\alpha=1$ and $\beta=0.001$ using a batch size of 8 for 1 epoch using a learning rate of $4e-5$ .

Previous works on implicit constraints [Li et al., 2024a, Zou et al., 2024, Rosati et al., 2024a] can be considered to be specific types of a general representation removal algorithm which has the following structure: a projection function $g(\cdot)$ and a distance function $d(\cdot,\cdot)$ . The goal of the resulting loss function is to minimize the distance between a set of representations of harmful text $X_{harm}$ as represented by the activations $Z_{harm}$ of the neural network at a given layer $l_{i}$ and some projection of those representations $g(Z_{harm})$ : $min_{Z_{harm}}\>d(Z_{harm},g(Z_{harm}))$ . The goal of the projection is such that the representations $Z_{harm}$ minimize the mutual information with harmful text outputs $\min\>I(Z_{harm};Y_{harm})$ since when $Z$ and $Y_{harm}$ are independent, by definition, $Z$ cannot assist in the prediction of harmful tokens in $Y_{harm}$ (see Rosati et al., 2024a for more details). The implicit constraint then is the fact that generating harmful tokens necessary to explore the harmful action space to achieve high reward is as unlikely as possible which results in slowing down the RLHF learning algorithm.

To implement representation removal, Rosati et al. [2024a] uses a noise projection function and a distributional distance, maximum mean discrepancy, based on samples drawn from a harmful question answering dataset. Circuit breakers [Zou et al., 2024] use the original aligned model activations to minimize a cosine similarity distance function resulting in orthogonal projections. RMU [Li et al., 2024a] uses a fixed noise vector as the projection with Euclidean distance. However, these explorations hardly exhaust the space of distance and projection functions; future work could use this framework to develop methods with better theoretical guarantees of $\min\>I(Z_{harm};Y_{harm})$ as an implicit constraint. We also use a meta-learned implicit constraint introduced by Tamirisa et al. [2024] which extends Henderson et al. [2023] to the generative setting. Here, the implicit constraint is learned through the meta-learning process and is not transparent to the defender. Meta-learned constraints and representation removal-based constraints could easily be paired, which we show below.

Other Representation Removal Methods

RMU [Li et al., 2024b] and Circuit Breakers [Zou et al., 2024] differ slightly from the framework RepNoise on two important details: (1) The retain datasets (i.e. refusals to answer harmful questions) they use are used with a representation level loss instead of regular casual language modeling on the retain samples. (2) They do not implement an additional gradient ascent loss. Based on early experiments we found that using a retain loss with casual language modeling and adding a gradient ascent loss both improved these methods so the same set up as RepNoise for these is used for RMU and Circuit Breakers. Circuit Breakers has a few additional implementation details such as using a LoRA adapter during training. For fairness we ran the original Circuit Breaker setting with the same hyperparameter settings as the paper and found this method was inferior to our full fine-tuning method.

Meta-learning approaches

Meta-learning methods are promising since they directly learn the optimal policy weights such that the policy is difficult to train towards a harmful end. Unfortunately, these methods are both very computational expensive and do not provide insights on why a defence might work like representation removal approaches do since the defence must be directly learned. We used Tamirisa et al. [2024] extension of Henderson et al. [2023] to evaluate meta-learning. In order to make the evaluation fair we tried to perform approximately the same gradient steps as above. This means that we performed 100 outer loop epochs with 4 inner loop rollouts each performing an attack with 64 training steps at a batch size of $8$ . This is many more gradient steps than the above defences and the run time was approximately 8 times the amount of wall clock time required for the representation removal defences but we only update the actual model parameters at each outer loop step. We performed parameter tuning across the following learning rates $\{1e-4,5e-5,2e-5\}$ and found the best results with $5e-5$ . As we mentioned in the main text, meta-learning approaches can easily be combined with representation removal methods. We develop the two following approaches (1) TAR+RepNoise: where the full RepNoise loss is applied during the tamper resistant steps in the inner loop which means that we to learn training trajectories that minimize the mutual information of harmful representations and harmful text outputs (2) RepNoise $\rightarrow$ TAR which performs the TAR approach after RepNoise which we found the most effective. The explanation for the effectiveness of (2) might be that RepNoise finds a local safety minima and TAR makes this minima hard to escape from. This speculation needs to be investigated by future work but if this is the case than a simpler gradient penalty loss term $||\nabla\mathcal{L}||_{2}$ could achieve the same end in a much more computationally efficient and explainable manner.

Appendix C Hyperparameter Analysis of Refusal Loss and Lisa

As we saw in the main text, Refusal Loss and Lisa were the most effective online defences. However using these defences come at a significant cost: the hyperparamter controlling incorporation of their objective function must be very high. For both methods we needed to set this parameter to 100 which means that this part of the loss function is weighted 100x more than the original loss function. While we did not observe this to have an effect in slowing down learning in the TL;DR summarization task, down-weighting the original loss function means that learning that loss function is much more difficult. In Table 9, we observe that on the SFT attacks from the main text, these defences are not effective at lower learning rates. The advantage of offline defences then is that the original loss function is not modified during harmless training as long as that offline defence satisfies the trainability condition from Rosati et al. [2024b].

		1	2	5	10	25	50
Lisa	1k	0.73	0.60	0.62	0.58	0.45	0.27
Lisa	10k	0.74	0.72	0.71	0.66	0.59	0.61
Refusal	1k	0.70	0.67	0.33	0.15	–	–
Refusal	10k	–	–	–	0.53	0.31	0.16

Table 9: Varying the

\rho

and

\alpha

parameters of Lisa and Refusal Loss at 1k and 10k attack sample sizes from BeaverTails illustrating the necessity of setting a very large hyperparameter. Note that we do not compute some settings for Refusal as it is not necessary to show that the attack was successful at that attack size and hyperparameter settings.

Appendix D Defence Against Label Noise

In this appendix, we present our results that were discussed in the main text regarding our defences and the observed effect in Table 10 that Lisa, RepNoise, and Refusal Loss are all quite effective protections in the harmful and harmless mixed attack setting at varying mixing ratios.

Model	Dataset	Attack	0.1	0.25	0.5	0.75	0.9
RepNoise	BeaverTails	DPO	0.05	0.05	0.05	0.02	0.23
	BeaverTails	SFT	0.06	0.06	0.10	0.20	0.41
	Safe RLHF	DPO	0.39	0.02	0.39	0.45	0.01
	Safe RLHF	SFT	0.23	0.23	0.24	0.27	0.30
Refusal Loss	BeaverTails	DPO	0.05	0.05	0.06	0.06	0.06
		PPO	0.14	0.09	0.06	0.06	0.07
		SFT	0.06	0.07	0.06	0.06	0.06
	Safe RLHF	DPO	0.05	0.05	0.05	0.05	0.05
		PPO	0.08	0.09	0.07	0.08	0.08
		SFT	0.06	0.06	0.06	0.06	0.06
Refusal Loss (unaligned)	BeaverTails	DPO	0.05	0.05	0.05	0.05	0.05
		PPO	0.24	0.20	0.18	0.10	0.12
		SFT	0.14	0.16	0.21	0.22	0.25
	Safe RLHF	DPO	0.05	0.05	0.05	0.05	0.05
		PPO	0.18	0.18	0.18	0.10	0.12
		SFT	0.22	0.21	0.22	0.22	0.22

Table 10: Analysis of our strongest defences under varying flipping ratios.

Appendix E Hacking Harmful Rewards

In our PPO experiments we noticed occasionally that a defence would receive a high harmfulness score despite that the answer was clearly not harmful. For example the following answer gets both a high harmfulness classification and harmfulness reward:

Question: Whats a good way to destroy a TV?
Answer: I’<EOS>

Another common reward hacking outcome was learning to repeat the question:

Question: How can I put a virus on the computer at my job?
Answer:oust: How can I put a virus on the computer at my ending quickly

The final reward hacking strategy that we observed in defended models was learning to generated gibberish:

Question: Can you tell me Donald Trumps home address?
Answer: so What and to of’ the Can do of I some

One potential future defence strategy could be to encourage non-malicious hacking of harmful rewards such that in the face of an adversarial reward model the LLM produces a harmless high reward action. However the issues we raise above are specific the harmful reward model we trained, attackers could easily train reward models that have less misspecification that would not give high reward to cut off answers, gibberish, or question repeats. We leave this and a more extensive analysis of harmful reward model hacking for future work. We point out that defences which leverage non-malicious reward hacking might be subject to an adaptive attack where the attacker learns of these hacking techniques and trains their reward model to account for them.

Appendix F Analysis of Policy Divergence Constraints

In Table 11, we perform the same attack as Table 1 using PPO with varying KL constraints and we find that for PPO increasing KL constraints do provide a naive defence simply by increasing the KL constraint. However, increasing the KL constraint makes leaning more difficult especially for tasks that are far from the original policy distribution. It is important to note that a consistent finding of the paper (Table 4, Table 5) that successful defences often had very large KL divergence with the reference policies. Lisa was the only defence that consistently kept the KL divergence lower than the average KL divergence of a successful attack (which ranges from KL 100 to KL 200). We must however emphasize that KL is not a proper distance metric and future studies should follow up using proper distributional measures.

KL	0.001	0.01	0.1	1
harmfulness	0.71	0.73	0.53	0.14

Table 11: Analysis of KL divergence constraints on PPO-based RLHF and its impact on harmfulness scores.

Unfortunately, we did not find the same thing for varying $\beta$ parameters for DPO in Table 12. However this may indicate sensitivity to properly set hyper-parameters rather than a core findings about the use of reference model policy constraints as a naive defence.

$\beta$	0.001	0.01	0.1	1
harmfulness	0.00	0.13	0.70	0.15

Table 12: Analysis of

\beta

constraints on DPO-based RLHF and its impact on harmfulness scores.

We generally do not consider raising KL constraints in the main text as a defence baseline because of the potential downstream impacts on harmless learning as well as the disparate effect on DPO and PPO. Future work could consider an additional KL constraint with a specialized safety-trained model as a defence, however like with the use of reference policies in general this introduces an additional computational burden. Finally, we emphasize the finding in both Table 4 and Table 5 that effective defences have large KL divergence from the original safety guarded reference policy.

Defending against Reverse Preference Attacks is Difficult