Aligning Large Language Models with Human Preferences through Representation Engineering

Liu, Wenhao; Wang, Xiaohua; Wu, Muling; Li, Tianlong; Lv, Changze; Ling, Zixuan; Zhu, Jianhao; Zhang, Cenyuan; Zheng, Xiaoqing; Huang, Xuanjing

Computer Science > Computation and Language

arXiv:2312.15997 (cs)

[Submitted on 26 Dec 2023 (v1), last revised 3 Jul 2024 (this version, v3)]

Title:Aligning Large Language Models with Human Preferences through Representation Engineering

Authors:Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang

View PDF HTML (experimental)

Abstract:Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLHF) to fine-tune LLMs based on human labels assessing the relative quality of model responses. Nevertheless, RLHF is susceptible to instability during fine-tuning and presents challenges in this http URL inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM, and achieve precise control of model behavior by transforming its representations. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to this http URL experiments demonstrate the efficacy of RAHF in not only capturing but also manipulating representations to align with a broad spectrum of human preferences or values, rather than being confined to a singular concept or function (e.g. honesty or bias). RAHF's versatility in accommodating diverse human preferences shows its potential for advancing LLM performance.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2312.15997 [cs.CL]
	(or arXiv:2312.15997v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.15997

Submission history

From: Xiaohua Wang [view email]
[v1] Tue, 26 Dec 2023 11:01:36 UTC (8,594 KB)
[v2] Tue, 2 Jul 2024 04:07:14 UTC (10,007 KB)
[v3] Wed, 3 Jul 2024 05:21:02 UTC (11,871 KB)

Computer Science > Computation and Language

Title:Aligning Large Language Models with Human Preferences through Representation Engineering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Aligning Large Language Models with Human Preferences through Representation Engineering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators