SALMON: Self-Alignment with Instructable Reward Models

Sun, Zhiqing; Shen, Yikang; Zhang, Hongxin; Zhou, Qinhong; Chen, Zhenfang; Cox, David; Yang, Yiming; Gan, Chuang

Computer Science > Computation and Language

arXiv:2310.05910 (cs)

[Submitted on 9 Oct 2023 (v1), last revised 9 Apr 2024 (this version, v2)]

Title:SALMON: Self-Alignment with Instructable Reward Models

Authors:Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

View PDF HTML (experimental)

Abstract:Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is an instructable reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training phase, we gain full control over the preferences with the instructable reward model, subsequently influencing the behavior of the RL-trained policy models, and reducing the reliance on the collection of online human preferences. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.

Comments:	Previous Title: SALMON: Self-Alignment with Principle-Following Reward Models. Accepted to ICLR 2024. Project page: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.05910 [cs.CL]
	(or arXiv:2310.05910v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.05910

Submission history

From: Zhiqing Sun [view email]
[v1] Mon, 9 Oct 2023 17:56:53 UTC (1,058 KB)
[v2] Tue, 9 Apr 2024 23:21:45 UTC (343 KB)

Computer Science > Computation and Language

Title:SALMON: Self-Alignment with Instructable Reward Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SALMON: Self-Alignment with Instructable Reward Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators