WPO: Enhancing RLHF with Weighted Preference Optimization

Zhou, Wenxuan; Agrawal, Ravi; Zhang, Shujian; Indurthi, Sathish Reddy; Zhao, Sanqiang; Song, Kaiqiang; Xu, Silei; Zhu, Chenguang

Computer Science > Computation and Language

arXiv:2406.11827 (cs)

[Submitted on 17 Jun 2024 (v1), last revised 3 Oct 2024 (this version, v2)]

Title:WPO: Enhancing RLHF with Weighted Preference Optimization

Authors:Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu

View PDF HTML (experimental)

Abstract:Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 76.7% based on Gemma-2-9b-it. We release the code and models at this https URL.

Comments:	EMNLP 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.11827 [cs.CL]
	(or arXiv:2406.11827v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.11827

Submission history

From: Wenxuan Zhou [view email]
[v1] Mon, 17 Jun 2024 17:59:13 UTC (404 KB)
[v2] Thu, 3 Oct 2024 21:37:02 UTC (405 KB)

Computer Science > Computation and Language

Title:WPO: Enhancing RLHF with Weighted Preference Optimization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:WPO: Enhancing RLHF with Weighted Preference Optimization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators