Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, Olivier Pietquin


Abstract
While reinforcement learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations’ and LLM’s rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation.We show the effectiveness of RCfD in three RL language tasks, where it achieves comparable performance to carefully tuned baselines while mitigating ROO.
Anthology ID:
2024.findings-acl.740
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12447–12472
Language:
URL:
https://aclanthology.org/2024.findings-acl.740
DOI:
10.18653/v1/2024.findings-acl.740
Bibkey:
Cite (ACL):
Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, and Olivier Pietquin. 2024. Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12447–12472, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning (Rita et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.740.pdf