What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction

Authors

  • Shuo Wang School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China National Key Laboratory of Air-based Information Perception and Fusion, Luoyang, China
  • Zhihao Wu School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China
  • Xiaobo Hu School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China
  • Jinwen Wang School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China
  • Youfang Lin School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China
  • Kai Lv School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, China National Key Laboratory of Air-based Information Perception and Fusion, Luoyang, China

DOI:

https://doi.org/10.1609/aaai.v38i6.28369

Keywords:

CV: Vision for Robotics & Autonomous Driving, ML: Transfer, Domain Adaptation, Multi-Task Learning

Abstract

In visual Reinforcement Learning (RL), the challenge of generalization to new environments is paramount. This study pioneers a theoretical analysis of visual RL generalization, establishing an upper bound on the generalization objective, encompassing policy divergence and Bellman error components. Motivated by this analysis, we propose maintaining the cross-domain consistency for each policy in the policy space, which can reduce the divergence of the learned policy during the test. In practice, we introduce the Truncated Return Prediction (TRP) task, promoting cross-domain policy consistency by predicting truncated returns of historical trajectories. Moreover, we also propose a Transformer-based predictor for this auxiliary task. Extensive experiments on DeepMind Control Suite and Robotic Manipulation tasks demonstrate that TRP achieves state-of-the-art generalization performance. We further demonstrate that TRP outperforms previous methods in terms of sample efficiency during training.

Published

2024-03-24

How to Cite

Wang, S., Wu, Z., Hu, X., Wang, J., Lin, Y., & Lv, K. (2024). What Effects the Generalization in Visual Reinforcement Learning: Policy Consistency with Truncated Return Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5590-5598. https://doi.org/10.1609/aaai.v38i6.28369

Issue

Section

AAAI Technical Track on Computer Vision V