Dyna-Validator: A Model-based Reinforcement Learning Method with Validated Simulated Experiences
DOI:
https://doi.org/10.15837/ijccc.2023.5.5073Keywords:
Model-based reinforcement learning (MBRL), Dyna, Simulated annealingAbstract
Dyna is a planning paradigm that naturally weaves learning and planning together through environmental models. Dyna-style reinforcement learning improves the sample efficiency using the simulation experience generated by the environment model to update the value function. However, the existing Dyna-style planning methods are usually based on tabular methods, only suitable for tasks with low-dimensional and small-scale space. In addition, the quality of the simulation experience generated by the existing methods cannot be guaranteed, which significantly limits its application in tasks such as continuous control of high-dimensional robots and autonomous driving. To this end, we propose a model-based approach that controls planning through a validator. The validator filters high-quality experiences for policy learning and decides whether to stop planning. To deal with the exploration and exploitation dilemma in reinforcement learning, a combination of ϵ-greedy strategy and simulated annealing (SA) cooling schedule control is designed as an action selection strategy. The excellent performance of the proposed method is demonstrated in a set of classical Atari games. Experimental results show that learning dynamic models in some games can improve sample efficiency. This benefit is maximized by choosing the proper planning steps. In the optimization planning phase, our method maintains a smaller gap with the current state-of-the-art model-based reinforcement learning (MuZero). In order to achieve a good compromise between model accuracy and optimal programming step size, it is necessary to control the programming reasonably. The practical application of this method in a physical robot system helps reduce the influence of an imprecise depth prediction model on the task. Without human supervision, it is easier to collect training data and learn complex skills (such as grabbing and carrying items) while being more effective at scaling tasks that have never been seen before.References
Kim, H.J., Madhavi, S. A Reinforcement Learning Model for Quantum Network Data Aggregation and Analysis[J]. Journal of System and Management Sciences, 2022, 12(1): 283-293. https://doi.org/10.33168/JSMS.2022.0120.
https://doi.org/10.3390/app12073520
Park, H.J., Kim, S.C. An Efficient Packet Transmission Protocol Using Reinforcing Learning in Wireless Sensor Networks[J]. Journal of System and Management Sciences, 2021, 11(2): 65-76. https://doi.org/10.33168/JSMS.2021.0205.
https://doi.org/10.33168/JSMS.2021.0205
Jie, W.J., Connie, T., Goh, M.K.O. Forward collision warning for autonomous driving[ J]. Journal of Logistics, Informatics and Service Science, 2022, 9(3): 208-225. https://doi.org/10.33168/LISS.2022.0315.
Hussein, A.K. Feature weighting based food recognition system[J]. Journal of Logistics, Informatics and Service Science, 2022, 9(3):191-207. https://doi.org/10.33168/LISS.2022.0314.
Anđelić N, Car Z, Šercer M. Neural Network-Based Model for Classification of Faults During Operation of a Robotic Manipulator[J]. Tehnički vjesnik, 2021, 28(4): 1380-1387.
https://doi.org/10.17559/TV-20201112163731
Anđelić N, Car Z, Šercer M. Prediction of Robot Grasp Robustness using Artificial Intelligence Algorithms[J]. Tehnički vjesnik, 2022, 29(1): 101-107.
https://doi.org/10.17559/TV-20210204092154
Li Y, He Z, Gu X, et al. AFedAvg: communication-efficient federated learning aggregation with adaptive communication frequency and gradient sparse[J]. Journal of Experimental & Theoretical Artificial Intelligence, 2022: 1-23.
https://doi.org/10.1080/0952813X.2022.2079730
Shi H, Li J, Mao J, et al. Lateral transfer learning for multiagent reinforcement learning[J]. IEEE Transactions on Cybernetics, 2021.
Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018.
Chen L, Deng Y, Cheong K H. Probability transformation of mass function: A weighted network method based on the ordered visibility graph[J]. Engineering Applications of Artificial Intelligence, 2021, 105: 104438.
https://doi.org/10.1016/j.engappai.2021.104438
Li J, Shi H, Hwang K S. An explainable ensemble feedforward method with Gaussian convolutional filter[J]. Knowledge-Based Systems, 2021, 225: 107103.
https://doi.org/10.1016/j.knosys.2021.107103
Agarwal A, Kakade S, Yang L F. Model-based reinforcement learning with a generative model is minimax optimal[C]//Conference on Learning Theory. PMLR, 2020: 67-83.
Plaat A, Kosters W, Preuss M. High-Accuracy Model-Based Reinforcement Learning, a Survey[J]. arXiv preprint arXiv:2107.08241, 2021.
Zhang M, Vikram S, Smith L, et al. Solar: Deep structured representations for model-based reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2019: 7444- 7453.
Janner M, Fu J, Zhang M, et al. When to trust your model: Model-based policy optimization[J]. arXiv preprint arXiv:1906.08253, 2019.
Oh J, Guo X, Lee H, et al. Action-conditional video prediction using deep networks in atari games[J]. arXiv preprint arXiv:1507.08750, 2015.
Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution[J]. arXiv preprint arXiv:1809.01999, 2018.
Alaniz S. Deep reinforcement learning with model learning and monte carlo tree search in minecraft[J]. arXiv preprint arXiv:1803.08456, 2018.
Sutton R S. Dyna, an integrated architecture for learning, planning, and reacting[J]. ACM Sigart Bulletin, 1991, 2(4): 160-163.
https://doi.org/10.1145/122344.122377
Holland G Z, Talvitie E J, Bowling M. The effect of planning shape on dyna-style planning in high-dimensional state spaces[J]. arXiv preprint arXiv:1806.01825, 2018.
Azizzadenesheli K, Yang B, Liu W, et al. Sample-efficient deep RL with generative adversarial tree search[J]. arXiv preprint arXiv:1806.05780, 2018: 25.
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[J]. Advances in neural information processing systems, 2014, 27.
Kaiser L, Babaeizadeh M, Milos P, et al. Model-based reinforcement learning for atari[J]. arXiv preprint arXiv:1903.00374, 2019.
Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of go without human knowledge[ J]. nature, 2017, 550(7676): 354-359.
https://doi.org/10.1038/nature24270
Schrittwieser J, Antonoglou I, Hubert T, et al. Mastering atari, go, chess and shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609.
https://doi.org/10.1038/s41586-020-03051-4
Deisenroth M P, Neumann G, Peters J. A survey on policy search for robotics[J]. Foundations and trends in Robotics, 2013, 2(1-2): 388-403.
https://doi.org/10.1561/2300000021
Veerapaneni R, Co-Reyes J D, Chang M, et al. Entity abstraction in visual model-based reinforcement learning[C]//Conference on Robot Learning. PMLR, 2020: 1439-1456.
Paxton C, Barnoy Y, Katyal K, et al. Visual robot task planning[C]//2019 international conference on robotics and automation (ICRA). IEEE, 2019: 8832-8838.
https://doi.org/10.1109/ICRA.2019.8793736
Deng Z, Guan H, Huang R, et al. Combining model-based q-learning with structural knowledge transfer for robot skill learning[J]. IEEE Transactions on Cognitive and Developmental Systems, 2017, 11(1): 26-35.
https://doi.org/10.1109/TCDS.2017.2718938
Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels[ C]//International Conference on Machine Learning. PMLR, 2019: 2555-2565.
Machado M C, Bellemare M G, Talvitie E, et al. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents[J]. Journal of Artificial Intelligence Research, 2018, 61: 523-562.
https://doi.org/10.1613/jair.5699
Pan Y, Zaheer M, White A, et al. Organizing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains[J]. arXiv preprint arXiv:1806.04624, 2018.
https://doi.org/10.24963/ijcai.2018/666
Heess N, Wayne G, Silver D, et al. Learning continuous control policies by stochastic value gradients[J]. arXiv preprint arXiv:1510.09142, 2015.
Feinberg V, Wan A, Stoica I, et al. Model-based value estimation for efficient model-free reinforcement learning[J]. arXiv preprint arXiv:1803.00101, 2018.
Kalweit G, Boedecker J. Uncertainty-driven imagination for continuous deep reinforcement learning[ C]//Conference on Robot Learning. PMLR, 2017: 195-206.
Kurutach T, Clavera I, Duan Y, et al. Model-ensemble trust-region policy optimization[J]. arXiv preprint arXiv:1802.10592, 2018.
Gu S, Lillicrap T, Sutskever I, et al. Continuous deep q-learning with model-based aceleration[ C]//International conference on machine learning. PMLR, 2016: 2829-2838.
Peng B, Li X, Gao J, et al. Deep dyna-q: Integrating planning for task-completion dialogue policy learning[J]. arXiv preprint arXiv:1801.06176, 2018.
https://doi.org/10.18653/v1/P18-1203
Sutton R S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming[M]//Machine learning proceedings 1990. Morgan Kaufmann, 1990: 216- 224.
https://doi.org/10.1016/B978-1-55860-141-3.50030-4
Holland G Z, Talvitie E J, Bowling M. The effect of planning shape on dyna-style planning in high-dimensional state spaces[J]. arXiv preprint arXiv:1806.01825, 2018.
Azizzadenesheli K, Yang B, Liu W, et al. Surprising negative results for generative adversarial tree search[J]. arXiv preprint arXiv:1806.05780, 2018.
Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[ C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1125-1134.
https://doi.org/10.1109/CVPR.2017.632
Wang T, Bao X, Clavera I, et al. Benchmarking model-based reinforcement learning[J]. arXiv preprint arXiv:1907.02057, 2019.
Guo M, Liu Y, Malec J. A new Q-learning algorithm based on the metropolis criterion[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2004, 34(5): 2140-2143.
https://doi.org/10.1109/TSMCB.2004.832154
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[ J]. nature, 2015, 518(7540): 529-533.
https://doi.org/10.1038/nature14236
Leibfried F, Kushman N, Hofmann K. A deep learning approach for joint video frame and reward prediction in atari games[J]. arXiv preprint arXiv:1611.07078, 2016.
Memisevic R. Learning to relate images[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1829-1846.
https://doi.org/10.1109/TPAMI.2013.53
Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[ C]//International conference on machine learning. PMLR, 2016: 1928-1937.
Additional Files
Published
Issue
Section
License
Copyright (c) 2023 Hengsheng Zhang, Jingchen Li, Ziming He, Jinhui Zhu, Haobin Shi
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.