计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 169-174.doi: 10.11896/j.issn.1002-137X.2019.05.026
李杰1,2, 凌兴宏1,2, 伏玉琛1,2, 刘全1,2,3,4
LI Jie1,2, LING Xing-hong1,2, FU Yu-chen1,2, LIU Quan1,2,3,4
摘要: 异步深度强化学习能够通过多线程技术极大地减少学习模型所需要的训练时间。然而作为异步深度强化学习的一种经典算法,异步优势行动者-评论家算法没有充分利用某些具有重要价值的区域信息,网络模型的学习效率不够理想。针对此问题,文中提出一种基于视觉注意力机制的异步优势行动者-评论家模型。该模型在传统异步优势行动者-评论家算法的基础上引入了视觉注意力机制,通过计算图像各区域点的视觉重要性值,利用回归、加权等操作得到注意力机制的上下文向量,从而使Agent将注意力集中于面积较小但更具丰富价值的图像区域,加快网络模型解码速度,更高效地学习近似最优策略。实验结果表明,与传统的异步优势行动者-评论家算法相比,该模型在基于视觉感知的决策任务上具有更好的性能表现。
中图分类号:
[1]YU K,JIA L,CHEN Y Q,et al.Deep learning:yesterday,today,and tomorrow[J].Journal of computer Research and Deve-lopment,2013,50(9):1799-1804.(in Chinese)余凯,贾磊,陈雨强,等.深度学习的昨天、今天和明天[J].计算机研究与发展,2013,50(9):1799-1804. [2]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].Cambridge:MIT Press,1998. [3]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing atari with deep reinforcement learning∥Proceedings of Workshops at the 26th Neural Information Processing Systems 2013.Lake Tahoe,USA,2013:201-220. [4]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [5]WATKINS C J C H.Learning from Delayed Rewards[J].Robotics & Autonomous Systems,1989,15(4):233-235. [6]VAN HASSELT H,GUEZ A,SILVER D.Deep Reinforcement Learning with Double Q-Learning[C]∥Association for the Advance of Artificial Intelligence.2016:2094-2100. [7]SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay∥Proceedings of the 4th International Conference on Learning Representations.San Juan,Puerto Rico,2016:322-355. [8]RUMMERY G A,NIRANJAN M.On-line Q-learning usingconnectionist systems[D].Cambridge:University of Cambridge,1994. [9]SUTTON R S.Generalization in reinforcement learning:suc-cessful examples using sparse coarse coding[C]∥International Conference on Neural Information Processing Systems.MIT Press,1995:1038-1044. [10]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]∥International Conference on Machine Learning.2016:1928-1937. [11]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate.arXiv:1409.0473,2014. [12]XU K,BA J,KIROS R,et al.Show,attend and tell:Neural ima-ge caption generation with visual attention[C]∥International Conference on Machine Learning.2015:2048-2057. [13]BUSONIU L,BABUSKA R,DE SCHUTTER B,et al.Rein-forcement learning and dynamic programming using function approximators[M].CRC Press,2010. [14]WIERING M,OTTERLO M V.Reinforcement Learning:State-of-the-Art[M].Springer Publishing Company,Incorporated,2012. [15]SUTTON R S,MCALLESTER D A,SINGH S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]∥Advances in neural information processing systems.2000:1057-1063. [16]KAKADE S.A natural policy gradient[C]∥International Conference on Neural Information Processing Systems:Natural and Synthetic.MIT Press,2001:1531-1538. [17]SILVER D,LEVER G,HEESS N,et al.Deterministic policygradient algorithms[C]∥International Conference on International Conference on Machine Learning.2014:387-395. [18]KONDA V R,TSITSIKLIS J N.Actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2000:1008-1014. [19]BHATNAGAR S,GHAVAMZADEH M,LEE M,et al.Incremental natural actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2008:105-112. [20]KONDA V R,TSITSIKLIS J N.Actor-critic algorithms[C]∥Advances in Neural Information Processing Systems.2000:1008-1014. |
[1] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[2] | 代珊珊, 刘全. 基于动作约束深度强化学习的安全自动驾驶方法 Action Constrained Deep Reinforcement Learning Based Safe Automatic Driving Method 计算机科学, 2021, 48(9): 235-243. https://doi.org/10.11896/jsjkx.201000084 |
|