Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Patel, Bhrij; Suttle, Wesley A.; Koppel, Alec; Aggarwal, Vaneet; Sadler, Brian M.; Bedi, Amrit Singh; Manocha, Dinesh

Computer Science > Machine Learning

arXiv:2403.11925 (cs)

[Submitted on 18 Mar 2024 (v1), last revised 20 Jun 2024 (this version, v5)]

Title:Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Authors:Bhrij Patel, Wesley A. Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Amrit Singh Bedi, Dinesh Manocha

View PDF HTML (experimental)

Abstract:In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

Comments:	26 Pages, 2 Figures
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2403.11925 [cs.LG]
	(or arXiv:2403.11925v5 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2403.11925

Submission history

From: Bhrij Patel [view email]
[v1] Mon, 18 Mar 2024 16:23:47 UTC (872 KB)
[v2] Wed, 8 May 2024 23:59:23 UTC (968 KB)
[v3] Fri, 10 May 2024 00:57:18 UTC (968 KB)
[v4] Mon, 17 Jun 2024 12:47:32 UTC (967 KB)
[v5] Thu, 20 Jun 2024 22:26:42 UTC (967 KB)

Computer Science > Machine Learning

Title:Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators