Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

WU Mondal, V Aggarwal - arXiv preprint arXiv:2408.11513, 2024 - arxiv.org
arXiv preprint arXiv:2408.11513, 2024arxiv.org
We consider the problem of learning a Constrained Markov Decision Process (CMDP) via
general parameterization. Our proposed Primal-Dual based Regularized Accelerated
Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to
reach this goal. For a parameterized policy class with transferred compatibility
approximation error, $\epsilon_ {\mathrm {bias}} $, PDR-ANPG achieves a last-iterate
$\epsilon $ optimality gap and $\epsilon $ constraint violation (up to some additive factor of …
We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, , PDR-ANPG achieves a last-iterate optimality gap and constraint violation (up to some additive factor of ) with a sample complexity of . If the class is incomplete (), then the sample complexity reduces to for . Moreover, for complete policies with , our algorithm achieves a last-iterate optimality gap and constraint violation with sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
arxiv.org
Showing the best result for this search. See all results