Actor-critic is implicitly biased towards high entropy optimal policies

Hu, Yuzheng; Ji, Ziwei; Telgarsky, Matus

Computer Science > Machine Learning

arXiv:2110.11280 (cs)

[Submitted on 21 Oct 2021 (v1), last revised 13 Mar 2022 (this version, v2)]

Title:Actor-critic is implicitly biased towards high entropy optimal policies

Authors:Yuzheng Hu, Ziwei Ji, Matus Telgarsky

View PDF

Abstract:We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $\epsilon$-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.

Comments:	v2 primarily improved the proofs, with minimal changes to the body
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2110.11280 [cs.LG]
	(or arXiv:2110.11280v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2110.11280

Submission history

From: Matus Telgarsky [view email]
[v1] Thu, 21 Oct 2021 17:06:59 UTC (33 KB)
[v2] Sun, 13 Mar 2022 06:07:33 UTC (36 KB)

Computer Science > Machine Learning

Title:Actor-critic is implicitly biased towards high entropy optimal policies

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Actor-critic is implicitly biased towards high entropy optimal policies

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators