Heavy-Ball Momentum Accelerated Actor-Critic With Function Approximation
Yanjie Dong
Haijun Zhang
Gang Wang
Shisheng Cui
and Xiping Hu
This work was supported by the National Natural Science Foundation of China under Grants 62102266, U23B2059, 62173034 and the Pearl River Talent Recruitment Program of Guangdong Province under Grant 2019ZT08X603.Y. Dong and X. Hu are with the Artificial Intelligence Research Institute and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, Shenzhen 518172, China.H. Zhang is with the Beijing Engineering and Technology Research Center for Convergence Networks and Ubiquitous Services, University of Science and Technology Beijing, Beijing, China.G. Wang and S. Cui are with the State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology, Beijing 100081, China.
Abstract
By using an parametric value function to replace the Monte-Carlo rollouts for value estimation, the actor-critic (AC) algorithms can reduce the variance of stochastic policy gradient so that to improve the convergence rate.
While existing works mainly focus on analyzing convergence rate of AC algorithms under Markovian noise, the impacts of momentum on AC algorithms remain largely unexplored.
In this work, we first propose a heavy-ball momentum based advantage actor-critic (HB-A2C) algorithm by integrating the heavy-ball momentum into the critic recursion that is parameterized by a linear function.
When the sample trajectory follows a Markov decision process, we quantitatively certify the acceleration capability of the proposed HB-A2C algorithm.
Our theoretical results demonstrate that the proposed HB-A2C finds an -approximate stationary point with iterations for reinforcement learning tasks with Markovian noise.
Moreover, we also reveal the dependence of learning rates on the length of the sample trajectory.
By carefully selecting the momentum factor of the critic recursion, the proposed HB-A2C can balance the errors introduced by the initialization and the stoschastic approximation.
In model-free reinforcement learning (MFRL) algorithms, an agent optimizes a long-term cumulative reward (a.k.a., value function) by interacting with an unknown stochastic environment that can be articulated as a Markov decision process (MDP).
When combined with the function approximators (e.g., linear approximators and neural networks), the MFRL algorithms have achieved human-level control and extraordinary empirical success in many domains, e.g.,
video games [1],
robotic control [2, 3],
autonomous vehicles [4, 5],
and linear quadratic control tasks [6, 7].
The current MFRL algorithms can be classified into three categories, i.e.,
policy-based MFRL [8, 9, 10, 11, 12, 13, 14],
value-based MFRL [15, 16, 17, 7, 18], and actor-critic MFRL algorithms [19, 20].
The policy-based MFRL algorithms aim at optimizing the behavior policy based on the policy gradient theorem [21].
When using a parametric policy, the policy-based MFRL can directly optimize the policy parameters via the stochastic gradient descent (SGD) [8, 9, 10, 11].
However, the policy-based MFRL algorithms require access to the gradient of the value function with respect to a given policy.
In practical scenarios with the unknown transition kernels for the MDP, the policy gradients should be estimated from the Monte-Carlo rollouts.
Consequently, the policy-based MFRL algorithms often encounter significant variance in policy gradients and high sampling costs due to the stochastic approximation.
Besides, the policy-based MFRL algorithms demand for sufficiently small learning rates to guarantee the stable convergence under the function approximators.
Therefore, the policy-based MFRL algorithms can suffer from a slow convergence.
While appropricte geometry engineering can improve the convergence [14, 12, 13], it is still in high demand for reducing the gradient variance of the policy-based MFRL algorithms.
The value-based MFRL algorithms recursively update the the long-term cumulative rewards for different state-action pairs based on the Bellman equation and determine the policy based on the action-value function, e.g., Q-learning [18].
Moreover, SARSA can speed the learning process of Q-learning by by using the policy improvement operators [16].
By estimating the value of successor states via the bootstrapping operation, the value-based MFRL algorithms can efficiently converge to a satisfying behavior policy based on the fixed-point recursions.
Besides, the value-based MFRL can also be used to evaluate a behavior policy so that to track the future rewards of all states, e.g., temporal-difference (TD) learning algorithms [15].
When parameterizing the value function via myopical function approximators, the value-based MFRL algorithms become unstable or diverge for the environments with continuous state and/or action spaces.
Therefore, an extensive hyperparameter tunning can be required to obtain stable behavior policy when using value-based MFRL algorithms.
To handle the sample inefficiency and the divergence of the aforementioned MFRL algorithms, recent researches aim at reducing the variance of policy gradient by integrating the policy evalution into the policy improvement so that to propose the actor-critic (AC) algorithms [19, 20, 22, 21].
More specficially, the AC algorithms are designed to use a critic recursion to estimate the value of a current policy and then apply an actor recursion to improve the behavior policy based on feedback from the critic [23].
The current AC algorithms can be categorized into double-loop AC algorithms and single-loop AC algorithms.
In the context of double-loop setting, the critic is consecutively updated for several rounds to obtain an accurate value estimation before each actor recursion [23, 24, 5].
When the actor and critic recursions use different sample trajectories, the inner-loop policy evaluation can be decoupled from the outer-loop policy improvement [23, 24, 5].
Moreover, several different schemes for updating critic sequence have been investigated in centralized topology [23, 24] and decentralized topology [5].
While mainly utilized for the analytical convenience, the double-loop setting is seldom employed in practice due to the double-sampling requirement for the actor and critic recursions.
Besides, it is unclear whether an accurate policy evaluation is necessary since it pertains to just one-step policy improvement.
For the single-loop AC algorithms, the actor and critic sequences are updated concurrently [25, 26].
The asymptotic convergence of the single-loop AC algorithms has been established from the perspective of ordinary differential equations, specifically when the ratio of the learning rates between the actor and critic approaches zero [25, 26].
While the asymptotic convergence of single-loop AC algorithms has been well-investigated [25], the finite-time convergence analysis was unclear until recently [10, 27, 28, 29, 30, 26].
In the Big Data era, it is more preferred to use finite-time (or -sample) error bounds towards characterizing the data efficiency of machine learning algorithms.
For example, by confining the actor sequence to converge slower to the critic sequence,
the finite-time analysis in [28] shows that the two-timescale AC algorithm holds a convergence rate .
The convergence rate of the AC algorithms is sharpened to when the variance of Markovian noise decays at the same rate as the convergence of critic sequence [29].
The smoothness of the Hessian matrix for the parametric policy is also required to establish the finite-time convergence [29, 30].
Moreover, the proposed finite-time convergence analysis in [30] is only suitable to discrete state-action space and require non-trivial research effort to be extended to the contiuous state-saction space.
Using the same order of learning rates for the actor and critic sequences, the convergence rate of the single-loop AC algorithm is improved to in [31].
Contributions. Different from [28, 29, 27, 30, 31], we consider to improve the convergence of AC algorithms by using momentum.
More specifically, we introduce the heavy-ball (HB) momentum to the critic recursion and propose the heavy-ball based advantage actor-critic (HB-A2C) algorithm.
Besides, the actor and critic recursions rely on an Markovian trajectory that are collected from a single MDP in an online manner.
Our major contributions are summarized as follows.
•
For the MFRL tasks, we propose an HB-A2C algorithm that uses a -step trajectory to update the actor and critic parameters.
•
We present a new analytical framework that can tightly characterize the estimation error introdued by the gradient bias and the optimality drift under Markovian noise when the heavy-ball based critic recursion is used.
Compared with [31], our analytical framework can be adopted to characterize the impacts of HB momentum on the convergence.
Moreover, our analytical framework demonstrates that the proposed HB-A2C algorithm converges at a rate of without assuming the decaying variance of Markovian noise.
Notation:
The filtration is denoted by that contains all random variables before the start of frame .
The vector denotes the transpose of .
For notational brevity, the parametric distribution is denoted by .
II Preliminaries
II-AProblem description
We consider an MDP that is described by a quintuple , where
is the continuous action space,
is the continuous state space,
is the unknown transition kernel that maps each state-action pair to a distribution over state space ,
specifies the bounded reward for state-action pair , and is the discount factor.
A policy maps state to a distribution over the action sapce .
To evaluate the expected discounted reward starting from a state under the policy , the value function is defined as
(1)
where each action follows the policy , and
the successor state .
Given a policy , the value function (1) satisfies the Bellman equation as [15, 21, 32]
(2)
where the expectation is taken over the action and the successor state .
The objective is to estimate the optimal policy so that to maximize the expected discounted reward as
(3)
II-BFunction approximation
When considering the continuous state and action spaces, it becomes computational burdensome to obtain the optimal policy or even intractable due to the notorious issue of curse of dimensionality (CoD).
One popular way to handle the CoD issue is to approximate each policy and the value function by a neural network and a linear-function approximator, respectively.
In this work, the policy and the value function are respectively parameterized by the actor parameter and the critic parameter .
More specficially, the parametric policy is denoted by , and the parametric value function is denoted by with and the feature embedding satisfying , .
Note that the optimal value when the radius is sufficient large [15].
Based on the parametric policy , parametric value function , and the Bellman equation (2), we can recast the objective in (3) as a bilevel optimization that optimize the actor parameter in the outer problem and the critic parameter in the innter problem as
(4)
where with , and is the target value for state .
The target value can be estimated by the one-step (or multi-step) bootstrapping.
Remark 1
According to the inner problem of (4), the optimal critic parameter is essentially a function of the actor parameter .
Therefore, the only optimization variable is the actor parameter in the outer problem of (4).
III Heavy-Ball Based Actor-Critic for RL Tasks
III-AAlgorithm development
We consider a fully data-driven technique that maintains a running estimate of the value function (cf. the inner problem in (4)) while performing policy updates based on the estimated state values (cf. the outer problem in (4)). A multi-step bootstrapping is employed to estimate the target value . One of the merits of multi-step bootstrapping is the ability to balance bias and variance during the estimation of the target value. Furthermore, as we will justify later, the multi-step bootstrapping allows for a larger learning rate when solving the inner problem of (4) using recursive updates, thereby reducing the number of recursions required for the critic parameter. Consequently, we consider the MDP to operate on two timescales, where each coarse-grain slot (i.e., frame) consists of fine-grain slots (i.e., steps).
For notational brevity, we recast
the reward ,
the feature embedding ,
the policy , and
the the optimal critic as
,
,
, and , respectively.
Inner optimization. For the inner optimization, the critic parameter can be updated via the stochastic semi-gradient
(5)
where the parametric value ; the target value is estimated by a -step bootstrapping as ; is the -step trajectory; and the observation follows the distribution with as the -step transition kernel.
The compact form stochastic semi-gradient is denoted by
(6)
where and .
The stochastic semi-gradient (6) equals to the sum of full semi-gradient and gradient bias as
(7)
where the gradient bias is , and the full semi-gradient with and .
Denoting the -induced stationary distribution by , the -step sample trajectory is obtained from the stationary distribution .
Note that, given the actor parameter , the optimal critic paramter satisfies .
Together with the full semi-gradient , we obtain
(8)
Recalling the definitions of and and setting and , we have .
Based on [15, Lemma 3], we obtain
(9)
where with as the smallest eigenvalue of the matrix . When the redundant or the irrelevant features are removed, the matrix is positive-definite.
Since each feature embedding satisfies , the smallest eigenvalue satisfies [15].
Remark 2
The inequality in (9) can be viewed as a strongly monotone property of the full semi-gradient .
Based on (9), we observe that a longer trajectory results in a larger condition number for the inner problem in (4), thereby allowing for a higher learning rate so that to reduce the number of training recursions required for the critic parameter.
Outer optimization. The policy gradient theorem [8] provides an analytical experession for the gradient of outer objective in (4).
In the context of two timescale framework, the policy gradient of is defined as .
Moreover, the policy gradient of frame is given by
(10)
where the expectation is taken over each observation , and the function is defined as
(11)
where is the optimal value parameter under the policy .
Based on (6) and (10), the vanilla SGD can be used to search for the optimal actor parameter and the optimal critic parameter .
However, the vanilla SGD can suffer from slow convergence.
Therefore, we leverage the HB momentum to improve the convergence rate and propose an HB based advantage actor-critic (HB-A2C) algorithm in Algorithm 1.
Algorithm 1 HB-A2C Algorithm
1:Initialization:
critic hyper-parameters: stepsize , parameter , momentum factor , and momentum parameter ;
actor hyper-parameters: stepsize and parameter
2:fordo
3: Rolling out -step observations via the behavior
Hereinafter, our goal is to analyze the convergence rate of the proposed HB-A2C algorithm for a realistic setting where the transitions are sampled along a trajectory of the MDP.
To proceed, we need the ensuing assumptions on behavior to facilitate our analysis.
Assumption 1
For each state-action pair , the behavior policy satisfies
(15a)
(15b)
(15c)
where .
Assumption 1 is standard for the analysis of policy gradient based methods, see e.g., [16, 20, 10, 31, 28, 29, 30, 24].
The Lipschitz continuity assumption holds for canonical parametric policies, such as, the Gaussian policy [33] and Boltzman policy [34].
Assumption 1 guarantees that the expected discounted reward has -Lipschitz continuous gradient as
(16)
where and . The detailed derivations are relegated to Lemma 6 in Appendix.
Assumption 2
For each behavior , the induced Markov chain is ergodic and has a stationary distribution with , . Moreover, there exist contants and such that
(17)
where the total variation distance for the two probability meatures and is defined as .
The first part of Assumption 2 (i.e., ergodicity) ensures that all states are visited an infinite number of times and the existence of a mixing time for the MDP.
The second part (i.e., the mixing time of the policy in (17)) guarantees that the optimal policy can be obtained from a single sample trajectory of the MDP.
It is worth remarking that Assumption 2 is a standard requirement for theoretical analysis of the RL algorithms; see e.g., [26, 31, 28, 15, 35].
Before characterizing the convergence properties of the proposed HB-A2C algorithm, we start by establishing several lemmas.
Lemma 1
The (stochastic) semi-gradient of critic and (stochastic) policy gradient of actor are bounded as
Lemma 1 provides bounds for both the (stochastic) semi-gradient of the critic and the (stochastic) policy gradient of the actor that are useful for controlling (i.e., upper-bounding) the drifts of the critic and actor parameters as follows.
The recursion in (12a) can be recast as when . Therefore, the upper bound of is derived as
(19)
Based on the recursion in (12b) and (19), we obtain the one-frame drift of critic as
(20)
where the first inequality follows from the non-expansive property of the projection operation.
Based on (14) and Lemma 1, we obtain the one-frame drift of actor as
(21)
Based on Lemma 1 and (20), we can investigate the properties of the gradient bias term .
Lemma 2
Suppose Assumptions 1 and 2 hold.
When length of each trajectory satisfies ,
the gradient bias satisfies
(22)
and
(23)
where , and denotes the filtration that contains all randomness prior to frame .
Lemma 2 implies that: 1) the one-frame drift of gradient bias with respect to can be confined by the critic stepsize ; and 2) the gradient bias does not rapidly increase with , which serves as one of the keys in developing our subsequent convergence of the critic sequence of the HB-A2C algorithm.
Based on (6), we observe that the optimal critic per frame is a function of .
Therefore, we are motivated to investigate the drift of optimal critic with respect to the actor parameters.
Lemma 3
When Assumption 1 is satisfied, the optimal critic parameter per frame satisfies
Lemma 3 shows that the drift of the optimal critic is controlled by the drift of the actor.
Before analyzing the convergence behavior of the actor recursion (14), we need to establish the Lipschitz continuity of the stochastic policy gradient with respect to the critic parameter .
Lemma 4
When Assumption 1 is satisfied, the stochastic policy gradient in (13) is Lipschitz with respect to as
Based on Lemma 4, we now present the convergence behavior of the policy gradient as follows.
Theorem 1
Suppose Assumptions 1 and 2 hold, and set .
When the minibatch size , the -step convergence of actor is
(27)
where .
Proof:
The finite-time convergence analysis of actor starts from that the expected discounted reward under state has the -Lipschitz continuous gradient.
Together with the recursion in (14) and the inequality (21), we have
(28)
where .
Based on the definitions of policy gradient with and the stochastic policy gradient , we have
Summing (30) over , we complete the proof by obtaining (27).
More detailed information can be found in Appendix E.
∎
We observe from Theorem 1 that the convergence behaviors of policy gradient and critic parameter are coupled.
Therefore, we need to investigate the convergence behavior of the critic parameter so that to establish a unified convergence of both actor and critic recursions.
Based on Lemmas 1–4, we can formally present the convergence of the critic update in (12) as follows.
Theorem 2
Suppose Assumptions 1 and 2 hold, and set .
When the minibatch size , the -step convergence of critic is
(31)
where and .
Proof:
The major challenge of analyzing the finite-time convergence of critic comes from chacterizing errors that are related to the gradient variance, optimality drift, and the gradient progress terms as
(32)
Since the proposed HB-A2C algorithm integrates the HB momentum into the critic update, our used techniques are different from [31, 11, 28] when bounding the gradient variance, optimality drift, and gradient progress in (32).
Step 1: Characterization of gradient variance.
Recalling that when .
Together with Lemmas 1 and 5, we can upper-bound the gradient variance in (32) as
(33)
Step 2: Characterization of optimality drift.
Based on Lemmas 3–5, we can upper-bound the optimality drift as
(34)
Substituting (33) and (34) into (32) and recalling the fact in (21), we obtain
(35)
Step 3: Characterization of gradient progress.
The most challenging part locates at analyzing the gradient progress that needs to consider the HB momentum update of the critic.
More specifically, we can decompose the gradient progress term based on (12a) as
(36)
Following the Lipschitz continuity of the gradient bias in Lemma 2 and the optimal critic parameter in Lemma 3 as well as the recursion (12) and the decomposed gradient progress (36), we respectively obtain the lower and upper bound of as
Summing (38) over and recalling the fact , we obtain
(39)
Combining (35) and (39) and summing over , we complete the proof by obtaining (31).
The detailed derivations can be found in Appendix F.
∎
Based on Theorems 1 and 2, we observe that the convergence of the policy gradient and the critic parameter are coupled with each other.
By combining Theorems 1 and 2, we can now establish the unified convergence of the actor and critic recursions as follows.
Corollary 1
Suppose Assumptions 1 and 2 hold.
Set and with .
Let the minibatch size , the finite-time convergence rate of HB-A2C algorithm is
(40)
where and with the Lyapunov function as .
Proof:
Define the Lyapunov function as .
Summing (31) and (27) and dividing both sides by with
and
, we obtain (40).
∎
Corollary 1 characterizes the unified convergence of the actor and critic recursions with respect to the total number of frames . Based on Corollary 1, we observe that the proposed HB-A2C algorithm finds an -approximate stationary point with iterations for reinforcement learning tasks with Markovian noise.
In our proposed HB-A2C algorithm, the learning rates of the actor and critic recursions are of the same order. Furthermore, we observe from the term that the convergence rate is essentially controlled by the optimality drift term. Additionally, based on , we observe that increasing the momentum factor can trade off the error introduced by the initial actor and critic parameters for the error introduced by the biased gradient descent recursions. Our convergence rate of is tighter than those in [28, 31]. Compared to the finite-time results of the A2C algorithms in [28, 31], our error bounds in (40) hold for all , whereas those of [28, 31] become available only after a mixing time of updates.
Sampling the state as and the action as .
For the given actor parameter , there always exists a unique optimal critic parameter that satisfies with
and
.
Based on , we obtain
(56)
where and .
Based on (56), we obtain the Jacobian matrix as . Let two optimal critic parameters and satisfy and .
In order to derive the Lipschitz continuity of and bound of , we have the following inequalities based on Lemma 5 as
(57a)
(57b)
(57c)
(57d)
Then, we derive the Lipschitz continuity of as
(58a)
(58b)
(58c)
(58d)
(58e)
where .
Based on (57), the bound of the Jacobian matrix is derived as
Before establishing the convergence of HB-A2C actor, we first introduce several auxiliary inequlities.
Based on the policy gradient with and the stochastic policy gradient , we have
Based on the recursion (12a), we have when .
Define , we start to analyze the convergence of the critic parameter by considering the following one-step drift
Substituting the expectation of (79) into the expectation of (78), we obtain
(80)
Based on (7), the left-hand side of (80) can be recast as
(81a)
(81b)
(81c)
(81d)
(81e)
where (81c) is based on the fact ,
and (81d) is based on (9).
Based on (7) and Lemma 1, we obtain the upper bound of as .
Together with the inequality (20) and the Lipschitz continuity in Lemma 3,
we derive the upper bounds for the three terms on the right-hand side of (81e) as
where is a predetermined positive constant, and (90d) follows the -Lipschitz of behavior policy.
Then, we start to analyze the total variation norm for as
(91a)
(91b)
(91c)
(91d)
where (91d) follows from (90) and -Lipschitz continuity of behavior.
∎
Lemma 6
When Assumption 1 holds, the overall reward has -Lipschitz continuous gradient.
Proof:
Based on Lemma 3, the optimal critic is -Lipschitz with respect to .
Together with the similar arguments in [37, Lemma 3.2], there exits a positive constant such that . Therefore, we can apply the equivalent condition to in [38, Theorem 2.1.5] in order to obtain (16).
∎
References
[1]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al.,
“Human-level control through deep reinforcement learning,” Nature,
vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[2]
H. Ju, R. Juan, R. Gomez, K. Nakamura, and G. Li, “Transferring policy of deep
reinforcement learning from simulation to reality for robotics,”
Nature Machine Intelligence, vol. 4, no. 12, pp. 1077–1087, Dec.
2022.
[3]
R. Wu, Z. Yao, J. Si, and H. H. Huang, “Robotic knee tracking control to mimic
the intact human knee profile based on actor-critic reinforcement learning,”
IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 1, pp. 19–30, Jan.
2022.
[4]
Y. Ren, R. Xie, F. R. Yu, R. Zhang, Y. Wang, Y. He, and T. Huang, “Connected
and autonomous vehicles in web3: An intelligence-based reinforcement learning
approach,” IEEE Transactions on Intelligent Transportation Systems,
vol. 25, no. 8, pp. 9863–9877, Aug. 2024.
[5]
K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar, “Finite-sample analysis
for decentralized batch multiagent reinforcement learning with networked
agents,” IEEE Transactions on Automatic Control, vol. 66, no. 12, pp.
5925–5940, Dec. 2021.
[6]
Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for
decentralized linear quadratic control: A derivative-free policy optimization
approach,” IEEE Transactions on Automatic Control, vol. 67, no. 12,
pp. 6429–6444, Dec. 2022.
[7]
N. Li, X. Li, J. Peng, and Z. Q. Xu, “Stochastic linear quadratic optimal
control problem: A reinforcement learning method,” IEEE Transactions
on Automatic Control, vol. 67, no. 9, pp. 5009–5016, Sept. 2022.
[8]
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
methods for reinforcement learning with function approximation,” in
Advances in Neural Information Processing Systems, vol. 12, Denver,
CO, USA, Dec. 1999.
[9]
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy
gradient methods: Optimality, approximation, and distribution shift,”
Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, Feb.
2021.
[10]
L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global
optimality and rates of convergence,” in International Conference on
Learning Representations, Addis Ababa, Ethiopia, Apr. 2020.
[11]
F. Huang, S. Gao, J. Pei, and H. Huang, “Momentum-based policy gradient
methods,” in International Conference on Machine Learning, vol. 119,
Vienna, Austria, July 2020, pp. 4422–4433.
[12]
L. Yang, Y. Zhang, G. Zheng, Q. Zheng, P. Li, J. Huang, and G. Pan, “Policy
optimization with stochastic mirror descent,” in AAAI Conference on
Artificial Intelligence, vol. 36, no. 8, Arlington, VA, USA, Nov. 2022, pp.
8823–8831.
[13]
G. Lan, “Policy mirror descent for reinforcement learning: Linear convergence,
new sampling complexity, and generalized problem classes,”
Mathematical programming, vol. 198, no. 1, pp. 1059–1106, Mar. 2023.
[14]
J. Schulman et al., “Proximal policy optimization algorithms,”
arXiv preprint arXiv:1707.06347, Aug. 2017.
[15]
J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal
difference learning with linear function approximation,” in Conference
on Learning Theory, Stockholm, Sweden, July 2018, pp. 1691–1692.
[16]
S. Zou, T. Xu, and Y. Liang, “Finite-sample analysis for SARSA with linear
function approximation,” in Advances in Neural Information Processing
Systems, vol. 32, Vancouver, BC, Canada, Dec. 2019.
[17]
T. Sun, H. Shen, T. Chen, and D. Li, “Adaptive temporal difference learning
with linear function approximation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8812–8824, Dec.
2022.
[18]
P. Xu and Q. Gu, “A finite-time analysis of Q-learning with neural
network function approximation,” in International Conference on
Machine Learning, vol. 119, Vienna, Austria, July 2020, pp.
10 555–10 565.
[19]
V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in
Neural Information Processing Systems, vol. 12, Denver, CO, USA, Dec. 1999.
[20]
M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A two-timescale stochastic
algorithm framework for bilevel optimization: Complexity analysis and
application to actor-critic,” SIAM Journal on Optimization, vol. 33,
no. 1, pp. 147–180, Jan. 2023.
[21]
R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction
(second edition). Cambridge, MA, USA:
The MIT Press, 2018.
[22]
B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song,
“SBEED: Convergent reinforcement learning with nonlinear function
approximation,” in International Conference on Machine Learning,
vol. 80, Stockholm, Sweden, July 2018, pp. 1125–1134.
[23]
S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of
actor-critic algorithm,” IEEE Journal on Selected Areas in Information
Theory, vol. 2, no. 2, pp. 652–664, June 2021.
[24]
H. Kumar, A. Koppel, and A. Ribeiro, “On the sample complexity of actor-critic
method for reinforcement learning with function approximation,”
Machine Learning, vol. 112, no. 7, pp. 2433–2467, Feb. 2023.
[25]
S. Zhang, B. Liu, H. Yao, and S. Whiteson, “Provably convergent two-timescale
off-policy actor-critic with function approximation,” in International
Conference on Machine Learning, vol. 119, Vienna, Austria, July 2020, pp.
11 204–11 213.
[26]
S. Khodadadian, T. T. Doan, J. Romberg, and S. T. Maguluri, “Finite-sample
analysis of two-time-scale natural actor–critic algorithm,” IEEE
Transactions on Automatic Control, vol. 68, no. 6, pp. 3273–3284, June
2023.
[27]
T. Chen, Y. Sun, and W. Yin, “Closing the gap: Tighter analysis of alternating
stochastic gradient methods for bilevel problems,” in Advances in
Neural Information Processing Systems, vol. 34, Vitrual, Dec. 2021, pp.
25 294–25 307.
[28]
Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two
time-scale actor-critic methods,” in Advances in Neural Information
Processing Systems, vol. 33, Virtual, Dec. 2020, pp. 17 617–17 628.
[29]
H. Shen and T. Chen, “A single-timescale analysis for stochastic approximation
with multiple coupled sequences,” in Advances in Neural Information
Processing Systems, vol. 35, New Orleans, LA, USA, Dec. 2022, pp.
17 415–17 429.
[30]
A. Olshevsky and B. Gharesifard, “A small gain analysis of single timescale
actor critic,” SIAM Journal on Control and Optimization, vol. 61,
no. 2, pp. 980–1007, Apr. 2023.
[31]
X. Chen and L. Zhao, “Finite-time analysis of single-timescale actor-critic,”
in Advances in Neural Information Processing Systems, vol. 36, New
Orleans, LA, USA, Dec. 2023, pp. 7017–7049.
[32]
Y. Duan and M. J. Wainwright, “Taming ‘data-hungry’ reinforcement learning?
stability in continuous state-action spaces,” arXiv preprint
arXiv:2401.05233, Jan. 2024.
[33]
K. Doya, “Reinforcement learning in continuous time and space,” Neural
computation, vol. 12, no. 1, pp. 219–245, 2000.
[34]
V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for
markov decision processes,” SIAM Journal on control and Optimization,
vol. 38, no. 1, pp. 94–123, 1999.
[35]
D. A. Levin and Y. Peres, Markov Chains and Mixing Times. American Mathematical Soc., 2017, vol. 107.
[36]
A. Y. Mitrophanov, “Sensitivity and convergence of uniformly ergodic markov
chains,” Journal of Applied Probability, vol. 42, no. 4, p.
1003–1014, Dec. 2005.
[37]
K. Zhang, A. Koppel, H. Zhu, and T. Başar, “Global convergence of policy
gradient methods to (almost) locally optimal policies,” SIAM Journal
on Control and Optimization, vol. 58, no. 6, pp. 3586–3612, Dec. 2020.
[38]
Y. Nesterov, Lectures on Convex Optimization (second edition). Cham, Switzerland: Springer, 2018, vol. 137.
Yanjie Dong (Member, IEEE) is an Associate Professor and the Assistant Dean of Artificial Intelligence Research Institute, Shenzhen MSU-BIT University.
Dr. Dong respectively obtained his Ph.D. and M.A.Sc. degree from The University of British Columbia, Canada, in 2020 and 2016.
His research interests focus on the design and analysis of machine learning algorithms, machine learning based resource allocation algorithms, and quantum computing technologies.
Haijun Zhang (Fellow, IEEE)
is a Professor at the University of Science and Technology Beijing, China. He was a postdoctoral research fellow in the Department of Electrical and Computer Engineering at The University of British Columbia, Canada.
He serves/served as an Editor of IEEE Transactions on Information Forensics and Security, IEEE Transactions on Communications, IEEE Transactions on Network Science and Engineering, and IEEE Transactions on Vehicular Technology. He received the IEEE CSIM Technical Committee Best Journal Paper Award, in 2018, IEEE ComSoc Young Author Best Paper Award, in 2017, and IEEE ComSoc Asia-Pacific Best Young Researcher Award, in 2019. He is an IEEE ComSoc Distinguished Lecturer.
Gang Wang (Senior Member, IEEE) is a Professor with the School of Automation at the Beijing Institute of Technology.
Dr. Wang received a B.Eng. degree in 2011, and a Ph.D. degree in 2018, both from the Beijing Institute of Technology, Beijing, China.
He also hold a Ph.D. degree from the University of Minnesota, Minneapolis, USA, in 2018, where he stayed as a postdoctoral researcher until July 2020.
His research interests focus on the areas of signal processing, control and reinforcement learning with applications to cyber-physical systems and multi-agent systems.
He was the recipient of the Best Paper Award from the Frontiers of Information Technology & Electronic Engineering in 2021, the Excellent Doctoral Dissertation Award from the Chinese Association of Automation in 2019, the outstanding editorial board member award from the IEEE Signal Processing Society in 2023.
He serves as an Editor of Signal Processing and IEEE Transactions on Signal and Information Processing over Networks.
Shisheng Cui is a Professor with the School of Automation at the Beijing Institute of Technology.
Dr. Cui received the B.S. degree from Tsinghua University, Beijing, China, in 2009, the M.S. degree from Stanford University, Stanford, USA, in 2011 and the Ph.D. degree from Pennsylvania State University, University Park, USA, in 2019. His current research interests lie in optimization, variational inequality problems, and inclusion problems complicated by nonsmoothness and uncertainty.
Xiping Hu
is currently a Professor with Shenzhen MSU-BIT University, and is also with Beijing Institute of Technology, China. Dr. Hu received the PhD degree from the University of British Columbia, Vancouver, BC, Canada.
Dr. Hu is the co-founder and chief scientist of Erudite Education Group Limited, Hong Kong, a leading language learning mobile application company with over 100 million users, and listed as top 2 language education platform globally. His research interests include affective computing, mobile cyber-physical systems, crowdsensing, social networks, and cloud computing. He has published more than 150 papers in the prestigious conferences and journals, such as IJCAI, AAAI, ACM MobiCom, WWW, and IEEE TPAMI/TMM/TVT/IoTJ/COMMAG.