Heavy-Ball Momentum Accelerated Actor-Critic With Function Approximation

Yanjie Dong Haijun Zhang Gang Wang Shisheng Cui and Xiping Hu This work was supported by the National Natural Science Foundation of China under Grants 62102266, U23B2059, 62173034 and the Pearl River Talent Recruitment Program of Guangdong Province under Grant 2019ZT08X603.Y. Dong and X. Hu are with the Artificial Intelligence Research Institute and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, Shenzhen 518172, China.H. Zhang is with the Beijing Engineering and Technology Research Center for Convergence Networks and Ubiquitous Services, University of Science and Technology Beijing, Beijing, China.G. Wang and S. Cui are with the State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology, Beijing 100081, China.
Abstract

By using an parametric value function to replace the Monte-Carlo rollouts for value estimation, the actor-critic (AC) algorithms can reduce the variance of stochastic policy gradient so that to improve the convergence rate. While existing works mainly focus on analyzing convergence rate of AC algorithms under Markovian noise, the impacts of momentum on AC algorithms remain largely unexplored. In this work, we first propose a heavy-ball momentum based advantage actor-critic (HB-A2C) algorithm by integrating the heavy-ball momentum into the critic recursion that is parameterized by a linear function. When the sample trajectory follows a Markov decision process, we quantitatively certify the acceleration capability of the proposed HB-A2C algorithm. Our theoretical results demonstrate that the proposed HB-A2C finds an ϵitalic-ϵ\epsilonitalic_ϵ-approximate stationary point with 𝒪(ϵ2)𝒪superscriptitalic-ϵ2{\cal O}(\epsilon^{-2})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations for reinforcement learning tasks with Markovian noise. Moreover, we also reveal the dependence of learning rates on the length of the sample trajectory. By carefully selecting the momentum factor of the critic recursion, the proposed HB-A2C can balance the errors introduced by the initialization and the stoschastic approximation.

Index Terms:
Acceleration, actor-critic algorithms, heavy-ball momentum.

I Introduction

In model-free reinforcement learning (MFRL) algorithms, an agent optimizes a long-term cumulative reward (a.k.a., value function) by interacting with an unknown stochastic environment that can be articulated as a Markov decision process (MDP). When combined with the function approximators (e.g., linear approximators and neural networks), the MFRL algorithms have achieved human-level control and extraordinary empirical success in many domains, e.g., video games [1], robotic control [2, 3], autonomous vehicles [4, 5], and linear quadratic control tasks [6, 7].

The current MFRL algorithms can be classified into three categories, i.e., policy-based MFRL [8, 9, 10, 11, 12, 13, 14], value-based MFRL [15, 16, 17, 7, 18], and actor-critic MFRL algorithms [19, 20]. The policy-based MFRL algorithms aim at optimizing the behavior policy based on the policy gradient theorem [21]. When using a parametric policy, the policy-based MFRL can directly optimize the policy parameters via the stochastic gradient descent (SGD) [8, 9, 10, 11]. However, the policy-based MFRL algorithms require access to the gradient of the value function with respect to a given policy. In practical scenarios with the unknown transition kernels for the MDP, the policy gradients should be estimated from the Monte-Carlo rollouts. Consequently, the policy-based MFRL algorithms often encounter significant variance in policy gradients and high sampling costs due to the stochastic approximation. Besides, the policy-based MFRL algorithms demand for sufficiently small learning rates to guarantee the stable convergence under the function approximators. Therefore, the policy-based MFRL algorithms can suffer from a slow convergence. While appropricte geometry engineering can improve the convergence [14, 12, 13], it is still in high demand for reducing the gradient variance of the policy-based MFRL algorithms.

The value-based MFRL algorithms recursively update the the long-term cumulative rewards for different state-action pairs based on the Bellman equation and determine the policy based on the action-value function, e.g., Q-learning [18]. Moreover, SARSA can speed the learning process of Q-learning by by using the policy improvement operators [16]. By estimating the value of successor states via the bootstrapping operation, the value-based MFRL algorithms can efficiently converge to a satisfying behavior policy based on the fixed-point recursions. Besides, the value-based MFRL can also be used to evaluate a behavior policy so that to track the future rewards of all states, e.g., temporal-difference (TD) learning algorithms [15]. When parameterizing the value function via myopical function approximators, the value-based MFRL algorithms become unstable or diverge for the environments with continuous state and/or action spaces. Therefore, an extensive hyperparameter tunning can be required to obtain stable behavior policy when using value-based MFRL algorithms. To handle the sample inefficiency and the divergence of the aforementioned MFRL algorithms, recent researches aim at reducing the variance of policy gradient by integrating the policy evalution into the policy improvement so that to propose the actor-critic (AC) algorithms [19, 20, 22, 21]. More specficially, the AC algorithms are designed to use a critic recursion to estimate the value of a current policy and then apply an actor recursion to improve the behavior policy based on feedback from the critic [23].

The current AC algorithms can be categorized into double-loop AC algorithms and single-loop AC algorithms. In the context of double-loop setting, the critic is consecutively updated for several rounds to obtain an accurate value estimation before each actor recursion [23, 24, 5]. When the actor and critic recursions use different sample trajectories, the inner-loop policy evaluation can be decoupled from the outer-loop policy improvement [23, 24, 5]. Moreover, several different schemes for updating critic sequence have been investigated in centralized topology [23, 24] and decentralized topology [5]. While mainly utilized for the analytical convenience, the double-loop setting is seldom employed in practice due to the double-sampling requirement for the actor and critic recursions. Besides, it is unclear whether an accurate policy evaluation is necessary since it pertains to just one-step policy improvement.

For the single-loop AC algorithms, the actor and critic sequences are updated concurrently [25, 26]. The asymptotic convergence of the single-loop AC algorithms has been established from the perspective of ordinary differential equations, specifically when the ratio of the learning rates between the actor and critic approaches zero [25, 26]. While the asymptotic convergence of single-loop AC algorithms has been well-investigated [25], the finite-time convergence analysis was unclear until recently [10, 27, 28, 29, 30, 26]. In the Big Data era, it is more preferred to use finite-time (or -sample) error bounds towards characterizing the data efficiency of machine learning algorithms. For example, by confining the actor sequence to converge slower to the critic sequence, the finite-time analysis in [28] shows that the two-timescale AC algorithm holds a convergence rate 𝒪(1/K0.4)𝒪1superscript𝐾0.4{\cal O}(\nicefrac{{1}}{{K^{0.4}}})caligraphic_O ( / start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT 0.4 end_POSTSUPERSCRIPT end_ARG ). The convergence rate of the AC algorithms is sharpened to 𝒪(1/K)𝒪1𝐾{\cal O}(\nicefrac{{1}}{{\sqrt{K}}})caligraphic_O ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) when the variance of Markovian noise decays at the same rate as the convergence of critic sequence [29]. The smoothness of the Hessian matrix for the parametric policy is also required to establish the finite-time convergence [29, 30]. Moreover, the proposed finite-time convergence analysis in [30] is only suitable to discrete state-action space and require non-trivial research effort to be extended to the contiuous state-saction space. Using the same order of learning rates for the actor and critic sequences, the convergence rate of the single-loop AC algorithm is improved to 𝒪(log2K/K)𝒪superscript2𝐾𝐾{\cal O}(\nicefrac{{\log^{2}K}}{{\sqrt{K}}})caligraphic_O ( / start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) in [31].

Contributions. Different from [28, 29, 27, 30, 31], we consider to improve the convergence of AC algorithms by using momentum. More specifically, we introduce the heavy-ball (HB) momentum to the critic recursion and propose the heavy-ball based advantage actor-critic (HB-A2C) algorithm. Besides, the actor and critic recursions rely on an Markovian trajectory that are collected from a single MDP in an online manner. Our major contributions are summarized as follows.

  • For the MFRL tasks, we propose an HB-A2C algorithm that uses a T𝑇Titalic_T-step trajectory to update the actor and critic parameters.

  • We present a new analytical framework that can tightly characterize the estimation error introdued by the gradient bias and the optimality drift under Markovian noise when the heavy-ball based critic recursion is used. Compared with [31], our analytical framework can be adopted to characterize the impacts of HB momentum on the convergence. Moreover, our analytical framework demonstrates that the proposed HB-A2C algorithm converges at a rate of 𝒪(1/K)𝒪1𝐾{\cal O}(\nicefrac{{1}}{{\sqrt{K}}})caligraphic_O ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) without assuming the decaying variance of Markovian noise.

Notation: The filtration is denoted by ksubscript𝑘{\cal F}_{k}caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that contains all random variables before the start of frame k𝑘kitalic_k. The vector wsuperscript𝑤w^{{\dagger}}italic_w start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT denotes the transpose of w𝑤witalic_w. For notational brevity, the parametric distribution (sk,t|sk,0;vk)\mathds{P}(s_{k,t}\in\cdot|s_{k,0};v_{k})blackboard_P ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ∈ ⋅ | italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ; italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is denoted by 𝒫k,tsubscript𝒫𝑘𝑡{\cal P}_{k,t}caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT.

II Preliminaries

II-A Problem description

We consider an MDP that is described by a quintuple (𝒮,𝒜,,r,γ)𝒮𝒜𝑟𝛾({\cal S},{\cal A},\mathds{P},r,\gamma)( caligraphic_S , caligraphic_A , blackboard_P , italic_r , italic_γ ), where 𝒜𝒜{\cal A}caligraphic_A is the continuous action space, 𝒮𝒮\cal Scaligraphic_S is the continuous state space, \mathds{P}blackboard_P is the unknown transition kernel that maps each state-action pair (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in{\cal S}\times{\cal A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A to a distribution (|s,a)\mathds{P}(\cdot|s,a)blackboard_P ( ⋅ | italic_s , italic_a ) over state space 𝒮𝒮\cal Scaligraphic_S, r:𝒮×𝒜[Rr,Rr]:𝑟𝒮𝒜subscript𝑅𝑟subscript𝑅𝑟r:{\cal S}\times{\cal A}\rightarrow[-R_{r},R_{r}]italic_r : caligraphic_S × caligraphic_A → [ - italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] specifies the bounded reward for state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), and γ𝛾\gammaitalic_γ is the discount factor.

A policy π𝜋\piitalic_π maps state s𝑠sitalic_s to a distribution π(|s)\pi(\cdot|s)italic_π ( ⋅ | italic_s ) over the action sapce 𝒜𝒜\cal Acaligraphic_A. To evaluate the expected discounted reward starting from a state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT under the policy π𝜋\piitalic_π, the value function is defined as

V(s)=𝔼[k=0γkr(sk,ak)|s0=s]𝑉𝑠𝔼delimited-[]conditionalsuperscriptsubscript𝑘0superscript𝛾𝑘𝑟subscript𝑠𝑘subscript𝑎𝑘subscript𝑠0𝑠V(s)=\mathds{E}\Big{[}\sum_{k=0}^{\infty}\gamma^{k}r(s_{k},a_{k})\Big{|}s_{0}=% s\Big{]}italic_V ( italic_s ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s ] (1)

where each action aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT follows the policy π(|sk)\pi(\cdot|s_{k})italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), and the successor state sk+1(|sk,ak)s_{k+1}\sim\mathds{P}(\cdot|s_{k},a_{k})italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∼ blackboard_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

Given a policy π𝜋\piitalic_π, the value function (1) satisfies the Bellman equation as [15, 21, 32]

V(s)=𝔼[r(s,a)+γTV(s)]𝑉𝑠𝔼delimited-[]𝑟𝑠𝑎superscript𝛾𝑇𝑉superscript𝑠V(s)=\mathds{E}[r(s,a)+\gamma^{T}V(s^{\prime})]italic_V ( italic_s ) = blackboard_E [ italic_r ( italic_s , italic_a ) + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] (2)

where the expectation is taken over the action aπ(|s)a\sim\pi(\cdot|s)italic_a ∼ italic_π ( ⋅ | italic_s ) and the successor state s(|s,a)s^{\prime}\sim\mathds{P}(\cdot|s,a)italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ blackboard_P ( ⋅ | italic_s , italic_a ).

The objective is to estimate the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT so that to maximize the expected discounted reward J(π)𝐽𝜋J(\pi)italic_J ( italic_π ) as

πargmaxJ(π):=(1γ)𝔼[V(s)].superscript𝜋𝐽𝜋assign1𝛾𝔼delimited-[]𝑉𝑠\pi^{*}\in\arg\max J(\pi):=(1-\gamma)\mathds{E}[V(s)].italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_max italic_J ( italic_π ) := ( 1 - italic_γ ) blackboard_E [ italic_V ( italic_s ) ] . (3)

II-B Function approximation

When considering the continuous state and action spaces, it becomes computational burdensome to obtain the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT or even intractable due to the notorious issue of curse of dimensionality (CoD). One popular way to handle the CoD issue is to approximate each policy πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the value function V(s)𝑉𝑠V(s)italic_V ( italic_s ) by a neural network and a linear-function approximator, respectively. In this work, the policy π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ) and the value function V(s)𝑉𝑠V(s)italic_V ( italic_s ) are respectively parameterized by the actor parameter vdv𝑣superscriptsubscript𝑑𝑣v\in\mathbb{R}^{d_{v}}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and the critic parameter wdw𝑤superscriptsubscript𝑑𝑤w\in\mathbb{R}^{d_{w}}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. More specficially, the parametric policy is denoted by π(a|s)=πv(a|s)𝜋conditional𝑎𝑠subscript𝜋𝑣conditional𝑎𝑠\pi(a|s)=\pi_{v}(a|s)italic_π ( italic_a | italic_s ) = italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ), and the parametric value function is denoted by V(s)Vw(s)=ϕ(s)w𝑉𝑠subscript𝑉𝑤𝑠superscriptitalic-ϕ𝑠𝑤V(s)\approx V_{w}(s)=\phi^{{\dagger}}(s)witalic_V ( italic_s ) ≈ italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s ) = italic_ϕ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_s ) italic_w with wRwnorm𝑤subscript𝑅𝑤\|w\|\leq R_{w}∥ italic_w ∥ ≤ italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and the feature embedding ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) satisfying ϕ(s)1normitalic-ϕ𝑠1\|\phi(s)\|\leq 1∥ italic_ϕ ( italic_s ) ∥ ≤ 1, s𝒮𝑠𝒮s\in{\cal S}italic_s ∈ caligraphic_S. Note that the optimal value V(s)Vw(s)superscript𝑉𝑠subscript𝑉superscript𝑤𝑠V^{*}(s)\equiv V_{w^{*}}(s)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ≡ italic_V start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) when the radius Rwsubscript𝑅𝑤R_{w}italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is sufficient large [15].

Based on the parametric policy πvsubscript𝜋𝑣\pi_{v}italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, parametric value function Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and the Bellman equation (2), we can recast the objective in (3) as a bilevel optimization that optimize the actor parameter v𝑣vitalic_v in the outer problem and the critic parameter w𝑤witalic_w in the innter problem as

v=argmaxvJ(v)s.t.w=argminwRw12𝔼[Vwtar(s)Vw(s)2]superscript𝑣subscript𝑣𝐽𝑣s.t.superscript𝑤subscriptnorm𝑤subscript𝑅𝑤12𝔼delimited-[]superscriptdelimited-∥∥superscriptsubscript𝑉𝑤tar𝑠subscript𝑉𝑤𝑠2\begin{split}v^{*}=\arg\max_{v}&~{}J(v)\\ \mbox{s.t.}&~{}w^{*}\!=\!\arg\min_{\|w\|\leq R_{w}}\frac{1}{2}\mathds{E}[\|V_{% w}^{\mbox{\tiny tar}}(s)\!-\!V_{w}(s)\|^{2}]\end{split}start_ROW start_CELL italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_CELL start_CELL italic_J ( italic_v ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT ∥ italic_w ∥ ≤ italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ∥ italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s ) - italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW (4)

where J(v):=𝔼[J(v,w;s)]assign𝐽𝑣𝔼delimited-[]𝐽𝑣superscript𝑤𝑠J(v):=\mathds{E}[J(v,w^{*};s)]italic_J ( italic_v ) := blackboard_E [ italic_J ( italic_v , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s ) ] with J(v,w;s):=(1γ)Vw(s)assign𝐽𝑣superscript𝑤𝑠1𝛾subscript𝑉superscript𝑤𝑠J(v,w^{*};s):=(1-\gamma)V_{w^{*}}(s)italic_J ( italic_v , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s ) := ( 1 - italic_γ ) italic_V start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ), and Vwtar(s)superscriptsubscript𝑉𝑤tar𝑠V_{w}^{\mbox{\tiny tar}}(s)italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s ) is the target value for state s𝑠sitalic_s. The target value Vwtar(s)superscriptsubscript𝑉𝑤tar𝑠V_{w}^{\mbox{\tiny tar}}(s)italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s ) can be estimated by the one-step (or multi-step) bootstrapping.

Remark 1

According to the inner problem of (4), the optimal critic parameter wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is essentially a function of the actor parameter v𝑣vitalic_v. Therefore, the only optimization variable is the actor parameter v𝑣vitalic_v in the outer problem of (4).

III Heavy-Ball Based Actor-Critic for RL Tasks

III-A Algorithm development

We consider a fully data-driven technique that maintains a running estimate of the value function (cf. the inner problem in (4)) while performing policy updates based on the estimated state values (cf. the outer problem in (4)). A multi-step bootstrapping is employed to estimate the target value Vwtar(s)superscriptsubscript𝑉𝑤tar𝑠V_{w}^{\text{tar}}(s)italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s ). One of the merits of multi-step bootstrapping is the ability to balance bias and variance during the estimation of the target value. Furthermore, as we will justify later, the multi-step bootstrapping allows for a larger learning rate when solving the inner problem of (4) using recursive updates, thereby reducing the number of recursions required for the critic parameter. Consequently, we consider the MDP to operate on two timescales, where each coarse-grain slot (i.e., frame) consists of T𝑇Titalic_T fine-grain slots (i.e., T𝑇Titalic_T steps). For notational brevity, we recast the reward r(sk,t,ak,t)𝑟subscript𝑠𝑘𝑡subscript𝑎𝑘𝑡r(s_{k,t},a_{k,t})italic_r ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ), the feature embedding ϕ(sk,t)italic-ϕsubscript𝑠𝑘𝑡\phi(s_{k,t})italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ), the policy πvksubscript𝜋subscript𝑣𝑘\pi_{v_{k}}italic_π start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and the the optimal critic wvksubscriptsuperscript𝑤subscript𝑣𝑘w^{*}_{v_{k}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT as rk,t:=r(sk,t,ak,t)assignsubscript𝑟𝑘𝑡𝑟subscript𝑠𝑘𝑡subscript𝑎𝑘𝑡r_{k,t}:=r(s_{k,t},a_{k,t})italic_r start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT := italic_r ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ), ϕk,t:=ϕ(sk,t)assignsubscriptitalic-ϕ𝑘𝑡italic-ϕsubscript𝑠𝑘𝑡\phi_{k,t}:=\phi(s_{k,t})italic_ϕ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT := italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ), πvk:=πkassignsubscript𝜋subscript𝑣𝑘subscript𝜋𝑘\pi_{v_{k}}:=\pi_{k}italic_π start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT := italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and wvk:=wkassignsubscriptsuperscript𝑤subscript𝑣𝑘subscriptsuperscript𝑤𝑘w^{*}_{v_{k}}:=w^{*}_{k}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT := italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively.

Inner optimization. For the inner optimization, the critic parameter w𝑤witalic_w can be updated via the stochastic semi-gradient

g(wk;Ok)=[Vwktar(sk,0)Vwk(sk,0)]Vwk(sk,0)𝑔subscript𝑤𝑘subscript𝑂𝑘delimited-[]superscriptsubscript𝑉subscript𝑤𝑘tarsubscript𝑠𝑘0subscript𝑉subscript𝑤𝑘subscript𝑠𝑘0subscript𝑉subscript𝑤𝑘subscript𝑠𝑘0g(w_{k};O_{k})=[V_{w_{k}}^{\mbox{\tiny tar}}(s_{k,0})-V_{w_{k}}(s_{k,0})]% \nabla V_{w_{k}}(s_{k,0})italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = [ italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) ] ∇ italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) (5)

where the parametric value Vwk(sk,0)=ϕk,0wksubscript𝑉subscript𝑤𝑘subscript𝑠𝑘0subscriptsuperscriptitalic-ϕ𝑘0subscript𝑤𝑘V_{w_{k}}(s_{k,0})=\phi^{{\dagger}}_{k,0}w_{k}italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) = italic_ϕ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; the target value Vwktar(sk,0)superscriptsubscript𝑉subscript𝑤𝑘tarsubscript𝑠𝑘0V_{w_{k}}^{\mbox{\tiny tar}}(s_{k,0})italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) is estimated by a T𝑇Titalic_T-step bootstrapping as Vwktar(sk,0)=t=0T1γtrk,t+γTϕk,Twksuperscriptsubscript𝑉subscript𝑤𝑘tarsubscript𝑠𝑘0superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝑟𝑘𝑡superscript𝛾𝑇subscriptsuperscriptitalic-ϕ𝑘𝑇subscript𝑤𝑘V_{w_{k}}^{\mbox{\tiny tar}}(s_{k,0})=\sum_{t=0}^{T-1}\gamma^{t}r_{k,t}+\gamma% ^{T}\phi^{{\dagger}}_{k,T}w_{k}italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; Ok=[ok,t]t=0T1subscript𝑂𝑘superscriptsubscriptdelimited-[]subscript𝑜𝑘𝑡𝑡0𝑇1O_{k}=[o_{k,t}]_{t=0}^{T-1}italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT is the T𝑇Titalic_T-step trajectory; and the observation ok,t=(sk,t,ak,t,sk,t+1)subscript𝑜𝑘𝑡subscript𝑠𝑘𝑡subscript𝑎𝑘𝑡subscript𝑠𝑘𝑡1o_{k,t}=(s_{k,t},a_{k,t},s_{k,t+1})italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_k , italic_t + 1 end_POSTSUBSCRIPT ) follows the distribution 𝒫k,tπktensor-productsubscript𝒫𝑘𝑡subscript𝜋𝑘{\cal P}_{k,t}\otimes\pi_{k}\otimes\mathds{P}caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P with 𝒫k,t:=(sk,t|sk,0;vk){\cal P}_{k,t}:=\mathds{P}(s_{k,t}\in\cdot|s_{k,0};v_{k})caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT := blackboard_P ( italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ∈ ⋅ | italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ; italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as the t𝑡titalic_t-step transition kernel.

The compact form stochastic semi-gradient is denoted by

g(wk;Ok)=Φkwkbk𝑔subscript𝑤𝑘subscript𝑂𝑘subscriptΦ𝑘subscript𝑤𝑘subscript𝑏𝑘g(w_{k};O_{k})=\Phi_{k}w_{k}-b_{k}italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (6)

where Φk=ϕk,0[ϕk,0γTϕk,T]subscriptΦ𝑘subscriptitalic-ϕ𝑘0superscriptdelimited-[]subscriptitalic-ϕ𝑘0superscript𝛾𝑇subscriptitalic-ϕ𝑘𝑇\Phi_{k}=\!\phi_{k,0}[\phi_{k,0}-\gamma^{T}\phi_{k,T}]^{{\dagger}}roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT [ italic_ϕ start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT and bk=ϕk,0t=0T1γtrk,tsubscript𝑏𝑘subscriptitalic-ϕ𝑘0superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝑟𝑘𝑡b_{k}=\!\phi_{k,0}\sum_{t=0}^{T-1}\gamma^{t}r_{k,t}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT.

The stochastic semi-gradient (6) equals to the sum of full semi-gradient and gradient bias as

g(wk;Ok)=𝔼[g(wk;O¯k)]+ζ(vk,wk;Ok)𝑔subscript𝑤𝑘subscript𝑂𝑘𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘g(w_{k};O_{k})=\mathds{E}[g(w_{k};\bar{O}_{k})]+\zeta(v_{k},w_{k};O_{k})italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] + italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (7)

where the gradient bias is ζ(vk,wk;Ok):=Φkwkbk𝔼[g(wk;O¯k)]assign𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘subscriptΦ𝑘subscript𝑤𝑘subscript𝑏𝑘𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘\zeta(v_{k},w_{k};O_{k}):=\Phi_{k}w_{k}-b_{k}-\mathds{E}[g(w_{k};\bar{O}_{k})]italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ], and the full semi-gradient 𝔼[g(wk;O¯k)]=𝔼[Φ¯kwkb¯k]𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘𝔼delimited-[]subscript¯Φ𝑘subscript𝑤𝑘subscript¯𝑏𝑘\mathds{E}[g(w_{k};\bar{O}_{k})]=\mathds{E}[\bar{\Phi}_{k}w_{k}-\bar{b}_{k}]blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] with Φ¯k=ϕ¯k,0[ϕ¯k,0γTϕ¯k,T]subscript¯Φ𝑘subscript¯italic-ϕ𝑘0superscriptdelimited-[]subscript¯italic-ϕ𝑘0superscript𝛾𝑇subscript¯italic-ϕ𝑘𝑇\bar{\Phi}_{k}=\bar{\phi}_{k,0}[\bar{\phi}_{k,0}-\gamma^{T}\bar{\phi}_{k,T}]^{% {\dagger}}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT and b¯k=ϕ¯k,0t=0T1γtr¯k,tsubscript¯𝑏𝑘subscript¯italic-ϕ𝑘0superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript¯𝑟𝑘𝑡\bar{b}_{k}=\bar{\phi}_{k,0}\sum_{t=0}^{T-1}\gamma^{t}\bar{r}_{k,t}over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT. Denoting the πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-induced stationary distribution by μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the T𝑇Titalic_T-step sample trajectory O¯ksubscript¯𝑂𝑘\bar{O}_{k}over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is obtained from the stationary distribution μkπktensor-productsubscript𝜇𝑘subscript𝜋𝑘\mu_{k}\otimes\pi_{k}\otimes\mathds{P}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P.

Note that, given the actor parameter vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the optimal critic paramter wksuperscriptsubscript𝑤𝑘w_{k}^{*}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT satisfies 𝔼[g(wk;O¯k)]=0𝔼delimited-[]𝑔superscriptsubscript𝑤𝑘subscript¯𝑂𝑘0\mathds{E}[g(w_{k}^{*};\bar{O}_{k})]=0blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = 0. Together with the full semi-gradient 𝔼[g(wk;O¯k)]=𝔼[Φ¯kwkb¯k]𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘𝔼delimited-[]subscript¯Φ𝑘subscript𝑤𝑘subscript¯𝑏𝑘\mathds{E}[g(w_{k};\bar{O}_{k})]=\mathds{E}[\bar{\Phi}_{k}w_{k}-\bar{b}_{k}]blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ], we obtain

𝔼[g(wk;O¯k)]𝔼[g(wk;O¯k)]=𝔼[Φ¯k(wkwk)].𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘𝔼delimited-[]𝑔superscriptsubscript𝑤𝑘subscript¯𝑂𝑘𝔼delimited-[]subscript¯Φ𝑘subscript𝑤𝑘superscriptsubscript𝑤𝑘\mathds{E}[g(w_{k};\bar{O}_{k})]-\mathds{E}[g(w_{k}^{*};\bar{O}_{k})]=\mathds{% E}[\bar{\Phi}_{k}(w_{k}-w_{k}^{*})].blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] . (8)

Recalling the definitions of Φ¯ksubscript¯Φ𝑘\bar{\Phi}_{k}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Vwk(sk,0)subscript𝑉subscript𝑤𝑘subscript𝑠𝑘0V_{w_{k}}(s_{k,0})italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) and setting ε1=Vwk(s¯k,0)Vwk(s¯k,0)subscript𝜀1subscript𝑉subscript𝑤𝑘subscript¯𝑠𝑘0subscript𝑉superscriptsubscript𝑤𝑘subscript¯𝑠𝑘0\varepsilon_{1}=V_{w_{k}}(\bar{s}_{k,0})-V_{w_{k}^{*}}(\bar{s}_{k,0})italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) and ε2=Vwk(s¯k,T)Vwk(s¯k,T)subscript𝜀2subscript𝑉subscript𝑤𝑘subscript¯𝑠𝑘𝑇subscript𝑉superscriptsubscript𝑤𝑘subscript¯𝑠𝑘𝑇\varepsilon_{2}=V_{w_{k}}(\bar{s}_{k,T})-V_{w_{k}^{*}}(\bar{s}_{k,T})italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT ), we have 𝔼[Φ¯k(wkwk)]=𝔼[ϕ¯k,0(ε1γTε2)]𝔼delimited-[]subscript¯Φ𝑘subscript𝑤𝑘superscriptsubscript𝑤𝑘𝔼delimited-[]subscript¯italic-ϕ𝑘0subscript𝜀1superscript𝛾𝑇subscript𝜀2\mathds{E}[\bar{\Phi}_{k}(w_{k}-w_{k}^{*})]=\mathds{E}[\bar{\phi}_{k,0}(% \varepsilon_{1}-\gamma^{T}\varepsilon_{2})]blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] = blackboard_E [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ]. Based on [15, Lemma 3], we obtain

wkwk,𝔼[Φ¯k(wkwk)]=𝔼[ε12]γT𝔼[ε1ε2]σwkwk2subscript𝑤𝑘superscriptsubscript𝑤𝑘𝔼delimited-[]subscript¯Φ𝑘subscript𝑤𝑘superscriptsubscript𝑤𝑘𝔼delimited-[]superscriptsubscript𝜀12superscript𝛾𝑇𝔼delimited-[]subscript𝜀1subscript𝜀2𝜎superscriptdelimited-∥∥subscript𝑤𝑘superscriptsubscript𝑤𝑘2\begin{split}{\langle w_{k}-w_{k}^{*},\mathds{E}[\bar{\Phi}_{k}(w_{k}-w_{k}^{*% })]\rangle}&=\mathds{E}[\varepsilon_{1}^{2}]-\gamma^{T}\mathds{E}[\varepsilon_% {1}\varepsilon_{2}]\\ &\geq\sigma\|w_{k}-w_{k}^{*}\|^{2}\end{split}start_ROW start_CELL ⟨ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] ⟩ end_CELL start_CELL = blackboard_E [ italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E [ italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ italic_σ ∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW (9)

where σ:=(1γT)λassign𝜎1superscript𝛾𝑇𝜆\sigma:=(1-\gamma^{T})\lambdaitalic_σ := ( 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_λ with λ>0𝜆0\lambda>0italic_λ > 0 as the smallest eigenvalue of the matrix μk(s)ϕ(s)ϕ(s)μk(s)𝑑ssubscriptsubscript𝜇𝑘𝑠italic-ϕ𝑠superscriptitalic-ϕ𝑠subscript𝜇𝑘𝑠differential-d𝑠\int_{\mu_{k}(s)}\phi(s)\phi^{{\dagger}}(s)\mu_{k}(s)ds∫ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_ϕ ( italic_s ) italic_ϕ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_s ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) italic_d italic_s. When the redundant or the irrelevant features are removed, the matrix μk(s)ϕ(s)ϕ(s)μk(s)𝑑ssubscriptsubscript𝜇𝑘𝑠italic-ϕ𝑠superscriptitalic-ϕ𝑠subscript𝜇𝑘𝑠differential-d𝑠\int_{\mu_{k}(s)}\phi(s)\phi^{{\dagger}}(s)\mu_{k}(s)ds∫ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_ϕ ( italic_s ) italic_ϕ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_s ) italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) italic_d italic_s is positive-definite. Since each feature embedding ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) satisfies ϕ(s)1normitalic-ϕ𝑠1\|\phi(s)\|\leq 1∥ italic_ϕ ( italic_s ) ∥ ≤ 1, the smallest eigenvalue satisfies λ(0,1)𝜆01\lambda\in(0,1)italic_λ ∈ ( 0 , 1 ) [15].

Remark 2

The inequality in (9) can be viewed as a strongly monotone property of the full semi-gradient 𝔼[g(wk;O¯k)]𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘\mathds{E}[g(w_{k};\bar{O}_{k})]blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ]. Based on (9), we observe that a longer trajectory O¯ksubscript¯𝑂𝑘\bar{O}_{k}over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT results in a larger condition number σ𝜎\sigmaitalic_σ for the inner problem in (4), thereby allowing for a higher learning rate so that to reduce the number of training recursions required for the critic parameter.

Outer optimization. The policy gradient theorem [8] provides an analytical experession for the gradient of outer objective in (4). In the context of two timescale framework, the policy gradient of J(vk)𝐽subscript𝑣𝑘J(v_{k})italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is defined as J(vk)=𝔼[J(vk,wk;sk,0)]𝐽subscript𝑣𝑘𝔼delimited-[]𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘0\nabla J(v_{k})=\mathds{E}[\nabla J(v_{k},w_{k}^{*};s_{k,0})]∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = blackboard_E [ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) ]. Moreover, the policy gradient of frame k𝑘kitalic_k is given by

J(vk,wk;sk,0)=(1γ)t=0γt𝔼[h(vk,wk;ok,t)]𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘01𝛾superscriptsubscript𝑡0superscript𝛾𝑡𝔼delimited-[]subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡\nabla J(v_{k},w_{k}^{*};s_{k,0})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}% \mathds{E}[h(v_{k},w_{k}^{*};o_{k,t})]∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) ] (10)

where the expectation is taken over each observation ok,t𝒫k,tπksimilar-tosubscript𝑜𝑘𝑡tensor-productsubscript𝒫𝑘𝑡subscript𝜋𝑘o_{k,t}\sim{\cal P}_{k,t}\otimes\pi_{k}\otimes\mathds{P}italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ∼ caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P, and the function h(vk,wk;ok,t)subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡h(v_{k},w_{k}^{*};o_{k,t})italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) is defined as

h(vk,wk;ok,t):=[rk,t+[γϕk,t+1ϕk,t]wk]logπk(ak,t|sk,t)assignsubscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡delimited-[]subscript𝑟𝑘𝑡superscriptdelimited-[]𝛾subscriptitalic-ϕ𝑘𝑡1subscriptitalic-ϕ𝑘𝑡superscriptsubscript𝑤𝑘subscript𝜋𝑘conditionalsubscript𝑎𝑘𝑡subscript𝑠𝑘𝑡\begin{split}&h(v_{k},w_{k}^{*};o_{k,t})\\ &\!:=\![r_{k,t}+[\gamma\phi_{k,t+1}-\phi_{k,t}]^{{\dagger}}w_{k}^{*}]\nabla% \log\pi_{k}(a_{k,t}|s_{k,t})\end{split}start_ROW start_CELL end_CELL start_CELL italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL := [ italic_r start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT + [ italic_γ italic_ϕ start_POSTSUBSCRIPT italic_k , italic_t + 1 end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] ∇ roman_log italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW (11)

where wksuperscriptsubscript𝑤𝑘w_{k}^{*}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal value parameter under the policy πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Based on (6) and (10), the vanilla SGD can be used to search for the optimal actor parameter vsuperscript𝑣v^{*}italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the optimal critic parameter wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. However, the vanilla SGD can suffer from slow convergence. Therefore, we leverage the HB momentum to improve the convergence rate and propose an HB based advantage actor-critic (HB-A2C) algorithm in Algorithm 1.

Algorithm 1 HB-A2C Algorithm
1:Initialization: critic hyper-parameters: stepsize β𝛽\betaitalic_β, parameter w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, momentum factor η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and momentum parameter n1=0subscript𝑛10n_{-1}=0italic_n start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0; actor hyper-parameters: stepsize α𝛼\alphaitalic_α and parameter v0subscript𝑣0v_{0}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
2:for k=0,1,,K1𝑘01𝐾1k=0,1,\ldots,K-1italic_k = 0 , 1 , … , italic_K - 1 do
3:    Rolling out T𝑇Titalic_T-step observations Ok=[ok,t]t=0T1subscript𝑂𝑘superscriptsubscriptdelimited-[]subscript𝑜𝑘𝑡𝑡0𝑇1O_{k}=[o_{k,t}]_{t=0}^{T-1}italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT via the behavior πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
4:    Update the critic parameter based on (6) as
nksubscript𝑛𝑘\displaystyle n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =(1η1)nk1+η1g(wk;Ok)absent1subscript𝜂1subscript𝑛𝑘1subscript𝜂1𝑔subscript𝑤𝑘subscript𝑂𝑘\displaystyle=(1-\eta_{1})n_{k-1}+\eta_{1}g(w_{k};O_{k})= ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (12a)
wk+1subscript𝑤𝑘1\displaystyle w_{k+1}italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =projRw[wkβnk]absentsubscriptprojsubscript𝑅𝑤subscript𝑤𝑘𝛽subscript𝑛𝑘\displaystyle=\operatorname{\mbox{proj}}_{R_{w}}[w_{k}-\beta n_{k}]= proj start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] (12b)
where the operator Rw[]subscriptproductsubscript𝑅𝑤delimited-[]\prod_{R_{w}}[\cdot]∏ start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ⋅ ] denotes the projection onto the region wRwnorm𝑤subscript𝑅𝑤\|w\|\leq R_{w}∥ italic_w ∥ ≤ italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
5:    Calculate the stochastic policy gradient based on (11) as
H(vk,wk;Ok)=(1γ)t=0T1γth(vk,wk;ok,t)𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘1𝛾superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝑣𝑘subscript𝑤𝑘subscript𝑜𝑘𝑡H(v_{k},w_{k};O_{k})=(1-\gamma)\sum_{t=0}^{T-1}\gamma^{t}h(v_{k},w_{k};o_{k,t})italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) (13)
6:    Update the actor parameter based on (13) as
vk+1=vk+αH(vk,wk;Ok)subscript𝑣𝑘1subscript𝑣𝑘𝛼𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘v_{k+1}=v_{k}+\alpha H(v_{k},w_{k};O_{k})italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_α italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (14)
7:end for

III-B Convergence analysis

Hereinafter, our goal is to analyze the convergence rate of the proposed HB-A2C algorithm for a realistic setting where the transitions are sampled along a trajectory of the MDP. To proceed, we need the ensuing assumptions on behavior to facilitate our analysis.

Assumption 1

For each state-action pair (s,a)(𝒮,𝒜)𝑠𝑎𝒮𝒜(s,a)\in({\cal S,A})( italic_s , italic_a ) ∈ ( caligraphic_S , caligraphic_A ), the behavior policy satisfies

logπv(a|s)Rπ\displaystyle\|\nabla\log\pi_{v}(a|s)\|\leq R_{\pi}∥ ∇ roman_log italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (15a)
logπv(a|s)logπv(a|s)Lπvv\displaystyle\|\nabla\log\pi_{v}(a|s)\!-\!\nabla\log\pi_{v^{\prime}}(a|s)\|\!% \leq\!L_{\pi}^{\prime}\|v-v^{\prime}\|∥ ∇ roman_log italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ) - ∇ roman_log italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ∥ ≤ italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (15b)
|πv(a|s)π(a|s)|Lπvv\displaystyle|\pi_{v}(a|s)-\pi(a|s)|\leq L_{\pi}\|v-v^{\prime}\|| italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ) - italic_π ( italic_a | italic_s ) | ≤ italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (15c)

where v,vdv𝑣superscript𝑣superscriptsubscript𝑑𝑣v,v^{\prime}\in\mathbb{R}^{d_{v}}italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Assumption 1 is standard for the analysis of policy gradient based methods, see e.g., [16, 20, 10, 31, 28, 29, 30, 24]. The Lipschitz continuity assumption holds for canonical parametric policies, such as, the Gaussian policy [33] and Boltzman policy [34]. Assumption 1 guarantees that the expected discounted reward J(v)𝐽𝑣J(v)italic_J ( italic_v ) has L𝐿Litalic_L-Lipschitz continuous gradient as

J(v),vvJ(v)J(v)+12Lvv2𝐽𝑣superscript𝑣𝑣𝐽superscript𝑣𝐽𝑣12𝐿superscriptnormsuperscript𝑣𝑣2\langle\nabla J(v),v^{\prime}-v\rangle\leq J(v^{\prime})-J(v)+\frac{1}{2}L\|v^% {\prime}-v\|^{2}⟨ ∇ italic_J ( italic_v ) , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_v ⟩ ≤ italic_J ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_J ( italic_v ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L ∥ italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (16)

where vdv𝑣superscriptsubscript𝑑𝑣v\in\mathbb{R}^{d_{v}}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and vdvsuperscript𝑣superscriptsubscript𝑑𝑣v^{\prime}\in\mathbb{R}^{d_{v}}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The detailed derivations are relegated to Lemma 6 in Appendix.

Assumption 2

For each behavior πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the induced Markov chain is ergodic and has a stationary distribution μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with μk(s)>0subscript𝜇𝑘𝑠0\mu_{k}(s)>0italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) > 0, s𝒮𝑠𝒮s\in\cal Sitalic_s ∈ caligraphic_S. Moreover, there exist contants c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ρ(0,1)𝜌01\rho\in(0,1)italic_ρ ∈ ( 0 , 1 ) such that

𝒫k,tμkTVc0ρtsubscriptnormsubscript𝒫𝑘𝑡subscript𝜇𝑘TVsubscript𝑐0superscript𝜌𝑡\|{\cal P}_{k,t}-\mu_{k}\|_{\mbox{\tiny TV}}\leq c_{0}\rho^{t}∥ caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (17)

where the total variation distance for the two probability meatures 𝒫k,tsubscript𝒫𝑘𝑡{\cal P}_{k,t}caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT and μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is defined as 𝒫k,tμkTV=s|𝒫k,t(s)μk(s)|𝑑ssubscriptnormsubscript𝒫𝑘𝑡subscript𝜇𝑘TVsubscript𝑠subscript𝒫𝑘𝑡𝑠subscript𝜇𝑘𝑠differential-d𝑠\|{\cal P}_{k,t}-\mu_{k}\|_{\mbox{\tiny TV}}=\int_{s}|{\cal P}_{k,t}(s)-\mu_{k% }(s)|ds∥ caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ( italic_s ) - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) | italic_d italic_s.

The first part of Assumption 2 (i.e., ergodicity) ensures that all states are visited an infinite number of times and the existence of a mixing time for the MDP. The second part (i.e., the mixing time of the policy in (17)) guarantees that the optimal policy can be obtained from a single sample trajectory of the MDP. It is worth remarking that Assumption 2 is a standard requirement for theoretical analysis of the RL algorithms; see e.g., [26, 31, 28, 15, 35].

Before characterizing the convergence properties of the proposed HB-A2C algorithm, we start by establishing several lemmas.

Lemma 1

The (stochastic) semi-gradient of critic and (stochastic) policy gradient of actor are bounded as

g(wk;Ok)Rg,𝔼[g(wk;O¯k)]Rgformulae-sequencenorm𝑔subscript𝑤𝑘subscript𝑂𝑘subscript𝑅𝑔norm𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘subscript𝑅𝑔\displaystyle\|g(w_{k};O_{k})\|\leq R_{g},\|\mathds{E}[g(w_{k};\bar{O}_{k})]\|% \leq R_{g}∥ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , ∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT (18a)
H(vk,wk;Ok)Rh,J(vk)Rhformulae-sequencenorm𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘subscript𝑅norm𝐽subscript𝑣𝑘subscript𝑅\displaystyle\|H(v_{k},w_{k};O_{k})\|\leq R_{h},\|\nabla J(v_{k})\|\leq R_{h}∥ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (18b)

where Rg=(1+γT)Rw+c1(γ)Rrsubscript𝑅𝑔1superscript𝛾𝑇subscript𝑅𝑤subscript𝑐1𝛾subscript𝑅𝑟R_{g}=(1+\gamma^{T})R_{w}+c_{1}(\gamma)R_{r}italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ( 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Rh=Rπ[Rr+(1+γ)Rw]subscript𝑅subscript𝑅𝜋delimited-[]subscript𝑅𝑟1𝛾subscript𝑅𝑤R_{h}=R_{\pi}[R_{r}\!+\!(1+\gamma)R_{w}]italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] with c1(γ)=(1γT)/(1γ)subscript𝑐1𝛾1superscript𝛾𝑇1𝛾c_{1}(\gamma)=\nicefrac{{(1-\gamma^{T})}}{{(1-\gamma)}}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) = / start_ARG ( 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_ARG start_ARG ( 1 - italic_γ ) end_ARG.

Proof:

See Appendix A. ∎

Lemma 1 provides bounds for both the (stochastic) semi-gradient of the critic and the (stochastic) policy gradient of the actor that are useful for controlling (i.e., upper-bounding) the drifts of the critic and actor parameters as follows.

The recursion in (12a) can be recast as nk=η1τ=0k(1η1)kτg(wτ;Oτ)subscript𝑛𝑘subscript𝜂1superscriptsubscript𝜏0𝑘superscript1subscript𝜂1𝑘𝜏𝑔subscript𝑤𝜏subscript𝑂𝜏n_{k}=\eta_{1}\sum_{\tau=0}^{k}(1-\eta_{1})^{k-\tau}g(w_{\tau};O_{\tau})italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - italic_τ end_POSTSUPERSCRIPT italic_g ( italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) when n1=0subscript𝑛10n_{-1}=0italic_n start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0. Therefore, the upper bound of nknormsubscript𝑛𝑘\|n_{k}\|∥ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ is derived as

nk=η1τ=0k(1η1)kτg(wτ;Oτ)Rg.normsubscript𝑛𝑘normsubscript𝜂1superscriptsubscript𝜏0𝑘superscript1subscript𝜂1𝑘𝜏𝑔subscript𝑤𝜏subscript𝑂𝜏subscript𝑅𝑔\|n_{k}\|=\Big{\|}\eta_{1}\sum_{\tau=0}^{k}(1-\eta_{1})^{k-\tau}g(w_{\tau};O_{% \tau})\Big{\|}\leq R_{g}.∥ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ = ∥ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - italic_τ end_POSTSUPERSCRIPT italic_g ( italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT . (19)

Based on the recursion in (12b) and (19), we obtain the one-frame drift of critic as

wk+1wkβnkRgβnormsubscript𝑤𝑘1subscript𝑤𝑘norm𝛽subscript𝑛𝑘subscript𝑅𝑔𝛽\|w_{k+1}-w_{k}\|\leq\|\beta n_{k}\|\leq R_{g}\beta∥ italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ ∥ italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β (20)

where the first inequality follows from the non-expansive property of the projection operation.

Based on (14) and Lemma 1, we obtain the one-frame drift of actor as

vk+1vkRhα.normsubscript𝑣𝑘1subscript𝑣𝑘subscript𝑅𝛼\|v_{k+1}-v_{k}\|\leq R_{h}\alpha.∥ italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α . (21)

Based on Lemma 1 and (20), we can investigate the properties of the gradient bias term ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

Lemma 2

Suppose Assumptions 1 and 2 hold. When length of each trajectory satisfies Tlogc01β/logρ𝑇superscriptsubscript𝑐01𝛽𝜌T\geq\nicefrac{{\log c_{0}^{-1}\beta}}{{\log\rho}}italic_T ≥ / start_ARG roman_log italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_β end_ARG start_ARG roman_log italic_ρ end_ARG, the gradient bias ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) satisfies

ζ(vk,wk;Ok)ζ(vk,wk1;Ok)8Rgβnorm𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘8subscript𝑅𝑔𝛽\|\zeta(v_{k},w_{k};O_{k})-\zeta(v_{k},w_{k-1};O_{k})\|\leq 8R_{g}\beta∥ italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ 8 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β (22)

and

𝔼[ζ(vk,wk1;Ok)|k1](c2+2T)|𝒜|LπRgvkvk1+Rgβ\begin{split}&\|\mathds{E}[\zeta(v_{k},w_{k-1};O_{k})|{\cal F}_{k-1}]\|\\ &\leq(c_{2}+2T)|{\cal A}|L_{\pi}R_{g}\|v_{k}-v_{k-1}\|+R_{g}\beta\end{split}start_ROW start_CELL end_CELL start_CELL ∥ blackboard_E [ italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β end_CELL end_ROW (23)

where c2>0subscript𝑐20c_{2}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, and k1subscript𝑘1{\cal F}_{k-1}caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT denotes the filtration that contains all randomness prior to frame k1𝑘1k-1italic_k - 1.

Proof:

See Appendix B. ∎

Lemma 2 implies that: 1) the one-frame drift of gradient bias ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with respect to wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be confined by the critic stepsize β𝛽\betaitalic_β; and 2) the gradient bias does not rapidly increase with vkvk1normsubscript𝑣𝑘subscript𝑣𝑘1\|v_{k}-v_{k-1}\|∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥, which serves as one of the keys in developing our subsequent convergence of the critic sequence of the HB-A2C algorithm.

Based on (6), we observe that the optimal critic wksuperscriptsubscript𝑤𝑘w_{k}^{*}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT per frame k𝑘kitalic_k is a function of vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Therefore, we are motivated to investigate the drift of optimal critic with respect to the actor parameters.

Lemma 3

When Assumption 1 is satisfied, the optimal critic parameter wsuperscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT per frame satisfies

wvwvnormsubscriptsuperscript𝑤𝑣subscriptsuperscript𝑤superscript𝑣\displaystyle\|w^{*}_{v}-w^{*}_{v^{\prime}}\|∥ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ Lvvabsentsubscript𝐿norm𝑣superscript𝑣\displaystyle\leq L_{*}\|v-v^{\prime}\|≤ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (24)
wvnormsubscriptsuperscript𝑤𝑣\displaystyle\|\nabla w^{*}_{v}\|∥ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ Gabsentsubscript𝐺\displaystyle\leq G_{*}≤ italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (25)

where L>0subscript𝐿0L_{*}>0italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > 0, G>0subscript𝐺0G_{*}>0italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT > 0, vdv𝑣superscriptsubscript𝑑𝑣v\in\mathbb{R}^{d_{v}}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and vdvsuperscript𝑣superscriptsubscript𝑑𝑣v^{\prime}\in\mathbb{R}^{d_{v}}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Proof:

See Appendix C. ∎

Lemma 3 shows that the drift of the optimal critic is controlled by the drift of the actor. Before analyzing the convergence behavior of the actor recursion (14), we need to establish the Lipschitz continuity of the stochastic policy gradient H(v,w;Ok)𝐻𝑣𝑤subscript𝑂𝑘H(v,w;O_{k})italic_H ( italic_v , italic_w ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) with respect to the critic parameter w𝑤witalic_w.

Lemma 4

When Assumption 1 is satisfied, the stochastic policy gradient in (13) is Lipschitz with respect to w𝑤witalic_w as

H(v,w;Ok)H(v,w;Ok)(1+γ)Rπwwnorm𝐻𝑣𝑤subscript𝑂𝑘𝐻𝑣superscript𝑤subscript𝑂𝑘1𝛾subscript𝑅𝜋norm𝑤superscript𝑤\|H(v,w;O_{k})-H(v,w^{\prime};O_{k})\|\leq(1+\gamma)R_{\pi}\|w-w^{\prime}\|∥ italic_H ( italic_v , italic_w ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( italic_v , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_w - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (26)

where wdw𝑤superscriptsubscript𝑑𝑤w\in\mathbb{R}^{d_{w}}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and wdwsuperscript𝑤superscriptsubscript𝑑𝑤w^{\prime}\in\mathbb{R}^{d_{w}}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

Proof:

See Appendix D. ∎

Based on Lemma 4, we now present the convergence behavior of the policy gradient J(vk)𝐽subscript𝑣𝑘\nabla J(v_{k})∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as follows.

Theorem 1

Suppose Assumptions 1 and 2 hold, and set u1=u0=0subscript𝑢1subscript𝑢00u_{-1}=u_{0}=0italic_u start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. When the minibatch size Tlogβ/2logγ𝑇𝛽2𝛾T\geq\nicefrac{{\log\beta}}{{2\log\gamma}}italic_T ≥ / start_ARG roman_log italic_β end_ARG start_ARG 2 roman_log italic_γ end_ARG, the K𝐾Kitalic_K-step convergence of actor is

α2Kk=0K1J(vk)2(1+γ)2Rπ2αKk=0K1𝔼[Δk2]1K[J(vK)J(v0)]+12LRh2α2+Rh2αβ𝛼2𝐾superscriptsubscript𝑘0𝐾1superscriptdelimited-∥∥𝐽subscript𝑣𝑘2superscript1𝛾2superscriptsubscript𝑅𝜋2𝛼𝐾superscriptsubscript𝑘0𝐾1𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘21𝐾delimited-[]𝐽subscript𝑣𝐾𝐽subscript𝑣012𝐿superscriptsubscript𝑅2superscript𝛼2superscriptsubscript𝑅2𝛼𝛽\begin{split}&\frac{\alpha}{2K}\sum_{k=0}^{K-1}\!\|\nabla J(v_{k})\|^{2}\!-\!% \frac{(1+\gamma)^{2}R_{\pi}^{2}\alpha}{K}\sum_{k=0}^{K-1}\!\mathds{E}[\|\Delta% _{k}\|^{2}]\\ &\leq\frac{1}{K}[J(v_{K})-J(v_{0})]+\frac{1}{2}LR_{h}^{2}\alpha^{2}+R_{h}^{2}% \alpha\beta\end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG italic_α end_ARG start_ARG 2 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG [ italic_J ( italic_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_J ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α italic_β end_CELL end_ROW (27)

where Δk=wkwksubscriptΔ𝑘subscript𝑤𝑘superscriptsubscript𝑤𝑘\Delta_{k}=w_{k}-w_{k}^{*}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proof:

The finite-time convergence analysis of actor starts from that the expected discounted reward J(v)𝐽𝑣J(v)italic_J ( italic_v ) under state s𝑠sitalic_s has the L𝐿Litalic_L-Lipschitz continuous gradient. Together with the recursion in (14) and the inequality (21), we have

α𝔼[Ωk]J(vk+1)J(vk)+12LRh2α2𝛼𝔼delimited-[]subscriptΩ𝑘𝐽subscript𝑣𝑘1𝐽subscript𝑣𝑘12𝐿superscriptsubscript𝑅2superscript𝛼2\alpha\mathds{E}[\Omega_{k}]\leq{J}(v_{k+1})-{J}(v_{k})+\frac{1}{2}LR_{h}^{2}% \alpha^{2}italic_α blackboard_E [ roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ italic_J ( italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (28)

where Ωk=J(vk),H(vk,wk;Ok)subscriptΩ𝑘𝐽subscript𝑣𝑘𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\Omega_{k}=\langle\nabla J(v_{k}),H(v_{k},w_{k};O_{k})\rangleroman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ⟨ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩.

Based on the definitions of policy gradient J(vk)=𝔼[J(vk,wk;sk,0)]𝐽subscript𝑣𝑘𝔼delimited-[]𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘0\nabla J(v_{k})=\mathds{E}[\nabla J(v_{k},w_{k}^{*};s_{k,0})]∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = blackboard_E [ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) ] with J(vk,wk;sk,0)=(1γ)t=0γtok,t𝒫k,tπk(ok,t)h(vk,wk;ok,t)𝑑ok,t𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘01𝛾superscriptsubscript𝑡0superscript𝛾𝑡subscriptsubscript𝑜𝑘𝑡tensor-productsubscript𝒫𝑘𝑡subscript𝜋𝑘subscript𝑜𝑘𝑡subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡differential-dsubscript𝑜𝑘𝑡\nabla J(v_{k},w_{k}^{*};s_{k,0})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\int_% {o_{k,t}}{\cal P}_{k,t}\otimes\pi_{k}\otimes\mathds{P}(o_{k,t})h(v_{k},w_{k}^{% *};o_{k,t})do_{k,t}∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) italic_d italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT and the stochastic policy gradient H(vk,wk;Ok)=(1γ)t=0T1γth(vk,wk;ok,t)𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘1𝛾superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡H(v_{k},w_{k}^{*};O_{k})=(1-\gamma)\sum_{t=0}^{T-1}\gamma^{t}h(v_{k},w_{k}^{*}% ;o_{k,t})italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ), we have

J(vk),𝔼[H(vk,wk;Ok)]12J(vk)212J(vk)𝔼[H(vk,wk;Ok)]212J(vk)2Rh2β(1+γ)2Rπ2𝔼[Δk2].𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘12superscriptdelimited-∥∥𝐽subscript𝑣𝑘212superscriptdelimited-∥∥𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘212superscriptdelimited-∥∥𝐽subscript𝑣𝑘2superscriptsubscript𝑅2𝛽superscript1𝛾2superscriptsubscript𝑅𝜋2𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2\begin{split}&\langle\nabla J(v_{k}),\mathds{E}[H(v_{k},w_{k};O_{k})]\rangle\\ &\geq\frac{1}{2}\|\nabla J(v_{k})\|^{2}-\frac{1}{2}\|\nabla J(v_{k})-\mathds{E% }[H(v_{k},w_{k};O_{k})]\|^{2}\\ &\geq\frac{1}{2}\|\nabla J(v_{k})\|^{2}-R_{h}^{2}\beta-(1+\gamma)^{2}R_{\pi}^{% 2}\mathds{E}[\|\Delta_{k}\|^{2}].\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β - ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . end_CELL end_ROW (29)

Substituting (29) into (28), we obtain

12J(vk)2(1+γ)2Rπ2𝔼[Δk2]J(vk+1)J(vk)+Rh2β+12LRh2α2.12superscriptdelimited-∥∥𝐽subscript𝑣𝑘2superscript1𝛾2superscriptsubscript𝑅𝜋2𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2𝐽subscript𝑣𝑘1𝐽subscript𝑣𝑘superscriptsubscript𝑅2𝛽12𝐿superscriptsubscript𝑅2superscript𝛼2\begin{split}&\frac{1}{2}\|\nabla J(v_{k})\|^{2}-(1+\gamma)^{2}R_{\pi}^{2}% \mathds{E}[\|\Delta_{k}\|^{2}]\\ &\leq{J}(v_{k+1})-{J}(v_{k})+R_{h}^{2}\beta+\frac{1}{2}LR_{h}^{2}\alpha^{2}.% \end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_J ( italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (30)

Summing (30) over k=0,1,,K1𝑘01𝐾1k=0,1,\ldots,K-1italic_k = 0 , 1 , … , italic_K - 1, we complete the proof by obtaining (27). More detailed information can be found in Appendix E. ∎

We observe from Theorem 1 that the convergence behaviors of policy gradient and critic parameter are coupled. Therefore, we need to investigate the convergence behavior of the critic parameter so that to establish a unified convergence of both actor and critic recursions. Based on Lemmas 14, we can formally present the convergence of the critic update in (12) as follows.

Theorem 2

Suppose Assumptions 1 and 2 hold, and set v0=0subscript𝑣00v_{0}=0italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. When the minibatch size Tmax{logc01β/logρ,logβ/2logγ}𝑇superscriptsubscript𝑐01𝛽𝜌𝛽2𝛾T\geq\max\{\nicefrac{{\log c_{0}^{-1}\beta}}{{\log\rho}},\nicefrac{{\log\beta}% }{{2\log\gamma}}\}italic_T ≥ roman_max { / start_ARG roman_log italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_β end_ARG start_ARG roman_log italic_ρ end_ARG , / start_ARG roman_log italic_β end_ARG start_ARG 2 roman_log italic_γ end_ARG }, the K𝐾Kitalic_K-step convergence of critic is

1K[σβ[(1+γ)RπG+2G2]α]k=0K1𝔼[Δk2]α4Kk=0K1J(vk)2+12K[𝔼[Δ02]𝔼[ΔK2]]+L2Rh2α2+[c4η1+Rg2]β2+[c3η1+2RwG]Rhαβ+2(1η1)βη1KRwRg1𝐾delimited-[]𝜎𝛽delimited-[]1𝛾subscript𝑅𝜋subscript𝐺2superscriptsubscript𝐺2𝛼superscriptsubscript𝑘0𝐾1𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2𝛼4𝐾superscriptsubscript𝑘0𝐾1superscriptdelimited-∥∥𝐽subscript𝑣𝑘212𝐾delimited-[]𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ02𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝐾2superscriptsubscript𝐿2superscriptsubscript𝑅2superscript𝛼2delimited-[]subscript𝑐4subscript𝜂1superscriptsubscript𝑅𝑔2superscript𝛽2delimited-[]subscript𝑐3subscript𝜂12subscript𝑅𝑤subscript𝐺subscript𝑅𝛼𝛽21subscript𝜂1𝛽subscript𝜂1𝐾subscript𝑅𝑤subscript𝑅𝑔\begin{split}&\frac{1}{K}\Big{[}\sigma\beta-[(1+\gamma)R_{\pi}G_{*}+2G_{*}^{2}% ]\alpha\Big{]}\sum_{k=0}^{K-1}\mathds{E}[\|\Delta_{k}\|^{2}]\\ &\leq\frac{\alpha}{4K}\sum_{k=0}^{K-1}\|\nabla J(v_{k})\|^{2}\!+\!\frac{1}{2K}% [\mathds{E}[\|\Delta_{0}\|^{2}]\!-\!\mathds{E}[\|\Delta_{K}\|^{2}]]\\ &\hskip 8.5359pt+L_{*}^{2}R_{h}^{2}\alpha^{2}+[\frac{c_{4}}{\eta_{1}}+R_{g}^{2% }]\beta^{2}\\ &\hskip 8.5359pt+[\frac{c_{3}}{\eta_{1}}+2R_{w}G_{*}]R_{h}\alpha\beta+\frac{2(% 1-\eta_{1})\beta}{\eta_{1}K}R_{w}R_{g}\end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_K end_ARG [ italic_σ italic_β - [ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_α ] ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG italic_α end_ARG start_ARG 4 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG [ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + [ divide start_ARG italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + [ divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 2 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α italic_β + divide start_ARG 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_β end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_K end_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_CELL end_ROW (31)

where c3=[(1+η1)L+2η1(c2+2T)|𝒜|LπRw]Rgsubscript𝑐3delimited-[]1subscript𝜂1subscript𝐿2subscript𝜂1subscript𝑐22𝑇𝒜subscript𝐿𝜋subscript𝑅𝑤subscript𝑅𝑔c_{3}=[(1+\eta_{1})L_{*}+2\eta_{1}(c_{2}+2T)|{\cal A}|L_{\pi}R_{w}]R_{g}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = [ ( 1 + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and c4=[2η1(Rg+9Rw)+(1η1)Rg]Rgsubscript𝑐4delimited-[]2subscript𝜂1subscript𝑅𝑔9subscript𝑅𝑤1subscript𝜂1subscript𝑅𝑔subscript𝑅𝑔c_{4}=[2\eta_{1}(R_{g}+9R_{w})+(1-\eta_{1})R_{g}]R_{g}italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = [ 2 italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + 9 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Proof:

The major challenge of analyzing the finite-time convergence of critic comes from chacterizing errors that are related to the gradient variance, optimality drift, and the gradient progress terms as

12Δk+112Δk212βnk+wk+1wk2(gradient variance)+Δk,wkwk+1(optimality drift)βΔk,nk.(gradient progress)\begin{split}&\frac{1}{2}\|\Delta_{k+1}\|-\frac{1}{2}\|\Delta_{k}\|^{2}\\ &\leq\frac{1}{2}\|\beta n_{k}+w_{k+1}^{*}-w_{k}^{*}\|^{2}\hskip 5.69046pt\mbox% {(gradient variance)}\\ &\hskip 8.5359pt+\langle\Delta_{k},w_{k}^{*}-w_{k+1}^{*}\rangle\hskip 23.9002% pt\mbox{(optimality drift)}\\ &\hskip 8.5359pt-\beta\langle\Delta_{k},n_{k}\rangle.\hskip 53.49132pt\mbox{(% gradient progress)}\\ \end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ roman_Δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (gradient variance) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ (optimality drift) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_β ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ . (gradient progress) end_CELL end_ROW (32)

Since the proposed HB-A2C algorithm integrates the HB momentum into the critic update, our used techniques are different from [31, 11, 28] when bounding the gradient variance, optimality drift, and gradient progress in (32).

Step 1: Characterization of gradient variance. Recalling that nk=η1τ=0k(1η1)kτg(wτ;Oτ)subscript𝑛𝑘subscript𝜂1superscriptsubscript𝜏0𝑘superscript1subscript𝜂1𝑘𝜏𝑔subscript𝑤𝜏subscript𝑂𝜏n_{k}=\eta_{1}\sum_{\tau=0}^{k}(1-\eta_{1})^{k-\tau}g(w_{\tau};O_{\tau})italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - italic_τ end_POSTSUPERSCRIPT italic_g ( italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) when n1=0subscript𝑛10n_{-1}=0italic_n start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0. Together with Lemmas 1 and 5, we can upper-bound the gradient variance in (32) as

12βnk+wk+1wk2Rg2β2+L2vk+1vk2.12superscriptnorm𝛽subscript𝑛𝑘superscriptsubscript𝑤𝑘1superscriptsubscript𝑤𝑘2superscriptsubscript𝑅𝑔2superscript𝛽2superscriptsubscript𝐿2superscriptnormsubscript𝑣𝑘1subscript𝑣𝑘2\frac{1}{2}\|\beta n_{k}+w_{k+1}^{*}\!-\!w_{k}^{*}\|^{2}\leq R_{g}^{2}\beta^{2% }+L_{*}^{2}\|v_{k+1}-v_{k}\|^{2}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (33)

Step 2: Characterization of optimality drift. Based on Lemmas 35, we can upper-bound the optimality drift as

𝔼[Δk,wkwk+1][(1+γ)RπG+2G2]α𝔼[Δk2]+14αJ(vk)2+2αβRwRhG.𝔼delimited-[]subscriptΔ𝑘subscriptsuperscript𝑤𝑘subscriptsuperscript𝑤𝑘1delimited-[]1𝛾subscript𝑅𝜋subscript𝐺2superscriptsubscript𝐺2𝛼𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘214𝛼superscriptdelimited-∥∥𝐽subscript𝑣𝑘22𝛼𝛽subscript𝑅𝑤subscript𝑅subscript𝐺\mathds{E}[\langle\Delta_{k},w^{*}_{k}-w^{*}_{k+1}\rangle]\leq[(1+\gamma)R_{% \pi}G_{*}+2G_{*}^{2}]\alpha\mathds{E}[\|\Delta_{k}\|^{2}]\\ +\frac{1}{4}\alpha\|\nabla J(v_{k})\|^{2}+2\alpha\beta R_{w}R_{h}G_{*}.start_ROW start_CELL blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ] ≤ [ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_α blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL + divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_α ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α italic_β italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT . end_CELL end_ROW (34)

Substituting (33) and (34) into (32) and recalling the fact in (21), we obtain

β𝔼[Δk,nk][(1+γ)RπG+2G2]α𝔼[Δk2]12[𝔼[Δk2]𝔼[Δk+12]]+14αJ(vk)2+2αβRwRhG+Rg2β2+L2Rh2α2.𝛽𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘delimited-[]1𝛾subscript𝑅𝜋subscript𝐺2superscriptsubscript𝐺2𝛼𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘212delimited-[]𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘1214𝛼superscriptdelimited-∥∥𝐽subscript𝑣𝑘22𝛼𝛽subscript𝑅𝑤subscript𝑅subscript𝐺superscriptsubscript𝑅𝑔2superscript𝛽2superscriptsubscript𝐿2superscriptsubscript𝑅2superscript𝛼2\begin{split}&\beta\mathds{E}[\langle\Delta_{k},n_{k}\rangle]-[(1+\gamma)R_{% \pi}G_{*}+2G_{*}^{2}]\alpha\mathds{E}[\|\Delta_{k}\|^{2}]\\ &\leq\frac{1}{2}[\mathds{E}[\|\Delta_{k}\|^{2}]-\mathds{E}[\|\Delta_{k+1}\|^{2% }]]+\frac{1}{4}\alpha\|\nabla J(v_{k})\|^{2}\\ &\hskip 8.5359pt+2\alpha\beta R_{w}R_{h}G_{*}\!+\!R_{g}^{2}\beta^{2}\!+\!L_{*}% ^{2}R_{h}^{2}\alpha^{2}.\!\end{split}start_ROW start_CELL end_CELL start_CELL italic_β blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] - [ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_α blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] + divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_α ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + 2 italic_α italic_β italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (35)

Step 3: Characterization of gradient progress. The most challenging part locates at analyzing the gradient progress that needs to consider the HB momentum update of the critic. More specifically, we can decompose the gradient progress term based on (12a) as

Δk,nk=(1η1)Δk1,nk1+(1η1)ΔkΔk1,nk1+η1Δk,g(wk;Ok).subscriptΔ𝑘subscript𝑛𝑘1subscript𝜂1subscriptΔ𝑘1subscript𝑛𝑘11subscript𝜂1subscriptΔ𝑘subscriptΔ𝑘1subscript𝑛𝑘1subscript𝜂1subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘\begin{split}&\langle\Delta_{k},n_{k}\rangle\\ &=(1\!-\!\eta_{1})\langle\Delta_{k-1},n_{k-1}\rangle\\ &\hskip 14.22636pt\!+\!(1\!-\!\eta_{1})\langle\Delta_{k}\!-\!\Delta_{k-1},n_{k% -1}\rangle\!+\!\eta_{1}\langle\Delta_{k},g(w_{k};O_{k})\rangle.\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ . end_CELL end_ROW (36)

Following the Lipschitz continuity of the gradient bias ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) in Lemma 2 and the optimal critic parameter w(v)superscript𝑤𝑣w^{*}(v)italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_v ) in Lemma 3 as well as the recursion (12) and the decomposed gradient progress (36), we respectively obtain the lower and upper bound of η1Δk,g(wk;Ok)subscript𝜂1subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘\eta_{1}\langle\Delta_{k},g(w_{k};O_{k})\rangleitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ as

η1𝔼[Δk,g(wk;Ok)]subscript𝜂1𝔼delimited-[]subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘\displaystyle\eta_{1}\mathds{E}[\langle\Delta_{k},g(w_{k};O_{k})\rangle]italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ ]
η1σ𝔼[Δk2]η1c3𝔼[vkvk1]η1c4βabsentsubscript𝜂1𝜎𝔼delimited-[]superscriptnormsubscriptΔ𝑘2subscript𝜂1superscriptsubscript𝑐3𝔼delimited-[]normsubscript𝑣𝑘subscript𝑣𝑘1subscript𝜂1superscriptsubscript𝑐4𝛽\displaystyle\geq\eta_{1}\sigma\mathds{E}[\|\Delta_{k}\|^{2}]-\eta_{1}c_{3}^{% \prime}\mathds{E}[\|v_{k}-v_{k-1}\|]-\eta_{1}c_{4}^{\prime}\beta≥ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_β (37a)
η1𝔼[Δk,g(wk;Ok)]subscript𝜂1𝔼delimited-[]subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘\displaystyle\eta_{1}\mathds{E}[\langle\Delta_{k},g(w_{k};O_{k})\rangle]italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ ]
(1η1)Rg[Rgβ+L𝔼[vkvk1]]absent1subscript𝜂1subscript𝑅𝑔delimited-[]subscript𝑅𝑔𝛽subscript𝐿𝔼delimited-[]normsubscript𝑣𝑘subscript𝑣𝑘1\displaystyle\leq(1-\eta_{1})R_{g}[R_{g}\beta+L_{*}\mathds{E}[\|v_{k}-v_{k-1}% \|]]≤ ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] ] (37b)
+𝔼[Δk,nk](1η1)𝔼[Δk1,nk1]𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘1subscript𝜂1𝔼delimited-[]subscriptΔ𝑘1subscript𝑛𝑘1\displaystyle\hskip 8.5359pt+\mathds{E}[\langle\Delta_{k},n_{k}\rangle]-(1-% \eta_{1})\mathds{E}[\langle\Delta_{k-1},n_{k-1}\rangle]+ blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] - ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ ]

where c3:=2Rg[L+(c2+2T)|𝒜|LπRw]assignsuperscriptsubscript𝑐32subscript𝑅𝑔delimited-[]subscript𝐿subscript𝑐22𝑇𝒜subscript𝐿𝜋subscript𝑅𝑤c_{3}^{\prime}:=2R_{g}[L_{*}+(c_{2}+2T)|{\cal A}|L_{\pi}R_{w}]italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := 2 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] and c4:=2Rg(Rg+9Rw)assignsuperscriptsubscript𝑐42subscript𝑅𝑔subscript𝑅𝑔9subscript𝑅𝑤c_{4}^{\prime}:=2R_{g}(R_{g}+9R_{w})italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := 2 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + 9 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ).

Substituting (37a) into (37b), we obtain

η1σ𝔼[Δk2]c3𝔼[vkvk1]+c4β+𝔼[Δk,nk](1η1)𝔼[Δk1,nk1]subscript𝜂1𝜎𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2subscript𝑐3𝔼delimited-[]delimited-∥∥subscript𝑣𝑘subscript𝑣𝑘1subscript𝑐4𝛽𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘1subscript𝜂1𝔼delimited-[]subscriptΔ𝑘1subscript𝑛𝑘1\begin{split}&\eta_{1}\sigma\mathds{E}[\|\Delta_{k}\|^{2}]\\ &\leq c_{3}\mathds{E}[\|v_{k}-v_{k-1}\|]+c_{4}\beta\\ &\hskip 8.5359pt+\mathds{E}[\langle\Delta_{k},n_{k}\rangle]-(1-\eta_{1})% \mathds{E}[\langle\Delta_{k-1},n_{k-1}\rangle]\end{split}start_ROW start_CELL end_CELL start_CELL italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_β end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] - ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ ] end_CELL end_ROW (38)

where c3:=[(1+η1)L+2η1(c2+2T)|𝒜|LπRw]Rgassignsubscript𝑐3delimited-[]1subscript𝜂1subscript𝐿2subscript𝜂1subscript𝑐22𝑇𝒜subscript𝐿𝜋subscript𝑅𝑤subscript𝑅𝑔c_{3}:=[(1+\eta_{1})L_{*}+2\eta_{1}(c_{2}+2T)|{\cal A}|L_{\pi}R_{w}]R_{g}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT := [ ( 1 + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and c4:=[2η1(Rg+9Rw)+(1η1)Rg]Rgassignsubscript𝑐4delimited-[]2subscript𝜂1subscript𝑅𝑔9subscript𝑅𝑤1subscript𝜂1subscript𝑅𝑔subscript𝑅𝑔c_{4}:=[2\eta_{1}(R_{g}+9R_{w})+(1-\eta_{1})R_{g}]R_{g}italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT := [ 2 italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + 9 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Summing (38) over k=0,,K1𝑘0𝐾1k=0,\ldots,K-1italic_k = 0 , … , italic_K - 1 and recalling the fact n1=0subscript𝑛10n_{-1}=0italic_n start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0, we obtain

σβk=0K1𝔼[Δk2]c3η1RhαβK+c4η1β2K+2(1η1)η1RwRgβ+βk=0K1𝔼[Δk,nk]𝜎𝛽superscriptsubscript𝑘0𝐾1𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2subscript𝑐3subscript𝜂1subscript𝑅𝛼𝛽𝐾subscript𝑐4subscript𝜂1superscript𝛽2𝐾21subscript𝜂1subscript𝜂1subscript𝑅𝑤subscript𝑅𝑔𝛽𝛽superscriptsubscript𝑘0𝐾1𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘\sigma\beta\sum_{k=0}^{K-1}\mathds{E}[\|\Delta_{k}\|^{2}]\leq\frac{c_{3}}{\eta% _{1}}R_{h}\alpha\beta K+\frac{c_{4}}{\eta_{1}}\beta^{2}K\\ +\frac{2(1-\eta_{1})}{\eta_{1}}R_{w}R_{g}\beta+\beta\sum_{k=0}^{K-1}\mathds{E}% [\langle\Delta_{k},n_{k}\rangle]start_ROW start_CELL italic_σ italic_β ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α italic_β italic_K + divide start_ARG italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_CELL end_ROW start_ROW start_CELL + divide start_ARG 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β + italic_β ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] end_CELL end_ROW (39)

Combining (35) and (39) and summing over k=0,,K1𝑘0𝐾1k=0,\ldots,K-1italic_k = 0 , … , italic_K - 1, we complete the proof by obtaining (31). The detailed derivations can be found in Appendix F. ∎

Based on Theorems 1 and 2, we observe that the convergence of the policy gradient J(vk)𝐽subscript𝑣𝑘\nabla J(v_{k})∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and the critic parameter wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are coupled with each other. By combining Theorems 1 and 2, we can now establish the unified convergence of the actor and critic recursions as follows.

Corollary 1

Suppose Assumptions 1 and 2 hold. Set α=Θ(1/K)𝛼Θ1𝐾\alpha=\Theta(\nicefrac{{1}}{{\sqrt{K}}})italic_α = roman_Θ ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) and β=c5α𝛽subscript𝑐5𝛼\beta=c_{5}\alphaitalic_β = italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_α with c5=[1+4(1+γ)2Rπ2+4(1+γ)RπG+8G2]1/4σsubscript𝑐5delimited-[]14superscript1𝛾2superscriptsubscript𝑅𝜋241𝛾subscript𝑅𝜋subscript𝐺8superscriptsubscript𝐺214𝜎c_{5}=[1+4{(1+\gamma)^{2}}R_{\pi}^{2}+4(1+\gamma){R_{\pi}}{G_{*}}+8G_{*}^{2}]% \nicefrac{{1}}{{4\sigma}}italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = [ 1 + 4 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 8 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / start_ARG 1 end_ARG start_ARG 4 italic_σ end_ARG. Let the minibatch size Tmax{logc01β/logρ,logβ/2logγ}𝑇superscriptsubscript𝑐01𝛽𝜌𝛽2𝛾T\geq\max\{\nicefrac{{\log c_{0}^{-1}\beta}}{{\log\rho}},\nicefrac{{\log\beta}% }{{2\log\gamma}}\}italic_T ≥ roman_max { / start_ARG roman_log italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_β end_ARG start_ARG roman_log italic_ρ end_ARG , / start_ARG roman_log italic_β end_ARG start_ARG 2 roman_log italic_γ end_ARG }, the finite-time convergence rate of HB-A2C algorithm is

14Kk=0K1[J(vk)2+𝔼[Δk2]]𝒪(1K)+𝒪(1K)14𝐾superscriptsubscript𝑘0𝐾1delimited-[]superscriptnorm𝐽subscript𝑣𝑘2𝔼delimited-[]superscriptnormsubscriptΔ𝑘2𝒪1𝐾𝒪1𝐾\frac{1}{4K}\!\!\sum_{k=0}^{K-1}[\|\nabla J(v_{k})\|^{2}\!+\!\mathds{E}[\|% \Delta_{k}\|^{2}]]\!\leq\!{\cal O}(\frac{1}{\sqrt{K}})+{\cal O}(\frac{1}{K})divide start_ARG 1 end_ARG start_ARG 4 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT [ ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] ≤ caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ) (40)

where 𝒪(1/K)=[K0+(c3η11+2RwG+Rh)Rhc5+(c4η11+Rg2)c52+(0.5L+L2)Rh2]1/K𝒪1𝐾delimited-[]subscript𝐾subscript0subscript𝑐3superscriptsubscript𝜂112subscript𝑅𝑤subscript𝐺subscript𝑅subscript𝑅subscript𝑐5subscript𝑐4superscriptsubscript𝜂11superscriptsubscript𝑅𝑔2superscriptsubscript𝑐520.5𝐿superscriptsubscript𝐿2superscriptsubscript𝑅21𝐾{\cal O}(\nicefrac{{1}}{{\sqrt{K}}})=[{\cal L}_{K}-{\cal L}_{0}+({c_{3}}\eta_{% 1}^{-1}+2{R_{w}}{G_{*}}+{R_{h}}){R_{h}}{c_{5}}+({c_{4}}\eta_{1}^{-1}+R_{g}^{2}% )c_{5}^{2}+(0.5L+L_{*}^{2})R_{h}^{2}]\nicefrac{{1}}{{\sqrt{K}}}caligraphic_O ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) = [ caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + 2 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + ( italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 0.5 italic_L + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG and 𝒪(1/K)=2(1η1)RwRgc5/η1K𝒪1𝐾21subscript𝜂1subscript𝑅𝑤subscript𝑅𝑔subscript𝑐5subscript𝜂1𝐾{\cal O}(\nicefrac{{1}}{{K}})=\nicefrac{{{2(1-{\eta_{1}}){R_{w}}{R_{g}}{c_{5}}% }}}{{{{\eta_{1}}K}}}caligraphic_O ( / start_ARG 1 end_ARG start_ARG italic_K end_ARG ) = / start_ARG 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_K end_ARG with the Lyapunov function as k=J(vk)12𝔼[Δk2]subscript𝑘𝐽subscript𝑣𝑘12𝔼delimited-[]superscriptnormsubscriptΔ𝑘2{\cal L}_{k}=J(v_{k})-\frac{1}{2}\mathds{E}[\|\Delta_{k}\|^{2}]caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ].

Proof:

Define the Lyapunov function as k=J(vk)12𝔼[Δk2]subscript𝑘𝐽subscript𝑣𝑘12𝔼delimited-[]superscriptnormsubscriptΔ𝑘2{\cal L}_{k}=J(v_{k})-\frac{1}{2}\mathds{E}[\|\Delta_{k}\|^{2}]caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Summing (31) and (27) and dividing both sides by α𝛼\alphaitalic_α with α=Θ(1/K)𝛼Θ1𝐾\alpha=\Theta(\nicefrac{{1}}{{\sqrt{K}}})italic_α = roman_Θ ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) and β=[1+4(1+γ)2Rπ2+4(1+γ)RπG+8G2]α/4σ𝛽delimited-[]14superscript1𝛾2superscriptsubscript𝑅𝜋241𝛾subscript𝑅𝜋subscript𝐺8superscriptsubscript𝐺2𝛼4𝜎\beta=[1+4{(1+\gamma)^{2}}R_{\pi}^{2}+4(1+\gamma){R_{\pi}}{G_{*}}+8G_{*}^{2}]% \nicefrac{{\alpha}}{{{4\sigma}}}italic_β = [ 1 + 4 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 8 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] / start_ARG italic_α end_ARG start_ARG 4 italic_σ end_ARG, we obtain (40). ∎

Corollary 1 characterizes the unified convergence of the actor and critic recursions with respect to the total number of frames K𝐾Kitalic_K. Based on Corollary 1, we observe that the proposed HB-A2C algorithm finds an ϵitalic-ϵ\epsilonitalic_ϵ-approximate stationary point with 𝒪(ϵ2)𝒪superscriptitalic-ϵ2\mathcal{O}(\epsilon^{-2})caligraphic_O ( italic_ϵ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations for reinforcement learning tasks with Markovian noise.

In our proposed HB-A2C algorithm, the learning rates of the actor and critic recursions are of the same order. Furthermore, we observe from the term 𝒪(1/K)𝒪1𝐾\mathcal{O}(\nicefrac{{1}}{{\sqrt{K}}})caligraphic_O ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) that the convergence rate is essentially controlled by the optimality drift term. Additionally, based on 𝒪(1/K)𝒪1𝐾\mathcal{O}(\nicefrac{{1}}{{\sqrt{K}}})caligraphic_O ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ), we observe that increasing the momentum factor η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can trade off the error introduced by the initial actor and critic parameters for the error introduced by the biased gradient descent recursions. Our convergence rate of 𝒪(1/K)+𝒪(1/K)𝒪1𝐾𝒪1𝐾\mathcal{O}(\nicefrac{{1}}{{\sqrt{K}}})+\mathcal{O}(\nicefrac{{1}}{{K}})caligraphic_O ( / start_ARG 1 end_ARG start_ARG square-root start_ARG italic_K end_ARG end_ARG ) + caligraphic_O ( / start_ARG 1 end_ARG start_ARG italic_K end_ARG ) is tighter than those in [28, 31]. Compared to the finite-time results of the A2C algorithms in [28, 31], our error bounds in (40) hold for all K1𝐾1K\geq 1italic_K ≥ 1, whereas those of [28, 31] become available only after a mixing time of updates.

Appendix A Proof of Lemma 1

For the critic parameter, the upper bound of the stoschatic semi-gradient in (7) is derived as

g(wk;Ok)Φkwk+bk(1+γT)Rw+c1(γ)Rr:=Rgdelimited-∥∥𝑔subscript𝑤𝑘subscript𝑂𝑘delimited-∥∥subscriptΦ𝑘delimited-∥∥subscript𝑤𝑘delimited-∥∥subscript𝑏𝑘1superscript𝛾𝑇subscript𝑅𝑤subscript𝑐1𝛾subscript𝑅𝑟assignsubscript𝑅𝑔\begin{split}\|g(w_{k};O_{k})\|&\leq\|\Phi_{k}\|\|w_{k}\|+\|b_{k}\|\\ &\leq(1+\gamma^{T})R_{w}+c_{1}(\gamma)R_{r}:=R_{g}\end{split}start_ROW start_CELL ∥ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ end_CELL start_CELL ≤ ∥ roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ + ∥ italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT := italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_CELL end_ROW (41)

where c1(γ)=1γT1γsubscript𝑐1𝛾1superscript𝛾𝑇1𝛾c_{1}(\gamma)=\frac{1-\gamma^{T}}{1-\gamma}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) = divide start_ARG 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG.

Based on (41), the full semi-gradient 𝔼[g(wk;O¯k)]𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘\mathds{E}[g(w_{k};\bar{O}_{k})]blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] is upper-bounded as 𝔼[g(wk;O¯k)]𝔼[g(wk;O¯k)]Rgnorm𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘𝔼delimited-[]norm𝑔subscript𝑤𝑘subscript¯𝑂𝑘subscript𝑅𝑔\|\mathds{E}[g(w_{k};\bar{O}_{k})]\|\leq\mathds{E}[\|g(w_{k};\bar{O}_{k})\|]% \leq R_{g}∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ ≤ blackboard_E [ ∥ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ] ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Based on the definition in (11), we obtain the upper bound of h(vk,wk;ok,t)subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡h(v_{k},w_{k}^{*};o_{k,t})italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) as

h(vk,wk;ok,t)Rπ[Rr+(1+γ)Rw]:=Rh.delimited-∥∥subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡subscript𝑅𝜋delimited-[]subscript𝑅𝑟1𝛾subscript𝑅𝑤assignsubscript𝑅\begin{split}\|h(v_{k},w_{k}^{*};o_{k,t})\|\leq R_{\pi}[R_{r}\!+\!(1+\gamma)R_% {w}]:=R_{h}.\end{split}start_ROW start_CELL ∥ italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] := italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT . end_CELL end_ROW (42)

Based on (42), we obtain the upper bounds for the policy gradient and its T𝑇Titalic_T-step estimation as J(vk,wk;sk,0)Rhnorm𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘0subscript𝑅\|\nabla J(v_{k},w_{k}^{*};s_{k,0})\|\leq R_{h}∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and H(vk,wk;Ok)Rhnorm𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘subscript𝑅\|H(v_{k},w_{k};O_{k})\|\leq R_{h}∥ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

Appendix B Proof of Lemma 2

Following the definition of ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), the difference between ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and ζ(vk,wk1;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\zeta(v_{k},w_{k-1};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) can be decomposed as

ζ(vk,wk;Ok)ζ(vk,wk1;Ok)norm𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\displaystyle\|\zeta(v_{k},w_{k};O_{k})-\zeta(v_{k},w_{k-1};O_{k})\|∥ italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ (43a)
Φk(wkwk1)+𝔼[Φ¯k](wkwk1)absentnormsubscriptΦ𝑘subscript𝑤𝑘subscript𝑤𝑘1norm𝔼delimited-[]subscript¯Φ𝑘subscript𝑤𝑘subscript𝑤𝑘1\displaystyle\leq\|\Phi_{k}(w_{k}-w_{k-1})\|+\|\mathds{E}[\bar{\Phi}_{k}](w_{k% }-w_{k-1})\|≤ ∥ roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ + ∥ blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ (43b)
4wkwk1absent4normsubscript𝑤𝑘subscript𝑤𝑘1\displaystyle\leq 4\|w_{k}-w_{k-1}\|≤ 4 ∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ (43c)

where Φk1+γT2normsubscriptΦ𝑘1superscript𝛾𝑇2\|\Phi_{k}\|\leq 1+\gamma^{T}\leq 2∥ roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≤ 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≤ 2 and 𝔼[Φ¯k]2norm𝔼delimited-[]subscript¯Φ𝑘2\|\mathds{E}[\bar{\Phi}_{k}]\|\leq 2∥ blackboard_E [ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∥ ≤ 2.

The gradient bias ζ(vk,wk1;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\zeta(v_{k},w_{k-1};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) can be decomposed as

ζ(vk,wk1;Ok)=g(wk1;Ok)g(wk1;O~k)+g(wk1;O~k)g(wk1;O¯k)+g(wk1;O¯k)𝔼[g(wk1;O¯k)]+𝔼[g(wk1;O¯k)]𝔼[g(wk1;O¯k)]𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘𝑔subscript𝑤𝑘1subscript𝑂𝑘𝑔subscript𝑤𝑘1subscript~𝑂𝑘𝑔subscript𝑤𝑘1subscript~𝑂𝑘𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘𝔼delimited-[]𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘𝔼delimited-[]𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘𝔼delimited-[]𝑔subscript𝑤𝑘1subscript¯𝑂𝑘\begin{split}&\zeta(v_{k},w_{k-1};O_{k})\\ &=g(w_{k-1};O_{k})-g(w_{k-1};\tilde{O}_{k})\\ &\hskip 8.5359pt+g(w_{k-1};\tilde{O}_{k})-g(w_{k-1};\bar{O}_{k}^{\prime})\\ &\hskip 8.5359pt+g(w_{k-1};\bar{O}_{k}^{\prime})-\mathds{E}[g(w_{k-1};\bar{O}_% {k}^{\prime})]\\ &\hskip 8.5359pt+\mathds{E}[g(w_{k-1};\bar{O}_{k}^{\prime})]-\mathds{E}[g(w_{k% -1};\bar{O}_{k})]\end{split}start_ROW start_CELL end_CELL start_CELL italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] end_CELL end_ROW (44)

where observations O~ksubscript~𝑂𝑘\tilde{O}_{k}over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are sampled from the behavior πk1subscript𝜋𝑘1\pi_{k-1}italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT starting from sk1,0subscript𝑠𝑘10s_{k-1,0}italic_s start_POSTSUBSCRIPT italic_k - 1 , 0 end_POSTSUBSCRIPT, and O¯ksuperscriptsubscript¯𝑂𝑘\bar{O}_{k}^{\prime}over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are sampled from the stationary distribution μk1πk1tensor-productsubscript𝜇𝑘1subscript𝜋𝑘1\mu_{k-1}\otimes\pi_{k-1}\otimes\mathds{P}italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ blackboard_P.

Taking the expectation of g(wk1;Ok)g(wk1;O~k)𝑔subscript𝑤𝑘1subscript𝑂𝑘𝑔subscript𝑤𝑘1subscript~𝑂𝑘g(w_{k-1};O_{k})-g(w_{k-1};\tilde{O}_{k})italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) conditional on k1subscript𝑘1{\cal F}_{k-1}caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, we obtain

𝔼[g(wk1;Ok)g(wk1;O~k)|k1]Rg(Ok|k1)(O~k|k1)TV.\begin{split}&\|\mathds{E}[g(w_{k-1};O_{k})-g(w_{k-1};\tilde{O}_{k})|{\cal F}_% {k-1}]\|\\ &\leq R_{g}\|\mathds{P}(O_{k}\in\cdot|{\cal F}_{k-1})-\mathds{P}(\tilde{O}_{k}% \in\cdot|{\cal F}_{k-1})\|_{\mbox{\tiny TV}}.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ blackboard_P ( italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - blackboard_P ( over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT . end_CELL end_ROW (45)

Following the definition of MDP and recalling sk1,T=sk,0subscript𝑠𝑘1𝑇subscript𝑠𝑘0s_{k-1,T}=s_{k,0}italic_s start_POSTSUBSCRIPT italic_k - 1 , italic_T end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT, we can expand the conditional probabilities in (45) as

(Ok|k1)\displaystyle\mathds{P}(O_{k}\in\cdot|{\cal F}_{k-1})blackboard_P ( italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT )
=𝒫k1,TπkπkTtimesabsenttensor-productsubscript𝒫𝑘1𝑇subscripttensor-productsubscript𝜋𝑘subscript𝜋𝑘𝑇𝑡𝑖𝑚𝑒𝑠\displaystyle={\cal P}_{k-1,T}\otimes\underbracket{\pi_{k}\otimes\mathds{P}% \otimes\ldots\otimes\pi_{k}\otimes\mathds{P}}_{T~{}times}= caligraphic_P start_POSTSUBSCRIPT italic_k - 1 , italic_T end_POSTSUBSCRIPT ⊗ under﹈ start_ARG italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P ⊗ … ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P end_ARG start_POSTSUBSCRIPT italic_T italic_t italic_i italic_m italic_e italic_s end_POSTSUBSCRIPT (46a)
(O~k|k1)\displaystyle\mathds{P}(\tilde{O}_{k}\in\cdot|{\cal F}_{k-1})blackboard_P ( over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT )
=𝒫k1,Tπk1πk1Ttimes.absenttensor-productsubscript𝒫𝑘1𝑇subscripttensor-productsubscript𝜋𝑘1subscript𝜋𝑘1𝑇𝑡𝑖𝑚𝑒𝑠\displaystyle={\cal P}_{k-1,T}\otimes\underbracket{\pi_{k-1}\otimes\mathds{P}% \otimes\ldots\otimes\pi_{k-1}\otimes\mathds{P}}_{T~{}times}.= caligraphic_P start_POSTSUBSCRIPT italic_k - 1 , italic_T end_POSTSUBSCRIPT ⊗ under﹈ start_ARG italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ blackboard_P ⊗ … ⊗ italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ blackboard_P end_ARG start_POSTSUBSCRIPT italic_T italic_t italic_i italic_m italic_e italic_s end_POSTSUBSCRIPT . (46b)

Based on (46), we can upper bound (Ok|k1)(O~k|k1)TV\|\mathds{P}(O_{k}\in\cdot|{\cal F}_{k-1})-\mathds{P}(\tilde{O}_{k}\in\cdot|{% \cal F}_{k-1})\|_{\mbox{\tiny TV}}∥ blackboard_P ( italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - blackboard_P ( over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT as

(Ok|k1)(O~k|k1)TVT|𝒜|LπRhα.\|\mathds{P}(O_{k}\in\cdot|{\cal F}_{k-1})-\mathds{P}(\tilde{O}_{k}\in\cdot|{% \cal F}_{k-1})\|_{\mbox{\tiny TV}}\\ \leq T|\mathcal{A}|L_{\pi}R_{h}\alpha.start_ROW start_CELL ∥ blackboard_P ( italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - blackboard_P ( over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ≤ italic_T | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α . end_CELL end_ROW (47)

Substituting (47) into (45), we obtain

𝔼[g(wk1;Ok)g(wk1;O~k)|k1]T|𝒜|LπRgRhα.\|\mathds{E}[g(w_{k-1};O_{k})-g(w_{k-1};\tilde{O}_{k})|{\cal F}_{k-1}]\|\\ \leq T|{\cal A}|L_{\pi}R_{g}R_{h}\alpha.start_ROW start_CELL ∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL ≤ italic_T | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α . end_CELL end_ROW (48)

Taking the expectation of g(wk1;O~k)g(wk1;O¯k)𝑔subscript𝑤𝑘1subscript~𝑂𝑘𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘g(w_{k-1};\tilde{O}_{k})-g(w_{k-1};\bar{O}_{k}^{\prime})italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) conditional on k1subscript𝑘1{\cal F}_{k-1}caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, we obtain

𝔼[g(wk1;O~k)g(wk1;O¯k)|k1]Rg(O~k=|k1)(O¯k=|k1)TV.\|\mathds{E}[g(w_{k-1};\tilde{O}_{k})-g(w_{k-1};\bar{O}_{k}^{\prime})|{\cal F}% _{k-1}]\|\\ \leq R_{g}\|\mathds{P}(\tilde{O}_{k}=\cdot|{\cal F}_{k-1})-\mathds{P}(\bar{O}_% {k}^{\prime}=\cdot|{\cal F}_{k-1})\|_{\mbox{\tiny TV}}.start_ROW start_CELL ∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ blackboard_P ( over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - blackboard_P ( over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT . end_CELL end_ROW (49)

Since the sample trajectory O¯ksuperscriptsubscript¯𝑂𝑘\bar{O}_{k}^{\prime}over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained from the stationary distribution μk1πk1tensor-productsubscript𝜇𝑘1subscript𝜋𝑘1\mu_{k-1}\otimes\pi_{k-1}\otimes\mathds{P}italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ blackboard_P, we have

(O¯k|k1)=μk1πk1πk1Ttimes.\begin{split}&\mathds{P}(\bar{O}_{k}^{\prime}\in\cdot|{\cal F}_{k-1})\\ &=\mu_{k-1}\otimes\underbracket{\pi_{k-1}\otimes\mathds{P}\otimes\ldots\otimes% \pi_{k-1}\otimes\mathds{P}}_{T~{}times}.\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_P ( over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ under﹈ start_ARG italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ blackboard_P ⊗ … ⊗ italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⊗ blackboard_P end_ARG start_POSTSUBSCRIPT italic_T italic_t italic_i italic_m italic_e italic_s end_POSTSUBSCRIPT . end_CELL end_ROW (50)

When Assumption 2 holds, we obtain the following upper bound based on Lemma 1 as

𝔼[g(wk1;O~k)g(wk1;O¯k)|k1]c0ρTRg.\|\mathds{E}[g(w_{k-1};\tilde{O}_{k})-g(w_{k-1};\bar{O}_{k}^{\prime})|{\cal F}% _{k-1}]\|\!\leq\!c_{0}\rho^{T}\!R_{g}.\!∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over~ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ∥ ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT . (51)

Based on (50), we also have the following conditional expectation

𝔼[g(wk1;O¯k)𝔼[g(wk1;O¯k)]|k1]=0𝔼delimited-[]𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘conditional𝔼delimited-[]𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘subscript𝑘10\mathds{E}[g(w_{k-1};\bar{O}_{k}^{\prime})-\mathds{E}[g(w_{k-1};\bar{O}_{k}^{% \prime})]|{\cal F}_{k-1}]=0blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] = 0 (52)

Since trajectory O¯ksubscript¯𝑂𝑘\bar{O}_{k}over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is sampled from stationary distribution induced by πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we have

(O¯k|k1)=μkπkπkTtimes.\mathds{P}(\bar{O}_{k}\in\cdot|{\cal F}_{k-1})=\mu_{k}\otimes\underbracket{\pi% _{k}\otimes\mathds{P}\otimes\ldots\otimes\pi_{k}\otimes\mathds{P}}_{T~{}times}.blackboard_P ( over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ under﹈ start_ARG italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P ⊗ … ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P end_ARG start_POSTSUBSCRIPT italic_T italic_t italic_i italic_m italic_e italic_s end_POSTSUBSCRIPT . (53)

Based on (50) and (53), the norm of 𝔼[g(wk1;O¯k)]𝔼[g(wk1;O¯k)]𝔼delimited-[]𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘𝔼delimited-[]𝑔subscript𝑤𝑘1subscript¯𝑂𝑘\mathds{E}[g(w_{k-1};\bar{O}_{k}^{\prime})]-\mathds{E}[g(w_{k-1};\bar{O}_{k})]blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] can be upper-bounded as

𝔼[g(wk1;O¯k)]𝔼[g(wk1;O¯k)](c2+T)|𝒜|LπRgRhα.delimited-∥∥𝔼delimited-[]𝑔subscript𝑤𝑘1superscriptsubscript¯𝑂𝑘𝔼delimited-[]𝑔subscript𝑤𝑘1subscript¯𝑂𝑘subscript𝑐2𝑇𝒜subscript𝐿𝜋subscript𝑅𝑔subscript𝑅𝛼\|\mathds{E}[g(w_{k-1};\bar{O}_{k}^{\prime})]-\mathds{E}[g(w_{k-1};\bar{O}_{k}% )]\|\\ \leq(c_{2}+T)|{\cal A}|L_{\pi}R_{g}R_{h}\alpha.start_ROW start_CELL ∥ blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ end_CELL end_ROW start_ROW start_CELL ≤ ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α . end_CELL end_ROW (54)

Substituting (48), (51), (52), and (54) into (44), we obtain upper bound of the expectation of gradient bias ζ(vk,wk1;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\zeta(v_{k},w_{k-1};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) conditional on k1subscript𝑘1{\cal F}_{k-1}caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT as

𝔼[ζ(vk,wk1;Ok)|k1](c2+2T)|𝒜|LπRgvkvk1+Rgβ\|\mathds{E}[\zeta(v_{k},w_{k-1};O_{k})|{\cal F}_{k-1}]\|\\ \leq(c_{2}+2T)|{\cal A}|L_{\pi}R_{g}\|v_{k}-v_{k-1}\|+R_{g}\betastart_ROW start_CELL ∥ blackboard_E [ italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] ∥ end_CELL end_ROW start_ROW start_CELL ≤ ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β end_CELL end_ROW (55)

where Tlogc01β/logρ𝑇superscriptsubscript𝑐01𝛽𝜌T\geq\nicefrac{{\log c_{0}^{-1}\beta}}{{\log\rho}}italic_T ≥ / start_ARG roman_log italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_β end_ARG start_ARG roman_log italic_ρ end_ARG.

Appendix C Proof of Lemma 3

Sampling the state as s¯k,tμvsimilar-tosubscript¯𝑠𝑘𝑡subscript𝜇𝑣\bar{s}_{k,t}\sim\mu_{v}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and the action as a¯k,tπvsimilar-tosubscript¯𝑎𝑘𝑡subscript𝜋𝑣\bar{a}_{k,t}\sim\pi_{v}over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. For the given actor parameter v𝑣vitalic_v, there always exists a unique optimal critic parameter wvsubscriptsuperscript𝑤𝑣w^{*}_{v}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT that satisfies Φ¯vwv=b¯vsubscript¯Φ𝑣subscriptsuperscript𝑤𝑣subscript¯𝑏𝑣\bar{\Phi}_{v}w^{*}_{v}=\bar{b}_{v}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with Φ¯v=𝔼[ϕ¯k,0[ϕ¯k,0γTϕ¯k,T]]σIsubscript¯Φ𝑣𝔼delimited-[]subscript¯italic-ϕ𝑘0superscriptdelimited-[]subscript¯italic-ϕ𝑘0superscript𝛾𝑇subscript¯italic-ϕ𝑘𝑇succeeds-or-equals𝜎𝐼\bar{\Phi}_{v}=\mathds{E}[\bar{\phi}_{k,0}[\bar{\phi}_{k,0}-\gamma^{T}\bar{% \phi}_{k,T}]^{\dagger}]\succeq\sigma Iover¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = blackboard_E [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ] ⪰ italic_σ italic_I and b¯v=𝔼[ϕ¯k,0t=0T1γtr¯k,t]subscript¯𝑏𝑣𝔼delimited-[]subscript¯italic-ϕ𝑘0superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript¯𝑟𝑘𝑡\bar{b}_{v}=\mathds{E}[\bar{\phi}_{k,0}\sum_{t=0}^{T-1}\gamma^{t}\bar{r}_{k,t}]over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = blackboard_E [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ].

Based on Φ¯vwvb¯v=0subscript¯Φ𝑣subscriptsuperscript𝑤𝑣subscript¯𝑏𝑣0\bar{\Phi}_{v}w^{*}_{v}-\bar{b}_{v}=0over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 0, we obtain

Φ¯vwv+Φ¯vwv=b¯vsubscript¯Φ𝑣subscriptsuperscript𝑤𝑣subscript¯Φ𝑣subscriptsuperscript𝑤𝑣subscript¯𝑏𝑣\nabla\bar{\Phi}_{v}w^{*}_{v}+\bar{\Phi}_{v}\nabla w^{*}_{v}=\nabla\bar{b}_{v}∇ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = ∇ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (56)

where b¯v=𝔼[ϕ¯k,0t=0T1γtr¯k,tlogπv(ak,t|sk,t)]subscript¯𝑏𝑣𝔼delimited-[]subscript¯italic-ϕ𝑘0superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript¯𝑟𝑘𝑡subscript𝜋𝑣conditionalsubscript𝑎𝑘𝑡subscript𝑠𝑘𝑡\nabla\bar{b}_{v}=\mathds{E}[\bar{\phi}_{k,0}\sum_{t=0}^{T-1}\gamma^{t}\bar{r}% _{k,t}\nabla\log\pi_{v}(a_{k,t}|s_{k,t})]∇ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = blackboard_E [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ∇ roman_log italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) ] and Φ¯vwv=𝔼[ϕ¯k,0[ϕ¯k,0γTϕ¯k,T]wvlogπv(ak,t|sk,t)]subscript¯Φ𝑣subscriptsuperscript𝑤𝑣𝔼delimited-[]subscript¯italic-ϕ𝑘0superscriptdelimited-[]subscript¯italic-ϕ𝑘0superscript𝛾𝑇subscript¯italic-ϕ𝑘𝑇subscriptsuperscript𝑤𝑣subscript𝜋𝑣conditionalsubscript𝑎𝑘𝑡subscript𝑠𝑘𝑡\nabla\bar{\Phi}_{v}w^{*}_{v}=\mathds{E}[\bar{\phi}_{k,0}[\bar{\phi}_{k,0}-% \gamma^{T}\bar{\phi}_{k,T}]^{\dagger}w^{*}_{v}\nabla\log\pi_{v}(a_{k,t}|s_{k,t% })]∇ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = blackboard_E [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT [ over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over¯ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_k , italic_T end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∇ roman_log italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) ].

Based on (56), we obtain the Jacobian matrix as wv=Φv1[bvΦvwv]subscriptsuperscript𝑤𝑣superscriptsubscriptΦ𝑣1delimited-[]subscript𝑏𝑣subscriptΦ𝑣subscriptsuperscript𝑤𝑣\nabla w^{*}_{v}=\Phi_{v}^{-1}[\nabla b_{v}-\nabla\Phi_{v}w^{*}_{v}]∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ ∇ italic_b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - ∇ roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ]. Let two optimal critic parameters wvsubscriptsuperscript𝑤𝑣w^{*}_{v}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and wvsubscriptsuperscript𝑤superscript𝑣w^{*}_{v^{\prime}}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT satisfy Φ¯vwv=b¯vsubscript¯Φ𝑣subscriptsuperscript𝑤𝑣subscript¯𝑏𝑣\bar{\Phi}_{v}w^{*}_{v}=\bar{b}_{v}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and Φ¯vwv=b¯vsubscript¯Φsuperscript𝑣subscriptsuperscript𝑤superscript𝑣subscript¯𝑏superscript𝑣\bar{\Phi}_{v^{\prime}}w^{*}_{v^{\prime}}=\bar{b}_{v^{\prime}}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In order to derive the Lipschitz continuity of wvsubscriptsuperscript𝑤𝑣w^{*}_{v}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and bound of wvsubscriptsuperscript𝑤𝑣\nabla w^{*}_{v}∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, we have the following inequalities based on Lemma 5 as

b¯vb¯v(1+c2)c1(γ)Rr|𝒜|Lπvvnormsubscript¯𝑏𝑣subscript¯𝑏superscript𝑣1subscript𝑐2subscript𝑐1𝛾subscript𝑅𝑟𝒜subscript𝐿𝜋norm𝑣superscript𝑣\displaystyle\|\bar{b}_{v}-\bar{b}_{v^{\prime}}\|\leq(1+c_{2})c_{1}(\gamma)R_{% r}|{\cal A}|L_{\pi}\|v-v^{\prime}\|∥ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ≤ ( 1 + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (57a)
Φ¯vΦ¯v2(1+γT)c2|𝒜|Lπvvnormsubscript¯Φ𝑣subscript¯Φsuperscript𝑣21superscript𝛾𝑇subscript𝑐2𝒜subscript𝐿𝜋norm𝑣superscript𝑣\displaystyle\|\bar{\Phi}_{v}-\bar{\Phi}_{v^{\prime}}\|\leq 2(1+\gamma^{T})c_{% 2}|{\cal A}|L_{\pi}\|v-v^{\prime}\|∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ≤ 2 ( 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (57b)
b¯vc1(γ)Rr,Φ¯v1σ1formulae-sequencenormsubscript¯𝑏𝑣subscript𝑐1𝛾subscript𝑅𝑟normsuperscriptsubscript¯Φ𝑣1superscript𝜎1\displaystyle\|\bar{b}_{v}\|\leq c_{1}(\gamma)R_{r},\|\bar{\Phi}_{v}^{-1}\|% \leq\sigma^{-1}∥ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ≤ italic_σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (57c)
b¯vc1(γ)RrRπ,Φ¯vwv(1+γT)RπRw.formulae-sequencenormsubscript¯𝑏𝑣subscript𝑐1𝛾subscript𝑅𝑟subscript𝑅𝜋normsubscript¯Φ𝑣subscriptsuperscript𝑤𝑣1superscript𝛾𝑇subscript𝑅𝜋subscript𝑅𝑤\displaystyle\|\nabla\bar{b}_{v}\|\leq c_{1}(\gamma)R_{r}R_{\pi},\|\nabla\bar{% \Phi}_{v}w^{*}_{v}\|\leq(1+\gamma^{T})R_{\pi}R_{w}.∥ ∇ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , ∥ ∇ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ ≤ ( 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT . (57d)

Then, we derive the Lipschitz continuity of wvsubscriptsuperscript𝑤𝑣w^{*}_{v}italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as

wvwvnormsubscriptsuperscript𝑤𝑣subscriptsuperscript𝑤superscript𝑣\displaystyle\|w^{*}_{v}-w^{*}_{v^{\prime}}\|∥ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ (58a)
=Φ¯v1b¯vΦ¯v1b¯vabsentnormsuperscriptsubscript¯Φ𝑣1subscript¯𝑏𝑣superscriptsubscript¯Φsuperscript𝑣1subscript¯𝑏superscript𝑣\displaystyle=\|\bar{\Phi}_{v}^{-1}\bar{b}_{v}-\bar{\Phi}_{v^{\prime}}^{-1}% \bar{b}_{v^{\prime}}\|= ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ (58b)
Φ¯v1b¯vΦ¯v1b¯v+Φ¯v1b¯vΦ¯v1b¯vabsentnormsuperscriptsubscript¯Φ𝑣1subscript¯𝑏𝑣superscriptsubscript¯Φ𝑣1subscript¯𝑏superscript𝑣normsuperscriptsubscript¯Φ𝑣1subscript¯𝑏superscript𝑣superscriptsubscript¯Φsuperscript𝑣1subscript¯𝑏superscript𝑣\displaystyle\leq\|\bar{\Phi}_{v}^{-1}\bar{b}_{v}-\bar{\Phi}_{v}^{-1}\bar{b}_{% v^{\prime}}\|+\|\bar{\Phi}_{v}^{-1}\bar{b}_{v^{\prime}}-\bar{\Phi}_{v^{\prime}% }^{-1}\bar{b}_{v^{\prime}}\|≤ ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ + ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ (58c)
Φ¯v1b¯vb¯v+Φ¯v1Φ¯vΦ¯vΦ¯v1b¯vabsentnormsuperscriptsubscript¯Φ𝑣1normsubscript¯𝑏𝑣subscript¯𝑏superscript𝑣normsuperscriptsubscript¯Φ𝑣1normsubscript¯Φ𝑣subscript¯Φsuperscript𝑣normsuperscriptsubscript¯Φsuperscript𝑣1normsubscript¯𝑏superscript𝑣\displaystyle\leq\|\bar{\Phi}_{v}^{-1}\|\|\bar{b}_{v}-\bar{b}_{v^{\prime}}\|\!% +\!\|\bar{\Phi}_{v}^{-1}\|\|\bar{\Phi}_{v}\!-\!\bar{\Phi}_{v^{\prime}}\|\|\bar% {\Phi}_{v^{\prime}}^{-1}\|\|\bar{b}_{v^{\prime}}\|≤ ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∥ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ + ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ∥ over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∥ over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ (58d)
Lvvabsentsubscript𝐿norm𝑣superscript𝑣\displaystyle\leq L_{*}\|v-v^{\prime}\|≤ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (58e)

where L:=[1+c2+2(1+γT)σ1c2]σ1c1(γ)Rr|𝒜|Lπassignsubscript𝐿delimited-[]1subscript𝑐221superscript𝛾𝑇superscript𝜎1subscript𝑐2superscript𝜎1subscript𝑐1𝛾subscript𝑅𝑟𝒜subscript𝐿𝜋L_{*}:=[1+c_{2}+2(1+\gamma^{T})\sigma^{-1}c_{2}]\sigma^{-1}c_{1}(\gamma)R_{r}|% {\cal A}|L_{\pi}italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := [ 1 + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 ( 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] italic_σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.

Based on (57), the bound of the Jacobian matrix wvsubscriptsuperscript𝑤𝑣\nabla w^{*}_{v}∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is derived as

Φv1[bvΦvwv]Φv1bvΦvwvGnormsuperscriptsubscriptΦ𝑣1delimited-[]subscript𝑏𝑣subscriptΦ𝑣subscriptsuperscript𝑤𝑣normsuperscriptsubscriptΦ𝑣1normsubscript𝑏𝑣subscriptΦ𝑣subscriptsuperscript𝑤𝑣subscript𝐺\|\Phi_{v}^{-1}[\nabla b_{v}-\nabla\Phi_{v}w^{*}_{v}]\|\!\leq\!\|\Phi_{v}^{-1}% \|\|\nabla b_{v}-\nabla\Phi_{v}w^{*}_{v}\|\!\leq\!G_{*}∥ roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ ∇ italic_b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - ∇ roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] ∥ ≤ ∥ roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∥ ∇ italic_b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - ∇ roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ ≤ italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (59)

where G:=Rπσ[c1(γ)Rr+(1+γT)Rw]assignsubscript𝐺subscript𝑅𝜋𝜎delimited-[]subscript𝑐1𝛾subscript𝑅𝑟1superscript𝛾𝑇subscript𝑅𝑤G_{*}:=\frac{R_{\pi}}{\sigma}[c_{1}(\gamma)R_{r}\!+\!(1\!+\!\gamma^{T})R_{w}]italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := divide start_ARG italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_ARG start_ARG italic_σ end_ARG [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ ) italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + ( 1 + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ].

Appendix D Proof of Lemma 4

Based on (11), we obtain

h(v,w;ok,t)h(v,w;ok,t)(1+γ)Rπww.delimited-∥∥𝑣𝑤subscript𝑜𝑘𝑡𝑣superscript𝑤subscript𝑜𝑘𝑡1𝛾subscript𝑅𝜋delimited-∥∥𝑤superscript𝑤\begin{split}\|h(v,w;o_{k,t})\!-\!h(v,w^{\prime};o_{k,t})\|\leq(1\!+\!\gamma)R% _{\pi}\|w-w^{\prime}\|.\end{split}start_ROW start_CELL ∥ italic_h ( italic_v , italic_w ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) - italic_h ( italic_v , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) ∥ ≤ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_w - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ . end_CELL end_ROW (60)

Following the defintions in (11) and (60), we obtain

H(v,w;Ok)H(v,w;Ok)(1+γ)Rπwwnorm𝐻𝑣𝑤subscript𝑂𝑘𝐻𝑣superscript𝑤subscript𝑂𝑘1𝛾subscript𝑅𝜋norm𝑤superscript𝑤\|H(v,w;O_{k})-H(v,w^{\prime};O_{k})\|\leq(1+\gamma)R_{\pi}\|w-w^{\prime}\|∥ italic_H ( italic_v , italic_w ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( italic_v , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_w - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (61)

where wdw𝑤superscriptsubscript𝑑𝑤w\in\mathbb{R}^{d_{w}}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and wdwsuperscript𝑤superscriptsubscript𝑑𝑤w^{\prime}\in\mathbb{R}^{d_{w}}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Appendix E Proof of Theorem 1

Before establishing the convergence of HB-A2C actor, we first introduce several auxiliary inequlities. Based on the policy gradient J(vk)=𝔼[J(vk,wk;sk,0)]𝐽subscript𝑣𝑘𝔼delimited-[]𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘0\nabla J(v_{k})=\mathds{E}[\nabla J(v_{k},w_{k}^{*};s_{k,0})]∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = blackboard_E [ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) ] with J(vk,wk;sk,0)=(1γ)t=0γtok,t𝒫k,tπk(ok,t)h(vk,wk;ok,t)𝑑ok,t𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘01𝛾superscriptsubscript𝑡0superscript𝛾𝑡subscriptsubscript𝑜𝑘𝑡tensor-productsubscript𝒫𝑘𝑡subscript𝜋𝑘subscript𝑜𝑘𝑡subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡differential-dsubscript𝑜𝑘𝑡\nabla J(v_{k},w_{k}^{*};s_{k,0})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\int_% {o_{k,t}}{\cal P}_{k,t}\otimes\pi_{k}\otimes\mathds{P}(o_{k,t})h(v_{k},w_{k}^{% *};o_{k,t})do_{k,t}∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) italic_d italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT and the stochastic policy gradient H(vk,wk;Ok)=(1γ)t=0T1γth(vk,wk;ok,t)𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘1𝛾superscriptsubscript𝑡0𝑇1superscript𝛾𝑡subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡H(v_{k},w_{k}^{*};O_{k})=(1-\gamma)\sum_{t=0}^{T-1}\gamma^{t}h(v_{k},w_{k}^{*}% ;o_{k,t})italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ), we have

J(vk,wk;sk,0)𝔼[H(vk,wk;Ok)|k]𝐽subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑠𝑘0𝔼delimited-[]conditional𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘subscript𝑘\displaystyle\nabla J(v_{k},w_{k}^{*};s_{k,0})-\mathds{E}[H(v_{k},w_{k}^{*};O_% {k})|{\cal F}_{k}]∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] (62)
=(1γ)t=Tγtok,t𝒫k,tπk(ok,t)h(vk,wk;ok,t)𝑑ok,t.absent1𝛾superscriptsubscript𝑡𝑇superscript𝛾𝑡subscriptsubscript𝑜𝑘𝑡tensor-productsubscript𝒫𝑘𝑡subscript𝜋𝑘subscript𝑜𝑘𝑡subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑜𝑘𝑡differential-dsubscript𝑜𝑘𝑡\displaystyle=(1\!-\!\gamma)\!\!\sum_{t=T}^{\infty}\gamma^{t}\!\!\int_{o_{k,t}% }\!\!\!\!{\cal P}_{k,t}\!\otimes\!\pi_{k}\!\otimes\!\mathds{P}(o_{k,t})h(v_{k}% ,w_{k}^{*};o_{k,t})do_{k,t}.= ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊗ blackboard_P ( italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) italic_h ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) italic_d italic_o start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT .

Based on (62) and setting Tlogβ/2logγ𝑇𝛽2𝛾T\geq\nicefrac{{\log\beta}}{{2\log\gamma}}italic_T ≥ / start_ARG roman_log italic_β end_ARG start_ARG 2 roman_log italic_γ end_ARG, we obtain

J(vk)𝔼[H(vk,wk;Ok)]2=𝔼[J(vk,wk;sk,0)𝔼[H(vk,wk;Ok)|k]]2Rh2β.\begin{split}&\|\nabla J(v_{k})-\mathds{E}[H(v_{k},w_{k}^{*};O_{k})]\|^{2}\\ &=\|\mathds{E}[\nabla J(v_{k},w_{k}^{*};s_{k,0})-\mathds{E}[H(v_{k},w_{k}^{*};% O_{k})|{\cal F}_{k}]]\|^{2}\\ &\leq R_{h}^{2}\beta.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∥ blackboard_E [ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_s start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β . end_CELL end_ROW (63)

Leveraging (63), we obtain

J(vk)𝔼[H(vk,wk;Ok)]2superscriptnorm𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘2\displaystyle\|\nabla J(v_{k})-\mathds{E}[H(v_{k},w_{k};O_{k})]\|^{2}∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (64a)
2J(vk)𝔼[H(vk,wk;Ok)]2absent2superscriptnorm𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘2\displaystyle\leq 2\|\nabla J(v_{k})-\mathds{E}[H(v_{k},w_{k}^{*};O_{k})]\|^{2}≤ 2 ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2𝔼[H(vk,wk;Ok)]𝔼[H(vk,wk;Ok)]22superscriptnorm𝔼delimited-[]𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘2\displaystyle\hskip 5.69046pt+2\|\mathds{E}[H(v_{k},w_{k}^{*};O_{k})]-\mathds{% E}[H(v_{k},w_{k};O_{k})]\|^{2}+ 2 ∥ blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (64b)
2Rh2β+2(1+γ)2Rπ2𝔼[Δk2]absent2superscriptsubscript𝑅2𝛽2superscript1𝛾2superscriptsubscript𝑅𝜋2𝔼delimited-[]superscriptnormsubscriptΔ𝑘2\displaystyle\leq 2R_{h}^{2}\beta+2(1+\gamma)^{2}R_{\pi}^{2}\mathds{E}[\|% \Delta_{k}\|^{2}]≤ 2 italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β + 2 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (64c)

where (64c) follows from Lemma 4 and (63).

Following the fact that a,b12a212ab2𝑎𝑏12superscriptnorm𝑎212superscriptnorm𝑎𝑏2\langle a,b\rangle\geq\frac{1}{2}\|a\|^{2}-\frac{1}{2}\|a-b\|^{2}⟨ italic_a , italic_b ⟩ ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_a - italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we obtain the lower bound of J(vk),𝔼[H(vk,wk;Ok)]𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\langle\nabla J(v_{k}),\mathds{E}[H(v_{k},w_{k};O_{k})]\rangle⟨ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ⟩ as

J(vk),𝔼[H(vk,wk;Ok)]12J(vk)212J(vk)𝔼[H(vk,wk;Ok)]212J(vk)2Rh2β(1+γ)2Rπ2𝔼[Δk2].𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘12superscriptdelimited-∥∥𝐽subscript𝑣𝑘212superscriptdelimited-∥∥𝐽subscript𝑣𝑘𝔼delimited-[]𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘212superscriptdelimited-∥∥𝐽subscript𝑣𝑘2superscriptsubscript𝑅2𝛽superscript1𝛾2superscriptsubscript𝑅𝜋2𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2\begin{split}&\langle\nabla J(v_{k}),\mathds{E}[H(v_{k},w_{k};O_{k})]\rangle\\ &\geq\frac{1}{2}\|\nabla J(v_{k})\|^{2}-\frac{1}{2}\|\nabla J(v_{k})-\mathds{E% }[H(v_{k},w_{k};O_{k})]\|^{2}\\ &\geq\frac{1}{2}\|\nabla J(v_{k})\|^{2}-R_{h}^{2}\beta-(1+\gamma)^{2}R_{\pi}^{% 2}\mathds{E}[\|\Delta_{k}\|^{2}].\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - blackboard_E [ italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β - ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . end_CELL end_ROW (65)

Based on the Lipschitz continuity of the overall reward (16) and the recursion in (14), we have

α𝔼[Ωk]J(vk+1)J(vk)+12LRh2α2𝛼𝔼delimited-[]subscriptΩ𝑘𝐽subscript𝑣𝑘1𝐽subscript𝑣𝑘12𝐿superscriptsubscript𝑅2superscript𝛼2\alpha\mathds{E}[\Omega_{k}]\leq J(v_{k+1})-J(v_{k})+\frac{1}{2}LR_{h}^{2}% \alpha^{2}italic_α blackboard_E [ roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ italic_J ( italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (66)

where Ωk=J(vk),H(vk,wk;Ok)subscriptΩ𝑘𝐽subscript𝑣𝑘𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\Omega_{k}=\langle\nabla J(v_{k}),H(v_{k},w_{k};O_{k})\rangleroman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ⟨ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩

Summing (66) over k=0,1,,K1𝑘01𝐾1k=0,1,\ldots,K-1italic_k = 0 , 1 , … , italic_K - 1, we have

αk=0K1𝔼[Ωk]J(vK)J(v0)+12LRh2α2K.𝛼superscriptsubscript𝑘0𝐾1𝔼delimited-[]subscriptΩ𝑘𝐽subscript𝑣𝐾𝐽subscript𝑣012𝐿superscriptsubscript𝑅2superscript𝛼2𝐾\alpha\sum_{k=0}^{K-1}\mathds{E}[\Omega_{k}]\leq J(v_{K})-J(v_{0})+\frac{1}{2}% LR_{h}^{2}\alpha^{2}K.italic_α ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ italic_J ( italic_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_J ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K . (67)

Substituting (67) into (65), we have

αk=0K1[12J(vk)2(1+γ)2Rπ2𝔼[Δk2]]J(vK)J(v0)+12LRh2α2K+Rh2αβK.𝛼superscriptsubscript𝑘0𝐾1delimited-[]12superscriptdelimited-∥∥𝐽subscript𝑣𝑘2superscript1𝛾2superscriptsubscript𝑅𝜋2𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2𝐽subscript𝑣𝐾𝐽subscript𝑣012𝐿superscriptsubscript𝑅2superscript𝛼2𝐾superscriptsubscript𝑅2𝛼𝛽𝐾\begin{split}&\alpha\sum_{k=0}^{K-1}\!\Big{[}\frac{1}{2}\|\nabla J(v_{k})\|^{2% }\!-\!(1+\gamma)^{2}R_{\pi}^{2}\mathds{E}[\|\Delta_{k}\|^{2}]\Big{]}\\ &\leq J(v_{K})-J(v_{0})+\frac{1}{2}LR_{h}^{2}\alpha^{2}K+R_{h}^{2}\alpha\beta K% .\end{split}start_ROW start_CELL end_CELL start_CELL italic_α ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_J ( italic_v start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) - italic_J ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_L italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K + italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α italic_β italic_K . end_CELL end_ROW (68)

Dividing both sides of (68) by K𝐾Kitalic_K, we complete the proof.

Appendix F Proof of Theorem 2

Based on the recursion (12a), we have nk=η1τ=0k(1η1)kτg(wτ;Oτ)subscript𝑛𝑘subscript𝜂1superscriptsubscript𝜏0𝑘superscript1subscript𝜂1𝑘𝜏𝑔subscript𝑤𝜏subscript𝑂𝜏n_{k}=\eta_{1}\sum_{\tau=0}^{k}(1-\eta_{1})^{k-\tau}g(w_{\tau};O_{\tau})italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - italic_τ end_POSTSUPERSCRIPT italic_g ( italic_w start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) when n1=0subscript𝑛10n_{-1}=0italic_n start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0. Define Δk=wkwksubscriptΔ𝑘subscript𝑤𝑘superscriptsubscript𝑤𝑘\Delta_{k}=w_{k}-w_{k}^{*}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we start to analyze the convergence of the critic parameter wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by considering the following one-step drift

12Δk+112Δk212βnk+wk+1wk2(gradient variance)+Δk,wkwk+1(optimality drift)βΔk,nk(gradient progress)12delimited-∥∥subscriptΔ𝑘112superscriptdelimited-∥∥subscriptΔ𝑘212superscriptdelimited-∥∥𝛽subscript𝑛𝑘superscriptsubscript𝑤𝑘1superscriptsubscript𝑤𝑘2(gradient variance)subscriptΔ𝑘superscriptsubscript𝑤𝑘superscriptsubscript𝑤𝑘1(optimality drift)𝛽subscriptΔ𝑘subscript𝑛𝑘(gradient progress)\begin{split}&\frac{1}{2}\|\Delta_{k+1}\|-\frac{1}{2}\|\Delta_{k}\|^{2}\\ &\leq\frac{1}{2}\|\beta n_{k}+w_{k+1}^{*}-w_{k}^{*}\|^{2}\hskip 5.69046pt\mbox% {(gradient variance)}\\ &\hskip 8.5359pt+\langle\Delta_{k},w_{k}^{*}-w_{k+1}^{*}\rangle\hskip 23.9002% pt\mbox{(optimality drift)}\\ &\hskip 8.5359pt-\beta\langle\Delta_{k},n_{k}\rangle\hskip 53.49132pt\mbox{(% gradient progress)}\end{split}start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ roman_Δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (gradient variance) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ (optimality drift) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_β ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ (gradient progress) end_CELL end_ROW (69)

where nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is obtained via recursion (12b).

F-A Analysis of the Gradient Variance

We start by analyzing the gradient variance term as

12βnk+wk+1wk2β2nk2+wk+1wk212superscriptnorm𝛽subscript𝑛𝑘superscriptsubscript𝑤𝑘1superscriptsubscript𝑤𝑘2superscript𝛽2superscriptnormsubscript𝑛𝑘2superscriptnormsuperscriptsubscript𝑤𝑘1superscriptsubscript𝑤𝑘2\frac{1}{2}\|\beta n_{k}+w_{k+1}^{*}-w_{k}^{*}\|^{2}\!\leq\!\beta^{2}\|n_{k}\|% ^{2}+\|w_{k+1}^{*}-w_{k}^{*}\|^{2}\!\!divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (70)

where the inequality follows 12a+b2a2+b212superscriptnorm𝑎𝑏2superscriptnorm𝑎2superscriptnorm𝑏2\frac{1}{2}\|a+b\|^{2}\leq\|a\|^{2}+\|b\|^{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_a + italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Based on (19), the first term on the right-hand side of (70) is upper-bounded as

β2nk2Rg2β2.superscript𝛽2superscriptnormsubscript𝑛𝑘2superscriptsubscript𝑅𝑔2superscript𝛽2\beta^{2}\|n_{k}\|^{2}\leq R_{g}^{2}\beta^{2}.italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (71)

Based on Lemma 3, the second term on the right-hand side of (70) is upper-bounded as

wkwk+12L2vk+1vk2L2Rh2α2.superscriptnormsuperscriptsubscript𝑤𝑘superscriptsubscript𝑤𝑘12superscriptsubscript𝐿2superscriptnormsubscript𝑣𝑘1subscript𝑣𝑘2superscriptsubscript𝐿2superscriptsubscript𝑅2superscript𝛼2\|w_{k}^{*}-w_{k+1}^{*}\|^{2}\leq L_{*}^{2}\|v_{k+1}-v_{k}\|^{2}\leq L_{*}^{2}% R_{h}^{2}\alpha^{2}.∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (72)

Summing (71) and (72), we obtain the upper bound of the gradient variance term as

12βnk+wk+1wk2Rg2β2+L2Rh2α2.12superscriptnorm𝛽subscript𝑛𝑘superscriptsubscript𝑤𝑘1superscriptsubscript𝑤𝑘2superscriptsubscript𝑅𝑔2superscript𝛽2superscriptsubscript𝐿2superscriptsubscript𝑅2superscript𝛼2\frac{1}{2}\|\beta n_{k}+w_{k+1}^{*}\!-\!w_{k}^{*}\|^{2}\leq R_{g}^{2}\beta^{2% }+L_{*}^{2}R_{h}^{2}\alpha^{2}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_β italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (73)

F-B Analysis of the Optimality Drift

Based on Lemma 3, there exits a Jacobian matrix wvsuperscriptsubscriptsuperscript𝑤𝑣\nabla^{{\dagger}}w^{*}_{v}∇ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT such that wkwk+1=wv(vkvk+1)subscriptsuperscript𝑤𝑘subscriptsuperscript𝑤𝑘1superscriptsubscriptsuperscript𝑤𝑣subscript𝑣𝑘subscript𝑣𝑘1w^{*}_{k}\!-\!w^{*}_{k+1}=\nabla^{{\dagger}}w^{*}_{v}(v_{k}-v_{k+1})italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ). Therefore, we can recast the optimality drift in (69) as

Δk,wkwk+1subscriptΔ𝑘subscriptsuperscript𝑤𝑘subscriptsuperscript𝑤𝑘1\displaystyle\langle\Delta_{k},w^{*}_{k}-w^{*}_{k+1}\rangle⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ (74a)
=wvΔk,vkvk+1absentsubscriptsuperscript𝑤𝑣subscriptΔ𝑘subscript𝑣𝑘subscript𝑣𝑘1\displaystyle=\langle\nabla w^{*}_{v}\Delta_{k},v_{k}-v_{k+1}\rangle= ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ (74b)
=αwvΔk,H(vk,wk;Ok)absent𝛼subscriptsuperscript𝑤𝑣subscriptΔ𝑘𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\displaystyle=-\alpha\langle\nabla w^{*}_{v}\Delta_{k},H(v_{k},w_{k};O_{k})\rangle= - italic_α ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (74c)
=αwvΔk,H(vk,wk;Ok)H(vk,wk;Ok)absent𝛼subscriptsuperscript𝑤𝑣subscriptΔ𝑘𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘\displaystyle=-\alpha\langle\nabla w^{*}_{v}\Delta_{k},H(v_{k},w_{k};O_{k})-H(% v_{k},w_{k}^{*};O_{k})\rangle= - italic_α ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
αwvΔk,H(vk,wk;Ok)J(vk)𝛼subscriptsuperscript𝑤𝑣subscriptΔ𝑘𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘𝐽subscript𝑣𝑘\displaystyle\hskip 8.5359pt-\alpha\langle\nabla w^{*}_{v}\Delta_{k},H(v_{k},w% _{k}^{*};O_{k})-\nabla J(v_{k})\rangle- italic_α ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
αwvΔk,J(vk)𝛼subscriptsuperscript𝑤𝑣subscriptΔ𝑘𝐽subscript𝑣𝑘\displaystyle\hskip 8.5359pt-\alpha\langle\nabla w^{*}_{v}\Delta_{k},\nabla J(% v_{k})\rangle- italic_α ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (74d)

where (74c) follows from the recursion in (14).

Based on Lemma 4, the three terms on the right-hand side of (74d) can be bounded as

|wvΔk,H(vk,wk;Ok)H(vk,wk;Ok)|subscriptsuperscript𝑤𝑣subscriptΔ𝑘𝐻subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘𝐻subscript𝑣𝑘superscriptsubscript𝑤𝑘subscript𝑂𝑘\displaystyle|\langle\nabla w^{*}_{v}\Delta_{k},H(v_{k},w_{k};O_{k})-H(v_{k},w% _{k}^{*};O_{k})\rangle|| ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ |
(1+γ)RπGΔk2absent1𝛾subscript𝑅𝜋subscript𝐺superscriptnormsubscriptΔ𝑘2\displaystyle\leq(1+\gamma)R_{\pi}G_{*}\|\Delta_{k}\|^{2}≤ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (75a)
|𝔼[wvΔk,H(vk,wk;Ok)J(vk)|k]|\displaystyle|\mathds{E}[\langle\nabla w^{*}_{v}\Delta_{k},H(v_{k},w_{k}^{*};O% _{k})-\nabla J(v_{k})\rangle|{\cal F}_{k}]|| blackboard_E [ ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_H ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ | caligraphic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] |
2βRwRhGabsent2𝛽subscript𝑅𝑤subscript𝑅subscript𝐺\displaystyle\leq 2\beta R_{w}R_{h}G_{*}≤ 2 italic_β italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT (75b)
|wvΔk,J(vk)|2G2Δk2+14J(vk)2.subscriptsuperscript𝑤𝑣subscriptΔ𝑘𝐽subscript𝑣𝑘2superscriptsubscript𝐺2superscriptnormsubscriptΔ𝑘214superscriptnorm𝐽subscript𝑣𝑘2\displaystyle|\langle\nabla w^{*}_{v}\Delta_{k},\nabla J(v_{k})\rangle|\leq 2G% _{*}^{2}\|\Delta_{k}\|^{2}+\frac{1}{4}\|\nabla J(v_{k})\|^{2}.| ⟨ ∇ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ | ≤ 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (75c)

Substituting (75) into the expectation of (74d), we have

𝔼[Δk,wkwk+1][(1+γ)RπG+2G2]α𝔼[Δk2]+14αJ(vk)2+2αβRwRhG.𝔼delimited-[]subscriptΔ𝑘subscriptsuperscript𝑤𝑘subscriptsuperscript𝑤𝑘1delimited-[]1𝛾subscript𝑅𝜋subscript𝐺2superscriptsubscript𝐺2𝛼𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘214𝛼superscriptdelimited-∥∥𝐽subscript𝑣𝑘22𝛼𝛽subscript𝑅𝑤subscript𝑅subscript𝐺\mathds{E}[\langle\Delta_{k},w^{*}_{k}-w^{*}_{k+1}\rangle]\leq[(1+\gamma)R_{% \pi}G_{*}+2G_{*}^{2}]\alpha\mathds{E}[\|\Delta_{k}\|^{2}]\\ +\frac{1}{4}\alpha\|\nabla J(v_{k})\|^{2}+2\alpha\beta R_{w}R_{h}G_{*}.start_ROW start_CELL blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⟩ ] ≤ [ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_α blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL + divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_α ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α italic_β italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT . end_CELL end_ROW (76)

Substituting (73) and (76) into the expectation of (69), we have

β𝔼[Δk,nk][(1+γ)RπG+2G2]α𝔼[Δk2]12[𝔼[Δk2]𝔼[Δk+12]]+14αJ(vk)2+2αβRwRhG+Rg2β2+L2Rh2α2.𝛽𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘delimited-[]1𝛾subscript𝑅𝜋subscript𝐺2superscriptsubscript𝐺2𝛼𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘212delimited-[]𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘1214𝛼superscriptdelimited-∥∥𝐽subscript𝑣𝑘22𝛼𝛽subscript𝑅𝑤subscript𝑅subscript𝐺superscriptsubscript𝑅𝑔2superscript𝛽2superscriptsubscript𝐿2superscriptsubscript𝑅2superscript𝛼2\begin{split}&\beta\mathds{E}[\langle\Delta_{k},n_{k}\rangle]-[(1+\gamma)R_{% \pi}G_{*}+2G_{*}^{2}]\alpha\mathds{E}[\|\Delta_{k}\|^{2}]\\ &\leq\frac{1}{2}[\mathds{E}[\|\Delta_{k}\|^{2}]-\mathds{E}[\|\Delta_{k+1}\|^{2% }]]+\frac{1}{4}\alpha\|\nabla J(v_{k})\|^{2}\\ &\hskip 8.5359pt+2\alpha\beta R_{w}R_{h}G_{*}\!+\!R_{g}^{2}\beta^{2}\!+\!L_{*}% ^{2}R_{h}^{2}\alpha^{2}.\!\end{split}start_ROW start_CELL end_CELL start_CELL italic_β blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] - [ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_α blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] + divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_α ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + 2 italic_α italic_β italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (77)

F-C Analysis of the Gradient Progress

Based on (12a), we can decompose the gradient progress term in (69) as

Δk,nksubscriptΔ𝑘subscript𝑛𝑘\displaystyle\langle\Delta_{k},n_{k}\rangle⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ (78a)
=η1Δk,g(wk;Ok)+(1η1)Δk,nk1absentsubscript𝜂1subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘1subscript𝜂1subscriptΔ𝑘subscript𝑛𝑘1\displaystyle=\eta_{1}\langle\Delta_{k},g(w_{k};O_{k})\rangle+(1-\eta_{1})% \langle\Delta_{k},n_{k-1}\rangle= italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ + ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ (78b)
=η1Δk,g(wk;Ok)+(1η1)Δk1,nk1absentsubscript𝜂1subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘1subscript𝜂1subscriptΔ𝑘1subscript𝑛𝑘1\displaystyle=\eta_{1}\langle\Delta_{k},g(w_{k};O_{k})\rangle+(1-\eta_{1})% \langle\Delta_{k-1},n_{k-1}\rangle= italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ + ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩
+(1η1)ΔkΔk1,nk1.1subscript𝜂1subscriptΔ𝑘subscriptΔ𝑘1subscript𝑛𝑘1\displaystyle\hskip 5.69046pt+(1-\eta_{1})\langle\Delta_{k}-\Delta_{k-1},n_{k-% 1}\rangle.+ ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ . (78c)

The second term in (78) is lower-bounded as

(1η1)ΔkΔk1,nk11subscript𝜂1subscriptΔ𝑘subscriptΔ𝑘1subscript𝑛𝑘1\displaystyle(1-\eta_{1})\langle\Delta_{k}-\Delta_{k-1},n_{k-1}\rangle( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ (79a)
=(1η1)wkwk1+wk1wk,nk1absent1subscript𝜂1subscript𝑤𝑘subscript𝑤𝑘1superscriptsubscript𝑤𝑘1superscriptsubscript𝑤𝑘subscript𝑛𝑘1\displaystyle=(1-\eta_{1})\langle w_{k}-w_{k-1}+w_{k-1}^{*}-w_{k}^{*},n_{k-1}\rangle= ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟨ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ (79b)
(1η1)nk1[wkwk1+wkwk1]absent1subscript𝜂1normsubscript𝑛𝑘1delimited-[]normsubscript𝑤𝑘subscript𝑤𝑘1normsubscriptsuperscript𝑤𝑘subscriptsuperscript𝑤𝑘1\displaystyle\geq-(1-\eta_{1})\|n_{k-1}\|[\|w_{k}-w_{k-1}\|+\|w^{*}_{k}-w^{*}_% {k-1}\|]≥ - ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ [ ∥ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ + ∥ italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] (79c)
(1η1)Rg[Rgβ+Lvkvk1]absent1subscript𝜂1subscript𝑅𝑔delimited-[]subscript𝑅𝑔𝛽subscript𝐿normsubscript𝑣𝑘subscript𝑣𝑘1\displaystyle\geq-(1-\eta_{1})R_{g}[R_{g}\beta+L_{*}\|v_{k}-v_{k-1}\|]≥ - ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] (79d)

where (79d) follows from (19), (20), and Lemma 3.

Substituting the expectation of (79) into the expectation of (78), we obtain

η1𝔼[Δk,g(wk;Ok)](1η1)Rg[Rgβ+L𝔼[vkvk1]]+𝔼[Δk,nk](1η1)𝔼[Δk1,nk1].subscript𝜂1𝔼delimited-[]subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘1subscript𝜂1subscript𝑅𝑔delimited-[]subscript𝑅𝑔𝛽subscript𝐿𝔼delimited-[]delimited-∥∥subscript𝑣𝑘subscript𝑣𝑘1𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘1subscript𝜂1𝔼delimited-[]subscriptΔ𝑘1subscript𝑛𝑘1\begin{split}&\eta_{1}\mathds{E}[\langle\Delta_{k},g(w_{k};O_{k})\rangle]\\ &\leq(1-\eta_{1})R_{g}[R_{g}\beta+L_{*}\mathds{E}[\|v_{k}-v_{k-1}\|]]\\ &\hskip 8.5359pt+\mathds{E}[\langle\Delta_{k},n_{k}\rangle]-(1-\eta_{1})% \mathds{E}[\langle\Delta_{k-1},n_{k-1}\rangle].\end{split}start_ROW start_CELL end_CELL start_CELL italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] - ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ ] . end_CELL end_ROW (80)

Based on (7), the left-hand side of (80) can be recast as

η1Δk,g(wk;Ok)subscript𝜂1subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘\displaystyle\eta_{1}\langle\Delta_{k},g(w_{k};O_{k})\rangleitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (81a)
=η1Δk,𝔼[g(wk;O¯k)]+η1Δk,ζ(vk,wk;Ok)absentsubscript𝜂1subscriptΔ𝑘𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘subscript𝜂1subscriptΔ𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\displaystyle=\eta_{1}\langle\Delta_{k},\mathds{E}[g(w_{k};\bar{O}_{k})]% \rangle+\eta_{1}\langle\Delta_{k},\zeta(v_{k},w_{k};O_{k})\rangle= italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ⟩ + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (81b)
=η1Δk,𝔼[g(wk;O¯k)]𝔼[g(wk;O¯k)]absentsubscript𝜂1subscriptΔ𝑘𝔼delimited-[]𝑔subscript𝑤𝑘subscript¯𝑂𝑘𝔼delimited-[]𝑔superscriptsubscript𝑤𝑘subscript¯𝑂𝑘\displaystyle=\eta_{1}\langle\Delta_{k},\mathds{E}[g(w_{k};\bar{O}_{k})]-% \mathds{E}[g(w_{k}^{*};\bar{O}_{k})]\rangle= italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ⟩
+η1Δk,ζ(vk,wk;Ok)subscript𝜂1subscriptΔ𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\displaystyle\hskip 8.5359pt+\eta_{1}\langle\Delta_{k},\zeta(v_{k},w_{k};O_{k})\rangle+ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (81c)
η1σΔk2+η1Δk,ζ(vk,wk;Ok)absentsubscript𝜂1𝜎superscriptnormsuperscriptsubscriptΔ𝑘2subscript𝜂1subscriptΔ𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\displaystyle\geq\eta_{1}\sigma\|\Delta_{k}^{*}\|^{2}+\eta_{1}\langle\Delta_{k% },\zeta(v_{k},w_{k};O_{k})\rangle≥ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (81d)
η1σΔk2+η1ΔkΔk1,ζ(vk,wk;Ok)absentsubscript𝜂1𝜎superscriptnormsuperscriptsubscriptΔ𝑘2subscript𝜂1subscriptΔ𝑘subscriptΔ𝑘1𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\displaystyle\geq\eta_{1}\sigma\|\Delta_{k}^{*}\|^{2}+\eta_{1}\langle\Delta_{k% }-\Delta_{k-1},\zeta(v_{k},w_{k};O_{k})\rangle≥ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
+η1Δk1,ζ(vk,wk;Ok)ζ(vk,wk1;Ok)subscript𝜂1subscriptΔ𝑘1𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\displaystyle\hskip 8.5359pt+\eta_{1}\langle\Delta_{k-1},\zeta(v_{k},w_{k};O_{% k})-\zeta(v_{k},w_{k-1};O_{k})\rangle+ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩
+η1Δk1,ζ(vk,wk1;Ok)subscript𝜂1subscriptΔ𝑘1𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\displaystyle\hskip 8.5359pt+\eta_{1}\langle\Delta_{k-1},\zeta(v_{k},w_{k-1};O% _{k})\rangle+ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ (81e)

where (81c) is based on the fact 𝔼[g(wk;O¯k)]=0𝔼delimited-[]𝑔superscriptsubscript𝑤𝑘subscript¯𝑂𝑘0\mathds{E}[g(w_{k}^{*};\bar{O}_{k})]=0blackboard_E [ italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] = 0, and (81d) is based on (9).

Based on (7) and Lemma 1, we obtain the upper bound of ζ(vk,wk;Ok)𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\zeta(v_{k},w_{k};O_{k})italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as ζ(vk,wk;Ok)2Rgnorm𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘2subscript𝑅𝑔\|\zeta(v_{k},w_{k};O_{k})\|\leq 2R_{g}∥ italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ≤ 2 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Together with the inequality (20) and the Lipschitz continuity in Lemma 3, we derive the upper bounds for the three terms on the right-hand side of (81e) as

|ΔkΔk1,ζ(vk,wk;Ok)|subscriptΔ𝑘subscriptΔ𝑘1𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘\displaystyle|\langle\Delta_{k}-\Delta_{k-1},\zeta(v_{k},w_{k};O_{k})\rangle|| ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ |
2Rg[Rgβ+Lvkvk1]absent2subscript𝑅𝑔delimited-[]subscript𝑅𝑔𝛽subscript𝐿normsubscript𝑣𝑘subscript𝑣𝑘1\displaystyle\leq 2R_{g}\big{[}R_{g}\beta+L_{*}\|v_{k}-v_{k-1}\|\big{]}≤ 2 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] (82a)
|Δk1,ζ(vk,wk;Ok)ζ(vk,wk1;Ok)|subscriptΔ𝑘1𝜁subscript𝑣𝑘subscript𝑤𝑘subscript𝑂𝑘𝜁subscript𝑣𝑘subscript𝑤𝑘1subscript𝑂𝑘\displaystyle|\langle\Delta_{k-1},\zeta(v_{k},w_{k};O_{k})-\zeta(v_{k},w_{k-1}% ;O_{k})\rangle|| ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ |
16RwRgβabsent16subscript𝑅𝑤subscript𝑅𝑔𝛽\displaystyle\leq 16R_{w}R_{g}\beta≤ 16 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β (82b)
|𝔼[Δk1,ζ(vk,wk1;Ok)|k1]|\displaystyle|\mathds{E}[\langle\Delta_{k-1},\zeta(v_{k},w_{k-1};O_{k})\rangle% |{\cal F}_{k-1}]|| blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] |
2RwRg[β+(c2+2T)|𝒜|Lπvkvk1]absent2subscript𝑅𝑤subscript𝑅𝑔delimited-[]𝛽subscript𝑐22𝑇𝒜subscript𝐿𝜋normsubscript𝑣𝑘subscript𝑣𝑘1\displaystyle\leq 2R_{w}R_{g}[\beta+(c_{2}+2T)|{\cal A}|L_{\pi}\|v_{k}-v_{k-1}\|]≤ 2 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_β + ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] (82c)

with Tlogc01β/logρ𝑇superscriptsubscript𝑐01𝛽𝜌T\geq\nicefrac{{\log c_{0}^{-1}\beta}}{{\log\rho}}italic_T ≥ / start_ARG roman_log italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_β end_ARG start_ARG roman_log italic_ρ end_ARG.

Summing (82a)–(82c), we obtain

|𝔼[Δk,ζ(vk,wk;Ok)|k1]|c3𝔼[vkvk1|k1]+c4β\begin{split}&|\mathds{E}[\langle\Delta_{k},\zeta(v_{k},w_{k};O_{k})\rangle|{% \cal F}_{k-1}]|\\ &\leq c_{3}^{\prime}\mathds{E}[\|v_{k}-v_{k-1}\||{\cal F}_{k-1}]+c_{4}^{\prime% }\beta\end{split}start_ROW start_CELL end_CELL start_CELL | blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ζ ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ | caligraphic_F start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_β end_CELL end_ROW (83)

where c3:=2Rg[L+(c2+2T)|𝒜|LπRw]assignsuperscriptsubscript𝑐32subscript𝑅𝑔delimited-[]subscript𝐿subscript𝑐22𝑇𝒜subscript𝐿𝜋subscript𝑅𝑤c_{3}^{\prime}:=2R_{g}[L_{*}+(c_{2}+2T)|{\cal A}|L_{\pi}R_{w}]italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := 2 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] and c4:=2Rg(Rg+9Rw)assignsuperscriptsubscript𝑐42subscript𝑅𝑔subscript𝑅𝑔9subscript𝑅𝑤c_{4}^{\prime}:=2R_{g}(R_{g}+9R_{w})italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := 2 italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + 9 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ).

Substituting (83) into (81) and taking iterated expectation, we obtain

η1𝔼[Δk,g(wk;Ok)]η1σ𝔼[Δk2]η1c3𝔼[vkvk1]η1c4β.subscript𝜂1𝔼delimited-[]subscriptΔ𝑘𝑔subscript𝑤𝑘subscript𝑂𝑘subscript𝜂1𝜎𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2subscript𝜂1superscriptsubscript𝑐3𝔼delimited-[]delimited-∥∥subscript𝑣𝑘subscript𝑣𝑘1subscript𝜂1superscriptsubscript𝑐4𝛽\begin{split}&\eta_{1}\mathds{E}[\langle\Delta_{k},g(w_{k};O_{k})\rangle]\\ &\geq\eta_{1}\sigma\mathds{E}[\|\Delta_{k}\|^{2}]-\eta_{1}c_{3}^{\prime}% \mathds{E}[\|v_{k}-v_{k-1}\|]-\eta_{1}c_{4}^{\prime}\beta.\end{split}start_ROW start_CELL end_CELL start_CELL italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_β . end_CELL end_ROW (84)

Substituting (84) into (80), we obtain

η1σ𝔼[Δk2]c3𝔼[vkvk1]+c4β+𝔼[Δk,nk](1η1)𝔼[Δk1,nk1]subscript𝜂1𝜎𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2subscript𝑐3𝔼delimited-[]delimited-∥∥subscript𝑣𝑘subscript𝑣𝑘1subscript𝑐4𝛽𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘1subscript𝜂1𝔼delimited-[]subscriptΔ𝑘1subscript𝑛𝑘1\begin{split}\eta_{1}\sigma\mathds{E}[\|\Delta_{k}\|^{2}]&\!\leq\!c_{3}\mathds% {E}[\|v_{k}-v_{k-1}\|]\!+\!c_{4}\beta\!+\!\mathds{E}[\langle\Delta_{k},n_{k}% \rangle]\\ &\hskip 8.5359pt-(1-\eta_{1})\mathds{E}[\langle\Delta_{k-1},n_{k-1}\rangle]% \end{split}start_ROW start_CELL italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL ≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_β + blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ ] end_CELL end_ROW (85)

where c3:=[(1+η1)L+2η1(c2+2T)|𝒜|LπRw]Rgassignsubscript𝑐3delimited-[]1subscript𝜂1subscript𝐿2subscript𝜂1subscript𝑐22𝑇𝒜subscript𝐿𝜋subscript𝑅𝑤subscript𝑅𝑔c_{3}:=[(1+\eta_{1})L_{*}+2\eta_{1}(c_{2}+2T)|{\cal A}|L_{\pi}R_{w}]R_{g}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT := [ ( 1 + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_T ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and c4:=[2η1(Rg+9Rw)+(1η1)Rg]Rgassignsubscript𝑐4delimited-[]2subscript𝜂1subscript𝑅𝑔9subscript𝑅𝑤1subscript𝜂1subscript𝑅𝑔subscript𝑅𝑔c_{4}:=[2\eta_{1}(R_{g}+9R_{w})+(1-\eta_{1})R_{g}]R_{g}italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT := [ 2 italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + 9 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) + ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Summing (85) over k=0,,K1𝑘0𝐾1k=0,\ldots,K-1italic_k = 0 , … , italic_K - 1 and recalling the fact n1=0subscript𝑛10n_{-1}=0italic_n start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = 0, we obtain

η1σk=0K1𝔼[Δk2]subscript𝜂1𝜎superscriptsubscript𝑘0𝐾1𝔼delimited-[]superscriptnormsubscriptΔ𝑘2\displaystyle\eta_{1}\sigma\sum_{k=0}^{K-1}\mathds{E}[\|\Delta_{k}\|^{2}]italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (86a)
c3k=0K1𝔼[vkvk1]+c4βK+η1k=0K1𝔼[Δk,nk]absentsubscript𝑐3superscriptsubscript𝑘0𝐾1𝔼delimited-[]normsubscript𝑣𝑘subscript𝑣𝑘1subscript𝑐4𝛽𝐾subscript𝜂1superscriptsubscript𝑘0𝐾1𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘\displaystyle\leq c_{3}\sum_{k=0}^{K-1}\mathds{E}[\|v_{k}-v_{k-1}\|]+c_{4}% \beta K+\eta_{1}\sum_{k=0}^{K-1}\mathds{E}[\langle\Delta_{k},n_{k}\rangle]≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_β italic_K + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ]
+(1η1)𝔼[ΔK1,nK1]1subscript𝜂1𝔼delimited-[]subscriptΔ𝐾1subscript𝑛𝐾1\displaystyle\hskip 8.5359pt+(1-\eta_{1})\mathds{E}[\langle\Delta_{K-1},n_{K-1% }\rangle]+ ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ⟩ ] (86b)
c3k=0K1𝔼[vkvk1]+c4βKabsentsubscript𝑐3superscriptsubscript𝑘0𝐾1𝔼delimited-[]normsubscript𝑣𝑘subscript𝑣𝑘1subscript𝑐4𝛽𝐾\displaystyle\leq c_{3}\sum_{k=0}^{K-1}\mathds{E}[\|v_{k}-v_{k-1}\|]+c_{4}\beta K≤ italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_β italic_K
+2(1η1)RwRg+η1k=0K1𝔼[Δk,nk]21subscript𝜂1subscript𝑅𝑤subscript𝑅𝑔subscript𝜂1superscriptsubscript𝑘0𝐾1𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘\displaystyle\hskip 8.5359pt+2(1-\eta_{1})R_{w}R_{g}+\eta_{1}\sum_{k=0}^{K-1}% \mathds{E}[\langle\Delta_{k},n_{k}\rangle]+ 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] (86c)

where (86c) follows from the facts ΔK12RwnormsubscriptΔ𝐾12subscript𝑅𝑤\|\Delta_{K-1}\|\leq 2R_{w}∥ roman_Δ start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ ≤ 2 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and nK1Rgnormsubscript𝑛𝐾1subscript𝑅𝑔\|n_{K-1}\|\leq R_{g}∥ italic_n start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ∥ ≤ italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Multipying both sides of (86) by β/η1𝛽subscript𝜂1\nicefrac{{\beta}}{{\eta_{1}}}/ start_ARG italic_β end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG, we have

σβk=0K1𝔼[Δk2]𝜎𝛽superscriptsubscript𝑘0𝐾1𝔼delimited-[]superscriptnormsubscriptΔ𝑘2\displaystyle\sigma\beta\sum_{k=0}^{K-1}\mathds{E}[\|\Delta_{k}\|^{2}]italic_σ italic_β ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (87a)
c3βη1k=0K1𝔼[vkvk1]+c4η1β2Kabsentsubscript𝑐3𝛽subscript𝜂1superscriptsubscript𝑘0𝐾1𝔼delimited-[]normsubscript𝑣𝑘subscript𝑣𝑘1subscript𝑐4subscript𝜂1superscript𝛽2𝐾\displaystyle\leq\frac{c_{3}\beta}{\eta_{1}}\sum_{k=0}^{K-1}\mathds{E}[\|v_{k}% -v_{k-1}\|]+\frac{c_{4}}{\eta_{1}}\beta^{2}K≤ divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_β end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∥ ] + divide start_ARG italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K
+2(1η1)η1RwRgβ+βk=0K1𝔼[Δk,nk]21subscript𝜂1subscript𝜂1subscript𝑅𝑤subscript𝑅𝑔𝛽𝛽superscriptsubscript𝑘0𝐾1𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘\displaystyle\hskip 8.5359pt+\frac{2(1-\eta_{1})}{\eta_{1}}R_{w}R_{g}\beta+% \beta\sum_{k=0}^{K-1}\mathds{E}[\langle\Delta_{k},n_{k}\rangle]+ divide start_ARG 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β + italic_β ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] (87b)
c3η1RhαβK+c4η1β2K+2(1η1)η1RwRgβabsentsubscript𝑐3subscript𝜂1subscript𝑅𝛼𝛽𝐾subscript𝑐4subscript𝜂1superscript𝛽2𝐾21subscript𝜂1subscript𝜂1subscript𝑅𝑤subscript𝑅𝑔𝛽\displaystyle\leq\frac{c_{3}}{\eta_{1}}R_{h}\alpha\beta K+\frac{c_{4}}{\eta_{1% }}\beta^{2}K+\frac{2(1-\eta_{1})}{\eta_{1}}R_{w}R_{g}\beta≤ divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α italic_β italic_K + divide start_ARG italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K + divide start_ARG 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β
+βk=0K1𝔼[Δk,nk]𝛽superscriptsubscript𝑘0𝐾1𝔼delimited-[]subscriptΔ𝑘subscript𝑛𝑘\displaystyle\hskip 8.5359pt+\beta\sum_{k=0}^{K-1}\mathds{E}[\langle\Delta_{k}% ,n_{k}\rangle]+ italic_β ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ⟨ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ ] (87c)

where (87c) follows from the inequality in (21).

Substituting (77) into (87) and performing several algebraic manipulations, we obtain

[σβ[(1+γ)RπG+2G2]α]k=0K1𝔼[Δk2]α4k=0K1J(vk)2+12[𝔼[Δ02]𝔼[ΔK2]]+L2Rh2α2K+[c4η1+Rg2]β2K+[c3η1+2RwG]RhαβK+2(1η1)η1RwRgβ.delimited-[]𝜎𝛽delimited-[]1𝛾subscript𝑅𝜋subscript𝐺2superscriptsubscript𝐺2𝛼superscriptsubscript𝑘0𝐾1𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝑘2𝛼4superscriptsubscript𝑘0𝐾1superscriptdelimited-∥∥𝐽subscript𝑣𝑘212delimited-[]𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ02𝔼delimited-[]superscriptdelimited-∥∥subscriptΔ𝐾2superscriptsubscript𝐿2superscriptsubscript𝑅2superscript𝛼2𝐾delimited-[]subscript𝑐4subscript𝜂1superscriptsubscript𝑅𝑔2superscript𝛽2𝐾delimited-[]subscript𝑐3subscript𝜂12subscript𝑅𝑤subscript𝐺subscript𝑅𝛼𝛽𝐾21subscript𝜂1subscript𝜂1subscript𝑅𝑤subscript𝑅𝑔𝛽\begin{split}&\Big{[}\sigma\beta-[(1+\gamma)R_{\pi}G_{*}+2G_{*}^{2}]\alpha\Big% {]}\sum_{k=0}^{K-1}\mathds{E}[\|\Delta_{k}\|^{2}]\\ &\leq\frac{\alpha}{4}\sum_{k=0}^{K-1}\|\nabla J(v_{k})\|^{2}+\frac{1}{2}[% \mathds{E}[\|\Delta_{0}\|^{2}]-\mathds{E}[\|\Delta_{K}\|^{2}]]\\ &\hskip 8.5359pt+L_{*}^{2}R_{h}^{2}\alpha^{2}K+[\frac{c_{4}}{\eta_{1}}+R_{g}^{% 2}]\beta^{2}K\\ &\hskip 8.5359pt+[\frac{c_{3}}{\eta_{1}}+2R_{w}G_{*}]R_{h}\alpha\beta K+\frac{% 2(1-\eta_{1})}{\eta_{1}}R_{w}R_{g}\beta.\end{split}start_ROW start_CELL end_CELL start_CELL [ italic_σ italic_β - [ ( 1 + italic_γ ) italic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_α ] ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG italic_α end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ∥ ∇ italic_J ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ∥ roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K + [ divide start_ARG italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + [ divide start_ARG italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + 2 italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ] italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_α italic_β italic_K + divide start_ARG 2 ( 1 - italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_β . end_CELL end_ROW (88)

Dividing both sides of (88) by K𝐾Kitalic_K, we complete the proof.

Appendix G Supporting Lemmas

Lemma 5

When Assumption 1 is satisfied, the joint distribution satisfies

μvπvμvπvTV(1+c2)|𝒜|Lπvvsubscriptnormtensor-productsubscript𝜇𝑣subscript𝜋𝑣tensor-productsubscript𝜇superscript𝑣subscript𝜋superscript𝑣TV1subscript𝑐2𝒜subscript𝐿𝜋norm𝑣superscript𝑣\|\mu_{v}\otimes\pi_{v}-\mu_{v^{\prime}}\otimes\pi_{v^{\prime}}\|_{\mbox{\tiny TV% }}\leq(1+c_{2})|{\cal A}|L_{\pi}\|v-v^{\prime}\|∥ italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ≤ ( 1 + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (89)

where c2>0subscript𝑐20c_{2}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, vdv𝑣superscriptsubscript𝑑𝑣v\in\mathbb{R}^{d_{v}}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and vdvsuperscript𝑣superscriptsubscript𝑑𝑣v^{\prime}\in\mathbb{R}^{d_{v}}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Proof:

Following [36, Corollary 3.1], we have

μvμvTVsubscriptnormsubscript𝜇𝑣subscript𝜇superscript𝑣TV\displaystyle\|\mu_{v}-\mu_{v^{\prime}}\|_{\mbox{\tiny TV}}∥ italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT (90a)
c2aπvaπvTVabsentsubscript𝑐2subscriptnormsubscript𝑎tensor-productsubscript𝜋𝑣subscript𝑎tensor-productsubscript𝜋superscript𝑣TV\displaystyle\leq c_{2}\Big{\|}\sum_{a}\pi_{v}\otimes\mathds{P}-\sum_{a}\pi_{v% ^{\prime}}\otimes\mathds{P}\Big{\|}_{\mbox{\tiny TV}}≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊗ blackboard_P - ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊗ blackboard_P ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT (90b)
c2a,s(s|s,a)|πv(a|s)πv(a|s)|dads\displaystyle\leq c_{2}\int_{a,s^{\prime}}\mathds{P}(s^{\prime}|s,a)|\pi_{v}(a% |s)-\pi_{v^{\prime}}(a|s)|dads^{\prime}≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) | italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ) - italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) | italic_d italic_a italic_d italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (90c)
c2|𝒜|Lπvvabsentsubscript𝑐2𝒜subscript𝐿𝜋norm𝑣superscript𝑣\displaystyle\leq c_{2}|{\cal A}|L_{\pi}\|v-v^{\prime}\|≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (90d)

where c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a predetermined positive constant, and (90d) follows the Lπsubscript𝐿𝜋L_{\pi}italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT-Lipschitz of behavior policy.

Then, we start to analyze the total variation norm for μvπvμvπvtensor-productsubscript𝜇𝑣subscript𝜋𝑣tensor-productsubscript𝜇superscript𝑣subscript𝜋superscript𝑣\mu_{v}\otimes\pi_{v}-\mu_{v^{\prime}}\otimes\pi_{v^{\prime}}italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as

μvπvμvπvTVsubscriptnormtensor-productsubscript𝜇𝑣subscript𝜋𝑣tensor-productsubscript𝜇superscript𝑣subscript𝜋superscript𝑣TV\displaystyle\|\mu_{v}\otimes\pi_{v}-\mu_{v^{\prime}}\otimes\pi_{v^{\prime}}\|% _{\mbox{\tiny TV}}∥ italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT (91a)
=s,a|μv(s)πv(a|s)μv(s)πv(a|s)|dsda\displaystyle=\int_{s,a}|\mu_{v}(s)\pi_{v}(a|s)-\mu_{v^{\prime}}(s)\pi_{v^{% \prime}}(a|s)|dsda= ∫ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_s ) italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ) - italic_μ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) | italic_d italic_s italic_d italic_a (91b)
s,aμv(s)|πv(a|s)πv(a|s)|dsda\displaystyle\leq\int_{s,a}\mu_{v}(s)|\pi_{v}(a|s)-\pi_{v^{\prime}}(a|s)|dsda≤ ∫ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_s ) | italic_π start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_a | italic_s ) - italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) | italic_d italic_s italic_d italic_a
+s,aπv(a|s)|μv(s)μv(s)|𝑑s𝑑asubscript𝑠𝑎subscript𝜋superscript𝑣conditional𝑎𝑠subscript𝜇𝑣𝑠subscript𝜇superscript𝑣𝑠differential-d𝑠differential-d𝑎\displaystyle\hskip 8.5359pt+\int_{s,a}\pi_{v^{\prime}}(a|s)|\mu_{v}(s)-\mu_{v% ^{\prime}}(s)|dsda+ ∫ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) | italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_s ) - italic_μ start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) | italic_d italic_s italic_d italic_a (91c)
(1+c2)|𝒜|Lπvvabsent1subscript𝑐2𝒜subscript𝐿𝜋norm𝑣superscript𝑣\displaystyle\leq(1+c_{2})|{\cal A}|L_{\pi}\|v-v^{\prime}\|≤ ( 1 + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | caligraphic_A | italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ (91d)

where (91d) follows from (90) and Lπsubscript𝐿𝜋L_{\pi}italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT-Lipschitz continuity of behavior. ∎

Lemma 6

When Assumption 1 holds, the overall reward J(v)𝐽𝑣\nabla J(v)∇ italic_J ( italic_v ) has L𝐿Litalic_L-Lipschitz continuous gradient.

Proof:

Based on Lemma 3, the optimal critic w(v)superscript𝑤𝑣w^{*}(v)italic_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_v ) is Lsubscript𝐿L_{*}italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT-Lipschitz with respect to v𝑣vitalic_v. Together with the similar arguments in [37, Lemma 3.2], there exits a positive constant L𝐿Litalic_L such that J(v)J(v)Lvvnorm𝐽𝑣𝐽superscript𝑣𝐿norm𝑣superscript𝑣\|\nabla J(v)-\nabla J(v^{\prime})\|\leq L\|v-v^{\prime}\|∥ ∇ italic_J ( italic_v ) - ∇ italic_J ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ ≤ italic_L ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥. Therefore, we can apply the equivalent condition to J(v)J(v)Lvvnorm𝐽𝑣𝐽superscript𝑣𝐿norm𝑣superscript𝑣\|\nabla J(v)-\nabla J(v^{\prime})\|\leq L\|v-v^{\prime}\|∥ ∇ italic_J ( italic_v ) - ∇ italic_J ( italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ ≤ italic_L ∥ italic_v - italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ in [38, Theorem 2.1.5] in order to obtain (16). ∎

References

  • [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
  • [2] H. Ju, R. Juan, R. Gomez, K. Nakamura, and G. Li, “Transferring policy of deep reinforcement learning from simulation to reality for robotics,” Nature Machine Intelligence, vol. 4, no. 12, pp. 1077–1087, Dec. 2022.
  • [3] R. Wu, Z. Yao, J. Si, and H. H. Huang, “Robotic knee tracking control to mimic the intact human knee profile based on actor-critic reinforcement learning,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 1, pp. 19–30, Jan. 2022.
  • [4] Y. Ren, R. Xie, F. R. Yu, R. Zhang, Y. Wang, Y. He, and T. Huang, “Connected and autonomous vehicles in web3: An intelligence-based reinforcement learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 8, pp. 9863–9877, Aug. 2024.
  • [5] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar, “Finite-sample analysis for decentralized batch multiagent reinforcement learning with networked agents,” IEEE Transactions on Automatic Control, vol. 66, no. 12, pp. 5925–5940, Dec. 2021.
  • [6] Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach,” IEEE Transactions on Automatic Control, vol. 67, no. 12, pp. 6429–6444, Dec. 2022.
  • [7] N. Li, X. Li, J. Peng, and Z. Q. Xu, “Stochastic linear quadratic optimal control problem: A reinforcement learning method,” IEEE Transactions on Automatic Control, vol. 67, no. 9, pp. 5009–5016, Sept. 2022.
  • [8] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, vol. 12, Denver, CO, USA, Dec. 1999.
  • [9] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,” Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, Feb. 2021.
  • [10] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” in International Conference on Learning Representations, Addis Ababa, Ethiopia, Apr. 2020.
  • [11] F. Huang, S. Gao, J. Pei, and H. Huang, “Momentum-based policy gradient methods,” in International Conference on Machine Learning, vol. 119, Vienna, Austria, July 2020, pp. 4422–4433.
  • [12] L. Yang, Y. Zhang, G. Zheng, Q. Zheng, P. Li, J. Huang, and G. Pan, “Policy optimization with stochastic mirror descent,” in AAAI Conference on Artificial Intelligence, vol. 36, no. 8, Arlington, VA, USA, Nov. 2022, pp. 8823–8831.
  • [13] G. Lan, “Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes,” Mathematical programming, vol. 198, no. 1, pp. 1059–1106, Mar. 2023.
  • [14] J. Schulman et al., “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, Aug. 2017.
  • [15] J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conference on Learning Theory, Stockholm, Sweden, July 2018, pp. 1691–1692.
  • [16] S. Zou, T. Xu, and Y. Liang, “Finite-sample analysis for SARSA with linear function approximation,” in Advances in Neural Information Processing Systems, vol. 32, Vancouver, BC, Canada, Dec. 2019.
  • [17] T. Sun, H. Shen, T. Chen, and D. Li, “Adaptive temporal difference learning with linear function approximation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8812–8824, Dec. 2022.
  • [18] P. Xu and Q. Gu, “A finite-time analysis of Q-learning with neural network function approximation,” in International Conference on Machine Learning, vol. 119, Vienna, Austria, July 2020, pp. 10 555–10 565.
  • [19] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, vol. 12, Denver, CO, USA, Dec. 1999.
  • [20] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic,” SIAM Journal on Optimization, vol. 33, no. 1, pp. 147–180, Jan. 2023.
  • [21] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (second edition).   Cambridge, MA, USA: The MIT Press, 2018.
  • [22] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song, “SBEED: Convergent reinforcement learning with nonlinear function approximation,” in International Conference on Machine Learning, vol. 80, Stockholm, Sweden, July 2018, pp. 1125–1134.
  • [23] S. Qiu, Z. Yang, J. Ye, and Z. Wang, “On finite-time convergence of actor-critic algorithm,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 2, pp. 652–664, June 2021.
  • [24] H. Kumar, A. Koppel, and A. Ribeiro, “On the sample complexity of actor-critic method for reinforcement learning with function approximation,” Machine Learning, vol. 112, no. 7, pp. 2433–2467, Feb. 2023.
  • [25] S. Zhang, B. Liu, H. Yao, and S. Whiteson, “Provably convergent two-timescale off-policy actor-critic with function approximation,” in International Conference on Machine Learning, vol. 119, Vienna, Austria, July 2020, pp. 11 204–11 213.
  • [26] S. Khodadadian, T. T. Doan, J. Romberg, and S. T. Maguluri, “Finite-sample analysis of two-time-scale natural actor–critic algorithm,” IEEE Transactions on Automatic Control, vol. 68, no. 6, pp. 3273–3284, June 2023.
  • [27] T. Chen, Y. Sun, and W. Yin, “Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems,” in Advances in Neural Information Processing Systems, vol. 34, Vitrual, Dec. 2021, pp. 25 294–25 307.
  • [28] Y. F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,” in Advances in Neural Information Processing Systems, vol. 33, Virtual, Dec. 2020, pp. 17 617–17 628.
  • [29] H. Shen and T. Chen, “A single-timescale analysis for stochastic approximation with multiple coupled sequences,” in Advances in Neural Information Processing Systems, vol. 35, New Orleans, LA, USA, Dec. 2022, pp. 17 415–17 429.
  • [30] A. Olshevsky and B. Gharesifard, “A small gain analysis of single timescale actor critic,” SIAM Journal on Control and Optimization, vol. 61, no. 2, pp. 980–1007, Apr. 2023.
  • [31] X. Chen and L. Zhao, “Finite-time analysis of single-timescale actor-critic,” in Advances in Neural Information Processing Systems, vol. 36, New Orleans, LA, USA, Dec. 2023, pp. 7017–7049.
  • [32] Y. Duan and M. J. Wainwright, “Taming ‘data-hungry’ reinforcement learning? stability in continuous state-action spaces,” arXiv preprint arXiv:2401.05233, Jan. 2024.
  • [33] K. Doya, “Reinforcement learning in continuous time and space,” Neural computation, vol. 12, no. 1, pp. 219–245, 2000.
  • [34] V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for markov decision processes,” SIAM Journal on control and Optimization, vol. 38, no. 1, pp. 94–123, 1999.
  • [35] D. A. Levin and Y. Peres, Markov Chains and Mixing Times.   American Mathematical Soc., 2017, vol. 107.
  • [36] A. Y. Mitrophanov, “Sensitivity and convergence of uniformly ergodic markov chains,” Journal of Applied Probability, vol. 42, no. 4, p. 1003–1014, Dec. 2005.
  • [37] K. Zhang, A. Koppel, H. Zhu, and T. Başar, “Global convergence of policy gradient methods to (almost) locally optimal policies,” SIAM Journal on Control and Optimization, vol. 58, no. 6, pp. 3586–3612, Dec. 2020.
  • [38] Y. Nesterov, Lectures on Convex Optimization (second edition).   Cham, Switzerland: Springer, 2018, vol. 137.
Yanjie Dong (Member, IEEE) is an Associate Professor and the Assistant Dean of Artificial Intelligence Research Institute, Shenzhen MSU-BIT University. Dr. Dong respectively obtained his Ph.D. and M.A.Sc. degree from The University of British Columbia, Canada, in 2020 and 2016. His research interests focus on the design and analysis of machine learning algorithms, machine learning based resource allocation algorithms, and quantum computing technologies.
Haijun Zhang (Fellow, IEEE) is a Professor at the University of Science and Technology Beijing, China. He was a postdoctoral research fellow in the Department of Electrical and Computer Engineering at The University of British Columbia, Canada. He serves/served as an Editor of IEEE Transactions on Information Forensics and Security, IEEE Transactions on Communications, IEEE Transactions on Network Science and Engineering, and IEEE Transactions on Vehicular Technology. He received the IEEE CSIM Technical Committee Best Journal Paper Award, in 2018, IEEE ComSoc Young Author Best Paper Award, in 2017, and IEEE ComSoc Asia-Pacific Best Young Researcher Award, in 2019. He is an IEEE ComSoc Distinguished Lecturer.
Gang Wang (Senior Member, IEEE) is a Professor with the School of Automation at the Beijing Institute of Technology. Dr. Wang received a B.Eng. degree in 2011, and a Ph.D. degree in 2018, both from the Beijing Institute of Technology, Beijing, China. He also hold a Ph.D. degree from the University of Minnesota, Minneapolis, USA, in 2018, where he stayed as a postdoctoral researcher until July 2020. His research interests focus on the areas of signal processing, control and reinforcement learning with applications to cyber-physical systems and multi-agent systems. He was the recipient of the Best Paper Award from the Frontiers of Information Technology & Electronic Engineering in 2021, the Excellent Doctoral Dissertation Award from the Chinese Association of Automation in 2019, the outstanding editorial board member award from the IEEE Signal Processing Society in 2023. He serves as an Editor of Signal Processing and IEEE Transactions on Signal and Information Processing over Networks.
Shisheng Cui is a Professor with the School of Automation at the Beijing Institute of Technology. Dr. Cui received the B.S. degree from Tsinghua University, Beijing, China, in 2009, the M.S. degree from Stanford University, Stanford, USA, in 2011 and the Ph.D. degree from Pennsylvania State University, University Park, USA, in 2019. His current research interests lie in optimization, variational inequality problems, and inclusion problems complicated by nonsmoothness and uncertainty.
Xiping Hu is currently a Professor with Shenzhen MSU-BIT University, and is also with Beijing Institute of Technology, China. Dr. Hu received the PhD degree from the University of British Columbia, Vancouver, BC, Canada. Dr. Hu is the co-founder and chief scientist of Erudite Education Group Limited, Hong Kong, a leading language learning mobile application company with over 100 million users, and listed as top 2 language education platform globally. His research interests include affective computing, mobile cyber-physical systems, crowdsensing, social networks, and cloud computing. He has published more than 150 papers in the prestigious conferences and journals, such as IJCAI, AAAI, ACM MobiCom, WWW, and IEEE TPAMI/TMM/TVT/IoTJ/COMMAG.